Simply return -EINVAL if j_fc_off is invalid to avoid repeated check of
ret.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240801013815.2393869-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After jbd2_mark_journal_empty, journal log is supposed to be empty.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240801013815.2393869-8-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Move escape handle to futher improve code readability and remove some
repeat check.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240801013815.2393869-7-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
It's more intuitive to use jh_in->b_frozen_data directly instead of
done_copy_out variable. Simply remove unneeded done_copy_out variable
and use b_frozen_data instead.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240801013815.2393869-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove kmap for page of b_frozen_data from jbd2_alloc() which always
provides an address from the direct kernel mapping.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240801013815.2393869-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Remove unused return value of jbd2_fc_release_bufs.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240801013815.2393869-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We will alloc journal_head with __GFP_NOFAIL anyway, test for failure
is pointless.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240801013815.2393869-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Function jbd2_journal_shrink_checkpoint_list() assumes that '0' is not a
valid value for transaction IDs, which is incorrect. Don't assume that and
use two extra boolean variables to control the loop iterations and keep
track of the first and last tid.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240724161119.13448-4-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Function __jbd2_log_wait_for_space() assumes that '0' is not a valid value
for transaction IDs, which is incorrect. Don't assume that and invoke
jbd2_log_wait_commit() if the journal had a committing transaction instead.
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240724161119.13448-3-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
In __jbd2_log_wait_for_space(), we might call jbd2_cleanup_journal_tail()
to recover some journal space. But if an error occurs while executing
jbd2_cleanup_journal_tail() (e.g., an EIO), we don't stop waiting for free
space right away, we try other branches, and if j_committing_transaction
is NULL (i.e., the tid is 0), we will get the following complain:
============================================
JBD2: I/O error when updating journal superblock for sdd-8.
__jbd2_log_wait_for_space: needed 256 blocks and only had 217 space available
__jbd2_log_wait_for_space: no way to get more journal space in sdd-8
------------[ cut here ]------------
WARNING: CPU: 2 PID: 139804 at fs/jbd2/checkpoint.c:109 __jbd2_log_wait_for_space+0x251/0x2e0
Modules linked in:
CPU: 2 PID: 139804 Comm: kworker/u8:3 Not tainted 6.6.0+ #1
RIP: 0010:__jbd2_log_wait_for_space+0x251/0x2e0
Call Trace:
<TASK>
add_transaction_credits+0x5d1/0x5e0
start_this_handle+0x1ef/0x6a0
jbd2__journal_start+0x18b/0x340
ext4_dirty_inode+0x5d/0xb0
__mark_inode_dirty+0xe4/0x5d0
generic_update_time+0x60/0x70
[...]
============================================
So only if jbd2_cleanup_journal_tail() returns 1, i.e., there is nothing to
clean up at the moment, continue to try to reclaim free space in other ways.
Note that this fix relies on commit 6f6a6fda29 ("jbd2: fix ocfs2 corrupt
when updating journal superblock fails") to make jbd2_cleanup_journal_tail
return the correct error code.
Fixes: 8c3f25d895 ("jbd2: don't give up looking for space so easily in __jbd2_log_wait_for_space")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240718115336.2554501-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When a full journal commit is on-going, any fast commit has to be enqueued
into a different queue: FC_Q_STAGING instead of FC_Q_MAIN. This enqueueing
is done only once, i.e. if an inode is already queued in a previous fast
commit entry it won't be enqueued again. However, if a full commit starts
_after_ the inode is enqueued into FC_Q_MAIN, the next fast commit needs to
be done into FC_Q_STAGING. And this is not being done in function
ext4_fc_track_template().
This patch fixes the issue by re-enqueuing an inode into the STAGING queue
during the fast commit clean-up callback when doing a full commit. However,
to prevent a race with a fast-commit, the clean-up callback has to be called
with the journal locked.
This bug was found using fstest generic/047. This test creates several 32k
bytes files, sync'ing each of them after it's creation, and then shutting
down the filesystem. Some data may be loss in this operation; for example a
file may have it's size truncated to zero.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240717172220.14201-1-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Originally, we were quite conservative in limiting maximum transaction
size to a quarter of the journal because we were not accounting
transaction descriptor and revoke blocks. These days we do properly
account them and reserve space for them from the total transaction
credits. Thus there's no need to be so conservative and we can increase
the maximum transaction size to one third of the journal (even half
should work fine in principle but the performance will likely suffer in
that case). This also fixes failures to grow filesystems with tiny
journals.
Link: CA+hUFcuGs04JHZ_WzA1zGN57+ehL2qmHOt5a7RMpo+rv6Vyxtw@mail.gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240701132800.7158-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In jbd2_journal_init_common() we set batch size of a shrinker shrinking
checkpointed buffers to journal->j_max_transaction_buffers. But that is
guaranteed to be 0 at that point so we effectively stay with the default
shrinker batch size of 128. It has been like this since introduction of
jbd2 shrinkers so just drop the pointless initialization.
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240624170127.3253-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Commit 9f356e5a4f ("jbd2: Account descriptor blocks into
t_outstanding_credits") started to account descriptor blocks into
transactions outstanding credits. However it didn't appropriately
decrease the maximum amount of credits available to userspace. Thus if
the filesystem requests a transaction smaller than
j_max_transaction_buffers but large enough that when descriptor blocks
are added the size exceeds j_max_transaction_buffers, we confuse
add_transaction_credits() into thinking previous handles have grown the
transaction too much and enter infinite journal commit loop in
start_this_handle() -> add_transaction_credits() trying to create
transaction with enough credits available.
Fix the problem by properly accounting for transaction space reserved
for descriptor blocks when verifying requested transaction handle size.
CC: stable@vger.kernel.org
Fixes: 9f356e5a4f ("jbd2: Account descriptor blocks into t_outstanding_credits")
Reported-by: Alexander Coffin <alex.coffin@maticrobots.com>
Link: https://lore.kernel.org/all/CA+hUFcuGs04JHZ_WzA1zGN57+ehL2qmHOt5a7RMpo+rv6Vyxtw@mail.gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240624170127.3253-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Instead of computing the number of descriptor blocks a transaction can
have each time we need it (which is currently when starting each
transaction but will become more frequent later) precompute the number
once during journal initialization together with maximum transaction
size. We perform the precomputation whenever journal feature set is
updated similarly as for computation of
journal->j_revoke_records_per_block.
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240624170127.3253-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There's no reason to have jbd2_journal_get_max_txn_bufs() public
function. Currently all users are internal and can use
journal->j_max_transaction_buffers instead. This saves some unnecessary
recomputations of the limit as a bonus which becomes important as this
function gets more complex in the following patch.
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20240624170127.3253-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We encountered a problem that the file system could not be mounted in
the power-off scenario. The analysis of the file system mirror shows that
only part of the data is written to the last commit block.
The valid data of the commit block is concentrated in the first sector.
However, the data of the entire block is involved in the checksum calculation.
For different hardware, the minimum atomic unit may be different.
If the checksum of a committed block is incorrect, clear the data except the
'commit_header' and then calculate the checksum. If the checkusm is correct,
it is considered that the block is partially committed, Then continue to replay
journal.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240620072405.3533701-1-yebin@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
jbd2_transaction_committed() is used to check whether a transaction with
the given tid has already committed, it holds j_state_lock in read mode
and check the tid of current running transaction and committing
transaction, but holding the j_state_lock is expensive.
We have already stored the sequence number of the most recently
committed transaction in journal t->j_commit_sequence, we could do this
check by comparing it with the given tid instead. If the given tid isn't
smaller than j_commit_sequence, we can ensure that the given transaction
has been committed. That way we could drop the expensive lock and
achieve about 10% ~ 20% performance gains in concurrent DIOs on may
virtual machine with 100G ramdisk.
fio -filename=/mnt/foo -direct=1 -iodepth=10 -rw=$rw -ioengine=libaio \
-bs=4k -size=10G -numjobs=10 -runtime=60 -overwrite=1 -name=test \
-group_reporting
Before:
overwrite IOPS=88.2k, BW=344MiB/s
read IOPS=95.7k, BW=374MiB/s
rand overwrite IOPS=98.7k, BW=386MiB/s
randread IOPS=102k, BW=397MiB/s
After:
overwrite IOPS=105k, BW=410MiB/s
read IOPS=112k, BW=436MiB/s
rand overwrite IOPS=104k, BW=404MiB/s
randread IOPS=111k, BW=432MiB/s
CC: Dave Chinner <david@fromorbit.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Link: https://lore.kernel.org/linux-ext4/ZjILCPNZRHeazSqV@dread.disaster.area/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/20240520131831.2910790-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We only need to sleep if no running transaction is expired. Simply remove
unnecessary "should_sleep".
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240514112438.1269037-10-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We always set JBD2_UNMOUNT with j_state_lock held in journal_kill_thread.
In kjournald2, we check JBD2_UNMOUNT flag two times under the same
j_state_lock. Then the second check is unnecessary.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240514112438.1269037-9-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The j_commit_[sequence/request] are updated with j_state_lock held during
runtime. In kjournald2, two equality checks of j_commit_[sequence/request]
are under the same j_state_lock, then the second check is unnecessary.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240514112438.1269037-8-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
We save jh2bh(jh_in) to bh_in, so use bh_in directly instead of
jh2bh(jh_in) to simplify the code.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240514112438.1269037-7-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The data to do escape could be accessed directly from b_frozen_data,
just remove unneeded kmap.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240514112438.1269037-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If b_frozen_data is created concurrently, we can update new_folio and
new_offset with b_frozen_data and then move forward
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240514112438.1269037-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As we only need to copy out when we should do escape, need_copy_out
could be simply replaced by "do_escape".
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240514112438.1269037-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The done_copy_out info from jbd2_journal_write_metadata_buffer is not
used. Simply remove it.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240514112438.1269037-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The new_bh is from alloc_buffer_head, we should call free_buffer_head to
free it in error case.
Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240514112438.1269037-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Replacement of bdev->bd_inode with sane(r) set of primitives.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwjlgAKCRBZ7Krx/gZQ
66OmAP9nhZLASn/iM2+979I6O0GW+vid+uLh48uW3d+LbsmVIgD9GYpR+cuLQ/xj
mJESWfYKOVSpFFSrqlzKg9PQlU/GFgs=
=6LRp
-----END PGP SIGNATURE-----
Merge tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull bdev bd_inode updates from Al Viro:
"Replacement of bdev->bd_inode with sane(r) set of primitives by me and
Yu Kuai"
* tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RIP ->bd_inode
dasd_format(): killing the last remaining user of ->bd_inode
nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode
block/bdev.c: use the knowledge of inode/bdev coallocation
gfs2: more obvious initializations of mapping->host
fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping
blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here...
grow_dev_folio(): we only want ->bd_inode->i_mapping there
use ->bd_mapping instead of ->bd_inode->i_mapping
block_device: add a pointer to struct address_space (page cache of bdev)
missing helpers: bdev_unhash(), bdev_drop()
block: move two helpers into bdev.c
block2mtd: prevent direct access of bd_inode
dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode)
blkdev_write_iter(): saner way to get inode and bdev
bcachefs: remove dead function bdev_sectors()
ext4: remove block_device_ejected()
erofs_buf: store address_space instead of inode
erofs: switch erofs_bread() to passing offset instead of block number
As 'shrink_type' is exported. The module prefix 'jbd2' is added to
distinguish from memory reclamation.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20240407065355.1528580-3-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
"enum shrink_type" can clearly express the meaning of the parameter of
__jbd2_journal_clean_checkpoint_list(), and there is no need to use the
bool type.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://lore.kernel.org/r/20240407065355.1528580-2-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The variable err is being assigned a value that is never read, it
is being re-assigned inside the following while loop and also
after the while loop. The assignment is redundant and can be
removed.
Cleans up clang scan build warning:
fs/jbd2/commit.c:574:2: warning: Value stored to 'err' is never
read [deadcode.DeadStores]
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240410112803.232993-1-colin.i.king@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This is a replacement solution of commit bc71726c72 ("ext4: abort
the filesystem if failed to async write metadata buffer"), JBD2 can
detect metadata writeback error of fs dev by 'j_fs_dev_wb_err'.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213013224.2100050-5-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since 'JBD2_CHECKPOINT_IO_ERROR' and j_atomic_flags' are not useful
anymore after fs dev's errseq is imported into jbd2, just remove them.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213013224.2100050-4-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now JBD2 detects metadata writeback error of fs dev according to errseq.
Replace journal state flag by checking errseq.
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213013224.2100050-3-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Add errseq in journal, so that JBD2 can detect whether metadata is
successfully written to fs bdev. This patch adds detection in recovery
process to replace original solution(using local variable wb_err).
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231213013224.2100050-2-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There's issue when do io test:
WARN: soft lockup - CPU#45 stuck for 11s! [jbd2/dm-2-8:4170]
CPU: 45 PID: 4170 Comm: jbd2/dm-2-8 Kdump: loaded Tainted: G OE
Call trace:
dump_backtrace+0x0/0x1a0
show_stack+0x24/0x30
dump_stack+0xb0/0x100
watchdog_timer_fn+0x254/0x3f8
__hrtimer_run_queues+0x11c/0x380
hrtimer_interrupt+0xfc/0x2f8
arch_timer_handler_phys+0x38/0x58
handle_percpu_devid_irq+0x90/0x248
generic_handle_irq+0x3c/0x58
__handle_domain_irq+0x68/0xc0
gic_handle_irq+0x90/0x320
el1_irq+0xcc/0x180
queued_spin_lock_slowpath+0x1d8/0x320
jbd2_journal_commit_transaction+0x10f4/0x1c78 [jbd2]
kjournald2+0xec/0x2f0 [jbd2]
kthread+0x134/0x138
ret_from_fork+0x10/0x18
Analyzed informations from vmcore as follows:
(1) There are about 5k+ jbd2_inode in 'commit_transaction->t_inode_list';
(2) Now is processing the 855th jbd2_inode;
(3) JBD2 task has TIF_NEED_RESCHED flag;
(4) There's no pags in address_space around the 855th jbd2_inode;
(5) There are some process is doing drop caches;
(6) Mounted with 'nodioread_nolock' option;
(7) 128 CPUs;
According to informations from vmcore we know 'journal->j_list_lock' spin lock
competition is fierce. So journal_finish_inode_data_buffers() maybe process
slowly. Theoretically, there is scheduling point in the filemap_fdatawait_range_keep_errors().
However, if inode's address_space has no pages which taged with PAGECACHE_TAG_WRITEBACK,
will not call cond_resched(). So may lead to soft lockup.
journal_finish_inode_data_buffers
filemap_fdatawait_range_keep_errors
__filemap_fdatawait_range
while (index <= end)
nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end, PAGECACHE_TAG_WRITEBACK);
if (!nr_pages)
break; --> If 'nr_pages' is equal zero will break, then will not call cond_resched()
for (i = 0; i < nr_pages; i++)
wait_on_page_writeback(page);
cond_resched();
To solve above issue, add scheduling point in the journal_finish_inode_data_buffers();
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231211112544.3879780-1-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Current jbd2 only add REQ_SYNC for descriptor block, metadata log
buffer, commit buffer and superblock buffer, the submitted IO could be
throttled by writeback throttle in block layer, that could lead to
priority inversion in some cases. The log IO looks like a kind of high
priority metadata IO, so it should not be throttled by WBT like QOS
policies in block layer, let's add REQ_SYNC | REQ_IDLE to exempt from
writeback throttle, and also add REQ_META together indicates it's a
metadata IO.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231129114740.2686201-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The write_flags print in the trace of jbd2_write_superblock() is not
real, so move the modification before the trace.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20231129114740.2686201-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series "Fixes and cleanups to compaction".
- Joel Fernandes has a patchset ("Optimize mremap during mutual
alignment within PMD") which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested.
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series "Do not try to access unaccepted memory" Adrian Hunter
provides some fixups for the recently-added "unaccepted memory' feature.
To increase the feature's checking coverage. "Plug a few gaps where
RAM is exposed without checking if it is unaccepted memory".
- In the series "cleanups for lockless slab shrink" Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code.
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series "use refcount+RCU method to implement
lockless slab shrink".
- David Hildenbrand contributes some maintenance work for the rmap code
in the series "Anon rmap cleanups".
- Kefeng Wang does more folio conversions and some maintenance work in
the migration code. Series "mm: migrate: more folio conversion and
unification".
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series "Add and use bdev_getblk()".
- In the series "Use nth_page() in place of direct struct page
manipulation" Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames.
- In the series "mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO" has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of gigantic
pages are in use.
- Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
rationalization and folio conversions in the hugetlb code.
- Yin Fengwei has improved mlock()'s handling of large folios in the
series "support large folio for mlock"
- In the series "Expose swapcache stat for memcg v1" Liu Shixin has
added statistics for memcg v1 users which are available (and useful)
under memcg v2.
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named "MDWE
without inheritance".
- Kefeng Wang has provided the series "mm: convert numa balancing
functions to use a folio" which does what it says.
- In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
makes is possible for a process to propagate KSM treatment across
exec().
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use "high
bandwidth memory" in addition to Optane Data Center Persistent Memory
Modules (DCPMM). The series is named "memory tiering: calculate
abstract distance based on ACPI HMAT"
- In the series "Smart scanning mode for KSM" Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans.
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
series "mm: memcg: fix tracking of pending stats updates values".
- In the series "Implement IOCTL to get and optionally clear info about
PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
us to atomically read-then-clear page softdirty state. This is mainly
used by CRIU.
- Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
- a bunch of relatively minor maintenance tweaks to this code.
- Matthew Wilcox has increased the use of the VMA lock over file-backed
page faults in the series "Handle more faults under the VMA lock". Some
rationalizations of the fault path became possible as a result.
- In the series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
and folio conversions.
- In the series "various improvements to the GUP interface" Lorenzo
Stoakes has simplified and improved the GUP interface with an eye to
providing groundwork for future improvements.
- Andrey Konovalov has sent along the series "kasan: assorted fixes and
improvements" which does those things.
- Some page allocator maintenance work from Kemeng Shi in the series
"Two minor cleanups to break_down_buddy_pages".
- In thes series "New selftest for mm" Breno Leitao has developed
another MM self test which tickles a race we had between madvise() and
page faults.
- In the series "Add folio_end_read" Matthew Wilcox provides cleanups
and an optimization to the core pagecache code.
- Nhat Pham has added memcg accounting for hugetlb memory in the series
"hugetlb memcg accounting".
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series "Abstract vma_merge() and split_vma()".
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series "Fix page_owner's use of free timestamps".
- Lorenzo Stoakes has fixed the handling of new mappings of sealed files
in the series "permit write-sealed memfd read-only shared mappings".
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series "Batch hugetlb vmemmap modification operations".
- Some buffer_head folio conversions and cleanups from Matthew Wilcox in
the series "Finish the create_empty_buffers() transition".
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the series
"mm: PCP high auto-tuning".
- Roman Gushchin has contributed the patchset "mm: improve performance
of accounted kernel memory allocations" which improves their performance
by ~30% as measured by a micro-benchmark.
- folio conversions from Kefeng Wang in the series "mm: convert page
cpupid functions to folios".
- Some kmemleak fixups in Liu Shixin's series "Some bugfix about
kmemleak".
- Qi Zheng has improved our handling of memoryless nodes by keeping them
off the allocation fallback list. This is done in the series "handle
memoryless nodes more appropriately".
- khugepaged conversions from Vishal Moola in the series "Some
khugepaged folio conversions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
FgeUPAD1oasg6CP+INZvCj34waNxwAc=
=E+Y4
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'
- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'
- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'
- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'
- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'
- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames
- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use
- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code
- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'
- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'
- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says
- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'
- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'
- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU
- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code
- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result
- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions
- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements
- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things
- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'
- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults
- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code
- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'
- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'
- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'
- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark
- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'
- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'
- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'
- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"
[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in
https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/
with help from Qi Zheng.
The clone3 test filtering conflict was half-arsed by yours truly ]
* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...
JBD2 makes sure journal data is fallen on fs device by sync_blockdev(),
however, other process could intercept the EIO information from bdev's
mapping, which leads journal recovering successful even EIO occurs during
data written back to fs device.
We found this problem in our product, iscsi + multipath is chosen for block
device of ext4. Unstable network may trigger kpartx to rescan partitions in
device mapper layer. Detailed process is shown as following:
mount kpartx irq
jbd2_journal_recover
do_one_pass
memcpy(nbh->b_data, obh->b_data) // copy data to fs dev from journal
mark_buffer_dirty // mark bh dirty
vfs_read
generic_file_read_iter // dio
filemap_write_and_wait_range
__filemap_fdatawrite_range
do_writepages
block_write_full_folio
submit_bh_wbc
>> EIO occurs in disk <<
end_buffer_async_write
mark_buffer_write_io_error
mapping_set_error
set_bit(AS_EIO, &mapping->flags) // set!
filemap_check_errors
test_and_clear_bit(AS_EIO, &mapping->flags) // clear!
err2 = sync_blockdev
filemap_write_and_wait
filemap_check_errors
test_and_clear_bit(AS_EIO, &mapping->flags) // false
err2 = 0
Filesystem is mounted successfully even data from journal is failed written
into disk, and ext4/ocfs2 could become corrupted.
Fix it by comparing the wb_err state in fs block device before recovering
and after recovering.
A reproducer can be found in the kernel bugzilla referenced below.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217888
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230919012525.1783108-1-chengzhihao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now, if check data block checksum failed only print data's block number
then skip write data. However, one data block may in more than one transaction.
In some scenarios, offline analysis is inconvenient. As a result, it is
difficult to locate the areas where data is faulty.
So print 'io_block' if check data block checksum failed.
Signed-off-by: Ye Bin <yebin10@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230904105817.1728356-2-yebin10@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In preparation for implementing lockless slab shrink, use new APIs to
dynamically allocate the jbd2-journal shrinker, so that it can be freed
asynchronously via RCU. Then it doesn't need to wait for RCU read-side
critical section when releasing the struct journal_s.
Link: https://lkml.kernel.org/r/20230911094444.68966-32-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Abhinav Kumar <quic_abhinavk@quicinc.com>
Cc: Alasdair Kergon <agk@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Bob Peterson <rpeterso@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Carlos Llamas <cmllamas@google.com>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Chris Mason <clm@fb.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Chuck Lever <cel@kernel.org>
Cc: Coly Li <colyli@suse.de>
Cc: Dai Ngo <Dai.Ngo@oracle.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Airlie <airlied@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Gao Xiang <hsiangkao@linux.alibaba.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Jeffle Xu <jefflexu@linux.alibaba.com>
Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Kirill Tkhai <tkhai@ya.ru>
Cc: Marijn Suijten <marijn.suijten@somainline.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nadav Amit <namit@vmware.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Olga Kornievskaia <kolga@netapp.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Rob Clark <robdclark@gmail.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Sean Paul <sean@poorly.run>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Song Liu <song@kernel.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Steven Price <steven.price@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com>
Cc: Tom Talpey <tom@talpey.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmUGh1YACgkQ8vlZVpUN
gaN9lQgAqmMWu3xLwOERgVbK3CYT8WMcv0m9/by+vSwghCoPVDWWENgEgAzo4YpK
Lsp4q62wHaWs6AzvJEaJ8ryedo7e4FUHxcvp2f6dCuOPadOEZZZTa4G5fAr0kYXS
TIoaFtv6F2QVnGU6Y5lhtfYzmgLRdLL0B6MfSTYGO2MSREqxapvfxyGBQdkOuXfO
UEtrUUEqQ2GdDcKp+FRRnaUvNaTPEESY8d5eVwrMmyUhQWUQL/N2BPbFkk1TP6RG
MLDNsUZpdhZvLs6qLuR7dvO5wa2fshvRJIXlPINM0R0as5LmHqVL/ifCNkCn4W+k
ZNvdSPhqew68KHHq3sYFtm9rbZ3YOA==
=DopS
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Regression and bug fixes for ext4"
* tag 'ext4_for_linus-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix rec_len verify error
ext4: do not let fstrim block system suspend
ext4: move setting of trimmed bit into ext4_try_to_trim_range()
jbd2: Fix memory leak in journal_init_common()
jbd2: Remove page size assumptions
buffer: Make bh_offset() work for compound pages
There is a memory leak reported by kmemleak:
unreferenced object 0xff11000105903b80 (size 64):
comm "mount", pid 3382, jiffies 4295032021 (age 27.826s)
hex dump (first 32 bytes):
04 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 ................
ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffffae86ac40>] __kmalloc_node+0x50/0x160
[<ffffffffaf2486d8>] crypto_alloc_tfmmem.isra.0+0x38/0x110
[<ffffffffaf2498e5>] crypto_create_tfm_node+0x85/0x2f0
[<ffffffffaf24a92c>] crypto_alloc_tfm_node+0xfc/0x210
[<ffffffffaedde777>] journal_init_common+0x727/0x1ad0
[<ffffffffaede1715>] jbd2_journal_init_inode+0x2b5/0x500
[<ffffffffaed786b5>] ext4_load_and_init_journal+0x255/0x2440
[<ffffffffaed8b423>] ext4_fill_super+0x8823/0xa330
...
The root cause was traced to an error handing path in journal_init_common()
when malloc memory failed in register_shrinker(). The checksum driver is
used to reference to checksum algorithm via cryptoapi and the user should
release the memory when the driver is no longer needed or the journal
initialization failed.
Fix it by calling crypto_free_shash() on the "err_cleanup" error handing
path in journal_init_common().
Fixes: c30713084b ("jbd2: move load_superblock() into journal_init_common()")
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20230911025138.983101-1-lizetao1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
jbd2_alloc() allocates a buffer from slab when the block size is smaller
than PAGE_SIZE, and slab may be using a compound page. Before commit
8147c4c454, we set b_page to the precise page containing the buffer
and this code worked well. Now we set b_page to the head page of the
allocation, so we can no longer use offset_in_page(). While we could
do a 1:1 replacement with offset_in_folio(), use the more idiomatic
bh_offset() and the folio APIs to map the buffer.
This isn't enough to support a b_size larger than PAGE_SIZE on HIGHMEM
machines, but this is good enough to fix the actual bug we're seeing.
Fixes: 8147c4c454 ("jbd2: use a folio in jbd2_journal_write_metadata_buffer()")
Reported-by: Zorro Lang <zlang@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
[converted to be more folio]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>