Commit Graph

90565 Commits

Author SHA1 Message Date
Baokun Li
b4b4fda34e ext4: fix uninitialized ratelimit_state->lock access in __ext4_fill_super()
In the following concurrency we will access the uninitialized rs->lock:

ext4_fill_super
  ext4_register_sysfs
   // sysfs registered msg_ratelimit_interval_ms
                             // Other processes modify rs->interval to
                             // non-zero via msg_ratelimit_interval_ms
  ext4_orphan_cleanup
    ext4_msg(sb, KERN_INFO, "Errors on filesystem, "
      __ext4_msg
        ___ratelimit(&(EXT4_SB(sb)->s_msg_ratelimit_state)
          if (!rs->interval)  // do nothing if interval is 0
            return 1;
          raw_spin_trylock_irqsave(&rs->lock, flags)
            raw_spin_trylock(lock)
              _raw_spin_trylock
                __raw_spin_trylock
                  spin_acquire(&lock->dep_map, 0, 1, _RET_IP_)
                    lock_acquire
                      __lock_acquire
                        register_lock_class
                          assign_lock_key
                            dump_stack();
  ratelimit_state_init(&sbi->s_msg_ratelimit_state, 5 * HZ, 10);
    raw_spin_lock_init(&rs->lock);
    // init rs->lock here

and get the following dump_stack:

=========================================================
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
CPU: 12 PID: 753 Comm: mount Tainted: G E 6.7.0-rc6-next-20231222 #504
[...]
Call Trace:
 dump_stack_lvl+0xc5/0x170
 dump_stack+0x18/0x30
 register_lock_class+0x740/0x7c0
 __lock_acquire+0x69/0x13a0
 lock_acquire+0x120/0x450
 _raw_spin_trylock+0x98/0xd0
 ___ratelimit+0xf6/0x220
 __ext4_msg+0x7f/0x160 [ext4]
 ext4_orphan_cleanup+0x665/0x740 [ext4]
 __ext4_fill_super+0x21ea/0x2b10 [ext4]
 ext4_fill_super+0x14d/0x360 [ext4]
[...]
=========================================================

Normally interval is 0 until s_msg_ratelimit_state is initialized, so
___ratelimit() does nothing. But registering sysfs precedes initializing
rs->lock, so it is possible to change rs->interval to a non-zero value
via the msg_ratelimit_interval_ms interface of sysfs while rs->lock is
uninitialized, and then a call to ext4_msg triggers the problem by
accessing an uninitialized rs->lock. Therefore register sysfs after all
initializations are complete to avoid such problems.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240102133730.1098120-1-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-09 10:40:07 -04:00
Matthew Wilcox (Oracle)
ea4fd933ab ext4: remove calls to to set/clear the folio error flag
Nobody checks this flag on ext4 folios, stop setting and clearing it.

Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240420025029.2166544-11-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-09 00:23:51 -04:00
Baokun Li
dc1c4663bc ext4: propagate errors from ext4_sb_bread() in ext4_xattr_block_cache_find()
In ext4_xattr_block_cache_find(), when ext4_sb_bread() returns an error,
we will either continue to find the next ea block or return NULL to try to
insert a new ea block. But whether ext4_sb_bread() returns -EIO or -ENOMEM,
the next operation is most likely to fail with the same error. So propagate
the error returned by ext4_sb_bread() to make ext4_xattr_block_set() fail
to reduce pointless operations.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240504075526.2254349-3-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-07 15:59:18 -04:00
Baokun Li
0c0b4a49d3 ext4: fix mb_cache_entry's e_refcnt leak in ext4_xattr_block_cache_find()
Syzbot reports a warning as follows:

============================================
WARNING: CPU: 0 PID: 5075 at fs/mbcache.c:419 mb_cache_destroy+0x224/0x290
Modules linked in:
CPU: 0 PID: 5075 Comm: syz-executor199 Not tainted 6.9.0-rc6-gb947cc5bf6d7
RIP: 0010:mb_cache_destroy+0x224/0x290 fs/mbcache.c:419
Call Trace:
 <TASK>
 ext4_put_super+0x6d4/0xcd0 fs/ext4/super.c:1375
 generic_shutdown_super+0x136/0x2d0 fs/super.c:641
 kill_block_super+0x44/0x90 fs/super.c:1675
 ext4_kill_sb+0x68/0xa0 fs/ext4/super.c:7327
[...]
============================================

This is because when finding an entry in ext4_xattr_block_cache_find(), if
ext4_sb_bread() returns -ENOMEM, the ce's e_refcnt, which has already grown
in the __entry_find(), won't be put away, and eventually trigger the above
issue in mb_cache_destroy() due to reference count leakage.

So call mb_cache_entry_put() on the -ENOMEM error branch as a quick fix.

Reported-by: syzbot+dd43bd0f7474512edc47@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=dd43bd0f7474512edc47
Fixes: fb265c9cb4 ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240504075526.2254349-2-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-07 15:59:18 -04:00
Colin Ian King
8b57de1c5e jbd2: remove redundant assignement to variable err
The variable err is being assigned a value that is never read, it
is being re-assigned inside the following while loop and also
after the while loop. The assignment is redundant and can be
removed.

Cleans up clang scan build warning:
fs/jbd2/commit.c:574:2: warning: Value stored to 'err' is never
read [deadcode.DeadStores]

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240410112803.232993-1-colin.i.king@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-07 15:52:20 -04:00
Zhang Yi
df0b5afc62 ext4: remove the redundant folio_wait_stable()
__filemap_get_folio() with FGP_WRITEBEGIN parameter has already wait
for stable folio, so remove the redundant folio_wait_stable() in
ext4_da_write_begin(), it was left over from the commit cc883236b7
("ext4: drop unnecessary journal handle in delalloc write") that
removed the retry getting page logic.

Fixes: cc883236b7 ("ext4: drop unnecessary journal handle in delalloc write")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240419023005.2719050-1-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-07 15:48:04 -04:00
Dan Carpenter
3f4830abd2 ext4: fix potential unnitialized variable
Smatch complains "err" can be uninitialized in the caller.

    fs/ext4/indirect.c:349 ext4_alloc_branch()
    error: uninitialized symbol 'err'.

Set the error to zero on the success path.

Fixes: 8016e29f43 ("ext4: fast commit recovery path")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/363a4673-0fb8-4adf-b4fb-90a499077276@moroto.mountain
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-07 15:44:40 -04:00
Matthew Wilcox (Oracle)
c84f1510fb ext4: convert ac_buddy_page to ac_buddy_folio
This just carries around the bd_buddy_folio so should also be a folio.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240416172900.244637-6-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-07 15:38:17 -04:00
Matthew Wilcox (Oracle)
ccedf35b5d ext4: convert ac_bitmap_page to ac_bitmap_folio
This just carries around the bd_bitmap_folio so should also be a folio.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240416172900.244637-5-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-07 15:38:14 -04:00
Matthew Wilcox (Oracle)
e1622a0d55 ext4: convert ext4_mb_init_cache() to take a folio
All callers now have a folio, so convert this function from operating on
a page to operating on a folio.  The folio is assumed to be a single page.

Signe-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240416172900.244637-4-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-07 15:38:10 -04:00
Matthew Wilcox (Oracle)
5eea586b47 ext4: convert bd_buddy_page to bd_buddy_folio
There is no need to make this a multi-page folio, so leave all the
infrastructure around it in pages.  But since we're locking it, playing
with its refcount and checking whether it's uptodate, it needs to move
to the folio API.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240416172900.244637-3-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-07 15:38:07 -04:00
Matthew Wilcox (Oracle)
99b150d84e ext4: convert bd_bitmap_page to bd_bitmap_folio
There is no need to make this a multi-page folio, so leave all the
infrastructure around it in pages.  But since we're locking it, playing
with its refcount and checking whether it's uptodate, it needs to move
to the folio API.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20240416172900.244637-2-willy@infradead.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-07 15:37:46 -04:00
Kemeng Shi
da5704eef7 ext4: open coding repeated check in next_linear_group
Open coding repeated check in next_linear_group.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240424061904.987525-6-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03 00:12:33 -04:00
Kemeng Shi
2caffb6a27 ext4: use correct criteria name instead stale integer number in comment
Use correct criteria name instead stale integer number in comment

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240424061904.987525-5-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03 00:12:32 -04:00
Kemeng Shi
d1a3924e43 ext4: call ext4_mb_mark_free_simple to free continuous bits in found chunk
In mb_mark_used, we will find free chunk and mark it inuse. For chunk
in mid of passed range, we could simply mark whole chunk inuse. For chunk
at end of range, we may need to mark a continuous bits at end of part of
chunk inuse and keep rest part of chunk free. To only mark a part of
chunk inuse, we firstly mark whole chunk inuse and then mark a continuous
range at end of chunk free.
Function mb_mark_used does several times of "mb_find_buddy; mb_clear_bit;
..." to mark a continuous range free which can be done by simply calling
ext4_mb_mark_free_simple which free continuous bits in a more effective
way.
Just call ext4_mb_mark_free_simple in mb_mark_used to use existing and
effective code to free continuous blocks in chunk at end of passed range.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240424061904.987525-4-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03 00:12:32 -04:00
Kemeng Shi
d0b88624f8 ext4: add test_mb_mark_used_cost to estimate cost of mb_mark_used
Add test_mb_mark_used_cost to estimate cost of mb_mark_used

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Link: https://lore.kernel.org/r/20240424061904.987525-3-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03 00:12:32 -04:00
Kemeng Shi
9c97c34a99 ext4: keep "prefetch_grp" and "nr" consistent
Keep "prefetch_grp" and "nr" consistent to avoid to call
ext4_mb_prefetch_fini with non-prefetched groups.
When we step into next criteria, "prefetch_grp" is set to prefetch start
of new criteria while "nr" is number of the prefetched group in previous
criteria. If previous criteria and next criteria are both inexpensive
(< CR_GOAL_LEN_SLOW) and prefetch_ios reachs sbi->s_mb_prefetch_limit
in previous criteria, "prefetch_grp" and "nr" will be inconsistent and
may introduce unexpected cost to do ext4_mb_init_group for non-prefetched
groups.
Reset "nr" to 0 when we reset "prefetch_grp" to goal group to keep them
consistent.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240424061904.987525-2-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03 00:12:32 -04:00
Kemeng Shi
a11adf7be9 ext4: implement filesystem specific alloc_inode in unit test
We expect inode with ext4_info_info type as following:
mbt_kunit_init
  mbt_mb_init
    ext4_mb_init
      ext4_mb_init_backend
        sbi->s_buddy_cache = new_inode(sb);
        EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;

Implement alloc_inode ionde with ext4_inode_info type to avoid
out-of-bounds write.

Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20240322165518.8147-1-shikemeng@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03 00:04:35 -04:00
Jan Kara
0a46ef2347 ext4: do not create EA inode under buffer lock
ext4_xattr_set_entry() creates new EA inodes while holding buffer lock
on the external xattr block. This is problematic as it nests all the
allocation locking (which acquires locks on other buffers) under the
buffer lock. This can even deadlock when the filesystem is corrupted and
e.g. quota file is setup to contain xattr block as data block. Move the
allocation of EA inode out of ext4_xattr_set_entry() into the callers.

Reported-by: syzbot+a43d4f48b8397d0e41a9@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240321162657.27420-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03 00:02:24 -04:00
Jan Kara
4f3e6db3c3 Revert "ext4: drop duplicate ea_inode handling in ext4_xattr_block_set()"
This reverts commit 7f48212678. We will
need the special cleanup handling once we move allocation of EA inode
outside of the buffer lock in the following patch.

Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240321162657.27420-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-03 00:02:08 -04:00
Justin Stitt
744a56389f ext4: replace deprecated strncpy with alternatives
strncpy() is deprecated for use on NUL-terminated destination strings
[1] and as such we should prefer more robust and less ambiguous string
interfaces.

in file.c:
s_last_mounted is marked as __nonstring meaning it does not need to be
NUL-terminated. Let's instead use strtomem_pad() to copy bytes from the
string source to the byte array destination -- while also ensuring to
pad with zeroes.

in ioctl.c:
We can drop the memset and size argument in favor of using the new
2-argument version of strscpy_pad() -- which was introduced with Commit
e6584c3964 ("string: Allow 2-argument strscpy()"). This guarantees
NUL-termination and NUL-padding on the destination buffer -- which seems
to be a requirement judging from this comment:

|	static int ext4_ioctl_getlabel(struct ext4_sb_info *sbi, char __user *user_label)
|	{
|		char label[EXT4_LABEL_MAX + 1];
|
|		/*
|		 * EXT4_LABEL_MAX must always be smaller than FSLABEL_MAX because
|		 * FSLABEL_MAX must include terminating null byte, while s_volume_name
|		 * does not have to.
|		 */

in super.c:
s_first_error_func is marked as __nonstring meaning we can take the same
approach as in file.c; just use strtomem_pad()

Link: https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings [1]
Link: https://manpages.debian.org/testing/linux-manual-4.8/strscpy.9.en.html [2]
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20240321-strncpy-fs-ext4-file-c-v1-1-36a6a09fef0c@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 23:55:10 -04:00
Baokun Li
e19089dff5 ext4: clean up s_mb_rb_lock to fix build warnings with C=1
Running sparse (make C=1) on mballoc.c we get the following warning:

 fs/ext4/mballoc.c:3194:13: warning: context imbalance in
  'ext4_mb_seq_structs_summary_start' - wrong count at exit

This is because __acquires(&EXT4_SB(sb)->s_mb_rb_lock) was called in
ext4_mb_seq_structs_summary_start(), but s_mb_rb_lock was removed in commit
83e80a6e35 ("ext4: use buckets for cr 1 block scan instead of rbtree"),
so remove the __acquires to silence the warning.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240319113325.3110393-10-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 23:48:31 -04:00
Baokun Li
261341a932 ext4: set the type of max_zeroout to unsigned int to avoid overflow
The max_zeroout is of type int and the s_extent_max_zeroout_kb is of
type uint, and the s_extent_max_zeroout_kb can be freely modified via
the sysfs interface. When the block size is 1024, max_zeroout may
overflow, so declare it as unsigned int to avoid overflow.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240319113325.3110393-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 23:48:31 -04:00
Baokun Li
9a9f3a9842 ext4: set type of ac_groups_linear_remaining to __u32 to avoid overflow
Now ac_groups_linear_remaining is of type __u16 and s_mb_max_linear_groups
is of type unsigned int, so an overflow occurs when setting a value above
65535 through the mb_max_linear_groups sysfs interface. Therefore, the
type of ac_groups_linear_remaining is set to __u32 to avoid overflow.

Fixes: 196e402adf ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240319113325.3110393-8-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 23:48:31 -04:00
Baokun Li
63bfe84105 ext4: add positive int attr pointer to avoid sysfs variables overflow
The following variables controlled by the sysfs interface are of type
int and are normally used in the range [0, INT_MAX], but are declared as
attr_pointer_ui, and thus may be set to values that exceed INT_MAX and
result in overflows to get negative values.

  err_ratelimit_burst
  msg_ratelimit_burst
  warning_ratelimit_burst
  err_ratelimit_interval_ms
  msg_ratelimit_interval_ms
  warning_ratelimit_interval_ms

Therefore, we add attr_pointer_pi (aka positive int attr pointer) with a
value range of 0-INT_MAX to avoid overflow.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240319113325.3110393-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 23:48:30 -04:00
Baokun Li
b7b2a5799b ext4: add new attr pointer attr_mb_order
The s_mb_best_avail_max_trim_order is of type unsigned int, and has a
range of values well beyond the normal use of the mb_order. Although the
mballoc code is careful enough that large numbers don't matter there, but
this can mislead the sysadmin into thinking that it's normal to set such
values. Hence add a new attr_id attr_mb_order with values in the range
[0, 64] to avoid storing garbage values and make us more resilient to
surprises in the future.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240319113325.3110393-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 23:48:30 -04:00
Baokun Li
13df4d44a3 ext4: fix slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists()
We can trigger a slab-out-of-bounds with the following commands:

    mkfs.ext4 -F /dev/$disk 10G
    mount /dev/$disk /tmp/test
    echo 2147483647 > /sys/fs/ext4/$disk/mb_group_prealloc
    echo test > /tmp/test/file && sync

==================================================================
BUG: KASAN: slab-out-of-bounds in ext4_mb_find_good_group_avg_frag_lists+0x8a/0x200 [ext4]
Read of size 8 at addr ffff888121b9d0f0 by task kworker/u2:0/11
CPU: 0 PID: 11 Comm: kworker/u2:0 Tainted: GL 6.7.0-next-20240118 #521
Call Trace:
 dump_stack_lvl+0x2c/0x50
 kasan_report+0xb6/0xf0
 ext4_mb_find_good_group_avg_frag_lists+0x8a/0x200 [ext4]
 ext4_mb_regular_allocator+0x19e9/0x2370 [ext4]
 ext4_mb_new_blocks+0x88a/0x1370 [ext4]
 ext4_ext_map_blocks+0x14f7/0x2390 [ext4]
 ext4_map_blocks+0x569/0xea0 [ext4]
 ext4_do_writepages+0x10f6/0x1bc0 [ext4]
[...]
==================================================================

The flow of issue triggering is as follows:

// Set s_mb_group_prealloc to 2147483647 via sysfs
ext4_mb_new_blocks
  ext4_mb_normalize_request
    ext4_mb_normalize_group_request
      ac->ac_g_ex.fe_len = EXT4_SB(sb)->s_mb_group_prealloc
  ext4_mb_regular_allocator
    ext4_mb_choose_next_group
      ext4_mb_choose_next_group_best_avail
        mb_avg_fragment_size_order
          order = fls(len) - 2 = 29
        ext4_mb_find_good_group_avg_frag_lists
          frag_list = &sbi->s_mb_avg_fragment_size[order]
          if (list_empty(frag_list)) // Trigger SOOB!

At 4k block size, the length of the s_mb_avg_fragment_size list is 14,
but an oversized s_mb_group_prealloc is set, causing slab-out-of-bounds
to be triggered by an attempt to access an element at index 29.

Add a new attr_id attr_clusters_in_group with values in the range
[0, sbi->s_clusters_per_group] and declare mb_group_prealloc as
that type to fix the issue. In addition avoid returning an order
from mb_avg_fragment_size_order() greater than MB_NUM_ORDERS(sb)
and reduce some useless loops.

Fixes: 7e170922f0 ("ext4: Add allocation criteria 1.5 (CR1_5)")
CC: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Link: https://lore.kernel.org/r/20240319113325.3110393-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 23:48:30 -04:00
Baokun Li
57341fe317 ext4: refactor out ext4_generic_attr_show()
Refactor out the function ext4_generic_attr_show() to handle the reading
of values of various common types, with no functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240319113325.3110393-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 23:48:30 -04:00
Baokun Li
f536808adc ext4: refactor out ext4_generic_attr_store()
Refactor out the function ext4_generic_attr_store() to handle the setting
of values of various common types, with no functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240319113325.3110393-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 23:48:30 -04:00
Baokun Li
9e8e819f8f ext4: avoid overflow when setting values via sysfs
When setting values of type unsigned int through sysfs, we use kstrtoul()
to parse it and then truncate part of it as the final set value, when the
set value is greater than UINT_MAX, the set value will not match what we
see because of the truncation. As follows:

  $ echo 4294967296 > /sys/fs/ext4/sda/mb_max_linear_groups
  $ cat /sys/fs/ext4/sda/mb_max_linear_groups
    0

So we use kstrtouint() to parse the attr_pointer_ui type to avoid the
inconsistency described above. In addition, a judgment is added to avoid
setting s_resv_clusters less than 0.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20240319113325.3110393-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 23:48:30 -04:00
Max Kellermann
c77194965d Revert "ext4: apply umask if ACL support is disabled"
This reverts commit 484fd6c1de.  The
commit caused a regression because now the umask was applied to
symlinks and the fix is unnecessary because the umask/O_TMPFILE bug
has been fixed somewhere else already.

Fixes: https://lore.kernel.org/lkml/28DSITL9912E1.2LSZUVTGTO52Q@mforney.org/
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Tested-by: Michael Forney <mforney@mforney.org>
Link: https://lore.kernel.org/r/20240315142956.2420360-1-max.kellermann@ionos.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 18:25:39 -04:00
Kent Overstreet
fb092d4072 ext4: add support for FS_IOC_GETFSSYSFSPATH
The new sysfs path ioctl lets us get the /sys/fs/ path for a given
filesystem in a fs agnostic way, potentially nudging us towards
standarizing some of our reporting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Link: https://lore.kernel.org/r/20240315035308.3563511-4-kent.overstreet@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 15:21:23 -04:00
Jan Kara
35a1f12f0c ext4: avoid excessive credit estimate in ext4_tmpfile()
A user with minimum journal size (1024 blocks these days) complained
about the following error triggered by generic/697 test in
ext4_tmpfile():

run fstests generic/697 at 2024-02-28 05:34:46
JBD2: vfstest wants too many credits credits:260 rsv_credits:0 max:256
EXT4-fs error (device loop0) in __ext4_new_inode:1083: error 28

Indeed the credit estimate in ext4_tmpfile() is huge.
EXT4_MAXQUOTAS_INIT_BLOCKS() is 219, then 10 credits from ext4_tmpfile()
itself and then ext4_xattr_credits_for_new_inode() adds more credits
needed for security attributes and ACLs. Now the
EXT4_MAXQUOTAS_INIT_BLOCKS() is in fact unnecessary because we've
already initialized quotas with dquot_init() shortly before and so
EXT4_MAXQUOTAS_TRANS_BLOCKS() is enough (which boils down to 3 credits).

Fixes: af51a2ac36 ("ext4: ->tmpfile() support")
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Luis Henriques <lhenriques@suse.de>
Tested-by: Disha Goel <disgoel@linux.ibm.com>
Link: https://lore.kernel.org/r/20240307115320.28949-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 14:49:16 -04:00
Thorsten Blum
ea7d09ad7c ext4: remove unneeded if checks before kfree
kfree already checks if its argument is NULL. This fixes two
Coccinelle/coccicheck warnings reported by ifnullfree.cocci.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://lore.kernel.org/r/20240317153638.2136-2-thorsten.blum@toblux.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 14:44:51 -04:00
Christoph Hellwig
a0c7cce824 ext4: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
Since commit a2ad63daa8 ("VFS: add FMODE_CAN_ODIRECT file flag") file
systems can just set the FMODE_CAN_ODIRECT flag at open time instead of
wiring up a dummy direct_IO method to indicate support for direct I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[RH: Rebased to upstream]
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/e5797bb597219a49043e53e4e90aa494b97dc328.1709215665.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 10:53:32 -04:00
Ritesh Harjani (IBM)
53c17fe55a ext4: Remove PAGE_MASK dependency on mpage_submit_folio
This patch simply removes the PAGE_MASK dependency since
mpage_submit_folio() is already converted to work with folio.

Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/d6eadb090334ea49ceef4e643b371fabfcea328f.1709182251.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 10:50:44 -04:00
Ritesh Harjani (IBM)
c2a09f3d78 ext4: Fixes len calculation in mpage_journal_page_buffers
Truncate operation can race with writeback, in which inode->i_size can get
truncated and therefore size - folio_pos() can be negative. This fixes the
len calculation. However this path doesn't get easily triggered even
with data journaling.

Cc: stable@kernel.org # v6.5
Fixes: 80be8c5cc9 ("Fixes: ext4: Make mpage_journal_page_buffers use folio")
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/cff4953b5c9306aba71e944ab176a5d396b9a1b7.1709182250.git.ritesh.list@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-05-02 10:50:44 -04:00
Linus Torvalds
72374d71c3 Get rid of lockdep false positives around sysfs/overlayfs
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZhu2kwAKCRBZ7Krx/gZQ
 62gzAP9eeADy6rQkzgWJ8d8sKzGfmd0nup9WlCOxZSR0XojTXwEAnue47dn7PlMx
 wQY0joZ0V5FO8PNTEbWc2P/dSQrANgc=
 =MshW
 -----END PGP SIGNATURE-----

Merge tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull sysfs fix from Al Viro:
 "Get rid of lockdep false positives around sysfs/overlayfs

  syzbot has uncovered a class of lockdep false positives for setups
  with sysfs being one of the backing layers in overlayfs. The root
  cause is that of->mutex allocated when opening a sysfs file read-only
  (which overlayfs might do) is confused with of->mutex of a file opened
  writable (held in write to sysfs file, which overlayfs won't do).

  Assigning them separate lockdep classes fixes that bunch and it's
  obviously safe"

* tag 'pull-sysfs-annotation-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  kernfs: annotate different lockdep class for of->mutex of writable files
2024-04-14 11:41:51 -07:00
Amir Goldstein
16b52bbee4 kernfs: annotate different lockdep class for of->mutex of writable files
The writable file /sys/power/resume may call vfs lookup helpers for
arbitrary paths and readonly files can be read by overlayfs from vfs
helpers when sysfs is a lower layer of overalyfs.

To avoid a lockdep warning of circular dependency between overlayfs
inode lock and kernfs of->mutex, use a different lockdep class for
writable and readonly kernfs files.

Reported-by: syzbot+9a5b0ced8b1bfb238b56@syzkaller.appspotmail.com
Fixes: 0fedefd4c4 ("kernfs: sysfs: support custom llseek method for sysfs entries")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-04-14 06:55:46 -04:00
Linus Torvalds
76b0e9c429 zonefs fixes for 6.9-rc4
- Suppress a coccicheck warning using str_plural().
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCZhnikwAKCRDdoc3SxdoY
 ditrAP9s/r36Dl4c8BZxpoUyCmf44ww6oTMxGzjwi/4gA8Ry2gEA766WGVsDyJf2
 G0qjl7+bVneaOG6Xayzce69OjvZGZws=
 =U6kP
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs fix from Damien Le Moal:

 - Suppress a coccicheck warning using str_plural()

* tag 'zonefs-6.9-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: Use str_plural() to fix Coccinelle warning
2024-04-13 10:25:32 -07:00
Linus Torvalds
fa4022cb73 Four cifs.ko changesets, most also for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmYaS/8ACgkQiiy9cAdy
 T1G3OAwAjRyhi2XHb0QtA0YpZN0FLFghoY/+hgAYaS20WokdqbdyPfZeievRn6Jq
 zgoRYmPF9ihEAZ/UnNIu5ficHoFFsNTxPodQKOTMKxtYg7RCWv7ZF87+/OafuG/k
 FbYue7LQ1CvloEBMsNg088uFQsCp4u8k/9IaHHf+wEDmti42JTqoLjg5YsgbcotI
 Qt8VtxTVL+FzPax/qQkXtBvHggxTfs7VWvwRkv21NsotQzBoPGxHxv3bUV9G54Ez
 56w6Qg+R4ZSwIGgrRGAIjTZ0E6lPuIBanmePJlFwFB0em7Ajl3K2BBJUXpTsxgEd
 aDgJKFZ+ZmDpp5GWlYRccmy/jm6uOiLdCxPye/EQk5KK/oUNpFUUQ95OJ1clH6Su
 pR8LLk1yPQH+e8eyx0Exq7H8l0lsy4pHzerVyUgdgd9E1IZ3+MIN261vTu2TTfGq
 CtML0WNcZFnQdYbMnkADAZ7zYgr32U0UiWr6WS+V0rugyguDrPIEiIneFl3rSZsq
 MQaOHyS5
 =AHB9
 -----END PGP SIGNATURE-----

Merge tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - fix for oops in cifs_get_fattr of deleted files

 - fix for the remote open counter going negative in some directory
   lease cases

 - fix for mkfifo to instantiate dentry to avoid possible crash

 - important fix to allow handling key rotation for mount and remount
   (ie cases that are becoming more common when password that was used
   for the mount will expire soon but will be replaced by new password)

* tag 'v6.9-rc3-SMB3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: fix broken reconnect when password changing on the server by allowing password rotation
  smb: client: instantiate when creating SFU files
  smb3: fix Open files on server counter going negative
  smb: client: fix NULL ptr deref in cifs_mark_open_handles_for_deleted_file()
2024-04-13 10:10:18 -07:00
Linus Torvalds
90d3eaaf4f Two CephFS fixes marked for stable and a MAINTAINERS update.
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmYZXkYTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHziyLNCACkJv5GjDeHC22P+UB6re3gO39wdFsL
 BvHmhyebz5iidjS1l6pC/RcLOmB+OwekOiDfFvkwu0E+3d5dZTBInUV4BWzbJVLR
 Hkjd+oCEThV+f58n2Ol/r6lBGIFCgJpgtW02DQurM8wHTLm/vqxpj2GsvJCRXNr6
 lfqWVvZBteEmqxHUGNWldOQh2jNPGyvOpDRCQTzdjEfUAijoc8arLyuwbxl9tFdA
 qVtCZk+Jv0Lz9uQUSsI7aM+SEXUgGW3pZXJL9h9PPjC9d811GCwQnUcL4hNE2B3X
 Nf5Qzo9jV2GKJh8+kHui0dmpt0dMQfFZ2YaeJDZl8c8dE2uqBL+zMO79
 =mvie
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.9-rc4' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two CephFS fixes marked for stable and a MAINTAINERS update"

* tag 'ceph-for-6.9-rc4' of https://github.com/ceph/ceph-client:
  MAINTAINERS: remove myself as a Reviewer for Ceph
  ceph: switch to use cap_delay_lock for the unlink delay list
  ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE
2024-04-12 10:15:46 -07:00
Linus Torvalds
5939d45155 Tracing fixes for 6.9:
- Fix the buffer_percent accounting as it is dependent on three variables:
   1) pages_read - number of subbuffers read
   2) pages_lost - number of subbuffers lost due to overwrite
   3) pages_touched - number of pages that a writer entered
   These three counters only increment, and to know how many active pages
   there are on the buffer at any given time, the pages_read and
   pages_lost are subtracted from pages_touched. But the pages touched
   was incremented whenever any writer went to the next subbuffer even
   if it wasn't the only one, so it was incremented more than it should
   be causing the counter for how many subbuffers currently have content
   incorrect, which caused the buffer_percent that holds waiters until
   the ring buffer is filled to a given percentage to wake up early.
 
 - Fix warning of unused functions when PERF_EVENTS is not configured in
 
 - Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE
 
 - Fix to some kerneldoc function comments in eventfs code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZhk+khQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qvs0AP98c226UFU6Dha4wvgSulC/wKVvHN3X
 jeclMTdn8RGs2gD/b9OULKNv1//6fP16ZRun7ntRQkotVhlNhf9Ee0smiwU=
 =UYrk
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix the buffer_percent accounting as it is dependent on three
   variables:

     1) pages_read - number of subbuffers read
     2) pages_lost - number of subbuffers lost due to overwrite
     3) pages_touched - number of pages that a writer entered

   These three counters only increment, and to know how many active
   pages there are on the buffer at any given time, the pages_read and
   pages_lost are subtracted from pages_touched.

   But the pages touched was incremented whenever any writer went to the
   next subbuffer even if it wasn't the only one, so it was incremented
   more than it should be causing the counter for how many subbuffers
   currently have content incorrect, which caused the buffer_percent
   that holds waiters until the ring buffer is filled to a given
   percentage to wake up early.

 - Fix warning of unused functions when PERF_EVENTS is not configured in

 - Replace bad tab with space in Kconfig for FTRACE_RECORD_RECURSION_SIZE

 - Fix to some kerneldoc function comments in eventfs code.

* tag 'trace-v6.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Only update pages_touched when a new page is touched
  tracing: hide unused ftrace_event_id_fops
  tracing: Fix FTRACE_RECORD_RECURSION_SIZE Kconfig entry
  eventfs: Fix kernel-doc comments to functions
2024-04-12 09:02:24 -07:00
Yang Li
a8fa658eeb eventfs: Fix kernel-doc comments to functions
This commit fix kernel-doc style comments with complete parameter
descriptions for the lookup_file(),lookup_dir_entry() and
lookup_file_dentry().

Link: https://lore.kernel.org/linux-trace-kernel/20240322062604.28862-1-yang.lee@linux.alibaba.com

Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-04-11 17:42:09 -04:00
Steve French
35f834265e smb3: fix broken reconnect when password changing on the server by allowing password rotation
There are various use cases that are becoming more common in which password
changes are scheduled on a server(s) periodically but the clients connected
to this server need to stay connected (even in the face of brief network
reconnects) due to mounts which can not be easily unmounted and mounted at
will, and servers that do password rotation do not always have the ability
to tell the clients exactly when to the new password will be effective,
so add support for an alt password ("password2=") on mount (and also
remount) so that we can anticipate the upcoming change to the server
without risking breaking existing mounts.

An alternative would have been to use the kernel keyring for this but the
processes doing the reconnect do not have access to the keyring but do
have access to the ses structure.

Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-11 16:03:48 -05:00
Paulo Alcantara
c6ff459037 smb: client: instantiate when creating SFU files
In cifs_sfu_make_node(), on success, instantiate rather than leave it
with dentry unhashed negative to support callers that expect mknod(2)
to always instantiate.

This fixes the following test case:

  mount.cifs //srv/share /mnt -o ...,sfu
  mkfifo /mnt/fifo
  ./xfstests/ltp/growfiles -b -W test -e 1 -u -i 0 -L 30 /mnt/fifo
  ...
  BUG: unable to handle page fault for address: 000000034cec4e58
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 1 PREEMPT SMP PTI
  CPU: 0 PID: 138098 Comm: growfiles Kdump: loaded Not tainted
  5.14.0-436.3987_1240945149.el9.x86_64 #1
  Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
  RIP: 0010:_raw_callee_save__kvm_vcpu_is_preempted+0x0/0x20
  Code: e8 15 d9 61 00 e9 63 ff ff ff 41 bd ea ff ff ff e9 58 ff ff ff e8
  d0 71 c0 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 <48> 8b 04
  fd 60 2b c1 99 80 b8 90 50 03 00 00 0f 95 c0 c3 cc cc cc
  RSP: 0018:ffffb6a143cf7cf8 EFLAGS: 00010206
  RAX: ffff8a9bc30fb038 RBX: ffff8a9bc666a200 RCX: ffff8a9cc0260000
  RDX: 00000000736f622e RSI: ffff8a9bc30fb038 RDI: 000000007665645f
  RBP: ffffb6a143cf7d70 R08: 0000000000001000 R09: 0000000000000001
  R10: 0000000000000001 R11: 0000000000000000 R12: ffff8a9bc666a200
  R13: 0000559a302a12b0 R14: 0000000000001000 R15: 0000000000000000
  FS: 00007fbed1dbb740(0000) GS:ffff8a9cf0000000(0000)
  knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000034cec4e58 CR3: 0000000128ec6006 CR4: 0000000000770ef0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  PKRU: 55555554
  Call Trace:
   <TASK>
   ? show_trace_log_lvl+0x1c4/0x2df
   ? show_trace_log_lvl+0x1c4/0x2df
   ? __mutex_lock.constprop.0+0x5f7/0x6a0
   ? __die_body.cold+0x8/0xd
   ? page_fault_oops+0x134/0x170
   ? exc_page_fault+0x62/0x150
   ? asm_exc_page_fault+0x22/0x30
   ? _pfx_raw_callee_save__kvm_vcpu_is_preempted+0x10/0x10
   __mutex_lock.constprop.0+0x5f7/0x6a0
   ? __mod_memcg_lruvec_state+0x84/0xd0
   pipe_write+0x47/0x650
   ? do_anonymous_page+0x258/0x410
   ? inode_security+0x22/0x60
   ? selinux_file_permission+0x108/0x150
   vfs_write+0x2cb/0x410
   ksys_write+0x5f/0xe0
   do_syscall_64+0x5c/0xf0
   ? syscall_exit_to_user_mode+0x22/0x40
   ? do_syscall_64+0x6b/0xf0
   ? sched_clock_cpu+0x9/0xc0
   ? exc_page_fault+0x62/0x150
   entry_SYSCALL_64_after_hwframe+0x6e/0x76

Cc: stable@vger.kernel.org
Fixes: 72bc63f5e2 ("smb3: fix creating FIFOs when mounting with "sfu" mount option")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-11 16:03:40 -05:00
Steve French
28e0947651 smb3: fix Open files on server counter going negative
We were decrementing the count of open files on server twice
for the case where we were closing cached directories.

Fixes: 8e843bf38f ("cifs: return a single-use cfid if we did not get a lease")
Cc: stable@vger.kernel.org
Acked-by: Bharath SM <bharathsm@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-11 16:02:02 -05:00
Xiubo Li
17f8dc2db5 ceph: switch to use cap_delay_lock for the unlink delay list
The same list item will be used in both cap_delay_list and
cap_unlink_delay_list, so it's buggy to use two different locks
to protect them.

Cc: stable@vger.kernel.org
Fixes: dbc347ef7f ("ceph: add ceph_cap_unlink_work to fire check_caps() immediately")
Link: https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AODC76VXRAMXKLFDCTK4TKFDDPWUSCN5
Reported-by: Marc Ruhmann <ruhmann@luis.uni-hannover.de>
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Tested-by: Marc Ruhmann <ruhmann@luis.uni-hannover.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-04-11 22:56:28 +02:00
Linus Torvalds
e1dc191dbf bcachefs fixes for v6.9-rc4
Notable user impacting bugs
 
 - On multi device filesystems, recovery was looping in
   btree_trans_too_many_iters(). This checks if a transaction has touched
   too many btree paths (because of iteration over many keys), and isuses
   a restart to drop unneeded paths. But it's now possible for some paths
   to exceed the previous limit without iteration in the interior btree
   update path, since the transaction commit will do alloc updates for
   every old and new btree node, and during journal replay we don't use
   the btree write buffer for locking reasons and thus those updates use
   btree paths when they wouldn't normally.
 
 - Fix a corner case in rebalance when moving extents on a durability=0
   device. This wouldn't be hit when a device was formatted with
   durability=0 since in that case we'll only use it as a write through
   cache (only cached extents will live on it), but durability can now be
   changed on an existing device.
 
 - bch2_get_acl() could rarely forget to handle a transaction restart;
   this manifested as the occasional missing acl that came back after
   dropping caches.
 
 - Fix a major performance regression on high iops multithreaded write
   workloads (only since 6.9-rc1); a previous fix for a deadlock in the
   interior btree update path to check the journal watermark introduced a
   dependency on the state of btree write buffer flushing that we didn't
   want.
 
 - Assorted other repair paths and recovery fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmYXTd8ACgkQE6szbY3K
 bna4MA//Y/CSB2JupxJPFUAb69+WmNDMnJJV3FlD/Hwo19kOR7aRbKQMUxsH51nb
 dfv5o/58n39QIBqMcMtTipnoND6jrwv7l5NimFKQmj/YehxosdsOf2BgbD/M4Ozz
 84IUmHXSs5+zezjF8IAw/bjR/p13XNJGSYTfl1RWbGUMERLpfZcLn70FvoCR/qpQ
 Bcp+z70K6bBhrPZqFYd2mEC+Cfo42aCD1lqWUQ/e0FHiNEZnCNH2lNca4phBzGt4
 f9sBxBcwmDfizQxqpyZ4izRzbS9ZwZ3ega336L2DrPpwgjMgTRLKLjdXqJDjGvDW
 ngvnaUw7SgICs+q48g3f67cGhw/4lSPdVu/9a/ldqDEP9PynQiUS1G8aDofLdboW
 xCZM6toX86p0DMNY2kP9vc5kxr377cSXAL7VeKbE+ZV5vGyEz1qFDLRAYVixHDfr
 Q7KgYvoJq8CgjfK7rIZbO2tqKhz2TP4+2rJa6tDLwKXs5ice++w/aF5d/nBZ7iCe
 +dyJ+aJiosjHEVG+QscACFrjBJdNNspJqBWDP396XeOsdl+iCeIRP0VHOGytLLRf
 gisE0Lhj4pz6bv7OhAAkdxvegVQ8HfqN1E+f/WM7Kogqos0NS2skEhy9cTuDCHUm
 qtPTUq5XNibiE6J+NOK86pu6o+6sqpWIpfTPuTbMid5sevLsomE=
 =BQC0
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs

Pull more bcachefs fixes from Kent Overstreet:
 "Notable user impacting bugs

   - On multi device filesystems, recovery was looping in
     btree_trans_too_many_iters(). This checks if a transaction has
     touched too many btree paths (because of iteration over many keys),
     and isuses a restart to drop unneeded paths.

     But it's now possible for some paths to exceed the previous limit
     without iteration in the interior btree update path, since the
     transaction commit will do alloc updates for every old and new
     btree node, and during journal replay we don't use the btree write
     buffer for locking reasons and thus those updates use btree paths
     when they wouldn't normally.

   - Fix a corner case in rebalance when moving extents on a
     durability=0 device. This wouldn't be hit when a device was
     formatted with durability=0 since in that case we'll only use it as
     a write through cache (only cached extents will live on it), but
     durability can now be changed on an existing device.

   - bch2_get_acl() could rarely forget to handle a transaction restart;
     this manifested as the occasional missing acl that came back after
     dropping caches.

   - Fix a major performance regression on high iops multithreaded write
     workloads (only since 6.9-rc1); a previous fix for a deadlock in
     the interior btree update path to check the journal watermark
     introduced a dependency on the state of btree write buffer flushing
     that we didn't want.

   - Assorted other repair paths and recovery fixes"

* tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs: (25 commits)
  bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()
  bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()
  bcachefs: Fix a race in btree_update_nodes_written()
  bcachefs: btree_node_scan: Respect member.data_allowed
  bcachefs: Don't scan for btree nodes when we can reconstruct
  bcachefs: Fix check_topology() when using node scan
  bcachefs: fix eytzinger0_find_gt()
  bcachefs: fix bch2_get_acl() transaction restart handling
  bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list
  bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()
  bcachefs: Rename struct field swap to prevent macro naming collision
  MAINTAINERS: Add entry for bcachefs documentation
  Documentation: filesystems: Add bcachefs toctree
  bcachefs: JOURNAL_SPACE_LOW
  bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE
  bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems
  bcachefs: fix rand_delete unit test
  bcachefs: fix ! vs ~ typo in __clear_bit_le64()
  bcachefs: Fix rebalance from durability=0 device
  bcachefs: Print shutdown journal sequence number
  ...
2024-04-11 11:24:55 -07:00
NeilBrown
b372e96bd0 ceph: redirty page before returning AOP_WRITEPAGE_ACTIVATE
The page has been marked clean before writepage is called.  If we don't
redirty it before postponing the write, it might never get written.

Cc: stable@vger.kernel.org
Fixes: 503d4fa6ee ("ceph: remove reliance on bdi congestion")
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-04-11 19:17:02 +02:00