-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaG0jUACgkQxWXV+ddt
WDsg4Q//f7xRooomsRaRvIDN4Add17jyZWuhreMhxKKvdtoRzLq/YZUldGJt1XXu
tVpCvKkVvrJt5UY3b1czfA8GTHeOInZSoqyV8ZaxOAcg6cfKrUtLFi1wAKGU6BK/
S9Zz/9lfbjXhG4PnZW7GRFmH0n68p/dmS+kXPVVPhj7CibxuZJtifflK1r6rHw+Y
PDDnMIPqbrdK+xBfR+SEpG+1uIEFvI6SgsAZG6p0mzcbbUINGp7SpU3v2NVuvdIX
tjTlZhz+D2do4dPk3RJ4Z6dpL+VlZnR1F9CZn6SmPcdGWn6mTkywoukJPP62iHYO
b8Xr4j1mKDPcu55OEPWwhYWOvRP4FvltrGB+35xsQoPlKBBKxSMFgNyEkdfvjtKU
qL9qRYJvm8D2OBqdv5BnawKRSouPcR4KPiM9JSGWRcW6vdOiY3Cd1mnAR8EA1LSd
SRYmBlrZIIjOpEeGkEy2MeNKEH9L5B1wAU21VQd+cAlCKBFUmFb1csZ/vCJJ1B34
swedmirNx0lyZTNBGQsDipkxSM0c/0xj/ysG8N41bLvPUM35x0K9xw5ZqBlTXrrg
Lln51s6TeqSzfEvKdDGikBL3dVhT5I6acRUchQU9YSZ3Bqxj3QCTNnOZMZQsCFkp
CMohduRVHiK4LHQTkU++q8bT8xX20kh/1HIJhAjdnDbrlRY4oTU=
=tyXO
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix folio refcounting when releasing them (encoded write, dummy
extent buffer)
- fix out of bounds read when checking qgroup inherit data
- fix how configurable chunk size is handled in zoned mode
- in the ref-verify tool, fix uninitialized return value when checking
extent owner ref and simple quota are not enabled
* tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix folio refcount in __alloc_dummy_extent_buffer()
btrfs: fix folio refcount in btrfs_do_encoded_write()
btrfs: fix uninitialized return value in the ref-verify tool
btrfs: always do the basic checks for btrfs_qgroup_inherit structure
btrfs: zoned: fix calc_available_free_space() for zoned mode
three unrelated MM fixes.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZoYyTAAKCRDdBJ7gKXxA
ju7KAQCxpW3zGuZmFiBJNYYor+GAPBsqPgQ8RrSusklJiVDMlQD+LuV2gI1ARIm/
6f1bZjEYomswEZyCq6FHZQuhWqLVhw4=
=1oCx
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-07-03-22-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from, Andrew Morton:
"6 hotfies, all cc:stable. Some fixes for longstanding nilfs2 issues
and three unrelated MM fixes"
* tag 'mm-hotfixes-stable-2024-07-03-22-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
nilfs2: fix incorrect inode allocation from reserved inodes
nilfs2: add missing check for inode numbers on directory entries
nilfs2: fix inode number range checks
mm: avoid overflows in dirty throttling logic
Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again"
mm: optimize the redundant loop of mm_update_owner_next()
Another improper use of __folio_put() in an error path after freshly
allocating pages/folios which returns them with the refcount initialized
to 1. The refactor from __free_pages() -> __folio_put() (instead of
folio_put) removed a refcount decrement found in __free_pages() and
folio_put but absent from __folio_put().
Fixes: 13df3775ef ("btrfs: cleanup metadata page pointer usage")
CC: stable@vger.kernel.org # 6.8+
Tested-by: Ed Tomlinson <edtoml@gmail.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The conversion to folios switched __free_page() to __folio_put() in the
error path in btrfs_do_encoded_write().
However, this gets the page refcounting wrong. If we do hit that error
path (I reproduced by modifying btrfs_do_encoded_write to pretend to
always fail in a way that jumps to out_folios and running the fstests
case btrfs/281), then we always hit the following BUG freeing the folio:
BUG: Bad page state in process btrfs pfn:40ab0b
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b
flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff)
raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: nonzero _refcount
Call Trace:
<TASK>
dump_stack_lvl+0x3d/0xe0
bad_page+0xea/0xf0
free_unref_page+0x8e1/0x900
? __mem_cgroup_uncharge+0x69/0x90
__folio_put+0xe6/0x190
btrfs_do_encoded_write+0x445/0x780
? current_time+0x25/0xd0
btrfs_do_write_iter+0x2cc/0x4b0
btrfs_ioctl_encoded_write+0x2b6/0x340
It turns out __free_page() decreases the page reference count while
__folio_put() does not. Switch __folio_put() to folio_put() which
decreases the folio reference count first.
Fixes: 400b172b8c ("btrfs: compression: migrate compression/decompression paths to folios")
Tested-by: Ed Tomlinson <edtoml@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the bitmap block that manages the inode allocation status is corrupted,
nilfs_ifile_create_inode() may allocate a new inode from the reserved
inode area where it should not be allocated.
Previous fix commit d325dc6eb7 ("nilfs2: fix use-after-free bug of
struct nilfs_root"), fixed the problem that reserved inodes with inode
numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to
bitmap corruption, but since the start number of non-reserved inodes is
read from the super block and may change, in which case inode allocation
may occur from the extended reserved inode area.
If that happens, access to that inode will cause an IO error, causing the
file system to degrade to an error state.
Fix this potential issue by adding a wraparound option to the common
metadata object allocation routine and by modifying
nilfs_ifile_create_inode() to disable the option so that it only allocates
inodes with inode numbers greater than or equal to the inode number read
in "nilfs->ns_first_ino", regardless of the bitmap status of reserved
inodes.
Link: https://lkml.kernel.org/r/20240623051135.4180-4-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Syzbot reported that mounting and unmounting a specific pattern of
corrupted nilfs2 filesystem images causes a use-after-free of metadata
file inodes, which triggers a kernel bug in lru_add_fn().
As Jan Kara pointed out, this is because the link count of a metadata file
gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(),
tries to delete that inode (ifile inode in this case).
The inconsistency occurs because directories containing the inode numbers
of these metadata files that should not be visible in the namespace are
read without checking.
Fix this issue by treating the inode numbers of these internal files as
errors in the sanity check helper when reading directory folios/pages.
Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer
analysis.
Link: https://lkml.kernel.org/r/20240623051135.4180-3-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+d79afb004be235636ee8@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d79afb004be235636ee8
Reported-by: Jan Kara <jack@suse.cz>
Closes: https://lkml.kernel.org/r/20240617075758.wewhukbrjod5fp5o@quack3
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "nilfs2: fix potential issues related to reserved inodes".
This series fixes one use-after-free issue reported by syzbot, caused by
nilfs2's internal inode being exposed in the namespace on a corrupted
filesystem, and a couple of flaws that cause problems if the starting
number of non-reserved inodes written in the on-disk super block is
intentionally (or corruptly) changed from its default value.
This patch (of 3):
In the current implementation of nilfs2, "nilfs->ns_first_ino", which
gives the first non-reserved inode number, is read from the superblock,
but its lower limit is not checked.
As a result, if a number that overlaps with the inode number range of
reserved inodes such as the root directory or metadata files is set in the
super block parameter, the inode number test macros (NILFS_MDT_INODE and
NILFS_VALID_INODE) will not function properly.
In addition, these test macros use left bit-shift calculations using with
the inode number as the shift count via the BIT macro, but the result of a
shift calculation that exceeds the bit width of an integer is undefined in
the C specification, so if "ns_first_ino" is set to a large value other
than the default value NILFS_USER_INO (=11), the macros may potentially
malfunction depending on the environment.
Fix these issues by checking the lower bound of "nilfs->ns_first_ino" and
by preventing bit shifts equal to or greater than the NILFS_USER_INO
constant in the inode number test macros.
Also, change the type of "ns_first_ino" from signed integer to unsigned
integer to avoid the need for type casting in comparisons such as the
lower bound check introduced this time.
Link: https://lkml.kernel.org/r/20240623051135.4180-1-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/20240623051135.4180-2-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
cifs_expand_read() is causing a performance regression of around 30% by
causing extra pagecache to be allocated for an inode in the readahead path
before we begin actually dispatching RPC requests, thereby delaying the
actual I/O. The expansion is sized according to the rsize parameter, which
seems to be 4MiB on my test system; this is a big step up from the first
requests made by the fio test program.
Simple repro (look at read bandwidth number):
fio --name=writetest --filename=/xfstest.test/foo --time_based --runtime=60 --size=16M --numjobs=1 --rw=read
Fix this by removing cifs_expand_readahead(). Readahead expansion is
mostly useful for when we're using the local cache if the local cache has a
block size greater than PAGE_SIZE, so we can dispense with it when not
caching.
Fixes: 69c3c023af ("cifs: Implement netfslib hooks")
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZoRT7gAKCRCRxhvAZXjc
opNXAP9dXdBWK0LqpLlN5Y01UQ8Kd7AqFCAEvFL5SCX3U3dJ8QD+IZbzIM2qhPJJ
f2gVyw7drWTfqJvWhzFch616QyGVNwA=
=daVI
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.10-rc7.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"VFS:
- Improve handling of deep ancestor chains in is_subdir()
- Release locks cleanly when fctnl_setlk() races with close().
When setting a file lock fails the VFS tries to cleanup the already
created lock. The helper used for this calls back into the LSM
layer which may cause it to fail, leaving the stale lock accessible
via /proc/locks.
AFS:
- Fix a comma/semicolon typo"
* tag 'vfs-6.10-rc7.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
afs: Convert comma to semicolon
fs: better handle deep ancestor chains in is_subdir()
filelock: Remove locks reliably when fcntl/close race is detected
Jan reported that 'cd ..' may take a long time in deep directory
hierarchies under a bind-mount. If concurrent renames happen it is
possible to livelock in is_subdir() because it will keep retrying.
Change is_subdir() from simply retrying over and over to retry once and
then acquire the rename lock to handle deep ancestor chains better. The
list of alternatives to this approach were less then pleasant. Change
the scope of rcu lock to cover the whole walk while at it.
A big thanks to Jan and Linus. Both Jan and Linus had proposed
effectively the same thing just that one version ended up being slightly
more elegant.
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
- Fix possible global buffer memory leak when unloading EROFS module;
- Fix FS_IOC_GETFSUUID ioctl by using super_set_uuid();
- Reset m_llen to 0 so then it can retry if metadata is invalid.
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmaEAJwRHHhpYW5nQGtl
cm5lbC5vcmcACgkQUXZn5Zlu5qoaSg//cHQSVWZ5joXjTEsSk5a1cROsEOz6I8PG
lY0jrYNxk/xrAapdsr3SzJ+5evz2f3bsKNAy6qHk1LUQwZOMoOWxVZfMmUjirGHy
GncsEbD1XPgW7W/GdbYdg1/fxvBd7z88G5JNv62F07e8MEWWme1YLgV2AykTgWux
FuzUPdunJj10NA235530GWcwsTf2GC774HXIiJdbh+l+iTfC4ESpJJf6+e+aQKaA
GCRAxTV8TQtY98+HKIKFNpXvR31iApRQ4crqkXIzmTiT9Num87vOzzzLo9dYTprX
/rG652hWjYMzOPCDyR4z/W8LM9oUdVjQdb3VITBfQbzz5z0EjJfNlAug8y5PAXKO
sJPTCEd3ayN9Hbn6XFXUc1MzNLMu7+wPop/Oui2dpszMpcLYffJ2cG6MRvdYByTD
cU7aqyKotcwFJYCDInSwFBIYb0sJ4tdz/NpeFmRPwRhHBjKmqn/wM8Rc8pewx7/3
TmnyeTzrD0SPIQi2RbHUl2JTq3zwJ9Ur8+FQKYY/ZqeeJcSkutjoQGybtxN4PkYZ
Gn/qLZHJ3SWY1Najhf85OmzyDWV4lxDoGzz+83E5Z8GkPT/Pa7DXdYql2rQJfGlk
sJ/Bx/+OK0dN1vylGkEiKnAiF/HSZlSQKX6v/swN15bEmpOHWSMn0NmW5qaf5I3Q
0Ix/PbO1OH0=
=it8r
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-6.10-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"The most important one fixes possible infinite loops reported by a
smartphone vendor OPPO recently due to some unexpected zero-sized
compressed pcluster out of interrupted I/Os, storage failures, etc.
Another patch fixes global buffer memory leak on unloading, and the
remaining one switches to use super_set_uuid() to keep with the other
filesystems.
Summary:
- Fix possible global buffer memory leak when unloading EROFS module
- Fix FS_IOC_GETFSUUID ioctl by using super_set_uuid()
- Reset m_llen to 0 so then it can retry if metadata is invalid"
* tag 'erofs-for-6.10-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: ensure m_llen is reset to 0 if metadata is invalid
erofs: convert to use super_set_uuid to support for FS_IOC_GETFSUUID
erofs: fix possible memory leak in z_erofs_gbuf_exit()
When fcntl_setlk() races with close(), it removes the created lock with
do_lock_file_wait().
However, LSMs can allow the first do_lock_file_wait() that created the lock
while denying the second do_lock_file_wait() that tries to remove the lock.
In theory (but AFAIK not in practice), posix_lock_file() could also fail to
remove a lock due to GFP_KERNEL allocation failure (when splitting a range
in the middle).
After the bug has been triggered, use-after-free reads will occur in
lock_get_status() when userspace reads /proc/locks. This can likely be used
to read arbitrary kernel memory, but can't corrupt kernel memory.
This only affects systems with SELinux / Smack / AppArmor / BPF-LSM in
enforcing mode and only works from some security contexts.
Fix it by calling locks_remove_posix() instead, which is designed to
reliably get rid of POSIX locks associated with the given file and
files_struct and is also used by filp_flush().
Fixes: c293621bbf ("[PATCH] stale POSIX lock handling")
Cc: stable@kernel.org
Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563
Signed-off-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/r/20240702-fs-lock-recover-2-v1-1-edd456f63789@google.com
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
In the ref-verify tool, when processing the inline references of an extent
item, we may end up returning with uninitialized return value, because:
1) The 'ret' variable is not initialized if there are no inline extent
references ('ptr' == 'end' before the while loop starts);
2) If we find an extent owner inline reference we don't initialize 'ret'.
So fix these cases by initializing 'ret' to 0 when declaring the variable
and set it to -EINVAL if we find an extent owner inline references and
simple quotas are not enabled (as well as print an error message).
Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/59b40ebe-c824-457d-8b24-0bbca69d472b@gmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reports the following regression detected by KASAN:
BUG: KASAN: slab-out-of-bounds in btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277
Read of size 8 at addr ffff88814628ca50 by task syz-executor318/5171
CPU: 0 PID: 5171 Comm: syz-executor318 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
print_address_description mm/kasan/report.c:377 [inline]
print_report+0x169/0x550 mm/kasan/report.c:488
kasan_report+0x143/0x180 mm/kasan/report.c:601
btrfs_qgroup_inherit+0x42e/0x2e20 fs/btrfs/qgroup.c:3277
create_pending_snapshot+0x1359/0x29b0 fs/btrfs/transaction.c:1854
create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1922
btrfs_commit_transaction+0xf20/0x3740 fs/btrfs/transaction.c:2382
create_snapshot+0x6a1/0x9e0 fs/btrfs/ioctl.c:875
btrfs_mksubvol+0x58f/0x710 fs/btrfs/ioctl.c:1029
btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1075
__btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1340
btrfs_ioctl_snap_create_v2+0x1f2/0x3a0 fs/btrfs/ioctl.c:1422
btrfs_ioctl+0x99e/0xc60
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:907 [inline]
__se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fcbf1992509
RSP: 002b:00007fcbf1928218 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fcbf1a1f618 RCX: 00007fcbf1992509
RDX: 0000000020000280 RSI: 0000000050009417 RDI: 0000000000000003
RBP: 00007fcbf1a1f610 R08: 00007ffea1298e97 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fcbf19eb660
R13: 00000000200002b8 R14: 00007fcbf19e60c0 R15: 0030656c69662f2e
</TASK>
And it also pinned it down to commit b5357cb268 ("btrfs: qgroup: do not
check qgroup inherit if qgroup is disabled").
[CAUSE]
That offending commit skips the whole qgroup inherit check if qgroup is
not enabled.
But that also skips the very basic checks like
num_ref_copies/num_excl_copies and the structure size checks.
Meaning if a qgroup enable/disable race is happening at the background,
and we pass a btrfs_qgroup_inherit structure when the qgroup is
disabled, the check would be completely skipped.
Then at the time of transaction commitment, qgroup is re-enabled and
btrfs_qgroup_inherit() is going to use the incorrect structure and
causing the above KASAN error.
[FIX]
Make btrfs_qgroup_check_inherit() only skip the source qgroup checks.
So that even if invalid btrfs_qgroup_inherit structure is passed in, we
can still reject invalid ones no matter if qgroup is enabled or not.
Furthermore we do already have an extra safety inside
btrfs_qgroup_inherit(), which would just ignore invalid qgroup sources,
so even if we only skip the qgroup source check we're still safe.
Reported-by: syzbot+a0d1f7e26910be4dc171@syzkaller.appspotmail.com
Fixes: b5357cb268 ("btrfs: qgroup: do not check qgroup inherit if qgroup is disabled")
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
calc_available_free_space() returns the total size of metadata (or
system) block groups, which can be allocated from unallocated disk
space. The logic is wrong on zoned mode in two places.
First, the calculation of data_chunk_size is wrong. We always allocate
one zone as one chunk, and no partial allocation of a zone. So, we
should use zone_size (= data_sinfo->chunk_size) as it is.
Second, the result "avail" may not be zone aligned. Since we always
allocate one zone as one chunk on zoned mode, returning non-zone size
aligned bytes will result in less pressure on the async metadata reclaim
process.
This is serious for the nearly full state with a large zone size device.
Allowing over-commit too much will result in less async reclaim work and
end up in ENOSPC. We can align down to the zone size to avoid that.
Fixes: cb6cbab790 ("btrfs: adjust overcommit logic when very close to full")
CC: stable@vger.kernel.org # 6.9
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaC45kACgkQxWXV+ddt
WDt6yRAAkn3n/nkapAvQbtOEIAV9GOc+DYecQXLM+E6m85vsvOBO6OeO/QDfIGvI
ALNE4EEKTkmqk6AOLNX9rQUvo8aOaDXj/9bNZYuVSCzG2hfwLijv6DlRShxuEIE8
kzOisuIZ4w7aL7G7OsUa50j/BLdZ47+iVF/79N+odhdaCDhhK/CcfLbemiLUS9AF
OYkYDyl2WCo9HLduSlVHDWXNUKs7I6/S29UWQpkTKTkmLlMk7rdkgbpjgpKjFxsd
/CuVW5NEGs+4dkV2OdOJ9t+f4qGJ56YuJanvrV3R1bGh+sgDzrcA0kP8318nFzgG
KBYTXjAZoe5RAi4IYfMhSrEExo2JJYFeei0B7Dv3M4IvxLOF7NdMvppBFdODF0A4
20gZ8EgNtZ0sMEafK2WAZSI23sjX0TH+/P3FFKQszxX0fRH3NV6al2drhuBTeIKr
UzxDqeDuqpDHSZskHIMR9nMEymcov5JTg5v2ds64WiT8kr3L23u3j1KO+8UrZ8j4
eB5vYEXE9VJDNPoJQ/qhbqDduzIJ1s+RU8lhEVIUJRIThRk/oHc7YcXgXKqvS6RM
ilJeAzjEhp2t3CQDfx+FJ9jjgi5jhOoCkbITg6iiiExIt++XOXJtL0y2BH514LA5
5jQXoXl/YnukfCaKLNRsxYXCw7JAQxWQiVq3s9g1tG5679zrfNA=
=RkBU
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A fixup for a recent fix that prevents an infinite loop during block
group reclaim.
Unfortunately it introduced an unsafe way of updating block group list
and could race with relocation. This could be hit on fast devices when
relocation/balance does not have enough space"
* tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix adding block group to a reclaim list and the unused list during reclaim
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZoJSmAAKCRCRxhvAZXjc
ot3tAQCUjJh7jZvmmkUV0pF51JI1jEumk8d8vPORGsm1A6oMawEA+tyiWYkcIU3t
JUFGZSDce5MuJEI/frDPb98CW2dLkQA=
=fVtx
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.10-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"Misc:
- Don't misleadingly warn during filesystem thaw operations.
It's possible that a block device which was frozen before it was
mounted can cause a failing thaw operation if someone concurrently
tried to mount it while that thaw operation was issued and the
device had already been temporarily claimed for the mount (The
mount will of course be aborted because the device is frozen).
netfs:
- Fix io_uring based write-through. Make sure that the total request
length is correctly set.
- Fix partial writes to folio tail.
- Remove some xarray helpers that were intended for bounce buffers
which got defered to a later patch series.
- Make netfs_page_mkwrite() whether folio->mapping is vallid after
acquiring the folio lock.
- Make netfs_page_mkrite() flush conflicting data instead of waiting.
fsnotify:
- Ensure that fsnotify creation events are generated before fsnotify
open events when a file is created via ->atomic_open(). The
ordering was broken before.
- Ensure that no fsnotify events are generated for O_PATH file
descriptors. While no fsnotify open events were generated, fsnotify
close events were. Make it consistent and don't produce any"
* tag 'vfs-6.10-rc7.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix netfs_page_mkwrite() to flush conflicting data, not wait
netfs: Fix netfs_page_mkwrite() to check folio->mapping is valid
netfs: Delete some xarray-wangling functions that aren't used
netfs: Fix early issue of write op on partial write to folio tail
netfs: Fix io_uring based write-through
vfs: generate FS_CREATE before FS_OPEN when ->atomic_open used.
fsnotify: Do not generate events for O_PATH file descriptors
fs: don't misleadingly warn during thaw operations
There is a potential parallel list adding for retrying in
btrfs_reclaim_bgs_work and adding to the unused list. Since the block
group is removed from the reclaim list and it is on a relocation work,
it can be added into the unused list in parallel. When that happens,
adding it to the reclaim list will corrupt the list head and trigger
list corruption like below.
Fix it by taking fs_info->unused_bgs_lock.
[177.504][T2585409] BTRFS error (device nullb1): error relocating ch= unk 2415919104
[177.514][T2585409] list_del corruption. next->prev should be ff1100= 0344b119c0, but was ff11000377e87c70. (next=3Dff110002390cd9c0)
[177.529][T2585409] ------------[ cut here ]------------
[177.537][T2585409] kernel BUG at lib/list_debug.c:65!
[177.545][T2585409] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[177.555][T2585409] CPU: 9 PID: 2585409 Comm: kworker/u128:2 Tainted: G W 6.10.0-rc5-kts #1
[177.568][T2585409] Hardware name: Supermicro SYS-520P-WTR/X12SPW-TF, BIOS 1.2 02/14/2022
[177.579][T2585409] Workqueue: events_unbound btrfs_reclaim_bgs_work[btrfs]
[177.589][T2585409] RIP: 0010:__list_del_entry_valid_or_report.cold+0x70/0x72
[177.624][T2585409] RSP: 0018:ff11000377e87a70 EFLAGS: 00010286
[177.633][T2585409] RAX: 000000000000006d RBX: ff11000344b119c0 RCX:0000000000000000
[177.644][T2585409] RDX: 000000000000006d RSI: 0000000000000008 RDI:ffe21c006efd0f40
[177.655][T2585409] RBP: ff110002e0509f78 R08: 0000000000000001 R09:ffe21c006efd0f08
[177.665][T2585409] R10: ff11000377e87847 R11: 0000000000000000 R12:ff110002390cd9c0
[177.676][T2585409] R13: ff11000344b119c0 R14: ff110002e0508000 R15:dffffc0000000000
[177.687][T2585409] FS: 0000000000000000(0000) GS:ff11000fec880000(0000) knlGS:0000000000000000
[177.700][T2585409] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[177.709][T2585409] CR2: 00007f06bc7b1978 CR3: 0000001021e86005 CR4:0000000000771ef0
[177.720][T2585409] DR0: 0000000000000000 DR1: 0000000000000000 DR2:0000000000000000
[177.731][T2585409] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:0000000000000400
[177.742][T2585409] PKRU: 55555554
[177.748][T2585409] Call Trace:
[177.753][T2585409] <TASK>
[177.759][T2585409] ? __die_body.cold+0x19/0x27
[177.766][T2585409] ? die+0x2e/0x50
[177.772][T2585409] ? do_trap+0x1ea/0x2d0
[177.779][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.788][T2585409] ? do_error_trap+0xa3/0x160
[177.795][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.805][T2585409] ? handle_invalid_op+0x2c/0x40
[177.812][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.820][T2585409] ? exc_invalid_op+0x2d/0x40
[177.827][T2585409] ? asm_exc_invalid_op+0x1a/0x20
[177.834][T2585409] ? __list_del_entry_valid_or_report.cold+0x70/0x72
[177.843][T2585409] btrfs_delete_unused_bgs+0x3d9/0x14c0 [btrfs]
There is a similar retry_list code in btrfs_delete_unused_bgs(), but it is
safe, AFAICS. Since the block group was in the unused list, the used bytes
should be 0 when it was added to the unused list. Then, it checks
block_group->{used,reserved,pinned} are still 0 under the
block_group->lock. So, they should be still eligible for the unused list,
not the reclaim list.
The reason it is safe there it's because because we're holding
space_info->groups_sem in write mode.
That means no other task can allocate from the block group, so while we
are at deleted_unused_bgs() it's not possible for other tasks to
allocate and deallocate extents from the block group, so it can't be
added to the unused list or the reclaim list by anyone else.
The bug can be reproduced by btrfs/166 after a few rounds. In practice
this can be hit when relocation cannot find more chunk space and ends
with ENOSPC.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Suggested-by: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Fixes: 4eb4e85c4f ("btrfs: retry block group reclaim without infinite loop")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sometimes, the on-disk metadata might be invalid due to user
interrupts, storage failures, or other unknown causes.
In that case, z_erofs_map_blocks_iter() may still return a valid
m_llen while other fields remain invalid (e.g., m_plen can be 0).
Due to the return value of z_erofs_scan_folio() in some path will
be ignored on purpose, the following z_erofs_scan_folio() could
then use the invalid value by accident.
Let's reset m_llen to 0 to prevent this.
Link: https://lore.kernel.org/r/20240629185743.2819229-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
* Always free only post-EOF delayed allocations for files with the
XFS_DIFLAG_PREALLOC or APPEND flags set.
* Do not align cow fork delalloc to cowextsz hint when running low on space.
* Allow zero-size symlinks and directories as long as the link count is
zero.
* Change XFS_IOC_EXCHANGE_RANGE to be a _IOW only ioctl. This was ioctl was
introduced during v6.10 developement cycle.
* xfs_init_new_inode() now creates an attribute fork on a newly created
inode even if ATTR feature flag is not enabled.
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZnvYdwAKCRAH7y4RirJu
9DRmAP9VwmSgBrVGZ459K6LluP12FoIpzUljEYSiQiyjhxuQJgD/fou/8G+/TTQH
3TtdmC8Xo7SWRMq9+wPpH5OywbsvZQM=
=fV8d
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.10-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Chandan Babu:
- Always free only post-EOF delayed allocations for files with the
XFS_DIFLAG_PREALLOC or APPEND flags set.
- Do not align cow fork delalloc to cowextsz hint when running low on
space.
- Allow zero-size symlinks and directories as long as the link count is
zero.
- Change XFS_IOC_EXCHANGE_RANGE to be a _IOW only ioctl. This was ioctl
was introduced during v6.10 developement cycle.
- xfs_init_new_inode() now creates an attribute fork on a newly created
inode even if ATTR feature flag is not enabled.
* tag 'xfs-6.10-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: honor init_xattrs in xfs_init_new_inode for !ATTR fs
xfs: fix direction in XFS_IOC_EXCHANGE_RANGE
xfs: allow unlinked symlinks and dirs with zero size
xfs: restrict when we try to align cow fork delalloc to cowextsz hints
xfs: fix freeing speculative preallocations for preallocated files
- Due to a late review, revert and re-fix a recent crasher fix
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmZ+zksACgkQM2qzM29m
f5ey0Q/+O8Z//rVg8L8V1EI44gZ1LRTiJEeDFoJU1Hktn0dOwTvGQJNzNbfpTBbT
zTvvDBd9+cbHQ7Jl2r73+wkqgZbGXsRhMRUCmlA5ur0O7ujkYDo3rpOgQhP4V3LV
1G6o2uOJT9d9UykUF74Demlg97kNrc8CY1QNujXjKrj8jYpQY4KcRCDb91KdYgc9
GSque9zMD2jTJkTkSVRUyfWGEDvYtDxs7/4rdErjixaY30dgkPHae6euRzoQd2sC
v3aoKHSt5Y/pvOX+3/c2QzDqOiKPOnm3dLazB+rK6lYrnt4iUeP3CiPcFmqNJZk2
kZ3hpCPN8Le3oQFlMe450EljB8wLj2DqLix+Y/iOGyA732GLxNW6Hu8PkwRVFFwx
cGGp6KryrcwO/BNHgxceErfhBG4IPKSgnY6C8Io+mPYKwSRAOKA4su02RoJDWJkA
PTf67JgZiKv+bn3EGojeTo9YAvb04EYstttK2yBFUR9xylyew9vaZ9X8mYkqfyHN
pZEWf/SI8SExZteo3Ymx9HLcX5FYsBjgvigpow33kUCn2hMhuhfBXEtmLNsxOSYC
aEICcAYqnBovFJ+3vbz11IDS04ERIFdprlLQuExtJ21pujIlIM9pOxrIGvgu5pPl
++rsDP31Kh6imuZbSaJ5TFTdtDqcpah587m+mQtlaXAjOQL1qfw=
=gwat
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Due to a late review, revert and re-fix a recent crasher fix
* tag 'nfsd-6.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
Revert "nfsd: fix oops when reading pool_stats before server is started"
nfsd: initialise nfsd_info.mutex early.
simple stuff:
- null ptr/err ptr deref fixes
- fix for getting wedged on shutdown after journal error
- fix missing recalc_capacity() call, capacity now changes correctly
after a device goes read only
however: our capacity calculation still doesn't take into account when
we have mixed ro/rw devices and the ro devices have data on them,
that's going to be a more involved fix to separate accounting for
"capacity used on ro devices" and "capacity used on rw devices"
- boring syzbot stuff
slightly more involved:
- discard, invalidate workers are now per device
this has the effect of simplifying how we take device refs in these
paths, and the device ref cleanup fixes a longstanding race between
the device removal path and the discard path
- fixes for how the debugfs code takes refs on btree_trans objects
we have debugfs code that prints in use btree_trans objects. It uses
closure_get() on trans->ref, which is mainly for the cycle detector,
but the debugfs code was using it on a closure that may have hit 0,
which is not allowed; for performance reasons we cannot avoid having
not-in-use transactions on the global list.
introduce some new primitives to fix this and make the synchronization
here a whole lot saner
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmZ+ye4ACgkQE6szbY3K
bnb4shAAqkKdgB2abIaD1t8+KjUiXwt7seRY4EmzwrEaWniW5bDUYMBvV+tew93j
uvGmSKMs4ML/r24hcg0zGPJ9GoWrFb3MWhPYizzRS8QspsUjsECJuehNPCe3RPaf
QBgQtKahTge1e41y1frzkiGKqaOGOTtUVLOfPIebe+oJAhRCYRnrGY2dkZTms7Ue
aXNtBmnlX3Fkmlm0GiKYrTHpAZz3d0kzdX11Pc2vTXvqo/znuJTTVGnjJkdrHzyv
6cz6YnMKFdxLVbYO1KlB/3Hu9y9qt815g1rjvaqym8pDk9ltsGHNM3LcCCCyp7Of
btnbLQ6TdfggK5Kf2hNYuJRY2pnjNyfcNxupQF3RNaw/D/4G5EU16zfFElORC6Mw
eGwXLvDIGqOSSIvevoRZrgJKAvVptXNg9EtCI5Z5ujQ4ExW8ti1lPHp/r5SVOhyz
x0Am14H2ERuz7Vt5jUas3k74+tAck6JWc5OemMQawA5waeH1inMT7QZuBt+Bmrhx
Av0zbhaq4aTsHXmm+Xi6ofj3UBaOQ2rNzT7Au0kxdvJgDPe/USjw4tejV5DmjmHA
SyRsTG7Zn5xJBi7jc47fcwUgUzlxlffVQGFCVjRUU1vF6u/Ldn7K0zfYbkwSCiKp
iWSEyg3j5z5N69Vrgdadma4xTDjL/C5+XsMWh8G8ohf+crhUeSo=
=svIi
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-06-28' of https://evilpiepirate.org/git/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Simple stuff:
- NULL ptr/err ptr deref fixes
- fix for getting wedged on shutdown after journal error
- fix missing recalc_capacity() call, capacity now changes correctly
after a device goes read only
however: our capacity calculation still doesn't take into account
when we have mixed ro/rw devices and the ro devices have data on
them, that's going to be a more involved fix to separate accounting
for "capacity used on ro devices" and "capacity used on rw devices"
- boring syzbot stuff
Slightly more involved:
- discard, invalidate workers are now per device
this has the effect of simplifying how we take device refs in these
paths, and the device ref cleanup fixes a longstanding race between
the device removal path and the discard path
- fixes for how the debugfs code takes refs on btree_trans objects we
have debugfs code that prints in use btree_trans objects.
It uses closure_get() on trans->ref, which is mainly for the cycle
detector, but the debugfs code was using it on a closure that may
have hit 0, which is not allowed; for performance reasons we cannot
avoid having not-in-use transactions on the global list.
Introduce some new primitives to fix this and make the
synchronization here a whole lot saner"
* tag 'bcachefs-2024-06-28' of https://evilpiepirate.org/git/bcachefs:
bcachefs: Fix kmalloc bug in __snapshot_t_mut
bcachefs: Discard, invalidate workers are now per device
bcachefs: Fix shift-out-of-bounds in bch2_blacklist_entries_gc
bcachefs: slab-use-after-free Read in bch2_sb_errors_from_cpu
bcachefs: Add missing bch2_journal_do_writes() call
bcachefs: Fix null ptr deref in journal_pins_to_text()
bcachefs: Add missing recalc_capacity() call
bcachefs: Fix btree_trans list ordering
bcachefs: Fix race between trans_put() and btree_transactions_read()
closures: closure_get_not_zero(), closure_return_sync()
bcachefs: Make btree_deadlock_to_text() clearer
bcachefs: fix seqmutex_relock()
bcachefs: Fix freeing of error pointers
These are some bugfixes for system call ABI issues I found while
working on a cleanup series. None of these are urgent since these
bugs have gone unnoticed for many years, but I think we probably
want to backport them all to stable kernels, so it makes sense
to have the fixes included as early as possible.
One more fix addresses a compile-time warning in kallsyms that was
uncovered by a patch I did to enable additional warnings in 6.10. I had
mistakenly thought that this fix was already merged through the module
tree, but as Geert pointed out it was still missing.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmZ9iRQACgkQYKtH/8kJ
UicHIxAA0ej8dMJ3znHovc/CQYkZMpb88bxLlqLotOYuOItEzvR6wd7vnu4cPeZf
nHguBiP9RAnzCZhL3F7AS3p8NNJ+P1OZo+sj6tZOANO955mzj1VQ5p2fbSRw+WI3
4Oc1HKvP6UMhHGjU3wHY0+Odd5bpoepN9/fnoiQcHPzq0LbUFM8e4D9KGr51I7fV
r7tuDMy9xykEfs6umuDu9wOXih3JkpV9eSmefmjvzgxG3hKLdsvTbWVsVmnKXhZm
xdFiTROOmiNvttfkQh0ruBd0drBl8aVhzCKPqIe0vQqS9rBmcf9WTkcJzpihq/fI
BA3QjVQFvmHeXs+viaLZf4r/y0qabaTPRBMQxZyEFE0QgtwfxT4/ZnNEbH2s3pIC
Pcm0JltLlHLbZs7V63drL6txCoFVndiPXdEBTBsqBwnuDHXCj/tvDcO3tuVTfYoz
9G8TTOsYNEDLYmn8AmzzhJOh75gp6O6A2ui3TtcD9KFNaoTQqqzPJWp8IoxBfxcb
3+rzRWQvXAhfSRBIaejv1quo2ZxoZk3KO3i+ysRITTUF1MLz7b0/Yy/8r74CqmOu
8Iw2Q0BaFPtj1x+VjneQnL++iYWYPEh+ZBEg7AD/z6QHwMLz33SyHlD+/RgRkthV
J/L9xUBs5HagWJxRYkVc+l0LOVclTqVJieKD2AWONZ5OFRB+CCI=
=ieQy
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-fixes-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic fixes from Arnd Bergmann:
"These are some bugfixes for system call ABI issues I found while
working on a cleanup series. None of these are urgent since these bugs
have gone unnoticed for many years, but I think we probably want to
backport them all to stable kernels, so it makes sense to have the
fixes included as early as possible.
One more fix addresses a compile-time warning in kallsyms that was
uncovered by a patch I did to enable additional warnings in 6.10. I
had mistakenly thought that this fix was already merged through the
module tree, but as Geert pointed out it was still missing"
* tag 'asm-generic-fixes-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
kallsyms: rework symbol lookup return codes
linux/syscalls.h: add missing __user annotations
syscalls: mmap(): use unsigned offset type consistently
s390: remove native mmap2() syscall
hexagon: fix fadvise64_64 calling conventions
csky, hexagon: fix broken sys_sync_file_range
sh: rework sync_file_range ABI
powerpc: restore some missing spu syscalls
parisc: use generic sys_fanotify_mark implementation
parisc: use correct compat recv/recvfrom syscalls
sparc: fix compat recv/recvfrom syscalls
sparc: fix old compat_sys_select()
syscalls: fix compat_sys_io_pgetevents_time64 usage
ftruncate: pass a signed offset
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmZ9gV0ACgkQxWXV+ddt
WDsK2hAAmbOArbK5QHprawdOqqvJL46yoGCMba798EjYo53+hO1F8/lb531+zCUI
GDhdIC2mHdRIkARJ8Cde5POUjID1Kv3Rlc0rdHy3nOw38WZmA/+HdkcKzQhsDFSR
/FX9RKSWiu0xl6JdCLh4KkIWE+2m1v1kybhvRHCKb+70iBua1e+OSoM33BeiIhrP
yoFwMwIbzG2CoZOHoobDxUjs9ZMUHm4wH0csJYG9R59Vv7uLBOgpWuQB46iqpoj4
EYR8Sg8PscI7YXa0y8VTP3pdrNMW48IC6jerIAKHUeWRWRoTCh9He+3/E4xjNHxz
3Pm+Aat5QYdsqmE68IbeN5c7QB1YAdUCgoJJJwAFjwe9WtTn8RS9RiMjWIr+VHYm
E3REibQI151p0yHwAl8xPHDiTecmlNisof0eg6gzHdvODm/NYFuFapD+aDxWribX
G63dOa8Fy0h4pwDoF73Rd2YYbtO6tDVSNVIG3bWpPep3r+SI/oC4JMHbmn1aqmqF
c0/ZYVbsx/Hm066l4LCrpgi7kJ8en2zQ8MmcZHHt+2gXe1AyAON7kvRHaEizUCaA
fnLVhQvaOofC4g7DJc1JkwyLc9VF5hMTYUhldoJvqj1wm2qsT/siJgKVvllRGoMs
FU2qlWYaN/fPRylyEySyZPq/sWKHOAZZSOhyWM8SB2nYoBXNkO0=
=qpEp
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix quota root leak after quota disable failure
- fix condition when checking if a zone can be added as free
- allocate inode in NOFS context during logging or tree-log replay
- handle raid-stripe-tree lookup correctly during scrub
* tag 'for-6.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: qgroup: fix quota root leak after quota disable failure
btrfs: scrub: handle RST lookup error correctly
btrfs: zoned: fix initial free space detection
btrfs: use NOFS context when getting inodes during logging and log replay
Fix netfs_page_mkwrite() to check that folio->mapping is valid once it has
taken the folio lock (as filemap_page_mkwrite() does). Without this,
generic/247 occasionally oopses with something like the following:
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
RIP: 0010:trace_event_raw_event_netfs_folio+0x61/0xc0
...
Call Trace:
<TASK>
? __die_body+0x1a/0x60
? page_fault_oops+0x6e/0xa0
? exc_page_fault+0xc2/0xe0
? asm_exc_page_fault+0x22/0x30
? trace_event_raw_event_netfs_folio+0x61/0xc0
trace_netfs_folio+0x39/0x40
netfs_page_mkwrite+0x14c/0x1d0
do_page_mkwrite+0x50/0x90
do_pte_missing+0x184/0x200
__handle_mm_fault+0x42d/0x500
handle_mm_fault+0x121/0x1f0
do_user_addr_fault+0x23e/0x3c0
exc_page_fault+0xc2/0xe0
asm_exc_page_fault+0x22/0x30
This is due to the invalidate_inode_pages2_range() issued at the end of the
DIO write interfering with the mmap'd writes.
Fixes: 102a7e2c59 ("netfs: Allow buffered shared-writeable mmap through netfs_page_mkwrite()")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/780211.1719318546@warthog.procyon.org.uk
Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Matthew Wilcox <willy@infradead.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Delete some xarray-based buffer wangling functions that are intended for
use with bounce buffering, but aren't used because bounce-buffering got
deferred to a later patch series. Now, however, the intention is to use
something other than an xarray to do this.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240620173137.610345-9-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
During the writeback procedure, at the end of netfs_write_folio(), pending
write operations are flushed if the amount of write-streaming data stored
in a page is less than the size of the folio because if we haven't modified
a folio to the end, it cannot be contiguous with the following folio...
except if the dirty region of the folio is right at the end of the folio
space.
Fix the test to take the offset into the folio into account as well, such
that if the dirty region runs right up to the end of the folio, we leave
the flushing for later.
Fixes: 288ace2f57 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com> (DFS, global name space)
cc: v9fs@lists.linux.dev
cc: linux-afs@lists.infradead.org
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20240620173137.610345-4-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
FS_IOC_GETFSUUID ioctl exposes the uuid of a filesystem. To support
the ioctl, init sb->s_uuid with super_set_uuid().
Signed-off-by: Huang Xiaojia <huangxiaojia2@huawei.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240624063704.2476070-1-huangxiaojia2@huawei.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Because we incorrectly reused of variable `i` in `z_erofs_gbuf_exit()`
for inner loop, we may exit early from outer loop resulting in memory
leak. Fix this by using separate variable for iterating through inner loop.
Fixes: f36f3010f6 ("erofs: rename per-CPU buffers to global buffer pool and make it configurable")
Signed-off-by: Sandeep Dhavale <dhavale@google.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20240624220206.3373197-1-dhavale@google.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
xfs_init_new_inode ignores the init_xattrs parameter for filesystems
that do not have ATTR enabled. As a result, the first init_xattrs file
to be created by the kernel will not have an attr fork created to store
acls. Storing that first acl will add ATTR to the superblock flags, so
subsequent files will be created with attr forks. The overhead of this
is so small that chances are that nobody has noticed this behavior.
However, this is disastrous on a filesystem with parent pointers because
it requires that a new linkable file /must/ have a pre-existing attr
fork, and the parent pointers code uses init_xattrs to create that fork.
The preproduction version of mkfs.xfs used to set this, but the V5 sb
verifier only requires ATTR2, not ATTR. There is no guard for
filesystems with (PARENT && !ATTR).
It turns out that I misunderstood the two flags -- ATTR means that we at
some point created an attr fork to store xattrs in a file; ATTR2
apparently means only that inodes have dynamic fork offsets or that the
filesystem was mounted with the "attr2" option.
Fixes: 2442ee15bb ("xfs: eager inode attr fork init needs attr feature awareness")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
The kernel reads userspace's buffer but does not write it back.
Therefore this is really an _IOW ioctl. Change this before 6.10 final
releases.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
For a very very long time, inode inactivation has set the inode size to
zero before unmapping the extents associated with the data fork.
Unfortunately, commit 3c6f46eacd changed the inode verifier to
prohibit zero-length symlinks and directories. If an inode happens to
get logged in this state and the system crashes before freeing the
inode, log recovery will also fail on the broken inode.
Therefore, allow zero-size symlinks and directories as long as the link
count is zero; nobody will be able to open these files by handle so
there isn't any risk of data exposure.
Fixes: 3c6f46eacd ("xfs: sanity check directory inode di_size")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
xfs/205 produces the following failure when always_cow is enabled:
--- a/tests/xfs/205.out 2024-02-28 16:20:24.437887970 -0800
+++ b/tests/xfs/205.out.bad 2024-06-03 21:13:40.584000000 -0700
@@ -1,4 +1,5 @@
QA output created by 205
*** one file
+ !!! disk full (expected)
*** one file, a few bytes at a time
*** done
This is the result of overly aggressive attempts to align cow fork
delalloc reservations to the CoW extent size hint. Looking at the trace
data, we're trying to append a single fsblock to the "fred" file.
Trying to create a speculative post-eof reservation fails because
there's not enough space.
We then set @prealloc_blocks to zero and try again, but the cowextsz
alignment code triggers, which expands our request for a 1-fsblock
reservation into a 39-block reservation. There's not enough space for
that, so the whole write fails with ENOSPC even though there's
sufficient space in the filesystem to allocate the single block that we
need to land the write.
There are two things wrong here -- first, we shouldn't be attempting
speculative preallocations beyond what was requested when we're low on
space. Second, if we've already computed a posteof preallocation, we
shouldn't bother trying to align that to the cowextsize hint.
Fix both of these problems by adding a flag that only enables the
expansion of the delalloc reservation to the cowextsize if we're doing a
non-extending write, and only if we're not doing an ENOSPC retry. This
requires us to move the ENOSPC retry logic to xfs_bmapi_reserve_delalloc.
I probably should have caught this six years ago when 6ca30729c2 was
being reviewed, but oh well. Update the comments to reflect what the
code does now.
Fixes: 6ca30729c2 ("xfs: bmap code cleanup")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
xfs_can_free_eofblocks returns false for files that have persistent
preallocations unless the force flag is passed and there are delayed
blocks. This means it won't free delalloc reservations for files
with persistent preallocations unless the force flag is set, and it
will also free the persistent preallocations if the force flag is
set and the file happens to have delayed allocations.
Both of these are bad, so do away with the force flag and always free
only post-EOF delayed allocations for files with the XFS_DIFLAG_PREALLOC
or APPEND flags set.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
There's no reason for discards to be single threaded across all devices;
this will improve performance on multi device setups.
Additionally, making them per-device simplifies the refcounting on
bch_dev->io_ref; we now hold it for the duration that the discard path
is running, which fixes a race between the discard path and device
removal.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This series fix the shift-out-of-bounds issue in
bch2_blacklist_entries_gc().
Instead of passing 0 to eytzinger0_first() when iterating the entries,
we explicitly check 0 and initialize i to be 0.
syzbot has tested the proposed patch and the reproducer did not trigger
any issue:
Reported-and-tested-by: syzbot+835d255ad6bc7f29ee12@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=835d255ad6bc7f29ee12
Signed-off-by: Pei Li <peili.dev@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
nfsd_info.mutex can be dereferenced by svc_pool_stats_start()
immediately after the new netns is created. Currently this can
trigger an oops.
Move the initialisation earlier before it can possibly be dereferenced.
Fixes: 7b207ccd98 ("svc: don't hold reference for poolstats, only mutex.")
Reported-by: Sourabh Jain <sourabhjain@linux.ibm.com>
Closes: https://lore.kernel.org/all/c2e9f6de-1ec4-4d3a-b18d-d5a6ec0814a0@linux.ibm.com/
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Since commit 2282679fb2 ("mm: submit multipage write for SWP_FS_OPS
swap-space"), we can plug multiple pages then unplug them all together.
That means iov_iter_count(iter) could be way bigger than PAGE_SIZE, it
actually equals the size of iov_iter_npages(iter, INT_MAX).
Note this issue has nothing to do with large folios as we don't support
THP_SWPOUT to non-block devices.
[v-songbaohua@oppo.com: figure out the cause and correct the commit message]
Link: https://lkml.kernel.org/r/20240618065647.21791-1-21cnbao@gmail.com
Fixes: 2282679fb2 ("mm: submit multipage write for SWP_FS_OPS swap-space")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Closes: https://lore.kernel.org/linux-mm/20240617053201.GA16852@lst.de/
Reviewed-by: Martin Wege <martin.l.wege@gmail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Steve French <sfrench@samba.org>
Cc: Trond Myklebust <trondmy@kernel.org>
Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The code in ocfs2_dio_end_io_write() estimates number of necessary
transaction credits using ocfs2_calc_extend_credits(). This however does
not take into account that the IO could be arbitrarily large and can
contain arbitrary number of extents.
Extent tree manipulations do often extend the current transaction but not
in all of the cases. For example if we have only single block extents in
the tree, ocfs2_mark_extent_written() will end up calling
ocfs2_replace_extent_rec() all the time and we will never extend the
current transaction and eventually exhaust all the transaction credits if
the IO contains many single block extents. Once that happens a
WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in
jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to
this error. This was actually triggered by one of our customers on a
heavily fragmented OCFS2 filesystem.
To fix the issue make sure the transaction always has enough credits for
one extent insert before each call of ocfs2_mark_extent_written().
Heming Zhao said:
------
PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error"
PID: xxx TASK: xxxx CPU: 5 COMMAND: "SubmitThread-CA"
#0 machine_kexec at ffffffff8c069932
#1 __crash_kexec at ffffffff8c1338fa
#2 panic at ffffffff8c1d69b9
#3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2]
#4 __ocfs2_abort at ffffffffc0c88387 [ocfs2]
#5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2]
#6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2]
#7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2]
#8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2]
#9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2]
#10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2]
#11 dio_complete at ffffffff8c2b9fa7
#12 do_blockdev_direct_IO at ffffffff8c2bc09f
#13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2]
#14 generic_file_direct_write at ffffffff8c1dcf14
#15 __generic_file_write_iter at ffffffff8c1dd07b
#16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2]
#17 aio_write at ffffffff8c2cc72e
#18 kmem_cache_alloc at ffffffff8c248dde
#19 do_io_submit at ffffffff8c2ccada
#20 do_syscall_64 at ffffffff8c004984
#21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba
Link: https://lkml.kernel.org/r/20240617095543.6971-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20240614145243.8837-1-jack@suse.cz
Fixes: c15471f795 ("ocfs2: fix sparse file & data ordering issue in direct io")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If during the quota disable we fail when cleaning the quota tree or when
deleting the root from the root tree, we jump to the 'out' label without
ever dropping the reference on the quota root, resulting in a leak of the
root since fs_info->quota_root is no longer pointing to the root (we have
set it to NULL just before those steps).
Fix this by always doing a btrfs_put_root() call under the 'out' label.
This is a problem that exists since qgroups were first added in 2012 by
commit bed92eae26 ("Btrfs: qgroup implementation and prototypes"), but
back then we missed a kfree on the quota root and free_extent_buffer()
calls on its root and commit root nodes, since back then roots were not
yet reference counted.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/060 with forced RST feature, it would crash the
following ASSERT() inside scrub_read_endio():
ASSERT(sector_nr < stripe->nr_sectors);
Before that, we would have tree dump from
btrfs_get_raid_extent_offset(), as we failed to find the RST entry for
the range.
[CAUSE]
Inside scrub_submit_extent_sector_read() every time we allocated a new
bbio we immediately called btrfs_map_block() to make sure there was some
RST range covering the scrub target.
But if btrfs_map_block() fails, we immediately call endio for the bbio,
while the bbio is newly allocated, it's completely empty.
Then inside scrub_read_endio(), we go through the bvecs to find
the sector number (as bi_sector is no longer reliable if the bio is
submitted to lower layers).
And since the bio is empty, such bvecs iteration would not find any
sector matching the sector, and return sector_nr == stripe->nr_sectors,
triggering the ASSERT().
[FIX]
Instead of calling btrfs_map_block() after allocating a new bbio, call
btrfs_map_block() first.
Since our only objective of calling btrfs_map_block() is only to update
stripe_len, there is really no need to do that after btrfs_alloc_bio().
This new timing would avoid the problem of handling empty bbio
completely, and in fact fixes a possible race window for the old code,
where if the submission thread is the only owner of the pending_io, the
scrub would never finish (since we didn't decrease the pending_io
counter).
Although the root cause of RST lookup failure still needs to be
addressed.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When creating a new block group, it calls btrfs_add_new_free_space() to add
the entire block group range into the free space accounting.
__btrfs_add_free_space_zoned() checks if size == block_group->length to
detect the initial free space adding, and proceed that case properly.
However, if the zone_capacity == zone_size and the over-write speed is fast
enough, the entire zone can be over-written within one transaction. That
confuses __btrfs_add_free_space_zoned() to handle it as an initial free
space accounting. As a result, that block group becomes a strange state: 0
used bytes, 0 zone_unusable bytes, but alloc_offset == zone_capacity (no
allocation anymore).
The initial free space accounting can properly be checked by checking
alloc_offset too.
Fixes: 98173255bd ("btrfs: zoned: calculate free space from zone capacity")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>