Commit Graph

94541 Commits

Author SHA1 Message Date
Namjae Jeon
a77e0e02af ksmbd: add support for supplementary groups
Even though system user has a supplementary group, It gets
NT_STATUS_ACCESS_DENIED when attempting to create file or directory.
This patch add KSMBD_EVENT_LOGIN_REQUEST_EXT/RESPONSE_EXT netlink events
to get supplementary groups list. The new netlink event doesn't break
backward compatibility when using old ksmbd-tools.

Co-developed-by: Atte Heikkilä <atteh.mailbox@gmail.com>
Signed-off-by: Atte Heikkilä <atteh.mailbox@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-11 11:02:14 -05:00
Jaegeuk Kim
332fade75d f2fs: allow parallel DIO reads
This fixes a regression which prevents parallel DIO reads.

Fixes: 0cac51185e ("f2fs: fix to avoid racing in between read and OPU dio write")
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2024-10-11 15:12:07 +00:00
Darrick J. Wong
0fb823f1cf xfs: fix integer overflow in xrep_bmap
The variable declaration in this function predates the merge of the
nrext64 (aka 64-bit extent counters) feature, which means that the
variable declaration type is insufficient to avoid an integer overflow.
Fix that by redeclaring the variable to be xfs_extnum_t.

Coverity-id: 1630958
Fixes: 8f71bede8e ("xfs: repair inode fork block mapping data structures")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-11 12:32:48 +02:00
Gao Xiang
ae54567eaa erofs: get rid of kaddr in struct z_erofs_maprecorder
`kaddr` becomes useless after switching to metabuf.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241010235830.1535616-1-hsiangkao@linux.alibaba.com
2024-10-11 13:36:58 +08:00
Gao Xiang
2402082e53 erofs: get rid of z_erofs_try_to_claim_pcluster()
Just fold it into the caller for simplicity.

Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241010090420.405871-1-hsiangkao@linux.alibaba.com
2024-10-11 13:36:58 +08:00
Gao Xiang
416a8b2c02 erofs: ensure regular inodes for file-backed mounts
Only regular inodes are allowed for file-backed mounts, not directories
(as seen in the original syzbot case) or special inodes.

Also ensure that .read_folio() is implemented on the underlying fs
for the primary device.

Fixes: fb17675026 ("erofs: add file-backed mount support")
Reported-by: syzbot+001306cd9c92ce0df23f@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/r/00000000000011bdde0622498ee3@google.com
Tested-by: syzbot+001306cd9c92ce0df23f@syzkaller.appspotmail.com
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240917130803.32418-1-hsiangkao@linux.alibaba.com
2024-10-11 13:36:41 +08:00
Linus Torvalds
eb952c47d1 for-6.12-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcH4qAACgkQxWXV+ddt
 WDtDig//czntfO+iRvDERZWTIB6vdVExfLd3r3ZNYlO1pIvgCuvqx3iYva+0ZhGW
 8A+gcRax7cz0jCaxDp/+5lIRfdNxZH6/LwjZsDgU8Ly7himeRmwhtn2fCgNeiH/K
 bUl92+ZMo2vwqTKXYa3xF1g3Hz6cRXVW7gJrMwNhb1hpPTGx+lgYJU02m/Io/vjK
 1jcrZ84OEPIOY5uiAoDyO2hgsT/zVEeuuOiSTpKSzrghPbo0vmjLiYJ5T+CE5Uw3
 u3w7/Fqnw49NwucqtncvyFFDXY9EWNuQhowi3hqJgOYTInqwwJigIpQV0hDDwYxb
 ohGUGjazGfAEf/cy1jZXMbwCVgg8/Nj9x0eDKKhfs19VYUbMkEYQ8LKRTUlCeBwS
 H/2AmqpqHEEO+tPY3P+w6MVwkNho8JNpWPdP5OzJs7XrD067IViOjD06HPM/k5ci
 TU3zp9NYvgHVtmfZK1Aqsg9OYVhI1klVXejmlAzOLxejRPWXK/1hBw3kXbC6I+k1
 50l0Yh1dgEnclMI3yWsKoj8IYUAkh2eudt0pNsot4a5vICMY++NVS2eukdz5UcEz
 ix7hcpYcCcmzoOaelyEgmdAncWVGJT5w2Nzy85YaOp+Z1C65Ywb41utU+sSY+swB
 kZfwl9vrsfu754vX7UKBherCvvYo+Lnj3GeX8Oe+1LoT2BP0TPk=
 =lTqc
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - update fstrim loop and add more cancellation points, fix reported
   delayed or blocked suspend if there's a huge chunk queued

 - fix error handling in recent qgroup xarray conversion

 - in zoned mode, fix warning printing device path without RCU
   protection

 - again fix invalid extent xarray state (6252690f7e), lost due to
   refactoring

* tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix clear_dirty and writeback ordering in submit_one_sector()
  btrfs: zoned: fix missing RCU locking in error message when loading zone info
  btrfs: fix missing error handling when adding delayed ref with qgroups enabled
  btrfs: add cancellation points to trim loops
  btrfs: split remaining space to discard in chunks
2024-10-10 10:02:59 -07:00
Linus Torvalds
5870963f6c nfsd-6.12 fixes:
- Fix NFSD bring-up / shutdown
 - Fix a UAF when releasing a stateid
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcHx6IACgkQM2qzM29m
 f5ecExAAheSpSP9D+1n3hKeOlNfhY8FUzd41Arn4NYV+jIBJbtx94/FlSNOxA0mp
 Ovm1I8uAxy4TR8TLt7tsxfT7JDOStKwXFl3QlOUZT/+uyyJr7/q5R959R3oMiccR
 Rfrpj6j2yYWrI8qGDGHca4Vv2bDSxr4mzztWwDe0SHSsjwf4OAcv+XF5vcfZ/CJN
 Bxulb9WfNU8XvdFcRDHokMfk6jiY6/+FCTwX8ckvbVEG6gHT8+CRYSUJ05j0LJGo
 xKZV913NgzcuV7PH0vq6vExJE6+rEPt/ejDAT5FM5yeNe+WJ4RTDgsYyIr9iLbHF
 mWB9M4NnP+EZhejtOCbZ9RZjjKro09ilEPpqILuuGQPtcHSeWmhNbFz0kwLe+zYZ
 CdtjnPZhjB0ITWgZ1HCtoJ8k/ZcMa7iiM/kApMLGr9fVj8/BHHFzS95PK7K/Fqur
 FLdhvo6CzZCnRd16e2kqWsG7wO2lPWcz4NWTf9wxIG5GCunXoVCEnK1VfHvnldbH
 BIFXZ+ib5qnL2i3Qmz7bQxmfIp5ryZnNx1mF0OM8imR9K/rsnARd7JfQ99lpMy8D
 mD4coZVTMMk/Zg9zuH8k5GBzB2zXXqgngp4IJIxqrKR7/AsuSU3R7r+O9CWN91GQ
 GKpRtMn/rVUg81jxDr3qoKquyxONoyVrVXAKsj1PgUSQdjUJgqU=
 =Rud7
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix NFSD bring-up / shutdown

 - Fix a UAF when releasing a stateid

* tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: fix possible badness in FREE_STATEID
  nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed
  NFSD: Mark filecache "down" if init fails
2024-10-10 09:52:49 -07:00
Linus Torvalds
825ec756af Bug fixes for 6.12-rc3
* A few small typo fixes
 * fstests xfs/538 DEBUG-only fix
 * Performance fix on blockgc on COW'ed files,
   by skipping trims on cowblock inodes currently
   opened for write
 * Prevent cowblocks to be freed under dirty pagecache
   during unshare
 * Update MAINTAINERS file to quote the new maintainer
 
 Signed-off-by: Carlos Maiolino <cem@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZwY6mgAKCRBGdaER5Qtf
 poE8AYCZzMJr9wMrs2RsWRnaEhMRJNZIPQmSKXgHAK3mV5AbXtdHRc8yGVNHf+mW
 Nh0fwAkBf1Ix0VJWkXOSFHZI9O2lLRsCogbNjFhwYF0MHZch2/mq1Wa4Tj1SDlfg
 Ny2PJBNHyA==
 =OkRo
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:

 - A few small typo fixes

 - fstests xfs/538 DEBUG-only fix

 - Performance fix on blockgc on COW'ed files, by skipping trims on
   cowblock inodes currently opened for write

 - Prevent cowblocks to be freed under dirty pagecache during unshare

 - Update MAINTAINERS file to quote the new maintainer

* tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix a typo
  xfs: don't free cowblocks from under dirty pagecache on unshare
  xfs: skip background cowblock trims on inodes open for write
  xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc
  xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btalloc
  xfs: don't ifdef around the exact minlen allocations
  xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocate
  xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname
  xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split
  xfs: return bool from xfs_attr3_leaf_add
  xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addname
  xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate()
  xfs: scrub: convert comma to semicolon
  xfs: Remove empty declartion in header file
  MAINTAINERS: add Carlos Maiolino as XFS release manager
2024-10-10 09:45:45 -07:00
Aleksa Sarai
f92f0a1b05
openat2: explicitly return -E2BIG for (usize > PAGE_SIZE)
While we do currently return -EFAULT in this case, it seems prudent to
follow the behaviour of other syscalls like clone3. It seems quite
unlikely that anyone depends on this error code being EFAULT, but we can
always revert this if it turns out to be an issue.

Cc: stable@vger.kernel.org # v5.6+
Fixes: fddb5d430a ("open: introduce openat2(2) syscall")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20241010-extensible-structs-check_fields-v3-3-d2833dfe6edd@cyphar.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10 12:09:03 +02:00
Namjae Jeon
7aa8804c0b ksmbd: fix user-after-free from session log off
There is racy issue between smb2 session log off and smb2 session setup.
It will cause user-after-free from session log off.
This add session_lock when setting SMB2_SESSION_EXPIRED and referece
count to session struct not to free session while it is being used.

Cc: stable@vger.kernel.org # v5.15+
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-25282
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-09 21:23:17 -05:00
Linus Torvalds
d3d1556696 12 hotfixes, 5 of which are c:stable. All singletons, about half of which
are MM.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZwcILgAKCRDdBJ7gKXxA
 jjnMAQDRl+UfscRUeMippi7wnL3ee6MKyhhZVOhoxP24uB7yBwD/Ulq4oE+mLHml
 YTlK/wj5qTZIsdxGaBzM1yifqp3L7gU=
 =lFmJ
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "12 hotfixes, 5 of which are c:stable. All singletons, about half of
  which are MM"

* tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: zswap: delete comments for "value" member of 'struct zswap_entry'.
  CREDITS: sort alphabetically by name
  secretmem: disable memfd_secret() if arch cannot set direct map
  .mailmap: update Fangrui's email
  mm/huge_memory: check pmd_special() only after pmd_present()
  resource, kunit: fix user-after-free in resource_test_region_intersects()
  fs/proc/kcore.c: allow translation of physical memory addresses
  selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map test
  device-dax: correct pgoff align in dax_set_mapping()
  kthread: unpark only parked kthread
  Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"
  bcachefs: do not use PF_MEMALLOC_NORECLAIM
2024-10-09 16:01:40 -07:00
Kent Overstreet
3b80552e70 bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry
inode_bit_waitqueue() is changing - this update clears the way for
sched changes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:58:18 -04:00
Kent Overstreet
a7e2dd58fb bcachefs: Check if stuck in journal_res_get()
Like how we already do when the allocator seems to be stuck, check if
we're waiting too long for a journal reservation and print some debug
info.

This is specifically to track down
https://github.com/koverstreet/bcachefs/issues/656

which is showing up in userspace where we don't have sysfs/debugfs to
get the journal debug info.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:57:59 -04:00
Alan Huang
9205d24cf7 bcachefs: Fix state lock involved deadlock
We increased write ref, if the fs went to RO, that would lead to
a deadlock, it actually happens:

00171 ========= TEST   generic/279
00171
00172 bcachefs (vdb): starting version 1.12: rebalance_work_acct_fix opts=nocow
00172 bcachefs (vdb): recovering from clean shutdown, journal seq 35
00172 bcachefs (vdb): accounting_read... done
00172 bcachefs (vdb): alloc_read... done
00172 bcachefs (vdb): stripes_read... done
00172 bcachefs (vdb): snapshots_read... done
00172 bcachefs (vdb): journal_replay... done
00172 bcachefs (vdb): resume_logged_ops... done
00172 bcachefs (vdb): going read-write
00172 bcachefs (vdb): done starting filesystem
00172 FSTYP         -- bcachefs
00172 PLATFORM      -- Linux/aarch64 farm3-kvm 6.11.0-rc1-ktest-g3e290a0b8e34 #7030 SMP Tue Oct  8 14:15:12 UTC 2024
00172 MKFS_OPTIONS  -- --nocow /dev/vdc
00172 MOUNT_OPTIONS -- /dev/vdc /mnt/scratch
00172
00172 bcachefs (vdc): starting version 1.12: rebalance_work_acct_fix opts=nocow
00172 bcachefs (vdc): initializing new filesystem
00172 bcachefs (vdc): going read-write
00172 bcachefs (vdc): marking superblocks
00172 bcachefs (vdc): initializing freespace
00172 bcachefs (vdc): done initializing freespace
00172 bcachefs (vdc): reading snapshots table
00172 bcachefs (vdc): reading snapshots done
00172 bcachefs (vdc): done starting filesystem
00173 bcachefs (vdc): shutting down
00173 bcachefs (vdc): going read-only
00173 bcachefs (vdc): finished waiting for writes to stop
00173 bcachefs (vdc): flushing journal and stopping allocators, journal seq 4
00173 bcachefs (vdc): flushing journal and stopping allocators complete, journal seq 6
00173 bcachefs (vdc): shutdown complete, journal seq 7
00173 bcachefs (vdc): marking filesystem clean
00173 bcachefs (vdc): shutdown complete
00173 bcachefs (vdb): shutting down
00173 bcachefs (vdb): going read-only
00361 INFO: task umount:6180 blocked for more than 122 seconds.
00361 Not tainted 6.11.0-rc1-ktest-g3e290a0b8e34 #7030
00361 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
00361 task:umount          state:D stack:0     pid:6180  tgid:6180  ppid:6176   flags:0x00000004
00361 Call trace:
00362 __switch_to (arch/arm64/kernel/process.c:556)
00362 __schedule (kernel/sched/core.c:5191 kernel/sched/core.c:6529)
00363 schedule (include/asm-generic/bitops/generic-non-atomic.h:128 include/linux/thread_info.h:192 include/linux/sched.h:2084 kernel/sched/core.c:6608 kernel/sched/core.c:6621)
00365 bch2_fs_read_only (fs/bcachefs/super.c:346 (discriminator 41))
00367 __bch2_fs_stop (fs/bcachefs/super.c:620)
00368 bch2_put_super (fs/bcachefs/fs.c:1942)
00369 generic_shutdown_super (include/linux/list.h:373 (discriminator 2) fs/super.c:650 (discriminator 2))
00371 bch2_kill_sb (fs/bcachefs/fs.c:2170)
00372 deactivate_locked_super (fs/super.c:434 fs/super.c:475)
00373 deactivate_super (fs/super.c:508)
00374 cleanup_mnt (fs/namespace.c:250 fs/namespace.c:1374)
00376 __cleanup_mnt (fs/namespace.c:1381)
00376 task_work_run (include/linux/sched.h:2024 kernel/task_work.c:224)
00377 do_notify_resume (include/linux/resume_user_mode.h:50 arch/arm64/kernel/entry-common.c:151)
00377 el0_svc (arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:171 arch/arm64/kernel/entry-common.c:178 arch/arm64/kernel/entry-common.c:713)
00377 el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:731)
00378 el0t_64_sync (arch/arm64/kernel/entry.S:598)
00378 INFO: task tee:6182 blocked for more than 122 seconds.
00378 Not tainted 6.11.0-rc1-ktest-g3e290a0b8e34 #7030
00378 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
00378 task:tee             state:D stack:0     pid:6182  tgid:6182  ppid:533    flags:0x00000004
00378 Call trace:
00378 __switch_to (arch/arm64/kernel/process.c:556)
00378 __schedule (kernel/sched/core.c:5191 kernel/sched/core.c:6529)
00378 schedule (include/asm-generic/bitops/generic-non-atomic.h:128 include/linux/thread_info.h:192 include/linux/sched.h:2084 kernel/sched/core.c:6608 kernel/sched/core.c:6621)
00378 schedule_preempt_disabled (kernel/sched/core.c:6680)
00379 rwsem_down_read_slowpath (kernel/locking/rwsem.c:1073 (discriminator 1))
00379 down_read (kernel/locking/rwsem.c:1529)
00381 bch2_gc_gens (fs/bcachefs/sb-members.h:77 fs/bcachefs/sb-members.h:88 fs/bcachefs/sb-members.h:128 fs/bcachefs/btree_gc.c:1240)
00383 bch2_fs_store_inner (fs/bcachefs/sysfs.c:473)
00385 bch2_fs_internal_store (fs/bcachefs/sysfs.c:417 fs/bcachefs/sysfs.c:580 fs/bcachefs/sysfs.c:576)
00386 sysfs_kf_write (fs/sysfs/file.c:137)
00387 kernfs_fop_write_iter (fs/kernfs/file.c:334)
00389 vfs_write (fs/read_write.c:497 fs/read_write.c:590)
00390 ksys_write (fs/read_write.c:643)
00391 __arm64_sys_write (fs/read_write.c:652)
00391 invoke_syscall.constprop.0 (arch/arm64/include/asm/syscall.h:61 arch/arm64/kernel/syscall.c:54)
00392 do_el0_svc (include/linux/thread_info.h:127 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2) arch/arm64/kernel/syscall.c:151 (discriminator 2))
00392 el0_svc (arch/arm64/include/asm/irqflags.h:55 arch/arm64/include/asm/irqflags.h:76 arch/arm64/kernel/entry-common.c:165 arch/arm64/kernel/entry-common.c:178 arch/arm64/kernel/entry-common.c:713)
00392 el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:731)
00392 el0t_64_sync (arch/arm64/kernel/entry.S:598)

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:42:54 -04:00
Mohammed Anees
a30f32222d bcachefs: Fix NULL pointer dereference in bch2_opt_to_text
This patch adds a bounds check to the bch2_opt_to_text function to prevent
NULL pointer dereferences when accessing the opt->choices array. This
ensures that the index used is within valid bounds before dereferencing.
The new version enhances the readability.

Reported-and-tested-by: syzbot+37186860aa7812b331d5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=37186860aa7812b331d5
Signed-off-by: Mohammed Anees <pvmohammedanees2003@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:42:53 -04:00
Alan Huang
a154154148 bcachefs: Release transaction before wake up
We will get this if we wake up first:

Kernel panic - not syncing: btree_node_write_done leaked btree_trans

since there are still transactions waiting for cycle detectors after
BTREE_NODE_write_in_flight is cleared.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:42:53 -04:00
Piotr Zalewski
0151d10a48 bcachefs: add check for btree id against max in try read node
Add check for read node's btree_id against BTREE_ID_NR_MAX in
try_read_btree_node to prevent triggering EBUG_ON condition in
bch2_btree_id_root[1].

[1] https://syzkaller.appspot.com/bug?extid=cf7b2215b5d70600ec00

Reported-by: syzbot+cf7b2215b5d70600ec00@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=cf7b2215b5d70600ec00
Fixes: 4409b8081d ("bcachefs: Repair pass for scanning for btree nodes")
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:42:53 -04:00
Kent Overstreet
19773ec997 bcachefs: Disk accounting device validation fixes
- Fix failure to validate that accounting replicas entries point to
  valid devices: this wasn't a real bug since they'd be cleaned up by
  GC, but is still something we should know about

- Fix failure to validate that dev_data_type entries point to valid
  devices: this does fix a real bug, since bch2_accounting_read() would
  then try to copy the counters to that device and pop an inconsistent
  error when the device didn't exist

- Remove accounting entries that are zeroed or invalid: if we're not
  validating them we need to get rid of them: they might not exist in
  the superblock, so we need the to trigger the superblock mark path
  when they're readded.

  This fixes the replication.ktest rereplicate test, which was failing
  with "superblock not marked for replicas..."

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:42:53 -04:00
Kent Overstreet
9d86178782 bcachefs: bch2_inode_or_descendents_is_open()
fsck can now correctly check if inodes in interior snapshot nodes are
open/in use.

- Tweak the vfs inode rhashtable so that the subvolume ID isn't hashed,
  meaning inums in different subvolumes will hash to the same slot. Note
  that this is a hack, and will cause problems if anyone ever has the
  same file in many different snapshots open all at the same time.

- Then check if any of those subvolumes is a descendent of the snapshot
  ID being checked

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:42:53 -04:00
Kent Overstreet
84878e8245 bcachefs: Kill bch2_propagate_key_to_snapshot_leaves()
Dead code now.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:42:53 -04:00
Kent Overstreet
9b23fdbd5d bcachefs: bcachefs_metadata_version_inode_has_child_snapshots
There's an inherent race in taking a snapshot while an unlinked file is
open, and then reattaching it in the child snapshot.

In the interior snapshot node the file will appear unlinked, as though
it should be deleted - it's not referenced by anything in that snapshot
- but we can't delete it, because the file data is referenced by the
child snapshot.

This was being handled incorrectly with
propagate_key_to_snapshot_leaves() - but that doesn't resolve the
fundamental inconsistency of "this file looks like it should be deleted
according to normal rules, but - ".

To fix this, we need to fix the rule for when an inode is deleted. The
previous rule, ignoring snapshots (there was no well-defined rule
for with snapshots) was:
  Unlinked, non open files are deleted, either at recovery time or
  during online fsck

The new rule is:
  Unlinked, non open files, that do not exist in child snapshots, are
  deleted.

To make this work transactionally, we add a new inode flag,
BCH_INODE_has_child_snapshot; it overrides BCH_INODE_unlinked when
considering whether to delete an inode, or put it on the deleted list.

For transactional consistency, clearing it handled by the inode trigger:
when deleting an inode we check if there are parent inodes which can now
have the BCH_INODE_has_child_snapshot flag cleared.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-09 16:42:51 -04:00
Alexander Gordeev
3d5854d75e fs/proc/kcore.c: allow translation of physical memory addresses
When /proc/kcore is read an attempt to read the first two pages results in
HW-specific page swap on s390 and another (so called prefix) pages are
accessed instead.  That leads to a wrong read.

Allow architecture-specific translation of memory addresses using
kc_xlate_dev_mem_ptr() and kc_unxlate_dev_mem_ptr() callbacks similarily
to /dev/mem xlate_dev_mem_ptr() and unxlate_dev_mem_ptr() callbacks.  That
way an architecture can deal with specific physical memory ranges.

Re-use the existing /dev/mem callback implementation on s390, which
handles the described prefix pages swapping correctly.

For other architectures the default callback is basically NOP.  It is
expected the condition (vaddr == __va(__pa(vaddr))) always holds true for
KCORE_RAM memory type.

Link: https://lkml.kernel.org/r/20240930122119.1651546-1-agordeev@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09 12:47:19 -07:00
Michal Hocko
9897713fe1 bcachefs: do not use PF_MEMALLOC_NORECLAIM
Patch series "remove PF_MEMALLOC_NORECLAIM" v3.


This patch (of 2):

bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
inode to achieve GFP_NOWAIT semantic while holding locks. If this
allocation fails it will drop locks and use GFP_NOFS allocation context.

We would like to drop PF_MEMALLOC_NORECLAIM because it is really
dangerous to use if the caller doesn't control the full call chain with
this flag set. E.g. if any of the function down the chain needed
GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
cause unexpected failure.

While this is not the case in this particular case using the scoped gfp
semantic is not really needed bacause we can easily pus the allocation
context down the chain without too much clutter.

[akpm@linux-foundation.org: fix kerneldoc warnings]
Link: https://lkml.kernel.org/r/20240926172940.167084-1-mhocko@kernel.org
Link: https://lkml.kernel.org/r/20240926172940.167084-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz> # For vfs changes
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: James Morris <jmorris@namei.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09 12:47:18 -07:00
Dai Ngo
7ef6010806 NFS: remove revoked delegation from server's delegation list
After the delegation is returned to the NFS server remove it
from the server's delegations list to reduce the time it takes
to scan this list.

Network trace captured while running the below script shows the
time taken to service the CB_RECALL increases gradually due to
the overhead of traversing the delegation list in
nfs_delegation_find_inode_server.

The NFS server in this test is a Solaris server which issues
CB_RECALL when receiving the all-zero stateid in the SETATTR.

mount=/mnt/data
for i in $(seq 1 20)
do
   echo $i
   mkdir $mount/testtarfile$i
   time  tar -C $mount/testtarfile$i -xf 5000_files.tar
done

Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-09 15:39:22 -04:00
Linus Torvalds
ff9d4099e6 unicode updates
* Patch to handle code-points with the Ignorable property as regular
 character instead of treating them as an empty string. (Me)
 
 Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQS3XO7QfvpFoONBhH1OwQgI3t8RJgUCZwbNtQAKCRBOwQgI3t8R
 JrlrAP4yCrZCp4YPlXO6oQGfS9RIeYpmcMzGmp1IAeqlzpB5qwD/YS53kiAzF4qV
 +eD2fl/O4qNhZcWqBZKSH4shZBbXJAg=
 =XCsY
 -----END PGP SIGNATURE-----

Merge tag 'unicode-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode

Pull unicode fix from Gabriel Krisman Bertazi:

 - Handle code-points with the Ignorable property as regular character
   instead of treating them as an empty string (me)

* tag 'unicode-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
  unicode: Don't special case ignorable code points
2024-10-09 12:22:02 -07:00
Gabriel Krisman Bertazi
5c26d2f1d3 unicode: Don't special case ignorable code points
We don't need to handle them separately. Instead, just let them
decompose/casefold to themselves.

Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
2024-10-09 13:34:01 -04:00
Naohiro Aota
e761be2a07 btrfs: fix clear_dirty and writeback ordering in submit_one_sector()
This commit is a replay of commit 6252690f7e ("btrfs: fix invalid
mapping of extent xarray state"). We need to call
btrfs_folio_clear_dirty() before btrfs_set_range_writeback(), so that
xarray DIRTY tag is cleared.

With a refactoring commit 8189197425 ("btrfs: refactor
__extent_writepage_io() to do sector-by-sector submission"), it screwed
up and the order is reversed and causing the same hang. Fix the ordering
now in submit_one_sector().

Fixes: 8189197425 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-09 13:23:51 +02:00
Filipe Manana
fe4cd7ed12 btrfs: zoned: fix missing RCU locking in error message when loading zone info
At btrfs_load_zone_info() we have an error path that is dereferencing
the name of a device which is a RCU string but we are not holding a RCU
read lock, which is incorrect.

Fix this by using btrfs_err_in_rcu() instead of btrfs_err().

The problem is there since commit 08e11a3db0 ("btrfs: zoned: load zone's
allocation offset"), back then at btrfs_load_block_group_zone_info() but
then later on that code was factored out into the helper
btrfs_load_zone_info() by commit 09a46725cc ("btrfs: zoned: factor out
per-zone logic from btrfs_load_block_group_zone_info").

Fixes: 08e11a3db0 ("btrfs: zoned: load zone's allocation offset")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-09 13:23:51 +02:00
Andrew Kreimer
77bfe1b11e xfs: fix a typo
Fix a typo in comments.

Signed-off-by: Andrew Kreimer <algonell@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-09 10:05:26 +02:00
Brian Foster
4390f019ad xfs: don't free cowblocks from under dirty pagecache on unshare
fallocate unshare mode explicitly breaks extent sharing. When a
command completes, it checks the data fork for any remaining shared
extents to determine whether the reflink inode flag and COW fork
preallocation can be removed. This logic doesn't consider in-core
pagecache and I/O state, however, which means we can unsafely remove
COW fork blocks that are still needed under certain conditions.

For example, consider the following command sequence:

xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \
	-c "pwrite 0 32k" -c "funshare 0 1k" <file>

This allocates a data block at offset 0, shares it, and then
overwrites it with a larger buffered write. The overwrite triggers
COW fork preallocation, 32 blocks by default, which maps the entire
32k write to delalloc in the COW fork. All but the shared block at
offset 0 remains hole mapped in the data fork. The unshare command
redirties and flushes the folio at offset 0, removing the only
shared extent from the inode. Since the inode no longer maps shared
extents, unshare purges the COW fork before the remaining 28k may
have written back.

This leaves dirty pagecache backed by holes, which writeback quietly
skips, thus leaving clean, non-zeroed pagecache over holes in the
file. To verify, fiemap shows holes in the first 32k of the file and
reads return different data across a remount:

$ xfs_io -c "fiemap -v" <file>
<file>:
 EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
   ...
   1: [8..511]:        hole               504
   ...
$ xfs_io -c "pread -v 4k 8" <file>
00001000:  cd cd cd cd cd cd cd cd  ........
$ umount <mnt>; mount <dev> <mnt>
$ xfs_io -c "pread -v 4k 8" <file>
00001000:  00 00 00 00 00 00 00 00  ........

To avoid this problem, make unshare follow the same rules used for
background cowblock scanning and never purge the COW fork for inodes
with dirty pagecache or in-flight I/O.

Fixes: 46afb0628b ("xfs: only flush the unshared range in xfs_reflink_unshare")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-09 10:05:10 +02:00
Linus Torvalds
5b7c893ed5 Changes for 6.12-rc3
New:
 	implement fallocate for compressed files;
 	add support for the compression attribute;
 	optimize large writes to sparse files.
 Fixed:
 	fix several potential deadlock scenarios;
 	fix various internal bugs detected by syzbot;
 	add checks before accessing NTFS structures during parsing;
 	correct the format of output messages.
 Refactored:
 	replace fsparam_flag_no with fsparam_flag in options parser;
 	remove unused functions and macros.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmcFRkgACgkQqbAzH4Mk
 B7aQdQ//VT6lzBaxSZ0dm3xs/ygQYE3ter1OczAMQxh70O5px+Z0li5WCXQQe31W
 I74MmAy2G7C+JdzMkSc4F7+pFF21DO+1WQjODl9rP7yoHiLteiLFkMI+P5tHuFFT
 VPupAiCDQgm6QCtW2wofp/WR0ooxpLzC/yQsILrhUXJHdEY21oHLy/MQSOhVnv2k
 VeFdWA70+YS+JEQ+YyZBSeGeMVMQG9VROA57dZ+eAy1vgBUJGcKhYp9v4eiWG1uT
 A8rnV81FiHu+49OrE+ZwStguHkxu30dvlHuOqKW0pLW9/XoKCFf9w0WqLOGrAxLs
 lKznVgQZ521+QW+lNURMHBZt/bqQxQQInuOOH7jwLyBh0tmKoZlIU9Vw4V3yrm3t
 9NELb0FR/FeH02O89nMHl0SeaxDJvgFHDnfoOShuHaffur61d4RjChxqB4QcibOl
 Ibvtns71Uo89ngPIXdeTJK+7XZQVmAZHBXXYL+Rwr84Yv0tyn2q8DbrqzmphiYlR
 wbLyVBGqtx0ZcN/geuFLiVt2W/HSMPocXqrxv791ALX6cryL+CZdolQP7vWjt4Vh
 rKoaRZYN8fdcfT+BIYy95Kz0k12T7ColOoMfxVwugxK9ypF6tH6BJBwYXjCW4MOv
 AhkszJcsISxMKidsbzZA9xESM6M3sLHcW0swjAwWWRhc5zbSe6E=
 =j71V
 -----END PGP SIGNATURE-----

Merge tag 'ntfs3_for_6.12' of https://github.com/Paragon-Software-Group/linux-ntfs3

Pull ntfs3 updates from Konstantin Komarov:
"New:
   - implement fallocate for compressed files
   - add support for the compression attribute
   - optimize large writes to sparse files

 Fixes:
   - fix several potential deadlock scenarios
   - fix various internal bugs detected by syzbot
   - add checks before accessing NTFS structures during parsing
   - correct the format of output messages

  Refactoring:
   - replace fsparam_flag_no with fsparam_flag in options parser
   - remove unused functions and macros"

* tag 'ntfs3_for_6.12' of https://github.com/Paragon-Software-Group/linux-ntfs3: (25 commits)
  fs/ntfs3: Format output messages like others fs in kernel
  fs/ntfs3: Additional check in ntfs_file_release
  fs/ntfs3: Fix general protection fault in run_is_mapped_full
  fs/ntfs3: Sequential field availability check in mi_enum_attr()
  fs/ntfs3: Additional check in ni_clear()
  fs/ntfs3: Fix possible deadlock in mi_read
  ntfs3: Change to non-blocking allocation in ntfs_d_hash
  fs/ntfs3: Remove unused al_delete_le
  fs/ntfs3: Rename ntfs3_setattr into ntfs_setattr
  fs/ntfs3: Replace fsparam_flag_no -> fsparam_flag
  fs/ntfs3: Add support for the compression attribute
  fs/ntfs3: Implement fallocate for compressed files
  fs/ntfs3: Make checks in run_unpack more clear
  fs/ntfs3: Add rough attr alloc_size check
  fs/ntfs3: Stale inode instead of bad
  fs/ntfs3: Refactor enum_rstbl to suppress static checker
  fs/ntfs3: Fix sparse warning in ni_fiemap
  fs/ntfs3: Fix warning possible deadlock in ntfs_set_state
  fs/ntfs3: Fix sparse warning for bigendian
  fs/ntfs3: Separete common code for file_read/write iter/splice
  ...
2024-10-08 10:53:06 -07:00
Christian Brauner
dad1b6c805
Merge patch series "fsdax/xfs: unshare range fixes for 6.12"
Darrick J. Wong <djwong@kernel.org> says:

This patchset fixes multiple data corruption bugs in the fallocate
unshare range implementation for fsdax.

* patches from https://lore.kernel.org/r/172796813251.1131942.12184885574609980777.stgit@frogsfrogsfrogs:
  fsdax: dax_unshare_iter needs to copy entire blocks
  fsdax: remove zeroing code from dax_unshare_iter
  iomap: share iomap_unshare_iter predicate code with fsdax
  xfs: don't allocate COW extents when unsharing a hole

Link: https://lore.kernel.org/r/172796813251.1131942.12184885574609980777.stgit@frogsfrogsfrogs
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-08 12:27:14 +02:00
Filipe Manana
6ef8fbce01 btrfs: fix missing error handling when adding delayed ref with qgroups enabled
When adding a delayed ref head, at delayed-ref.c:add_delayed_ref_head(),
if we fail to insert the qgroup record we don't error out, we ignore it.
In fact we treat it as if there was no error and there was already an
existing record - we don't distinguish between the cases where
btrfs_qgroup_trace_extent_nolock() returns 1, meaning a record already
existed and we can free the given record, and the case where it returns
a negative error value, meaning the insertion into the xarray that is
used to track records failed.

Effectively we end up ignoring that we are lacking qgroup record in the
dirty extents xarray, resulting in incorrect qgroup accounting.

Fix this by checking for errors and return them to the callers.

Fixes: 3cce39a8ca ("btrfs: qgroup: use xarray to track dirty extents in transaction")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-07 23:22:30 +02:00
Luca Stefani
69313850dc btrfs: add cancellation points to trim loops
There are reports that system cannot suspend due to running trim because
the task responsible for trimming the device isn't able to finish in
time, especially since we have a free extent discarding phase, which can
trim a lot of unallocated space. There are no limits on the trim size
(unlike the block group part).

Since trime isn't a critical call it can be interrupted at any time,
in such cases we stop the trim, report the amount of discarded bytes and
return an error.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-07 23:21:56 +02:00
Luca Stefani
a99fcb0158 btrfs: split remaining space to discard in chunks
Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
mostly empty although we will do the split according to our super block
locations, the last super block ends at 256G, we can submit a huge
discard for the range [256G, 8T), causing a large delay.

Split the space left to discard based on BTRFS_MAX_DISCARD_CHUNK_SIZE in
preparation of introduction of cancellation points to trim. The value
of the chunk size is arbitrary, it can be higher or derived from actual
device capabilities but we can't easily read that using
bio_discard_limit().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-07 23:21:53 +02:00
Darrick J. Wong
50793801fc
fsdax: dax_unshare_iter needs to copy entire blocks
The code that copies data from srcmap to iomap in dax_unshare_iter is
very very broken, which bfoster's recent fsx changes have exposed.

If the pos and len passed to dax_file_unshare are not aligned to an
fsblock boundary, the iter pos and length in the _iter function will
reflect this unalignment.

dax_iomap_direct_access always returns a pointer to the start of the
kmapped fsdax page, even if its pos argument is in the middle of that
page.  This is catastrophic for data integrity when iter->pos is not
aligned to a page, because daddr/saddr do not point to the same byte in
the file as iter->pos.  Hence we corrupt user data by copying it to the
wrong place.

If iter->pos + iomap_length() in the _iter function not aligned to a
page, then we fail to copy a full block, and only partially populate the
destination block.  This is catastrophic for data confidentiality
because we expose stale pmem contents.

Fix both of these issues by aligning copy_pos/copy_len to a page
boundary (remember, this is fsdax so 1 fsblock == 1 base page) so that
we always copy full blocks.

We're not done yet -- there's no call to invalidate_inode_pages2_range,
so programs that have the file range mmap'd will continue accessing the
old memory mapping after the file metadata updates have completed.

Be careful with the return value -- if the unshare succeeds, we still
need to return the number of bytes that the iomap iter thinks we're
operating on.

Cc: ruansy.fnst@fujitsu.com
Fixes: d984648e42 ("fsdax,xfs: port unshare to fsdax")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/172796813328.1131942.16777025316348797355.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-07 13:51:47 +02:00
Darrick J. Wong
95472274b6
fsdax: remove zeroing code from dax_unshare_iter
Remove the code in dax_unshare_iter that zeroes the destination memory
because it's not necessary.

If srcmap is unwritten, we don't have to do anything because that
unwritten extent came from the regular file mapping, and unwritten
extents cannot be shared.  The same applies to holes.

Furthermore, zeroing to unshare a mapping is just plain wrong because
unsharing means copy on write, and we should be copying data.

This is effectively a revert of commit 13dd4e0462 ("fsdax: unshare:
zero destination if srcmap is HOLE or UNWRITTEN")

Cc: ruansy.fnst@fujitsu.com
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/172796813311.1131942.16033376284752798632.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-07 13:51:47 +02:00
Darrick J. Wong
6ef6a0e821
iomap: share iomap_unshare_iter predicate code with fsdax
The predicate code that iomap_unshare_iter uses to decide if it's really
needs to unshare a file range mapping should be shared with the fsdax
version, because right now they're opencoded and inconsistent.

Note that we simplify the predicate logic a bit -- we no longer allow
unsharing of inline data mappings, but there aren't any filesystems that
allow shared inline data currently.

This is a fix in the sense that it should have been ported to fsdax.

Fixes: b53fdb215d ("iomap: improve shared block detection in iomap_unshare_iter")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/172796813294.1131942.15762084021076932620.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-07 13:51:47 +02:00
Darrick J. Wong
b8c4076db5
xfs: don't allocate COW extents when unsharing a hole
It doesn't make sense to allocate a COW extent when unsharing a hole
because holes cannot be shared.

Fixes: 1f1397b721 ("xfs: don't allocate into the data fork for an unshare request")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/172796813277.1131942.5486112889531210260.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-07 13:51:46 +02:00
David Howells
796a404964
netfs: In readahead, put the folio refs as soon extracted
netfslib currently defers dropping the ref on the folios it obtains during
readahead to after it has started I/O on the basis that we can do it whilst
we wait for the I/O to complete, but this runs the risk of the I/O
collection racing with this in future.

Furthermore, Matthew Wilcox strongly suggests that the refs should be
dropped immediately, as readahead_folio() does (netfslib is using
__readahead_batch() which doesn't drop the refs).

Fixes: ee4cdf7ba8 ("netfs: Speed up buffered reading")
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3771538.1728052438@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-07 13:48:22 +02:00
Brian Foster
90a71daaf7 xfs: skip background cowblock trims on inodes open for write
The background blockgc scanner runs on a 5m interval by default and
trims preallocation (post-eof and cow fork) from inodes that are
otherwise idle. Idle effectively means that iolock can be acquired
without blocking and that the inode has no dirty pagecache or I/O in
flight.

This simple mechanism and heuristic has worked fairly well for
post-eof speculative preallocations. Support for reflink and COW
fork preallocations came sometime later and plugged into the same
mechanism, with similar heuristics. Some recent testing has shown
that COW fork preallocation may be notably more sensitive to blockgc
processing than post-eof preallocation, however.

For example, consider an 8GB reflinked file with a COW extent size
hint of 1MB. A worst case fully randomized overwrite of this file
results in ~8k extents of an average size of ~1MB. If the same
workload is interrupted a couple times for blockgc processing
(assuming the file goes idle), the resulting extent count explodes
to over 100k extents with an average size <100kB. This is
significantly worse than ideal and essentially defeats the COW
extent size hint mechanism.

While this particular test is instrumented, it reflects a fairly
reasonable pattern in practice where random I/Os might spread out
over a large period of time with varying periods of (in)activity.
For example, consider a cloned disk image file for a VM or container
with long uptime and variable and bursty usage. A background blockgc
scan that races and processes the image file when it happens to be
clean and idle can have a significant effect on the future
fragmentation level of the file, even when still in use.

To help combat this, update the heuristic to skip cowblocks inodes
that are currently opened for write access during non-sync blockgc
scans. This allows COW fork preallocations to persist for as long as
possible unless otherwise needed for functional purposes (i.e. a
sync scan), the file is idle and closed, or the inode is being
evicted from cache. While here, update the comments to help
distinguish performance oriented heuristics from the logic that
exists to maintain functional correctness.

Suggested-by: Darrick Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:12 +02:00
Christoph Hellwig
6aac770598 xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc
Currently the debug-only xfs_bmap_exact_minlen_extent_alloc allocation
variant fails to drop into the lowmode last resort allocator, and
thus can sometimes fail allocations for which the caller has a
transaction block reservation.

Fix this by using xfs_bmap_btalloc_low_space to do the actual allocation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:12 +02:00
Christoph Hellwig
405ee87c69 xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btalloc
xfs_bmap_exact_minlen_extent_alloc duplicates the args setup in
xfs_bmap_btalloc.  Switch to call it from xfs_bmap_btalloc after
doing the basic setup.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Christoph Hellwig
b611fddc04 xfs: don't ifdef around the exact minlen allocations
Exact minlen allocations only exist as an error injection tool for debug
builds.  Currently this is implemented using ifdefs, which means the code
isn't even compiled for non-XFS_DEBUG builds.  Enhance the compile test
coverage by always building the code and use the compilers' dead code
elimination to remove it from the generated binary instead.

The only downside is that the alloc_minlen_only field is unconditionally
added to struct xfs_alloc_args now, but by moving it around and packing
it tightly this doesn't actually increase the size of the structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Christoph Hellwig
865469cd41 xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocate
Userdata and metadata allocations end up in the same allocation helpers.
Remove the separate xfs_bmap_alloc_userdata function to make this more
clear.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Christoph Hellwig
b3f4e84e2f xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname
Just like xfs_attr3_leaf_split, xfs_attr_node_try_addname can return
-ENOSPC both for an actual failure to allocate a disk block, but also
to signal the caller to convert the format of the attr fork.  Use magic
1 to ask for the conversion here as well.

Note that unlike the similar issue in xfs_attr3_leaf_split, this one was
only found by code review.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Christoph Hellwig
a5f73342ab xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split
xfs_attr3_leaf_split propagates the need for an extra btree split as
-ENOSPC to it's only caller, but the same return value can also be
returned from xfs_da_grow_inode when it fails to find free space.

Distinguish the two cases by returning 1 for the extra split case instead
of overloading -ENOSPC.

This can be triggered relatively easily with the pending realtime group
support and a file system with a lot of small zones that use metadata
space on the main device.  In this case every about 5-10th run of
xfs/538 runs into the following assert:

	ASSERT(oldblk->magic == XFS_ATTR_LEAF_MAGIC);

in xfs_attr3_leaf_split caused by an allocation failure.  Note that
the allocation failure is caused by another bug that will be fixed
subsequently, but this commit at least sorts out the error handling.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Christoph Hellwig
346c1d46d4 xfs: return bool from xfs_attr3_leaf_add
xfs_attr3_leaf_add only has two potential return values, indicating if the
entry could be added or not.  Replace the errno return with a bool so that
ENOSPC from it can't easily be confused with a real ENOSPC.

Remove the return value from the xfs_attr3_leaf_add_work helper entirely,
as it always return 0.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Christoph Hellwig
b1c649da15 xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addname
xfs_attr_leaf_try_add is only called by xfs_attr_leaf_addname, and
merging the two will simplify a following error handling fix.

To facilitate this move the remote block state save/restore helpers up in
the file so that they don't need forward declarations now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Uros Bizjak
20195d011c xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate()
Use !try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in
xlog_cil_insert_pcp_aggregate().  x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg.

Also, try_cmpxchg implicitly assigns old *ptr value to "old" when
cmpxchg fails. There is no need to re-read the value in the loop.

Note that the value from *ptr should be read using READ_ONCE to
prevent the compiler from merging, refetching or reordering the read.

No functional change intended.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Yan Zhen
6148b77960 xfs: scrub: convert comma to semicolon
Replace a comma between expression statements by a semicolon.

Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Zhang Zekun
f6225eebd7 xfs: Remove empty declartion in header file
The definition of xfs_attr_use_log_assist() has been removed since
commit d9c61ccb3b ("xfs: move xfs_attr_use_log_assist out of xfs_log.c").
So, Remove the empty declartion in header files.

Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-07 08:00:11 +02:00
Pali Rohár
63271b7d56 cifs: Fix creating native symlinks pointing to current or parent directory
Calling 'ln -s . symlink' or 'ln -s .. symlink' creates symlink pointing to
some object name which ends with U+F029 unicode codepoint. This is because
trailing dot in the object name is replaced by non-ASCII unicode codepoint.

So Linux SMB client currently is not able to create native symlink pointing
to current or parent directory on Windows SMB server which can be read by
either on local Windows server or by any other SMB client which does not
implement compatible-reverse character replacement.

Fix this problem in cifsConvertToUTF16() function which is doing that
character replacement. Function comment already says that it does not need
to handle special cases '.' and '..', but after introduction of native
symlinks in reparse point form, this handling is needed.

Note that this change depends on the previous change
"cifs: Improve creating native symlinks pointing to directory".

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-06 22:57:12 -05:00
Pali Rohár
3eb4051253 cifs: Improve creating native symlinks pointing to directory
SMB protocol for native symlinks distinguish between symlink to directory
and symlink to file. These two symlink types cannot be exchanged, which
means that symlink of file type pointing to directory cannot be resolved at
all (and vice-versa).

Windows follows this rule for local filesystems (NTFS) and also for SMB.

Linux SMB client currenly creates all native symlinks of file type. Which
means that Windows (and some other SMB clients) cannot resolve symlinks
pointing to directory created by Linux SMB client.

As Linux system does not distinguish between directory and file symlinks,
its API does not provide enough information for Linux SMB client during
creating of native symlinks.

Add some heuristic into the Linux SMB client for choosing the correct
symlink type during symlink creation. Check if the symlink target location
ends with slash, or last path component is dot or dot-dot, and check if the
target location on SMB share exists and is a directory. If at least one
condition is truth then create a new SMB symlink of directory type.
Otherwise create it as file type symlink.

This change improves interoperability with Windows systems. Windows systems
would be able to resolve more SMB symlinks created by Linux SMB client
which points to existing directory.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-06 22:57:12 -05:00
Kent Overstreet
cba31b7eee bcachefs: Delete vestigal check_inode() checks
BCH_INODE_i_size_dirty dates from before we had logged operations for
truncate (as well as finsert) - it hasn't been needed since before
bcachefs was mainlined.

BCH_INODE_i_sectors_dirty hasn't been needed since we started always
updating i_sectors transactionally - it's been unused for even longer.

BCH_INODE_backptr_untrusted also hasn't been used since prior to
mainlining; when unlinking a hardling, we zero out the backpointer
fields if they're for the dirent being removed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-06 03:03:45 -04:00
Kent Overstreet
12f286085b bcachefs: btree_iter_peek_upto() now handles BTREE_ITER_all_snapshots
end_pos now compares against snapshot ID when required

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-06 03:03:45 -04:00
Kent Overstreet
38864eccf7 bcachefs: reattach_inode() now correctly handles interior snapshot nodes
When we find an unreachable inode, we now reattach it in the oldest
version that needs to be reattached (thus avoiding redundant work
reattaching every single version), and we now fix up inode -> dirent
backpointers in newer versions as needed - or white out the reattaching
dirent in newer versions, if the newer version isn't supposed to be
reattached.

This results in the second verify fsck now passing cleanly after
repairing on a user-provided filesystem image with thousands of
different snapshots.

Reported-by: Christopher Snowhill <chris@kode54.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-06 03:03:45 -04:00
Kent Overstreet
bade9711e0 bcachefs: Split out check_unreachable_inodes() pass
With inode backpointers, we can write a very simple
check_unreachable_inodes() pass that only looks for non-unlinked inodes
that are missing backpointers, and reattaches them.

This simplifies check_directory_structure() so that it's now only
checking for directory structure loops,

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-06 03:03:45 -04:00
Kent Overstreet
bf4baaa087 bcachefs: Fix lockdep splat in bch2_accounting_read
We can't take sb_lock while holding mark_lock, so split out
replicas_entry_validate() and replicas_entry_sb_validate() -
replicas_entry_validate() now uses the normal online device interface.

00039 ========= TEST   set_option
00039
00039 WATCHDOG 30
00040 bcachefs (vdb): starting version 1.12: rebalance_work_acct_fix opts=errors=panic
00040 bcachefs (vdb): initializing new filesystem
00040 bcachefs (vdb): going read-write
00040 bcachefs (vdb): marking superblocks
00040 bcachefs (vdb): initializing freespace
00040 bcachefs (vdb): done initializing freespace
00040 bcachefs (vdb): reading snapshots table
00040 bcachefs (vdb): reading snapshots done
00040 bcachefs (vdb): done starting filesystem
00040 zstd
00041 bcachefs (vdb): shutting down
00041 bcachefs (vdb): going read-only
00041 bcachefs (vdb): finished waiting for writes to stop
00041 bcachefs (vdb): flushing journal and stopping allocators, journal seq 3
00041 bcachefs (vdb): flushing journal and stopping allocators complete, journal seq 11
00041 bcachefs (vdb): shutdown complete, journal seq 12
00041 bcachefs (vdb): marking filesystem clean
00041 bcachefs (vdb): shutdown complete
00041 Setting option on offline fs
00041 bch2_write_super(): fatal error : attempting to write superblock that wasn't version downgraded (1.12: (unknown version) > 1.10: disk_accounting_v3)
00041 fatal error - emergency read only
00041 bch2_write_super(): fatal error : attempting to write superblock that wasn't version downgraded (1.12: (unknown version) > 1.10: disk_accounting_v3)
00042 bcachefs (vdb): starting version 1.12: rebalance_work_acct_fix opts=errors=panic,compression=zstd
00042 bcachefs (vdb): recovering from clean shutdown, journal seq 12
00042 bcachefs (vdb): accounting_read...
00042
00042 ======================================================
00042 WARNING: possible circular locking dependency detected
00042 6.12.0-rc1-ktest-g805e938a8502 #6807 Not tainted
00042 ------------------------------------------------------
00042 mount.bcachefs/665 is trying to acquire lock:
00045 ffffff80cc280908 (&c->sb_lock){+.+.}-{3:3}, at: bch2_replicas_entry_validate (fs/bcachefs/replicas.c:102)
00045
00045 but task is already holding lock:
00048 ffffff80cc284870 (&c->mark_lock){++++}-{0:0}, at: bch2_accounting_read (fs/bcachefs/disk_accounting.c:670 (discriminator 1))
00048
00048 which lock already depends on the new lock.
00048
00048
00048 the existing dependency chain (in reverse order) is:
00048
00048 -> #1 (&c->mark_lock){++++}-{0:0}:
00049 percpu_down_write (kernel/locking/percpu-rwsem.c:232)
00052 bch2_sb_replicas_to_cpu_replicas (fs/bcachefs/replicas.c:583)
00055 bch2_sb_to_fs (fs/bcachefs/super-io.c:614)
00057 bch2_fs_open (fs/bcachefs/super.c:828 fs/bcachefs/super.c:2050)
00060 bch2_fs_get_tree (fs/bcachefs/fs.c:2067)
00062 vfs_get_tree (fs/super.c:1801)
00064 path_mount (fs/namespace.c:3507 fs/namespace.c:3834)
00066 __arm64_sys_mount (fs/namespace.c:3847 fs/namespace.c:4055 fs/namespace.c:4032 fs/namespace.c:4032)
00067 invoke_syscall.constprop.0 (arch/arm64/include/asm/syscall.h:61 arch/arm64/kernel/syscall.c:54)
00068 do_el0_svc (include/linux/thread_info.h:127 (discriminator 2) arch/arm64/kernel/syscall.c:140 (discriminator 2) arch/arm64/kernel/syscall.c:151 (discriminator 2))
00069 el0_svc (arch/arm64/include/asm/irqflags.h:82 arch/arm64/include/asm/irqflags.h:123 arch/arm64/include/asm/irqflags.h:136 arch/arm64/kernel/entry-common.c:165 arch/arm64/kernel/entry-common.c:178 arch/arm64/kernel/entry-common.c:713)
00069 ========= FAILED TIMEOUT set_option in 30s

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-06 03:03:45 -04:00
Linus Torvalds
8f602276d3 bcachefs fixes for 6.12-rc2
A lot of little fixes, bigger ones include:
 
 - bcachefs's __wait_on_freeing_inode() was broken in rc1 due to vfs
   changes, now fixed along with another lost wakeup
 - fragmentation LRU fixes; fsck now repairs successfully (this is the
   data structure copygc uses); along with some nice simplification.
 - Rework logged op error handling, so that if logged op replay errors
   (due to another filesystem error) we delete the logged op instead of
   going into an infinite loop)
 - Various small filesystem connectivitity repair fixes
 
 The final part of this patch series, fixing snapshots + unlinked file
 handling, is now out on the list - I'm giving that part of the series
 more time for user testing.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcBhkIACgkQE6szbY3K
 bnYt8RAAqZo6RcN91sgz6xGsJkUvE6DS4Rtj1J4vlVAmuiIa5NUhRhqFnS6j8V9A
 AWZw63JwTizrglbLk4Z4knfiViT4GOeiKX4sttaJk7cLW7bxwCUddlho1G5Q7q0I
 PFurYevqG1ltcl5oZpD6LhZiqEhndQI3XnkpEvKsmoXy9TSB4KEqaU8Y+cewjq4q
 KCFuxTBhbmatxP9eTGuDhd6uWw5h0EVDGQyMitEcSutIaernGlSsBQ8gZ5n9dWSd
 lP91qFT5iypmCMo9Arf8Fq1YBvOpV6P91eq8YPa4A3sKDfzHn3CCzsSyjUiGK0RM
 Wcl+kNwqYJa7Fwtb7aGgTVhaMkqLzPTI+XYye3FXrXjJ6B0JKpl2QvvDoFhDxop9
 ZPb57QyRgRBtOvofvFz8fWQOr67n+HNvaMbeG1iwGvqm6/MrgdSLsN6OaRh80uAE
 5P0qX7rwTTOfJj5T6dKLxr3KuXKXNrM5AAIG0MjOMsha232+XUAZvofYNmqx7BMi
 juJvqZc9/GXrcXqdPTYDyBs4UXDkwHsKdr744ooZ64VNiIYFs6eTvXp7V0XuajYH
 ExLrEEjhO2UGPM5N9R9jw9AMsEhJstexgylHQsiiADtdi+jY4LKa/NZAJSJQQC+C
 QQyE3Q7ZCpzRPiGPkkpIY/D7IRoIHL2H+LhbXV/K3oMGdbA7hS4=
 =XnG4
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "A lot of little fixes, bigger ones include:

   - bcachefs's __wait_on_freeing_inode() was broken in rc1 due to vfs
     changes, now fixed along with another lost wakeup

   - fragmentation LRU fixes; fsck now repairs successfully (this is the
     data structure copygc uses); along with some nice simplification.

   - Rework logged op error handling, so that if logged op replay errors
     (due to another filesystem error) we delete the logged op instead
     of going into an infinite loop)

   - Various small filesystem connectivitity repair fixes"

* tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs:
  bcachefs: Rework logged op error handling
  bcachefs: Add warn param to subvol_get_snapshot, peek_inode
  bcachefs: Kill snapshot arg to fsck_write_inode()
  bcachefs: Check for unlinked, non-empty dirs in check_inode()
  bcachefs: Check for unlinked inodes with dirents
  bcachefs: Check for directories with no backpointers
  bcachefs: Kill alloc_v4.fragmentation_lru
  bcachefs: minor lru fsck fixes
  bcachefs: Mark more errors AUTOFIX
  bcachefs: Make sure we print error that causes fsck to bail out
  bcachefs: bkey errors are only AUTOFIX during read
  bcachefs: Create lost+found in correct snapshot
  bcachefs: Fix reattach_inode()
  bcachefs: Add missing wakeup to bch2_inode_hash_remove()
  bcachefs: Fix trans_commit disk accounting revert
  bcachefs: Fix bch2_inode_is_open() check
  bcachefs: Fix return type of dirent_points_to_inode_nowarn()
  bcachefs: Fix bad shift in bch2_read_flag_list()
2024-10-05 15:18:04 -07:00
Olga Kornievskaia
c88c150a46 nfsd: fix possible badness in FREE_STATEID
When multiple FREE_STATEIDs are sent for the same delegation stateid,
it can lead to a possible either use-after-free or counter refcount
underflow errors.

In nfsd4_free_stateid() under the client lock we find a delegation
stateid, however the code drops the lock before calling nfs4_put_stid(),
that allows another FREE_STATE to find the stateid again. The first one
will proceed to then free the stateid which leads to either
use-after-free or decrementing already zeroed counter.

Fixes: 3f29cc82a8 ("nfsd: split sc_status out of sc_type")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-05 15:44:25 -04:00
Linus Torvalds
fdd0a94dcf Fix some ext4 bug fixes and regressions relating to oneline resize and
fast commits.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmcAYaIACgkQ8vlZVpUN
 gaO8fQf+IOHRpyqVBXd2WwDJ05vBJAbbaKdhaI1r6gFPF1AzyIZ+lpCnEV9f9B2u
 IrDZaigb/CiJxO9mLRLrp0z/O9s6Z/+NjNdVFkoynJo2Crfr5mi07DXR9EQ7XkvT
 WYcT0IXg7xVswOzVosgBOUA76yMLJwtuTPp7YsAmCmp5EksMk1LSk1VzUEJcKRgJ
 BHSWSn2BqHxwMStB5L0eBnYtdaW8l1RLeWlTPdcPeCrD7UrYCXThvxfyz2KdztBz
 YfjcQEZhY6OD/uKdNFE6oQc54kL8RpqGDpF3YGMBixiZ8h99PpGUUz3bB5VLTfco
 WlxQwOD9dofPQ5Yh+s5icY8q67XXow==
 =Cnp+
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix some ext4 bugs and regressions relating to oneline resize and fast
  commits"

* tag 'ext4_for_linus-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix off by one issue in alloc_flex_gd()
  ext4: mark fc as ineligible using an handle in ext4_xattr_set()
  ext4: use handle to mark fc as ineligible in __track_dentry_update()
2024-10-05 10:47:00 -07:00
Kent Overstreet
0f25eb4b60 bcachefs: Rework logged op error handling
Initially it was thought that we just wanted to ignore errors from
logged op replay, but it turns out we do need to catch -EROFS, or we'll
go into an infinite loop.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:32 -04:00
Kent Overstreet
1f73cb4d34 bcachefs: Add warn param to subvol_get_snapshot, peek_inode
These shouldn't always be fatal errors - logged op resume, in
particular, and we want it as a parameter there.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:32 -04:00
Kent Overstreet
72350ee0ea bcachefs: Kill snapshot arg to fsck_write_inode()
It was initially believed that it would be better to be explicit about
the snapshot we're updating when writing inodes in fsck; however, it
turns out that passing around the snapshot separately is more error
prone and we're usually updating the inode in the same snapshow we read
it from.

This is different from normal filesystem paths, where we do the update
in the snapshot of the subvolume we're in.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:32 -04:00
Kent Overstreet
c9306a91c3 bcachefs: Check for unlinked, non-empty dirs in check_inode()
We want to check for this early so it can be reattached if necessary in
check_unreachable_inodes(); better than letting it be deleted and having
the children reattached, losing their filenames.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:32 -04:00
Kent Overstreet
c7da5ee2e5 bcachefs: Check for unlinked inodes with dirents
link count works differently in bcachefs - it's only nonzero for files
with multiple hardlinks, which means we can also avoid checking it
except for files that are known to have hardlinks.

That means we need a few different checks instead; in particular, we
don't want fsck to delet a file that has a dirent pointing to it.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:32 -04:00
Kent Overstreet
1c6051bbd7 bcachefs: Check for directories with no backpointers
It's legal for regular files to have missing backpointers (due to
hardlinks), and fsck should automatically add them, but for directories
this is an error that should be flagged.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:32 -04:00
Kent Overstreet
260af1562e bcachefs: Kill alloc_v4.fragmentation_lru
The fragmentation_lru field hasn't been needed since we reworked the LRU
btrees to use the btree write buffer; previously it was used to resolve
collisions, but the revised LRU btree uses the backpointer (the bucket)
as part of the key.

It should have been deleted at the time of the LRU rework; since it
wasn't, that left places for bugs to hide, in check/repair.

This fixes LRU fsck on a filesystem image helpfully provided by a user
who disappeared before I could get his name for the reported-by.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:32 -04:00
Kent Overstreet
01bf5e3bd2 bcachefs: minor lru fsck fixes
check_lru_key() wasn't using write buffer updates for deleting bad lru
entries - dating from before the lru btree used the btree write buffer.

And when possibly flushing the btree write buffer (to make sure we're
seeing a real inconsistency), we need to be using the modern
bch2_btree_write_buffer_maybe_flush().

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:31 -04:00
Kent Overstreet
1bea714c53 bcachefs: Mark more errors AUTOFIX
Errors are getting marked as AUTOFIX once they've been (re)-tested and
audited.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:31 -04:00
Kent Overstreet
492e24d760 bcachefs: Make sure we print error that causes fsck to bail out
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:31 -04:00
Kent Overstreet
658c82f41e bcachefs: bkey errors are only AUTOFIX during read
Newly generated keys, in the transaction commit path or write path,
should not be AUTOFIX; those indicate bugs that we need to fail fast
for.

Fixes: 5612daafb7 ("bcachefs: Fix fsck warnings from bkey validation")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:31 -04:00
Kent Overstreet
fda7b1ffde bcachefs: Create lost+found in correct snapshot
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:31 -04:00
Kent Overstreet
20826fe6b8 bcachefs: Fix reattach_inode()
Ensure a copy of the lost+found inode exists in the snapshot that we're
reattaching, so that we don't trigger warnings in
lookup_inode_for_snapshot() later.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:31 -04:00
Kent Overstreet
6b63a948a7 bcachefs: Add missing wakeup to bch2_inode_hash_remove()
This fixes two different bugs:

- Looser locking with the rhashtable means we need to recheck if the
  inode is still hashed after prepare_to_wait(), and add a corresponding
  wakeup after removing from the hash table.

- da18ecbf0f ("fs: add i_state helpers") changed the bit waitqueues
  used for inodes, and bcachefs wasn't updated and thus broke; this
  updates bcachefs to the new helper.

Fixes: 112d21fd1a ("bcachefs: switch to rhashtable for vfs inodes hash")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-04 20:25:31 -04:00
Baokun Li
6121258c2b ext4: fix off by one issue in alloc_flex_gd()
Wesley reported an issue:

==================================================================
EXT4-fs (dm-5): resizing filesystem from 7168 to 786432 blocks
------------[ cut here ]------------
kernel BUG at fs/ext4/resize.c:324!
CPU: 9 UID: 0 PID: 3576 Comm: resize2fs Not tainted 6.11.0+ #27
RIP: 0010:ext4_resize_fs+0x1212/0x12d0
Call Trace:
 __ext4_ioctl+0x4e0/0x1800
 ext4_ioctl+0x12/0x20
 __x64_sys_ioctl+0x99/0xd0
 x64_sys_call+0x1206/0x20d0
 do_syscall_64+0x72/0x110
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
==================================================================

While reviewing the patch, Honza found that when adjusting resize_bg in
alloc_flex_gd(), it was possible for flex_gd->resize_bg to be bigger than
flexbg_size.

The reproduction of the problem requires the following:

 o_group = flexbg_size * 2 * n;
 o_size = (o_group + 1) * group_size;
 n_group: [o_group + flexbg_size, o_group + flexbg_size * 2)
 o_size = (n_group + 1) * group_size;

Take n=0,flexbg_size=16 as an example:

              last:15
|o---------------|--------------n-|
o_group:0    resize to      n_group:30

The corresponding reproducer is:

img=test.img
rm -f $img
truncate -s 600M $img
mkfs.ext4 -F $img -b 1024 -G 16 8M
dev=`losetup -f --show $img`
mkdir -p /tmp/test
mount $dev /tmp/test
resize2fs $dev 248M

Delete the problematic plus 1 to fix the issue, and add a WARN_ON_ONCE()
to prevent the issue from happening again.

[ Note: another reproucer which this commit fixes is:

  img=test.img
  rm -f $img
  truncate -s 25MiB $img
  mkfs.ext4 -b 4096 -E nodiscard,lazy_itable_init=0,lazy_journal_init=0 $img
  truncate -s 3GiB $img
  dev=`losetup -f --show $img`
  mkdir -p /tmp/test
  mount $dev /tmp/test
  resize2fs $dev 3G
  umount $dev
  losetup -d $dev

  -- TYT ]

Reported-by: Wesley Hershberger <wesley.hershberger@canonical.com>
Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2081231
Reported-by: Stéphane Graber <stgraber@stgraber.org>
Closes: https://lore.kernel.org/all/20240925143325.518508-1-aleksandr.mikhalitsyn@canonical.com/
Tested-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Tested-by: Eric Sandeen <sandeen@redhat.com>
Fixes: 665d3e0af4 ("ext4: reduce unnecessary memory allocation in alloc_flex_gd()")
Cc: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240927133329.1015041-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-10-04 17:36:28 -04:00
Luis Henriques (SUSE)
04e6ce8f06 ext4: mark fc as ineligible using an handle in ext4_xattr_set()
Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result
in a fast-commit being done before the filesystem is effectively marked as
ineligible.  This patch moves the call to this function so that an handle
can be used.  If a transaction fails to start, then there's not point in
trying to mark the filesystem as ineligible, and an error will eventually be
returned to user-space.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240923104909.18342-3-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2024-10-04 17:36:09 -04:00
Luis Henriques (SUSE)
faab35a037 ext4: use handle to mark fc as ineligible in __track_dentry_update()
Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result
in a fast-commit being done before the filesystem is effectively marked as
ineligible.  This patch fixes the calls to this function in
__track_dentry_update() by adding an extra parameter to the callback used in
ext4_fc_track_template().

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240923104909.18342-2-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2024-10-04 17:35:54 -04:00
Mike Snitzer
76f5af9952 nfsd/localio: fix nfsd_file tracepoints to handle NULL rqstp
Otherwise nfsd_file_acquire, nfsd_file_insert_err, and
nfsd_file_cons_err will hit a NULL pointer when they are enabled and
LOCALIO used.

Example trace output (note xid is 0x0 and LOCALIO flag set):
 nfsd_file_acquire: xid=0x0 inode=0000000069a1b2e7
 may_flags=WRITE|LOCALIO ref=1 nf_flags=HASHED|GC nf_may=WRITE
 nf_file=0000000070123234 status=0

Fixes: c63f0e48fe ("nfsd: add nfsd_file_acquire_local()")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-04 14:52:04 -04:00
Linus Torvalds
e02f08e217 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmb//Q8ACgkQnJ2qBz9k
 QNlBoQf8CwTW4/BTbt8Pq3RIKHYH8HFyFurwxrlnKXoB/4E4QNAfd/iDbwxqkILJ
 mbKHGVyk96St83PH5H9b2MYnwQ6PP6WEHIe7RZjxOrfrrLTSwuuOwdNR9RZbRWPd
 rLQ1/nbnLiTWPR29RLZVbhFAhDiw1hyag8YKPklEw+7SQ/wSxFUTJ8DTQ8TZYq8f
 O8SoTTJACulAP1CfoNwPlhm472JBe5jJP8WygaYeKg5wky6RPQYn8J+JSUmHs+rZ
 qNY+RuxLZ/HauXA+vBgae34G1GAdWc3sqUEU2TrGBuy/x28K/Di84+eEdQXA74sr
 +8+ACt6RRlm/w+FrjvgxJ0JUV/OAMA==
 =CoCV
 -----END PGP SIGNATURE-----

Merge tag 'fsnotify_for_v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify fixes from Jan Kara:
 "Fixes for an inotify deadlock and a data race in fsnotify"

* tag 'fsnotify_for_v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  inotify: Fix possible deadlock in fsnotify_destroy_mark
  fsnotify: Avoid data race between fsnotify_recalc_mask() and fsnotify_object_watched()
2024-10-04 10:31:59 -07:00
Linus Torvalds
4770119d63 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmb//cYACgkQnJ2qBz9k
 QNmrxAgApRJqSQ0Koilm4Mtz+6k7CAM6MR/+JKRQLoJwl27cA4syw9REBPPdcNJP
 9Gbvlcenf8ISrtZStZehw/eyOuNvE9uBzjjmRcMc3+0esDfgFqfteDCanJ/MQDdZ
 NX3OB8pSMxwTtcf9iXJUyD/hvFIq58H378r8eOKAwI9S673r2zDZ5qGmske3S6DM
 gyMwfWepzPigKOYhwt1Frih4GmjOaD3637bSoGBfRdhT1PvAH9bJaKCFtwucNP7Y
 MG+on/JLrpPgegzbWQe2u4WTzQihkRDnEfhC+52Erdhb+Z4rv3IJUarVO2iIKzwF
 3CS7q3NyArbYN3QbvOiIKR+rORNnZg==
 =gbY2
 -----END PGP SIGNATURE-----

Merge tag 'fs_for_v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull UDF fixes from Jan Kara:
 "A couple of UDF error handling fixes for issues spotted by syzbot"

* tag 'fs_for_v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: fix uninit-value use in udf_get_fileshortad
  udf: refactor inode_bmap() to handle error
  udf: refactor udf_next_aext() to handle error
  udf: refactor udf_current_aext() to handle error
2024-10-04 10:24:06 -07:00
Linus Torvalds
a3a37691e6 A fix from Patrick for a variety of CephFS lockup scenarios caused by
a regression in cap handling which sneaked in through the netfs helper
 library in 5.18 (marked for stable) and an unrelated one-line cleanup.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmcACf8THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi3DiCACDP9ZubGLZ2GKv3fEyXEURnOyrSBAd
 M4Ci3cc31W0Q1rWtQndGxW9QF/9smcp4kN/xgelWfSZc6FJ1HLR1RHUPjMJpBPx2
 dQs+RaaBTIe0coVaMmiDnBQ1wnfuNhEZH8opnLonjHygANnoR2aYZpyIxpgYD7eZ
 KHfk1F/2rmE+QzYGB+tE7sCpRs3bhbtL2xa/2yQJyw+EWfgyJUjPW1cZe0liiWHi
 dXTMxJqfZLLUW2otLdGnaaZS+7yF3ItcoAWh6cUoiZ9cpqDu43B0RGXOV7Kegui1
 2e8W0sK3QIIYbUKoHo/4VPqrmN4puSjeNteumDvpZE4SgG8BfKWzQnmh
 =3OKU
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.12-rc2' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A fix from Patrick for a variety of CephFS lockup scenarios caused by
  a regression in cap handling which sneaked in through the netfs helper
  library in 5.18 (marked for stable) and an unrelated one-line cleanup"

* tag 'ceph-for-6.12-rc2' of https://github.com/ceph/ceph-client:
  ceph: fix cap ref leak via netfs init_request
  ceph: use struct_size() helper in __ceph_pool_perm_get()
2024-10-04 10:10:23 -07:00
Linus Torvalds
79eb2c07af for-6.12-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmb/9agACgkQxWXV+ddt
 WDtKFA/5AW88osk/1k/NVhOvOa0xPr5XyDLq1n8Gaxfy8uHlHAc8wdsvJzCDMS0M
 qUOD/tOPhRI0HGXPiKD767erwbyXiZAcCTkSd8x5jlXy1hVjUHQSKO//JxD0vtAZ
 jOscoUA1wJJutopCXcppnoUUFE2753edEg0w2EtUXMpfqivqOmMCR+1ZtKkfaNJo
 oRuZCq3Oi8hu7Wsvmh4Etq/9MvGM+xovXAMAji6Op8nsP1jJlzWztpEUogLOQH2S
 IhDFFxP9shBV9JjV+HSyXcAYr8VArH6HtYjapR9oajCH0pvSjLYQQPq/qQ0//8Hb
 SHr4YP6RBbpEIvvaQeA7vwsckBHNBOrSAbEgRQ2+zdmiwha6SRIEVyXYh5LUwcYT
 WnVozbk7ZX9rD1jOVhvgouFG6vUz8A6/qt3BD028bVcyMvBXW4gsEduCMVlFGmlN
 D6+hNY6J08j4HUEGnPk7fAYi/lk5OEK1p5yarUgsOQ3GqWWS0ywkrfbmbWwyC+Ff
 AxggFTl9YodU5RMs7EU2GeHkLU6LnXgevk6FFm0JzsLtT/BbEP7pj6tOot4Msl1e
 2ovqFiSbuPNg5Wr70ZBRO9LDIAYtTZy1UrVR/YCSLzm+wZsXMelDIHMQfE7+Yodp
 O5Cud23AanRTwjErvNl3X4rhut8rrI1FsR89gDyKK1EPG+mwofo=
 =yslr
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - in incremental send, fix invalid clone operation for file that got
   its size decreased

 - fix __counted_by() annotation of send path cache entries, we do not
   store the terminating NUL

 - fix a longstanding bug in relocation (and quite hard to hit by
   chance), drop back reference cache that can get out of sync after
   transaction commit

 - wait for fixup worker kthread before finishing umount

 - add missing raid-stripe-tree extent for NOCOW files, zoned mode
   cannot have NOCOW files but RST is meant to be a standalone feature

 - handle transaction start error during relocation, avoid potential
   NULL pointer dereference of relocation control structure (reported by
   syzbot)

 - disable module-wide rate limiting of debug level messages

 - minor fix to tracepoint definition (reported by checkpatch.pl)

* tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: disable rate limiting when debug enabled
  btrfs: wait for fixup workers before stopping cleaner kthread during umount
  btrfs: fix a NULL pointer dereference when failed to start a new trasacntion
  btrfs: send: fix invalid clone operation for file that got its size decreased
  btrfs: tracepoints: end assignment with semicolon at btrfs_qgroup_extent event class
  btrfs: drop the backref cache during relocation if we commit
  btrfs: also add stripe entries for NOCOW writes
  btrfs: send: fix buffer overflow detection when copying path to cache entry
2024-10-04 10:05:13 -07:00
Linus Torvalds
b7a838ee7e nine smb3 client fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmb+z50ACgkQiiy9cAdy
 T1FS9Av+N4943ciID42gNZEL/33t+NuNymFHDeC1H4txMNEi6MZXb+H4HFQMCF22
 ZWdb1IIUJ7dUjfX68hD+sIs7o+QsCUIGriyNLZvlg7xo7NIBXED2UVmWfLeiQZyh
 DVhCC62vVCYCvvyrlLMBKuTIM/mNzRe7JUTpFBN+wiiakLAgf5G/9ifoFyC6cmA3
 854V3z5644WM2mPOsLxWr0CkV+pELRWwgvWeMcHXnjjrljjIi6jpCX2jthkEJgkR
 6rcFYwfnS74VqfzjZl7sMD4Oc/blaTuNjj0iwZz5ThJMUIN6p/RzUvSN+2qySPqJ
 7ENhcElJwr9lslDfL7X412bLRAhma+vduofHW9IPNvD3Q3okXQDBZ/FRHx5q+mZR
 ziBFCFT/OrotSSy8MZGOl7tD1bs29B1R98hO91qGUzo/rhzNzM93DiGAgYDxVLRE
 /cw08lQzkFn468MIKHrXV3qKLqAN4Wp8h2tZ80oNAMjc3MwELoKzE5pljQc8vhVI
 Oh3qN8IE
 =Ky9w
 -----END PGP SIGNATURE-----

Merge tag 'v6.12-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - statfs fix (e.g. when limited access to root directory of share)

 - special file handling fixes: fix packet validation to avoid buffer
   overflow for reparse points, fixes for symlink path parsing (one for
   reparse points, and one for SFU use case), and fix for cleanup after
   failed SET_REPARSE operation.

 - fix for SMB2.1 signing bug introduced by recent patch to NFS symlink
   path, and NFS reparse point validation

 - comment cleanup

* tag 'v6.12-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Do not convert delimiter when parsing NFS-style symlinks
  cifs: Validate content of NFS reparse point buffer
  cifs: Fix buffer overflow when parsing NFS reparse points
  smb: client: Correct typos in multiple comments across various files
  smb: client: use actual path when queryfs
  cifs: Remove intermediate object of failed create reparse call
  Revert "smb: client: make SHA-512 TFM ephemeral"
  smb: Update comments about some reparse point tags
  cifs: Check for UTF-16 null codepoint in SFU symlink target location
2024-10-04 09:56:05 -07:00
Linus Torvalds
6cca119588 close_range(): fix the logics in descriptor table trimming
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZv90qwAKCRBZ7Krx/gZQ
 629DAP9NT9/ndByazKrP03nQ/ITNhYVd0cby2iPcXZiHGj/fsQD7BF4yUUiBy7EO
 seIY8rGxc8S5TDJAG4HYa7t1Nuksnw4=
 =oji5
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull close_range() fix from Al Viro:
 "Fix the logic in descriptor table trimming"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  close_range(): fix the logics in descriptor table trimming
2024-10-04 09:46:16 -07:00
Linus Torvalds
263a25de5b fix ufs_rename() braino introduced this cycle
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZv8vaAAKCRBZ7Krx/gZQ
 61tVAP9IpmZX+sJ0sVI6kesjlZsTBjMfw6VuWB7sJuFm4KOWagD+NHYbma8NSIGX
 cyGKXaePWYUFww7Y0nh3Q+Dz0VXk9Q8=
 =ccvV
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes.ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull ufs fix from Al Viro:
 "Fix ufs_rename() braino introduced this cycle.

  The 'folio_release_kmap(dir_folio, new_dir)' in ufs_rename() part of
  folio conversion should've been getting a pointer to ufs directory
  entry within the page, rather than a pointer to directory struct
  inode..."

* tag 'pull-fixes.ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ufs_rename(): fix bogus argument of folio_release_kmap()
2024-10-03 17:03:18 -07:00
Mike Snitzer
009b15b574 nfs_common: fix Kconfig for NFS_COMMON_LOCALIO_SUPPORT
The 'default n' that was in NFS_COMMON_LOCALIO_SUPPORT caused these
extra defaults to be missed:
        default y if NFSD=y || NFS_FS=y
	default m if NFSD=m && NFS_FS=m

Remove the 'default n' for NFS_COMMON_LOCALIO_SUPPORT so that the
correct tristate is selected based on how NFSD and NFS_FS are
configured.  This fixes the reported case where NFS_FS=y but
NFS_COMMON_LOCALIO_SUPPORT=m, it is now correctly set to =y.

In addition, add extra 'depends on NFS_LOCALIO' to
NFS_COMMON_LOCALIO_SUPPORT so that if NFS_LOCALIO isn't set then
NFS_COMMON_LOCALIO_SUPPORT will not be either.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202410031944.hMCFY9BO-lkp@intel.com/
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-03 16:19:51 -04:00
Mike Snitzer
65f2a5c366 nfs_common: fix race in NFS calls to nfsd_file_put_local() and nfsd_serv_put()
Add nfs_to_nfsd_file_put_local() interface to fix race with nfsd
module unload.  Similarly, use RCU around nfs_open_local_fh()'s error
path call to nfs_to->nfsd_serv_put().  Holding RCU ensures that NFS
will safely _call and return_ from its nfs_to calls into the NFSD
functions nfsd_file_put_local() and nfsd_serv_put().

Otherwise, if RCU isn't used then there is a narrow window when NFS's
reference for the nfsd_file and nfsd_serv are dropped and the NFSD
module could be unloaded, which could result in a crash from the
return instruction for either nfs_to->nfsd_file_put_local() or
nfs_to->nfsd_serv_put().

Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-03 16:19:43 -04:00
Yanjun Zhang
a848c29e34 NFSv4: Prevent NULL-pointer dereference in nfs42_complete_copies()
On the node of an NFS client, some files saved in the mountpoint of the
NFS server were copied to another location of the same NFS server.
Accidentally, the nfs42_complete_copies() got a NULL-pointer dereference
crash with the following syslog:

[232064.838881] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116
[232064.839360] NFSv4: state recovery failed for open file nfs/pvc-12b5200d-cd0f-46a3-b9f0-af8f4fe0ef64.qcow2, error = -116
[232066.588183] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000058
[232066.588586] Mem abort info:
[232066.588701]   ESR = 0x0000000096000007
[232066.588862]   EC = 0x25: DABT (current EL), IL = 32 bits
[232066.589084]   SET = 0, FnV = 0
[232066.589216]   EA = 0, S1PTW = 0
[232066.589340]   FSC = 0x07: level 3 translation fault
[232066.589559] Data abort info:
[232066.589683]   ISV = 0, ISS = 0x00000007
[232066.589842]   CM = 0, WnR = 0
[232066.589967] user pgtable: 64k pages, 48-bit VAs, pgdp=00002000956ff400
[232066.590231] [0000000000000058] pgd=08001100ae100003, p4d=08001100ae100003, pud=08001100ae100003, pmd=08001100b3c00003, pte=0000000000000000
[232066.590757] Internal error: Oops: 96000007 [#1] SMP
[232066.590958] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm vhost_net vhost vhost_iotlb tap tun ipt_rpfilter xt_multiport ip_set_hash_ip ip_set_hash_net xfrm_interface xfrm6_tunnel tunnel4 tunnel6 esp4 ah4 wireguard libcurve25519_generic veth xt_addrtype xt_set nf_conntrack_netlink ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_bitmap_port ip_set_hash_ipport dummy ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs iptable_filter sch_ingress nfnetlink_cttimeout vport_gre ip_gre ip_tunnel gre vport_geneve geneve vport_vxlan vxlan ip6_udp_tunnel udp_tunnel openvswitch nf_conncount dm_round_robin dm_service_time dm_multipath xt_nat xt_MASQUERADE nft_chain_nat nf_nat xt_mark xt_conntrack xt_comment nft_compat nft_counter nf_tables nfnetlink ocfs2 ocfs2_nodemanager ocfs2_stackglue iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_ssif nbd overlay 8021q garp mrp bonding tls rfkill sunrpc ext4 mbcache jbd2
[232066.591052]  vfat fat cas_cache cas_disk ses enclosure scsi_transport_sas sg acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler ip_tables vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio dm_mirror dm_region_hash dm_log dm_mod nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc fuse xfs libcrc32c ast drm_vram_helper qla2xxx drm_kms_helper syscopyarea crct10dif_ce sysfillrect ghash_ce sysimgblt sha2_ce fb_sys_fops cec sha256_arm64 sha1_ce drm_ttm_helper ttm nvme_fc igb sbsa_gwdt nvme_fabrics drm nvme_core i2c_algo_bit i40e scsi_transport_fc megaraid_sas aes_neon_bs
[232066.596953] CPU: 6 PID: 4124696 Comm: 10.253.166.125- Kdump: loaded Not tainted 5.15.131-9.cl9_ocfs2.aarch64 #1
[232066.597356] Hardware name: Great Wall .\x93\x8e...RF6260 V5/GWMSSE2GL1T, BIOS T656FBE_V3.0.18 2024-01-06
[232066.597721] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[232066.598034] pc : nfs4_reclaim_open_state+0x220/0x800 [nfsv4]
[232066.598327] lr : nfs4_reclaim_open_state+0x12c/0x800 [nfsv4]
[232066.598595] sp : ffff8000f568fc70
[232066.598731] x29: ffff8000f568fc70 x28: 0000000000001000 x27: ffff21003db33000
[232066.599030] x26: ffff800005521ae0 x25: ffff0100f98fa3f0 x24: 0000000000000001
[232066.599319] x23: ffff800009920008 x22: ffff21003db33040 x21: ffff21003db33050
[232066.599628] x20: ffff410172fe9e40 x19: ffff410172fe9e00 x18: 0000000000000000
[232066.599914] x17: 0000000000000000 x16: 0000000000000004 x15: 0000000000000000
[232066.600195] x14: 0000000000000000 x13: ffff800008e685a8 x12: 00000000eac0c6e6
[232066.600498] x11: 0000000000000000 x10: 0000000000000008 x9 : ffff8000054e5828
[232066.600784] x8 : 00000000ffffffbf x7 : 0000000000000001 x6 : 000000000a9eb14a
[232066.601062] x5 : 0000000000000000 x4 : ffff70ff8a14a800 x3 : 0000000000000058
[232066.601348] x2 : 0000000000000001 x1 : 54dce46366daa6c6 x0 : 0000000000000000
[232066.601636] Call trace:
[232066.601749]  nfs4_reclaim_open_state+0x220/0x800 [nfsv4]
[232066.601998]  nfs4_do_reclaim+0x1b8/0x28c [nfsv4]
[232066.602218]  nfs4_state_manager+0x928/0x10f0 [nfsv4]
[232066.602455]  nfs4_run_state_manager+0x78/0x1b0 [nfsv4]
[232066.602690]  kthread+0x110/0x114
[232066.602830]  ret_from_fork+0x10/0x20
[232066.602985] Code: 1400000d f9403f20 f9402e61 91016003 (f9402c00)
[232066.603284] SMP: stopping secondary CPUs
[232066.606936] Starting crashdump kernel...
[232066.607146] Bye!

Analysing the vmcore, we know that nfs4_copy_state listed by destination
nfs_server->ss_copies was added by the field copies in handle_async_copy(),
and we found a waiting copy process with the stack as:
PID: 3511963  TASK: ffff710028b47e00  CPU: 0   COMMAND: "cp"
 #0 [ffff8001116ef740] __switch_to at ffff8000081b92f4
 #1 [ffff8001116ef760] __schedule at ffff800008dd0650
 #2 [ffff8001116ef7c0] schedule at ffff800008dd0a00
 #3 [ffff8001116ef7e0] schedule_timeout at ffff800008dd6aa0
 #4 [ffff8001116ef860] __wait_for_common at ffff800008dd166c
 #5 [ffff8001116ef8e0] wait_for_completion_interruptible at ffff800008dd1898
 #6 [ffff8001116ef8f0] handle_async_copy at ffff8000055142f4 [nfsv4]
 #7 [ffff8001116ef970] _nfs42_proc_copy at ffff8000055147c8 [nfsv4]
 #8 [ffff8001116efa80] nfs42_proc_copy at ffff800005514cf0 [nfsv4]
 #9 [ffff8001116efc50] __nfs4_copy_file_range.constprop.0 at ffff8000054ed694 [nfsv4]

The NULL-pointer dereference was due to nfs42_complete_copies() listed
the nfs_server->ss_copies by the field ss_copies of nfs4_copy_state.
So the nfs4_copy_state address ffff0100f98fa3f0 was offset by 0x10 and
the data accessed through this pointer was also incorrect. Generally,
the ordered list nfs4_state_owner->so_states indicate open(O_RDWR) or
open(O_WRITE) states are reclaimed firstly by nfs4_reclaim_open_state().
When destination state reclaim is failed with NFS_STATE_RECOVERY_FAILED
and copies are not deleted in nfs_server->ss_copies, the source state
may be passed to the nfs42_complete_copies() process earlier, resulting
in this crash scene finally. To solve this issue, we add a list_head
nfs_server->ss_src_copies for a server-to-server copy specially.

Fixes: 0e65a32c8a ("NFS: handle source server reboot")
Signed-off-by: Yanjun Zhang <zhangyanjun@cestc.cn>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-03 16:19:13 -04:00
Dan Carpenter
6dbf1f341b SUNRPC: Fix integer overflow in decode_rc_list()
The math in "rc_list->rcl_nrefcalls * 2 * sizeof(uint32_t)" could have an
integer overflow.  Add bounds checking on rc_list->rcl_nrefcalls to fix
that.

Fixes: 4aece6a19c ("nfs41: cb_sequence xdr implementation")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-10-03 16:19:13 -04:00
Pali Rohár
d3a49f6091 cifs: Do not convert delimiter when parsing NFS-style symlinks
NFS-style symlinks have target location always stored in NFS/UNIX form
where backslash means the real UNIX backslash and not the SMB path
separator.

So do not mangle slash and backslash content of NFS-style symlink during
readlink() syscall as it is already in the correct Linux form.

This fixes interoperability of NFS-style symlinks with backslashes created
by Linux NFS3 client throw Windows NFS server and retrieved by Linux SMB
client throw Windows SMB server, where both Windows servers exports the
same directory.

Fixes: d5ecebc490 ("smb3: Allow query of symlinks stored as reparse points")
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-03 12:06:36 -05:00
Pali Rohár
556ac52bb1 cifs: Validate content of NFS reparse point buffer
Symlink target location stored in DataBuffer is encoded in UTF-16. So check
that symlink DataBuffer length is non-zero and even number. And check that
DataBuffer does not contain UTF-16 null codepoint because Linux cannot
process symlink with null byte.

DataBuffer for char and block devices is 8 bytes long as it contains two
32-bit numbers (major and minor). Add check for this.

DataBuffer buffer for sockets and fifos zero-length. Add checks for this.

Signed-off-by: Pali Rohár <pali@kernel.org>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-03 12:06:13 -05:00
Pali Rohár
e2a8910af0 cifs: Fix buffer overflow when parsing NFS reparse points
ReparseDataLength is sum of the InodeType size and DataBuffer size.
So to get DataBuffer size it is needed to subtract InodeType's size from
ReparseDataLength.

Function cifs_strndup_from_utf16() is currentlly accessing buf->DataBuffer
at position after the end of the buffer because it does not subtract
InodeType size from the length. Fix this problem and correctly subtract
variable len.

Member InodeType is present only when reparse buffer is large enough. Check
for ReparseDataLength before accessing InodeType to prevent another invalid
memory access.

Major and minor rdev values are present also only when reparse buffer is
large enough. Check for reparse buffer size before calling reparse_mkdev().

Fixes: d5ecebc490 ("smb3: Allow query of symlinks stored as reparse points")
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-03 12:05:12 -05:00
Linus Torvalds
9c02404b52 three small ksmbd server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmb+HqAACgkQiiy9cAdy
 T1EQbgv/aybFhucbglNe1QIjQ12DqBUUJoRRbV0xLX2TmvbpxxBuehbD11pkTqeH
 c7zvCQE+Ank3PfSGvFjM77iY++AuhHtvDg5ugMtdZEUzqNtEdT6a1fnVcAsqmuhM
 5ROER0IheSwSbIha6FJwgodwKAeJuPmmEmbU9e0PZ4ZZLqetAnuhpKNOEurMMxoa
 G0K7hknyuG9/gOiyBfmVTysuorA9jP1IgWjnwBOANKJo+IbQdifaLd535XWaY/7+
 sabRy+0QAmMejcrP6XHT5KUUjw63YODmhnFKo0MRaG3GODg4RO/7JRJLdD9FMDCY
 DyL5at0Ro33zhzif7i0vFUn7VhvkWuivfXQBLL+ALk2xhHw+5Yk/zqce84fTbzQj
 KOeFeevG5B2P0uxGbShjxVqxbaPUgIKD7f1N6SmwkAnCE3+zXcGwRTENKFR4C5mF
 iMFd22hYPUMD3ED/yR6+1fEtLpGtHof9erHH99x1bRU4fL+Am+C6fHTy+klaFMPP
 K3xXe/1i
 =xo5T
 -----END PGP SIGNATURE-----

Merge tag 'v6.12-rc1-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

 - small cleanup patches leveraging struct size to improve access bounds checking

* tag 'v6.12-rc1-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: Use struct_size() to improve smb_direct_rdma_xmit()
  ksmbd: Annotate struct copychunk_ioctl_req with __counted_by_le()
  ksmbd: Use struct_size() to improve get_file_alternate_info()
2024-10-03 09:38:16 -07:00
Linus Torvalds
20c2474fa5 vfs-6.12-rc2.fixes.2
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZv5Y3gAKCRCRxhvAZXjc
 ojFPAP45kz5JgVKFn8iZmwfjPa7qbCa11gEzmx0SbUt3zZ3mJAD/fL9k9KaNU+qA
 LIcZW5BJn/p5fumUAw8/fKoz4ajCWQk=
 =LIz1
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12-rc2.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "vfs:

   - Ensure that iter_folioq_get_pages() advances to the next slot
     otherwise it will end up using the same folio with an out-of-bound
     offset.

  iomap:

   - Dont unshare delalloc extents which can't be reflinked, and thus
     can't be shared.

   - Constrain the file range passed to iomap_file_unshare() directly in
     iomap instead of requiring the callers to do it.

  netfs:

   - Use folioq_count instead of folioq_nr_slot to prevent an
     unitialized value warning in netfs_clear_buffer().

   - Fix missing wakeup after issuing writes by scheduling the write
     collector only if all the subrequest queues are empty and thus no
     writes are pending.

   - Fix two minor documentation bugs"

* tag 'vfs-6.12-rc2.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  iomap: constrain the file range passed to iomap_file_unshare
  iomap: don't bother unsharing delalloc extents
  netfs: Fix missing wakeup after issuing writes
  Documentation: add missing folio_queue entry
  folio_queue: fix documentation
  netfs: Fix a KMSAN uninit-value error in netfs_clear_buffer
  iov_iter: fix advancing slot in iter_folioq_get_pages()
2024-10-03 09:22:50 -07:00
Darrick J. Wong
a311a08a42
iomap: constrain the file range passed to iomap_file_unshare
File contents can only be shared (i.e. reflinked) below EOF, so it makes
no sense to try to unshare ranges beyond EOF.  Constrain the file range
parameters here so that we don't have to do that in the callers.

Fixes: 5f4e5752a8 ("fs: add iomap_file_dirty")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20241002150213.GC21853@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-03 10:22:28 +02:00
Darrick J. Wong
f7a4874d97
iomap: don't bother unsharing delalloc extents
If unshare encounters a delalloc reservation in the srcmap, that means
that the file range isn't shared because delalloc reservations cannot be
reflinked.  Therefore, don't try to unshare them.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20241002150040.GB21853@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-03 10:22:25 +02:00
Patrick Donnelly
ccda9910d8 ceph: fix cap ref leak via netfs init_request
Log recovered from a user's cluster:

    <7>[ 5413.970692] ceph:  get_cap_refs 00000000958c114b ret 1 got Fr
    <7>[ 5413.970695] ceph:  start_read 00000000958c114b, no cache cap
    ...
    <7>[ 5473.934609] ceph:   my wanted = Fr, used = Fr, dirty -
    <7>[ 5473.934616] ceph:  revocation: pAsLsXsFr -> pAsLsXs (revoking Fr)
    <7>[ 5473.934632] ceph:  __ceph_caps_issued 00000000958c114b cap 00000000f7784259 issued pAsLsXs
    <7>[ 5473.934638] ceph:  check_caps 10000000e68.fffffffffffffffe file_want - used Fr dirty - flushing - issued pAsLsXs revoking Fr retain pAsLsXsFsr  AUTHONLY NOINVAL FLUSH_FORCE

The MDS subsequently complains that the kernel client is late releasing
caps.

Approximately, a series of changes to this code by commits 4987005600
("ceph: convert ceph_readpages to ceph_readahead"), 2de1604173
("netfs: Change ->init_request() to return an error code") and
a5c9dc4451 ("ceph: Make ceph_init_request() check caps on readahead")
resulted in subtle resource cleanup to be missed. The main culprit is
the change in error handling in 2de1604173 which meant that a failure
in init_request() would no longer cause cleanup to be called. That
would prevent the ceph_put_cap_refs() call which would cleanup the
leaked cap ref.

Cc: stable@vger.kernel.org
Fixes: a5c9dc4451 ("ceph: Make ceph_init_request() check caps on readahead")
Link: https://tracker.ceph.com/issues/67008
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-10-03 09:31:08 +02:00
Thorsten Blum
7264745d55 ceph: use struct_size() helper in __ceph_pool_perm_get()
Use struct_size() to calculate the number of bytes to be allocated.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2024-10-03 09:31:08 +02:00
Kent Overstreet
d28786606a bcachefs: Fix trans_commit disk accounting revert
We only are applying JSET_ENTRY_TYPE_write_buffer_keys, revert path was
missed.

Fixes: a3581ca35d ("bcachefs: Fix BCH_TRANS_COMMIT_skip_accounting_apply")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-02 21:37:42 -04:00
Kent Overstreet
3b1425a4eb bcachefs: Fix bch2_inode_is_open() check
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-02 21:31:31 -04:00
Kent Overstreet
abaa6d4f6a bcachefs: Fix return type of dirent_points_to_inode_nowarn()
we're returning an error code now, not a bool

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-02 21:30:55 -04:00
Linus Torvalds
7ec462100e Getting rid of asm/unaligned.h includes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZv3NAgAKCRBZ7Krx/gZQ
 68kbAP0YzQxUgl0/o7Soda8XwKSPZTM9ls6kRk1UHTTG/i4ZigEA/G+i/mBQctL0
 AB911kK8mxfXppfOXzstFBjoJSqiigQ=
 =IE7D
 -----END PGP SIGNATURE-----

Merge tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull generic unaligned.h cleanups from Al Viro:
 "Get rid of architecture-specific <asm/unaligned.h> includes, replacing
  them with a single generic <linux/unaligned.h> header file.

  It's the second largest (after asm/io.h) class of asm/* includes, and
  all but two architectures actually end up using exact same file.

  Massage the remaining two (arc and parisc) to do the same and just
  move the thing to from asm-generic/unaligned.h to linux/unaligned.h"

[ This is one of those things that we're better off doing outside the
  merge window, and would only cause extra conflict noise if it was in
  linux-next for the next release due to all the trivial #include line
  updates.  Rip off the band-aid.   - Linus ]

* tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  move asm/unaligned.h to linux/unaligned.h
  arc: get rid of private asm/unaligned.h
  parisc: get rid of private asm/unaligned.h
2024-10-02 16:42:28 -07:00
Shen Lichuan
e9f49feefb smb: client: Correct typos in multiple comments across various files
Fixed some confusing typos that were currently identified witch codespell,
the details are as follows:

-in the code comments:
fs/smb/client/cifsacl.h:58: inheritence ==> inheritance
fs/smb/client/cifsencrypt.c:242: origiginal ==> original
fs/smb/client/cifsfs.c:164: referece ==> reference
fs/smb/client/cifsfs.c:292: ned ==> need
fs/smb/client/cifsglob.h:779: initital ==> initial
fs/smb/client/cifspdu.h:784: altetnative ==> alternative
fs/smb/client/cifspdu.h:2409: conrol ==> control
fs/smb/client/cifssmb.c:1218: Expirement ==> Experiment
fs/smb/client/cifssmb.c:3021: conver ==> convert
fs/smb/client/cifssmb.c:3998: asterik ==> asterisk
fs/smb/client/file.c:2505: useable ==> usable
fs/smb/client/fs_context.h:263: timemout ==> timeout
fs/smb/client/misc.c:257: responsbility ==> responsibility
fs/smb/client/netmisc.c:1006: divisable ==> divisible
fs/smb/client/readdir.c:556: endianess ==> endianness
fs/smb/client/readdir.c:818: bu ==> by
fs/smb/client/smb2ops.c:2180: snaphots ==> snapshots
fs/smb/client/smb2ops.c:3586: otions ==> options
fs/smb/client/smb2pdu.c:2979: timestaps ==> timestamps
fs/smb/client/smb2pdu.c:4574: memmory ==> memory
fs/smb/client/smb2transport.c:699: origiginal ==> original
fs/smb/client/smbdirect.c:222: happenes ==> happens
fs/smb/client/smbdirect.c:1347: registartions ==> registrations
fs/smb/client/smbdirect.h:114: accoutning ==> accounting

Signed-off-by: Shen Lichuan <shenlichuan@vivo.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-02 17:52:24 -05:00
Al Viro
5f60d5f6bb move asm/unaligned.h to linux/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.

auto-generated by the following:

for i in `git grep -l -w asm/unaligned.h`; do
	sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
	sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
2024-10-02 17:23:23 -04:00
Linus Torvalds
27af290f16 zonefs fixes for 6.12-rc2
- Add support for the FS_IOC_GETFSSYSFSPATH ioctl.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCZvu6SQAKCRDdoc3SxdoY
 dk/OAQD0wLQqxdrRFe3sbcgVnqKrGxBa59Pj2jQwy70bM9kN4QEAowFKia7PxiFH
 JZ20Z1a6om/ksjYouKuYc3QR+yRFkgE=
 =vZIR
 -----END PGP SIGNATURE-----

Merge tag 'zonefs-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs

Pull zonefs update from Damien Le Moal:

 - Add support for the FS_IOC_GETFSSYSFSPATH ioctl

* tag 'zonefs-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
  zonefs: add support for FS_IOC_GETFSSYSFSPATH
2024-10-02 12:02:15 -07:00
David Howells
1ca4169c39
netfs: Fix missing wakeup after issuing writes
After dividing up a proposed write into subrequests, netfslib sets
NETFS_RREQ_ALL_QUEUED to indicate to the collector that it can move on to
the final cleanup once it has emptied the subrequest queues.

Now, whilst the collector will normally end up running at least once after
this bit is set just because it takes a while to process all the write
subrequests before the collector runs out of subrequests, there exists the
possibility that the issuing thread will be forced to sleep and the
collector thread will clean up all the subrequests before ALL_QUEUED gets
set.

In such a case, the collector thread will not get triggered again and will
never clear NETFS_RREQ_IN_PROGRESS thus leaving a request uncompleted and
causing a potential futute hang.

Fix this by scheduling the write collector if all the subrequest queues are
empty (and thus no writes pending issuance).

Note that we'd do this ideally before queuing the subrequest, but in the
case of buffered writeback, at least, we can't find out that we've run out
of folios until after we've called writeback_iter() and it has returned
NULL - at which point we might not actually have any subrequests still
under construction.

Fixes: 288ace2f57 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3317784.1727880350@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-02 16:56:15 +02:00
Lizhi Xu
cad3f4a22c inotify: Fix possible deadlock in fsnotify_destroy_mark
[Syzbot reported]
WARNING: possible circular locking dependency detected
6.11.0-rc4-syzkaller-00019-gb311c1b497e5 #0 Not tainted
------------------------------------------------------
kswapd0/78 is trying to acquire lock:
ffff88801b8d8930 (&group->mark_mutex){+.+.}-{3:3}, at: fsnotify_group_lock include/linux/fsnotify_backend.h:270 [inline]
ffff88801b8d8930 (&group->mark_mutex){+.+.}-{3:3}, at: fsnotify_destroy_mark+0x38/0x3c0 fs/notify/mark.c:578

but task is already holding lock:
ffffffff8ea2fd60 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6841 [inline]
ffffffff8ea2fd60 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xbb4/0x35a0 mm/vmscan.c:7223

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (fs_reclaim){+.+.}-{0:0}:
       ...
       kmem_cache_alloc_noprof+0x3d/0x2a0 mm/slub.c:4044
       inotify_new_watch fs/notify/inotify/inotify_user.c:599 [inline]
       inotify_update_watch fs/notify/inotify/inotify_user.c:647 [inline]
       __do_sys_inotify_add_watch fs/notify/inotify/inotify_user.c:786 [inline]
       __se_sys_inotify_add_watch+0x72e/0x1070 fs/notify/inotify/inotify_user.c:729
       do_syscall_x64 arch/x86/entry/common.c:52 [inline]
       do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
       entry_SYSCALL_64_after_hwframe+0x77/0x7f

-> #0 (&group->mark_mutex){+.+.}-{3:3}:
       ...
       __mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
       fsnotify_group_lock include/linux/fsnotify_backend.h:270 [inline]
       fsnotify_destroy_mark+0x38/0x3c0 fs/notify/mark.c:578
       fsnotify_destroy_marks+0x14a/0x660 fs/notify/mark.c:934
       fsnotify_inoderemove include/linux/fsnotify.h:264 [inline]
       dentry_unlink_inode+0x2e0/0x430 fs/dcache.c:403
       __dentry_kill+0x20d/0x630 fs/dcache.c:610
       shrink_kill+0xa9/0x2c0 fs/dcache.c:1055
       shrink_dentry_list+0x2c0/0x5b0 fs/dcache.c:1082
       prune_dcache_sb+0x10f/0x180 fs/dcache.c:1163
       super_cache_scan+0x34f/0x4b0 fs/super.c:221
       do_shrink_slab+0x701/0x1160 mm/shrinker.c:435
       shrink_slab+0x1093/0x14d0 mm/shrinker.c:662
       shrink_one+0x43b/0x850 mm/vmscan.c:4815
       shrink_many mm/vmscan.c:4876 [inline]
       lru_gen_shrink_node mm/vmscan.c:4954 [inline]
       shrink_node+0x3799/0x3de0 mm/vmscan.c:5934
       kswapd_shrink_node mm/vmscan.c:6762 [inline]
       balance_pgdat mm/vmscan.c:6954 [inline]
       kswapd+0x1bcd/0x35a0 mm/vmscan.c:7223

[Analysis]
The problem is that inotify_new_watch() is using GFP_KERNEL to allocate
new watches under group->mark_mutex, however if dentry reclaim races
with unlinking of an inode, it can end up dropping the last dentry reference
for an unlinked inode resulting in removal of fsnotify mark from reclaim
context which wants to acquire group->mark_mutex as well.

This scenario shows that all notification groups are in principle prone
to this kind of a deadlock (previously, we considered only fanotify and
dnotify to be problematic for other reasons) so make sure all
allocations under group->mark_mutex happen with GFP_NOFS.

Reported-and-tested-by: syzbot+c679f13773f295d2da53@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c679f13773f295d2da53
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240927143642.2369508-1-lizhi.xu@windriver.com
2024-10-02 15:14:29 +02:00
Jan Kara
35ceae4474 fsnotify: Avoid data race between fsnotify_recalc_mask() and fsnotify_object_watched()
When __fsnotify_recalc_mask() recomputes the mask on the watched object,
the compiler can "optimize" the code to perform partial updates to the
mask (including zeroing it at the beginning). Thus places checking
the object mask without conn->lock such as fsnotify_object_watched()
could see invalid states of the mask. Make sure the mask update is
performed by one memory store using WRITE_ONCE().

Reported-by: syzbot+701037856c25b143f1ad@syzkaller.appspotmail.com
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/all/CACT4Y+Zk0ohwwwHSD63U2-PQ=UuamXczr1mKBD6xtj2dyYKBvA@mail.gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://patch.msgid.link/20240717140623.27768-1-jack@suse.cz
2024-10-02 15:11:40 +02:00
Gianfranco Trad
264db9d666 udf: fix uninit-value use in udf_get_fileshortad
Check for overflow when computing alen in udf_current_aext to mitigate
later uninit-value use in udf_get_fileshortad KMSAN bug[1].
After applying the patch reproducer did not trigger any issue[2].

[1] https://syzkaller.appspot.com/bug?extid=8901c4560b7ab5c2f9df
[2] https://syzkaller.appspot.com/x/log.txt?x=10242227980000

Reported-by: syzbot+8901c4560b7ab5c2f9df@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=8901c4560b7ab5c2f9df
Tested-by: syzbot+8901c4560b7ab5c2f9df@syzkaller.appspotmail.com
Suggested-by: Jan Kara <jack@suse.com>
Signed-off-by: Gianfranco Trad <gianf.trad@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240925074613.8475-3-gianf.trad@gmail.com
2024-10-02 14:32:37 +02:00
Zhao Mengmeng
c226964ec7 udf: refactor inode_bmap() to handle error
Refactor inode_bmap() to handle error since udf_next_aext() can return
error now. On situations like ftruncate, udf_extend_file() can now
detect errors and bail out early without resorting to checking for
particular offsets and assuming internal behavior of these functions.

Reported-by: syzbot+7a4842f0b1801230a989@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7a4842f0b1801230a989
Tested-by: syzbot+7a4842f0b1801230a989@syzkaller.appspotmail.com
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241001115425.266556-4-zhaomzhao@126.com
2024-10-02 14:32:29 +02:00
Zhao Mengmeng
b405c1e58b udf: refactor udf_next_aext() to handle error
Since udf_current_aext() has error handling, udf_next_aext() should have
error handling too. Besides, when too many indirect extents found in one
inode, return -EFSCORRUPTED; when reading block failed, return -EIO.

Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241001115425.266556-3-zhaomzhao@126.com
2024-10-02 14:10:50 +02:00
Zhao Mengmeng
ee703a7068 udf: refactor udf_current_aext() to handle error
As Jan suggested in links below, refactor udf_current_aext() to
differentiate between error, hit EOF and success, it now takes pointer to
etype to store the extent type, return 1 when getting etype success,
return 0 when hitting EOF and return -errno when err.

Link: https://lore.kernel.org/all/20240912111235.6nr3wuqvktecy3vh@quack3/
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241001115425.266556-2-zhaomzhao@126.com
2024-10-02 12:37:01 +02:00
Al Viro
0d0b8646a6 ufs_rename(): fix bogus argument of folio_release_kmap()
new_dir does *NOT* point into dir_folio - it's an inode, not a pointer
to ufs directory entry.

Fixes: 516b97cf03 "ufs: Convert directory handling to kmap_local"
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-02 00:05:09 -04:00
wangrong
a421e3fe0e smb: client: use actual path when queryfs
Due to server permission control, the client does not have access to
the shared root directory, but can access subdirectories normally, so
users usually mount the shared subdirectories directly. In this case,
queryfs should use the actual path instead of the root directory to
avoid the call returning an error (EACCES).

Signed-off-by: wangrong <wangrong@uniontech.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-01 21:47:16 -05:00
Kent Overstreet
e764e68103 bcachefs: Fix bad shift in bch2_read_flag_list()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-01 17:20:24 -04:00
Thorsten Blum
9c38339636 ksmbd: Use struct_size() to improve smb_direct_rdma_xmit()
Use struct_size() to calculate the number of bytes to allocate for a
new message.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-01 14:50:51 -05:00
Thorsten Blum
0801c1374a ksmbd: Annotate struct copychunk_ioctl_req with __counted_by_le()
Add the __counted_by_le compiler attribute to the flexible array member
Chunks to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.

Change the data type of the flexible array member Chunks from __u8[] to
struct srv_copychunk[] for ChunkCount to match the number of elements in
the Chunks array. (With __u8[], each srv_copychunk would occupy 24 array
entries and the __counted_by compiler attribute wouldn't be applicable.)

Use struct_size() to calculate the size of the copychunk_ioctl_req.

Read Chunks[0] after checking that ChunkCount is not 0.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-01 14:50:51 -05:00
Thorsten Blum
04afb0a3c3 ksmbd: Use struct_size() to improve get_file_alternate_info()
Use struct_size() to calculate the output buffer length.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-01 14:50:51 -05:00
Leo Martins
d6e7ac65d4 btrfs: disable rate limiting when debug enabled
Disable ratelimiting for btrfs_printk when CONFIG_BTRFS_DEBUG is
enabled. This allows for more verbose output which is often needed by
functions like btrfs_dump_space_info().

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-01 19:29:41 +02:00
Filipe Manana
41fd1e9406 btrfs: wait for fixup workers before stopping cleaner kthread during umount
During unmount, at close_ctree(), we have the following steps in this order:

1) Park the cleaner kthread - this doesn't destroy the kthread, it basically
   halts its execution (wake ups against it work but do nothing);

2) We stop the cleaner kthread - this results in freeing the respective
   struct task_struct;

3) We call btrfs_stop_all_workers() which waits for any jobs running in all
   the work queues and then free the work queues.

Syzbot reported a case where a fixup worker resulted in a crash when doing
a delayed iput on its inode while attempting to wake up the cleaner at
btrfs_add_delayed_iput(), because the task_struct of the cleaner kthread
was already freed. This can happen during unmount because we don't wait
for any fixup workers still running before we call kthread_stop() against
the cleaner kthread, which stops and free all its resources.

Fix this by waiting for any fixup workers at close_ctree() before we call
kthread_stop() against the cleaner and run pending delayed iputs.

The stack traces reported by syzbot were the following:

  BUG: KASAN: slab-use-after-free in __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065
  Read of size 8 at addr ffff8880272a8a18 by task kworker/u8:3/52

  CPU: 1 UID: 0 PID: 52 Comm: kworker/u8:3 Not tainted 6.12.0-rc1-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
  Workqueue: btrfs-fixup btrfs_work_helper
  Call Trace:
   <TASK>
   __dump_stack lib/dump_stack.c:94 [inline]
   dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
   print_address_description mm/kasan/report.c:377 [inline]
   print_report+0x169/0x550 mm/kasan/report.c:488
   kasan_report+0x143/0x180 mm/kasan/report.c:601
   __lock_acquire+0x77/0x2050 kernel/locking/lockdep.c:5065
   lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5825
   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
   _raw_spin_lock_irqsave+0xd5/0x120 kernel/locking/spinlock.c:162
   class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
   try_to_wake_up+0xb0/0x1480 kernel/sched/core.c:4154
   btrfs_writepage_fixup_worker+0xc16/0xdf0 fs/btrfs/inode.c:2842
   btrfs_work_helper+0x390/0xc50 fs/btrfs/async-thread.c:314
   process_one_work kernel/workqueue.c:3229 [inline]
   process_scheduled_works+0xa63/0x1850 kernel/workqueue.c:3310
   worker_thread+0x870/0xd30 kernel/workqueue.c:3391
   kthread+0x2f0/0x390 kernel/kthread.c:389
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
   </TASK>

  Allocated by task 2:
   kasan_save_stack mm/kasan/common.c:47 [inline]
   kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
   unpoison_slab_object mm/kasan/common.c:319 [inline]
   __kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:345
   kasan_slab_alloc include/linux/kasan.h:247 [inline]
   slab_post_alloc_hook mm/slub.c:4086 [inline]
   slab_alloc_node mm/slub.c:4135 [inline]
   kmem_cache_alloc_node_noprof+0x16b/0x320 mm/slub.c:4187
   alloc_task_struct_node kernel/fork.c:180 [inline]
   dup_task_struct+0x57/0x8c0 kernel/fork.c:1107
   copy_process+0x5d1/0x3d50 kernel/fork.c:2206
   kernel_clone+0x223/0x880 kernel/fork.c:2787
   kernel_thread+0x1bc/0x240 kernel/fork.c:2849
   create_kthread kernel/kthread.c:412 [inline]
   kthreadd+0x60d/0x810 kernel/kthread.c:765
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

  Freed by task 61:
   kasan_save_stack mm/kasan/common.c:47 [inline]
   kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
   kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
   poison_slab_object mm/kasan/common.c:247 [inline]
   __kasan_slab_free+0x59/0x70 mm/kasan/common.c:264
   kasan_slab_free include/linux/kasan.h:230 [inline]
   slab_free_hook mm/slub.c:2343 [inline]
   slab_free mm/slub.c:4580 [inline]
   kmem_cache_free+0x1a2/0x420 mm/slub.c:4682
   put_task_struct include/linux/sched/task.h:144 [inline]
   delayed_put_task_struct+0x125/0x300 kernel/exit.c:228
   rcu_do_batch kernel/rcu/tree.c:2567 [inline]
   rcu_core+0xaaa/0x17a0 kernel/rcu/tree.c:2823
   handle_softirqs+0x2c5/0x980 kernel/softirq.c:554
   __do_softirq kernel/softirq.c:588 [inline]
   invoke_softirq kernel/softirq.c:428 [inline]
   __irq_exit_rcu+0xf4/0x1c0 kernel/softirq.c:637
   irq_exit_rcu+0x9/0x30 kernel/softirq.c:649
   instr_sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1037 [inline]
   sysvec_apic_timer_interrupt+0xa6/0xc0 arch/x86/kernel/apic/apic.c:1037
   asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702

  Last potentially related work creation:
   kasan_save_stack+0x3f/0x60 mm/kasan/common.c:47
   __kasan_record_aux_stack+0xac/0xc0 mm/kasan/generic.c:541
   __call_rcu_common kernel/rcu/tree.c:3086 [inline]
   call_rcu+0x167/0xa70 kernel/rcu/tree.c:3190
   context_switch kernel/sched/core.c:5318 [inline]
   __schedule+0x184b/0x4ae0 kernel/sched/core.c:6675
   schedule_idle+0x56/0x90 kernel/sched/core.c:6793
   do_idle+0x56a/0x5d0 kernel/sched/idle.c:354
   cpu_startup_entry+0x42/0x60 kernel/sched/idle.c:424
   start_secondary+0x102/0x110 arch/x86/kernel/smpboot.c:314
   common_startup_64+0x13e/0x147

  The buggy address belongs to the object at ffff8880272a8000
   which belongs to the cache task_struct of size 7424
  The buggy address is located 2584 bytes inside of
   freed 7424-byte region [ffff8880272a8000, ffff8880272a9d00)

  The buggy address belongs to the physical page:
  page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x272a8
  head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
  flags: 0xfff00000000040(head|node=0|zone=1|lastcpupid=0x7ff)
  page_type: f5(slab)
  raw: 00fff00000000040 ffff88801bafa500 dead000000000122 0000000000000000
  raw: 0000000000000000 0000000080040004 00000001f5000000 0000000000000000
  head: 00fff00000000040 ffff88801bafa500 dead000000000122 0000000000000000
  head: 0000000000000000 0000000080040004 00000001f5000000 0000000000000000
  head: 00fff00000000003 ffffea00009caa01 ffffffffffffffff 0000000000000000
  head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: kasan: bad access detected
  page_owner tracks the page as allocated
  page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 2, tgid 2 (kthreadd), ts 71247381401, free_ts 71214998153
   set_page_owner include/linux/page_owner.h:32 [inline]
   post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537
   prep_new_page mm/page_alloc.c:1545 [inline]
   get_page_from_freelist+0x3039/0x3180 mm/page_alloc.c:3457
   __alloc_pages_noprof+0x256/0x6c0 mm/page_alloc.c:4733
   alloc_pages_mpol_noprof+0x3e8/0x680 mm/mempolicy.c:2265
   alloc_slab_page+0x6a/0x120 mm/slub.c:2413
   allocate_slab+0x5a/0x2f0 mm/slub.c:2579
   new_slab mm/slub.c:2632 [inline]
   ___slab_alloc+0xcd1/0x14b0 mm/slub.c:3819
   __slab_alloc+0x58/0xa0 mm/slub.c:3909
   __slab_alloc_node mm/slub.c:3962 [inline]
   slab_alloc_node mm/slub.c:4123 [inline]
   kmem_cache_alloc_node_noprof+0x1fe/0x320 mm/slub.c:4187
   alloc_task_struct_node kernel/fork.c:180 [inline]
   dup_task_struct+0x57/0x8c0 kernel/fork.c:1107
   copy_process+0x5d1/0x3d50 kernel/fork.c:2206
   kernel_clone+0x223/0x880 kernel/fork.c:2787
   kernel_thread+0x1bc/0x240 kernel/fork.c:2849
   create_kthread kernel/kthread.c:412 [inline]
   kthreadd+0x60d/0x810 kernel/kthread.c:765
   ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
   ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
  page last free pid 5230 tgid 5230 stack trace:
   reset_page_owner include/linux/page_owner.h:25 [inline]
   free_pages_prepare mm/page_alloc.c:1108 [inline]
   free_unref_page+0xcd0/0xf00 mm/page_alloc.c:2638
   discard_slab mm/slub.c:2678 [inline]
   __put_partials+0xeb/0x130 mm/slub.c:3146
   put_cpu_partial+0x17c/0x250 mm/slub.c:3221
   __slab_free+0x2ea/0x3d0 mm/slub.c:4450
   qlink_free mm/kasan/quarantine.c:163 [inline]
   qlist_free_all+0x9a/0x140 mm/kasan/quarantine.c:179
   kasan_quarantine_reduce+0x14f/0x170 mm/kasan/quarantine.c:286
   __kasan_slab_alloc+0x23/0x80 mm/kasan/common.c:329
   kasan_slab_alloc include/linux/kasan.h:247 [inline]
   slab_post_alloc_hook mm/slub.c:4086 [inline]
   slab_alloc_node mm/slub.c:4135 [inline]
   kmem_cache_alloc_noprof+0x135/0x2a0 mm/slub.c:4142
   getname_flags+0xb7/0x540 fs/namei.c:139
   do_sys_openat2+0xd2/0x1d0 fs/open.c:1409
   do_sys_open fs/open.c:1430 [inline]
   __do_sys_openat fs/open.c:1446 [inline]
   __se_sys_openat fs/open.c:1441 [inline]
   __x64_sys_openat+0x247/0x2a0 fs/open.c:1441
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

  Memory state around the buggy address:
   ffff8880272a8900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880272a8980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  >ffff8880272a8a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                              ^
   ffff8880272a8a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
   ffff8880272a8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ==================================================================

Reported-by: syzbot+8aaf2df2ef0164ffe1fb@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/66fb36b1.050a0220.aab67.003b.GAE@google.com/
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-01 19:29:33 +02:00
Qu Wenruo
c3b47f49e8 btrfs: fix a NULL pointer dereference when failed to start a new trasacntion
[BUG]
Syzbot reported a NULL pointer dereference with the following crash:

  FAULT_INJECTION: forcing a failure.
   start_transaction+0x830/0x1670 fs/btrfs/transaction.c:676
   prepare_to_relocate+0x31f/0x4c0 fs/btrfs/relocation.c:3642
   relocate_block_group+0x169/0xd20 fs/btrfs/relocation.c:3678
  ...
  BTRFS info (device loop0): balance: ended with status: -12
  Oops: general protection fault, probably for non-canonical address 0xdffffc00000000cc: 0000 [#1] PREEMPT SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000660-0x0000000000000667]
  RIP: 0010:btrfs_update_reloc_root+0x362/0xa80 fs/btrfs/relocation.c:926
  Call Trace:
   <TASK>
   commit_fs_roots+0x2ee/0x720 fs/btrfs/transaction.c:1496
   btrfs_commit_transaction+0xfaf/0x3740 fs/btrfs/transaction.c:2430
   del_balance_item fs/btrfs/volumes.c:3678 [inline]
   reset_balance_state+0x25e/0x3c0 fs/btrfs/volumes.c:3742
   btrfs_balance+0xead/0x10c0 fs/btrfs/volumes.c:4574
   btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:907 [inline]
   __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

[CAUSE]
The allocation failure happens at the start_transaction() inside
prepare_to_relocate(), and during the error handling we call
unset_reloc_control(), which makes fs_info->balance_ctl to be NULL.

Then we continue the error path cleanup in btrfs_balance() by calling
reset_balance_state() which will call del_balance_item() to fully delete
the balance item in the root tree.

However during the small window between set_reloc_contrl() and
unset_reloc_control(), we can have a subvolume tree update and created a
reloc_root for that subvolume.

Then we go into the final btrfs_commit_transaction() of
del_balance_item(), and into btrfs_update_reloc_root() inside
commit_fs_roots().

That function checks if fs_info->reloc_ctl is in the merge_reloc_tree
stage, but since fs_info->reloc_ctl is NULL, it results a NULL pointer
dereference.

[FIX]
Just add extra check on fs_info->reloc_ctl inside
btrfs_update_reloc_root(), before checking
fs_info->reloc_ctl->merge_reloc_tree.

That DEAD_RELOC_TREE handling is to prevent further modification to the
reloc tree during merge stage, but since there is no reloc_ctl at all,
we do not need to bother that.

Reported-by: syzbot+283673dbc38527ef9f3d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/66f6bfa7.050a0220.38ace9.0019.GAE@google.com/
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-01 19:22:37 +02:00
Filipe Manana
fa630df665 btrfs: send: fix invalid clone operation for file that got its size decreased
During an incremental send we may end up sending an invalid clone
operation, for the last extent of a file which ends at an unaligned offset
that matches the final i_size of the file in the send snapshot, in case
the file had its initial size (the size in the parent snapshot) decreased
in the send snapshot. In this case the destination will fail to apply the
clone operation because its end offset is not sector size aligned and it
ends before the current size of the file.

Sending the truncate operation always happens when we finish processing an
inode, after we process all its extents (and xattrs, names, etc). So fix
this by ensuring the file has a valid size before we send a clone
operation for an unaligned extent that ends at the final i_size of the
file. The size we truncate to matches the start offset of the clone range
but it could be any value between that start offset and the final size of
the file since the clone operation will expand the i_size if the current
size is smaller than the end offset. The start offset of the range was
chosen because it's always sector size aligned and avoids a truncation
into the middle of a page, which results in dirtying the page due to
filling part of it with zeroes and then making the clone operation at the
receiver trigger IO.

The following test reproduces the issue:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  # Create a file with a size of 256K + 5 bytes, having two extents, one
  # with a size of 128K and another one with a size of 128K + 5 bytes.
  last_ext_size=$((128 * 1024 + 5))
  xfs_io -f -d -c "pwrite -S 0xab -b 128K 0 128K" \
         -c "pwrite -S 0xcd -b $last_ext_size 128K $last_ext_size" \
         $MNT/foo

  # Another file which we will later clone foo into, but initially with
  # a larger size than foo.
  xfs_io -f -c "pwrite -S 0xef 0 1M" $MNT/bar

  btrfs subvolume snapshot -r $MNT/ $MNT/snap1

  # Now resize bar and clone foo into it.
  xfs_io -c "truncate 0" \
         -c "reflink $MNT/foo" $MNT/bar

  btrfs subvolume snapshot -r $MNT/ $MNT/snap2

  rm -f /tmp/send-full /tmp/send-inc
  btrfs send -f /tmp/send-full $MNT/snap1
  btrfs send -p $MNT/snap1 -f /tmp/send-inc $MNT/snap2

  umount $MNT
  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  btrfs receive -f /tmp/send-full $MNT
  btrfs receive -f /tmp/send-inc $MNT

  umount $MNT

Running it before this patch:

  $ ./test.sh
  (...)
  At subvol snap1
  At snapshot snap2
  ERROR: failed to clone extents to bar: Invalid argument

A test case for fstests will be sent soon.

Reported-by: Ben Millwood <thebenmachine@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAJhrHS2z+WViO2h=ojYvBPDLsATwLbg+7JaNCyYomv0fUxEpQQ@mail.gmail.com/
Fixes: 46a6e10a1a ("btrfs: send: allow cloning non-aligned extent if it ends at i_size")
CC: stable@vger.kernel.org # 6.11
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-01 19:15:12 +02:00
Josef Bacik
db7e68b522 btrfs: drop the backref cache during relocation if we commit
Since the inception of relocation we have maintained the backref cache
across transaction commits, updating the backref cache with the new
bytenr whenever we COWed blocks that were in the cache, and then
updating their bytenr once we detected a transaction id change.

This works as long as we're only ever modifying blocks, not changing the
structure of the tree.

However relocation does in fact change the structure of the tree.  For
example, if we are relocating a data extent, we will look up all the
leaves that point to this data extent.  We will then call
do_relocation() on each of these leaves, which will COW down to the leaf
and then update the file extent location.

But, a key feature of do_relocation() is the pending list.  This is all
the pending nodes that we modified when we updated the file extent item.
We will then process all of these blocks via finish_pending_nodes, which
calls do_relocation() on all of the nodes that led up to that leaf.

The purpose of this is to make sure we don't break sharing unless we
absolutely have to.  Consider the case that we have 3 snapshots that all
point to this leaf through the same nodes, the initial COW would have
created a whole new path.  If we did this for all 3 snapshots we would
end up with 3x the number of nodes we had originally.  To avoid this we
will cycle through each of the snapshots that point to each of these
nodes and update their pointers to point at the new nodes.

Once we update the pointer to the new node we will drop the node we
removed the link for and all of its children via btrfs_drop_subtree().
This is essentially just btrfs_drop_snapshot(), but for an arbitrary
point in the snapshot.

The problem with this is that we will never reflect this in the backref
cache.  If we do this btrfs_drop_snapshot() for a node that is in the
backref tree, we will leave the node in the backref tree.  This becomes
a problem when we change the transid, as now the backref cache has
entire subtrees that no longer exist, but exist as if they still are
pointed to by the same roots.

In the best case scenario you end up with "adding refs to an existing
tree ref" errors from insert_inline_extent_backref(), where we attempt
to link in nodes on roots that are no longer valid.

Worst case you will double free some random block and re-use it when
there's still references to the block.

This is extremely subtle, and the consequences are quite bad.  There
isn't a way to make sure our backref cache is consistent between
transid's.

In order to fix this we need to simply evict the entire backref cache
anytime we cross transid's.  This reduces performance in that we have to
rebuild this backref cache every time we change transid's, but fixes the
bug.

This has existed since relocation was added, and is a pretty critical
bug.  There's a lot more cleanup that can be done now that this
functionality is going away, but this patch is as small as possible in
order to fix the problem and make it easy for us to backport it to all
the kernels it needs to be backported to.

Followup series will dismantle more of this code and simplify relocation
drastically to remove this functionality.

We have a reproducer that reproduced the corruption within a few minutes
of running.  With this patch it survives several iterations/hours of
running the reproducer.

Fixes: 3fd0a5585e ("Btrfs: Metadata ENOSPC handling for balance")
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-01 19:10:26 +02:00
Johannes Thumshirn
97f9782276 btrfs: also add stripe entries for NOCOW writes
NOCOW writes do not generate stripe_extent entries in the RAID stripe
tree, as the RAID stripe-tree feature initially was designed with a
zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
writes. But the RAID stripe-tree feature is independent from the zoned
feature, so we must also do NOCOW writes for RAID stripe-tree filesystems.

Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-01 19:09:04 +02:00
Filipe Manana
96c6ca7157 btrfs: send: fix buffer overflow detection when copying path to cache entry
Starting with commit c0247d289e ("btrfs: send: annotate struct
name_cache_entry with __counted_by()") we annotated the variable length
array "name" from the name_cache_entry structure with __counted_by() to
improve overflow detection. However that alone was not correct, because
the length of that array does not match the "name_len" field - it matches
that plus 1 to include the NUL string terminator, so that makes a
fortified kernel think there's an overflow and report a splat like this:

  strcpy: detected buffer overflow: 20 byte write of buffer size 19
  WARNING: CPU: 3 PID: 3310 at __fortify_report+0x45/0x50
  CPU: 3 UID: 0 PID: 3310 Comm: btrfs Not tainted 6.11.0-prnet #1
  Hardware name: CompuLab Ltd.  sbc-ihsw/Intense-PC2 (IPC2), BIOS IPC2_3.330.7 X64 03/15/2018
  RIP: 0010:__fortify_report+0x45/0x50
  Code: 48 8b 34 (...)
  RSP: 0018:ffff97ebc0d6f650 EFLAGS: 00010246
  RAX: 7749924ef60fa600 RBX: ffff8bf5446a521a RCX: 0000000000000027
  RDX: 00000000ffffdfff RSI: ffff97ebc0d6f548 RDI: ffff8bf84e7a1cc8
  RBP: ffff8bf548574080 R08: ffffffffa8c40e10 R09: 0000000000005ffd
  R10: 0000000000000004 R11: ffffffffa8c70e10 R12: ffff8bf551eef400
  R13: 0000000000000000 R14: 0000000000000013 R15: 00000000000003a8
  FS:  00007fae144de8c0(0000) GS:ffff8bf84e780000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fae14691690 CR3: 00000001027a2003 CR4: 00000000001706f0
  Call Trace:
   <TASK>
   ? __warn+0x12a/0x1d0
   ? __fortify_report+0x45/0x50
   ? report_bug+0x154/0x1c0
   ? handle_bug+0x42/0x70
   ? exc_invalid_op+0x1a/0x50
   ? asm_exc_invalid_op+0x1a/0x20
   ? __fortify_report+0x45/0x50
   __fortify_panic+0x9/0x10
  __get_cur_name_and_parent+0x3bc/0x3c0
   get_cur_path+0x207/0x3b0
   send_extent_data+0x709/0x10d0
   ? find_parent_nodes+0x22df/0x25d0
   ? mas_nomem+0x13/0x90
   ? mtree_insert_range+0xa5/0x110
   ? btrfs_lru_cache_store+0x5f/0x1e0
   ? iterate_extent_inodes+0x52d/0x5a0
   process_extent+0xa96/0x11a0
   ? __pfx_lookup_backref_cache+0x10/0x10
   ? __pfx_store_backref_cache+0x10/0x10
   ? __pfx_iterate_backrefs+0x10/0x10
   ? __pfx_check_extent_item+0x10/0x10
   changed_cb+0x6fa/0x930
   ? tree_advance+0x362/0x390
   ? memcmp_extent_buffer+0xd7/0x160
   send_subvol+0xf0a/0x1520
   btrfs_ioctl_send+0x106b/0x11d0
   ? __pfx___clone_root_cmp_sort+0x10/0x10
   _btrfs_ioctl_send+0x1ac/0x240
   btrfs_ioctl+0x75b/0x850
   __se_sys_ioctl+0xca/0x150
   do_syscall_64+0x85/0x160
   ? __count_memcg_events+0x69/0x100
   ? handle_mm_fault+0x1327/0x15c0
   ? __se_sys_rt_sigprocmask+0xf1/0x180
   ? syscall_exit_to_user_mode+0x75/0xa0
   ? do_syscall_64+0x91/0x160
   ? do_user_addr_fault+0x21d/0x630
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7fae145eeb4f
  Code: 00 48 89 (...)
  RSP: 002b:00007ffdf1cb09b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fae145eeb4f
  RDX: 00007ffdf1cb0ad0 RSI: 0000000040489426 RDI: 0000000000000004
  RBP: 00000000000078fe R08: 00007fae144006c0 R09: 00007ffdf1cb0927
  R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffdf1cb1ce8
  R13: 0000000000000003 R14: 000055c499fab2e0 R15: 0000000000000004
   </TASK>

Fix this by not storing the NUL string terminator since we don't actually
need it for name cache entries, this way "name_len" corresponds to the
actual size of the "name" array. This requires marking the "name" array
field with __nonstring and using memcpy() instead of strcpy() as
recommended by the guidelines at:

   https://github.com/KSPP/linux/issues/90

Reported-by: David Arendt <admin@prnet.org>
Link: https://lore.kernel.org/linux-btrfs/cee4591a-3088-49ba-99b8-d86b4242b8bd@prnet.org/
Fixes: c0247d289e ("btrfs: send: annotate struct name_cache_entry with __counted_by()")
CC: stable@vger.kernel.org # 6.11
Tested-by: David Arendt <admin@prnet.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-01 19:06:30 +02:00
Chang Yu
f6023535b5
netfs: Fix a KMSAN uninit-value error in netfs_clear_buffer
Use folioq_count instead of folioq_nr_slots to fix a KMSAN uninit-value
error in netfs_clear_buffer

Signed-off-by: Chang Yu <marcus.yu.56@gmail.com>
Link: https://lore.kernel.org/r/ZvuXWC2bYpvQsWgS@gmail.com
Fixes: cd0277ed0c ("netfs: Use new folio_queue data type and iterator instead of xarray iter")
Acked-by: David Howells <dhowells@redhat.com>
Reported-by: syzbot+921873345a95f4dae7e9@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=921873345a95f4dae7e9
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-01 16:02:42 +02:00
Konstantin Komarov
48dbc12783
fs/ntfs3: Format output messages like others fs in kernel
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-10-01 12:19:09 +03:00
Konstantin Komarov
031d6f6082
fs/ntfs3: Additional check in ntfs_file_release
Reported-by: syzbot+8c652f14a0fde76ff11d@syzkaller.appspotmail.com
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-10-01 12:19:08 +03:00
Konstantin Komarov
a33fb016e4
fs/ntfs3: Fix general protection fault in run_is_mapped_full
Fixed deleating of a non-resident attribute in ntfs_create_inode()
rollback.

Reported-by: syzbot+9af29acd8f27fbce94bc@syzkaller.appspotmail.com
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-10-01 12:19:08 +03:00
Konstantin Komarov
090f612756
fs/ntfs3: Sequential field availability check in mi_enum_attr()
The code is slightly reformatted to consistently check field availability
without duplication.

Fixes: 556bdf27c2 ("ntfs3: Add bounds checking to mi_enum_attr()")
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-10-01 12:19:07 +03:00
Konstantin Komarov
d178944db3
fs/ntfs3: Additional check in ni_clear()
Checking of NTFS_FLAGS_LOG_REPLAYING added to prevent access to
uninitialized bitmap during replay process.

Reported-by: syzbot+3bfd2cc059ab93efcdb4@syzkaller.appspotmail.com
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-10-01 12:19:07 +03:00
Konstantin Komarov
03b097099e
fs/ntfs3: Fix possible deadlock in mi_read
Mutex lock with another subclass used in ni_lock_dir().

Reported-by: syzbot+bc7ca0ae4591cb2550f9@syzkaller.appspotmail.com
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-10-01 12:19:06 +03:00
Diogo Jahchan Koike
589996bf8c
ntfs3: Change to non-blocking allocation in ntfs_d_hash
d_hash is done while under "rcu-walk" and should not sleep.
__get_name() allocates using GFP_KERNEL, having the possibility
to sleep when under memory pressure. Change the allocation to
GFP_NOWAIT.

Reported-by: syzbot+7f71f79bbfb4427b00e1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7f71f79bbfb4427b00e1
Fixes: d392e85fd1 ("fs/ntfs3: Fix the format of the "nocase" mount option")
Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
2024-10-01 12:19:06 +03:00
Dr. David Alan Gilbert
bdd6baf740
fs/ntfs3: Remove unused al_delete_le
'al_delete_le' was added by:
Commit be71b5cba2 ("fs/ntfs3: Add attrib operations")

but has remained unused; there is an al_remove_le which seems
to be being used instead.

Remove 'al_delete_le'.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
2024-10-01 12:19:05 +03:00
Pali Rohár
c9432ad5e3 cifs: Remove intermediate object of failed create reparse call
If CREATE was successful but SMB2_OP_SET_REPARSE failed then remove the
intermediate object created by CREATE. Otherwise empty object stay on the
server when reparse call failed.

This ensures that if the creating of special files is unsupported by the
server then no empty file stay on the server as a result of unsupported
operation.

Fixes: 102466f303 ("smb: client: allow creating special files via reparse points")
Signed-off-by: Pali Rohár <pali@kernel.org>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-30 22:14:05 -05:00
Steve French
1eecd880a3 Revert "smb: client: make SHA-512 TFM ephemeral"
The original patch causes a crash with signed mounts when using
the SMB2.1 dialect

RIP: 0010:smb2_calc_signature+0x10e/0x460 [cifs]
Code: 46 30 00 00 00 00 49 c7 46 38 00 00 00 00 0f 85 3e 01 00 00 48 8b 83 a8 02 00 00 48 89 85 68 ff ff ff 49 8b b4 24 58 01 00 00 <48> 8b 38 ba 10 00 00 00 e8 55 0f 0c e0 41 89 c7 85 c0 0f 85 44 01
RSP: 0018:ffffb349422fb5c8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff98028765b800 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff980200f2b100 RDI: 0000000000000000
RBP: ffffb349422fb680 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff980235e37800
R13: ffffb349422fb900 R14: ffff98027c160700 R15: ffff98028765b820
FS:  000074139b98f780(0000) GS:ffff98097b980000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000011cb78006 CR4: 00000000003726f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ? show_regs+0x6c/0x80
 ? __die+0x24/0x80
 ? page_fault_oops+0x175/0x5c0
 ? hrtimer_try_to_cancel.part.0+0x55/0xf0
 ? do_user_addr_fault+0x4b2/0x870
 ? exc_page_fault+0x85/0x1c0
 ? asm_exc_page_fault+0x27/0x30
 ? smb2_calc_signature+0x10e/0x460 [cifs]
 ? smb2_calc_signature+0xa7/0x460 [cifs]
 ? kmem_cache_alloc_noprof+0x101/0x300
 smb2_sign_rqst+0xa2/0xe0 [cifs]
 smb2_setup_request+0x12d/0x240 [cifs]
 compound_send_recv+0x304/0x1220 [cifs]
 cifs_send_recv+0x22/0x40 [cifs]
 SMB2_tcon+0x2d9/0x8c0 [cifs]
 cifs_get_smb_ses+0x910/0xef0 [cifs]
 ? cifs_get_smb_ses+0x910/0xef0 [cifs]
 cifs_mount_get_session+0x6a/0x250 [cifs]

Reported-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Suggested-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

This reverts commit 220d83b52c.
2024-09-30 22:07:45 -05:00
Yang Li
37578c682c nfs: Remove duplicated include in localio.c
The header files linux/module.h is included twice in localio.c,
so one inclusion of each can be removed.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=11073
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-09-30 15:05:52 -04:00
Linus Torvalds
a5f24c7955 vfs-6.12-rc2.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZvqoUgAKCRCRxhvAZXjc
 oveLAQC952BB+giATJvX5lPz400g+MoDVlWiPdqAVbF0PCGRRwEA+/7TedOfOkQx
 Df3iAzDXVbX/WvWYdZ/DyLp/r44WHAw=
 =tGkE
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "afs:

   - Fix setting of the server responding flag

   - Remove unused struct afs_address_list and afs_put_address_list()
     function

   - Fix infinite loop because of unresponsive servers

   - Ensure that afs_retry_request() function is correctly added to the
     afs_req_ops netfs operations table

  netfs:

   - Fix netfs_folio tracepoint handling to handle NULL mappings

   - Add a missing folio_queue API documentation

   - Ensure that netfs_write_folio() correctly advances the iterator via
     iov_iter_advance()

   - Fix a dentry leak during concurrent cull and cookie lookup
     operations in cachefiles

  pidfs:

   - Correctly handle accessing another task's pid namespace"

* tag 'vfs-6.12-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  netfs: Fix the netfs_folio tracepoint to handle NULL mapping
  netfs: Add folio_queue API documentation
  netfs: Advance iterator correctly rather than jumping it
  afs: Fix the setting of the server responding flag
  afs: Remove unused struct and function prototype
  afs: Fix possible infinite loop with unresponsive servers
  pidfs: check for valid pid namespace
  afs: Fix missing wire-up of afs_retry_request()
  cachefiles: fix dentry leak in cachefiles_open_file()
2024-09-30 10:59:44 -07:00
Guenter Roeck
2007d28ec0 bcachefs: rename version -> bversion for big endian builds
Builds on big endian systems fail as follows.

fs/bcachefs/bkey.h: In function 'bch2_bkey_format_add_key':
fs/bcachefs/bkey.h:557:41: error:
	'const struct bkey' has no member named 'bversion'

The original commit only renamed the variable for little endian builds.
Rename it for big endian builds as well to fix the problem.

Fixes: cf49f8a8c2 ("bcachefs: rename version -> bversion")
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-29 23:55:52 -04:00
Al Viro
678379e1d4 close_range(): fix the logics in descriptor table trimming
Cloning a descriptor table picks the size that would cover all currently
opened files.  That's fine for clone() and unshare(), but for close_range()
there's an additional twist - we clone before we close, and it would be
a shame to have
	close_range(3, ~0U, CLOSE_RANGE_UNSHARE)
leave us with a huge descriptor table when we are not going to keep
anything past stderr, just because some large file descriptor used to
be open before our call has taken it out.

Unfortunately, it had been dealt with in an inherently racy way -
sane_fdtable_size() gets a "don't copy anything past that" argument
(passed via unshare_fd() and dup_fd()), close_range() decides how much
should be trimmed and passes that to unshare_fd().

The problem is, a range that used to extend to the end of descriptor
table back when close_range() had looked at it might very well have stuff
grown after it by the time dup_fd() has allocated a new files_struct
and started to figure out the capacity of fdtable to be attached to that.

That leads to interesting pathological cases; at the very least it's a
QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes
a snapshot of descriptor table one might have observed at some point.
Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination
of unshare(CLONE_FILES) with plain close_range(), ending up with a
weird state that would never occur with unshare(2) is confusing, to put
it mildly.

It's not hard to get rid of - all it takes is passing both ends of the
range down to sane_fdtable_size().  There we are under ->files_lock,
so the race is trivially avoided.

So we do the following:
	* switch close_files() from calling unshare_fd() to calling
dup_fd().
	* undo the calling convention change done to unshare_fd() in
60997c3d45 "close_range: add CLOSE_RANGE_UNSHARE"
	* introduce struct fd_range, pass a pointer to that to dup_fd()
and sane_fdtable_size() instead of "trim everything past that point"
they are currently getting.  NULL means "we are not going to be punching
any holes"; NR_OPEN_MAX is gone.
	* make sane_fdtable_size() use find_last_bit() instead of
open-coding it; it's easier to follow that way.
	* while we are at it, have dup_fd() report errors by returning
ERR_PTR(), no need to use a separate int *errorp argument.

Fixes: 60997c3d45 "close_range: add CLOSE_RANGE_UNSHARE"
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-09-29 21:52:29 -04:00
Pali Rohár
dd0d7edb86 smb: Update comments about some reparse point tags
NFS-style reparse points are recognized only by the Windows NFS server 2012
and new. Windows 8 does not contain Windows NFS server, so these reparse
points are not used on Windows 8.

Reparse points with IO_REPARSE_TAG_AF_UNIX tag were primarily introduced for
native Win32 AF_UNIX sockets and later were re-used by also by WSL:
https://devblogs.microsoft.com/commandline/af_unix-comes-to-windows/
https://devblogs.microsoft.com/commandline/windowswsl-interop-with-af_unix/

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-29 17:28:40 -05:00
Pali Rohár
a9023656bc cifs: Check for UTF-16 null codepoint in SFU symlink target location
Check that read buffer of SFU symlink target location does not contain
UTF-16 null codepoint (via UniStrnlen() call) because Linux cannot process
symlink with null byte, it truncates everything in buffer after null byte.

Fixes: cf2ce67345 ("cifs: Add support for reading SFU symlink location")
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-29 17:28:40 -05:00
Linus Torvalds
9f9a534724 bcachefs fixes for 6.11-rc1
Assorted minor syzbot fixes, and for bigger stuff:
 
 - Fix two disk accounting rewrite bugs
  - Disk accounting keys use the version field of bkey so that journal
    replay can tell which updates have been applied to the btree. This is
    set in the transaction commit path, after we've gotten our journal
    reservation (and our time ordering), but the
    BCH_TRANS_COMMIT_skip_accounting_apply flag that journal replay uses
    was incorrectly skipping this for new updates generated prior to
    journal replay.
 
    This fixes the underlying cause of an assertion pop in
    disk_accounting_read.
 
  - A couple fixes for disk accounting + device removal. Checking if
    acocunting replicas entries were marked in the superblock was being
    done at the wrong point, when deltas in the journal could still zero
    them out, and then additionally we'd try to add a missing replicas
    entry to the superblock without checking if it referred to an invalid
    (removed) device.
 
 - A whole slew of repair fixes
  - fix infinite loop in propagate_key_to_snapshot_leaves(), this fixes
    an infinite loop when repairing a filesystem with many snapshots
  - fix incorrect transaction restart handling leading to occasional
    "fsck counted ..." warnings"
  - fix warning in __bch2_fsck_err() for bkey fsck errors
  - check_inode() in fsck now correctly checks if the filesystem was
    clean
  - there shouldn't be pending logged ops if the fs was clean, we now
    check for this
  - remove_backpointer() doesn't remove a dirent that doesn't actually
    point to the inode
  - many more fsck errors are AUTOFIX
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmb4QtsACgkQE6szbY3K
 bnYx4A//bhGgZYgP55FxduuxUH8XjX2eOnXwuPv/MmYO/4oCok5VBa9bRDTVXhIK
 PtY4pP2IJZ3+u963mwbwJAawsPA01AEEty9tE+AdXbltDRQ03I33OEuIy0HFIso2
 s8VBkVPbru6yU4RCCvYNIVvRG/9GOL+J0GgrR1t05zHVyKXe1FuS00Yq5+z3niNP
 HtuGTsD273Nnhikz47bqyD+M6VizU+uzSUFLgnB3zrzpb+gPSGETSwgc4ggajlM4
 2P10Vc4L/Nb3KYV9RW+C3WpRfUR/o8BZA3wjJfNo0JeA4iDaUbltSjpCA07EcAnA
 3D6Omzqkm4aobL2WlvioT0UhZx4t8X/8x5t5F9HyX52i1k+g87oMT9/KIKec1Dzd
 8vQCwCdXFfWaLSZoOJsHyIljip7BuRLKhWwKosdzzLIAnRQy5StxAhsG99fNStu6
 JOWICPNCn1b6SkktnoKou1unL+K5RczeNfAxMAjcJjTD7IIAmytLe4mdRbP9q+Oa
 x8no7pttbb4JnoRvfo42GVz8KWQR07oN/Zy7mH3K4Y0Ix+xDOrLqlfLIDLGpxMNv
 HZz+UPchdlfpYJO+nTLoAOGXZWnKDqg70SAEcWKDc82Ri4vNOhraYDZvXrzl9qE+
 63RPzqDbg3uXGxLYMvujjPe610QkPxS9zKKyDvUZZx0ZiUX4CjI=
 =cdrz
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-09-28' of git://evilpiepirate.org/bcachefs

Pull more bcachefs updates from Kent Overstreet:
 "Assorted minor syzbot fixes, and for bigger stuff:

  Fix two disk accounting rewrite bugs:

   - Disk accounting keys use the version field of bkey so that journal
     replay can tell which updates have been applied to the btree.

     This is set in the transaction commit path, after we've gotten our
     journal reservation (and our time ordering), but the
     BCH_TRANS_COMMIT_skip_accounting_apply flag that journal replay
     uses was incorrectly skipping this for new updates generated prior
     to journal replay.

     This fixes the underlying cause of an assertion pop in
     disk_accounting_read.

   - A couple of fixes for disk accounting + device removal.

     Checking if acocunting replicas entries were marked in the
     superblock was being done at the wrong point, when deltas in the
     journal could still zero them out, and then additionally we'd try
     to add a missing replicas entry to the superblock without checking
     if it referred to an invalid (removed) device.

  A whole slew of repair fixes:

   - fix infinite loop in propagate_key_to_snapshot_leaves(), this fixes
     an infinite loop when repairing a filesystem with many snapshots

   - fix incorrect transaction restart handling leading to occasional
     "fsck counted ..." warnings

   - fix warning in __bch2_fsck_err() for bkey fsck errors

   - check_inode() in fsck now correctly checks if the filesystem was
     clean

   - there shouldn't be pending logged ops if the fs was clean, we now
     check for this

   - remove_backpointer() doesn't remove a dirent that doesn't actually
     point to the inode

   - many more fsck errors are AUTOFIX"

* tag 'bcachefs-2024-09-28' of git://evilpiepirate.org/bcachefs: (35 commits)
  bcachefs: check_subvol_path() now prints subvol root inode
  bcachefs: remove_backpointer() now checks if dirent points to inode
  bcachefs: dirent_points_to_inode() now warns on mismatch
  bcachefs: Fix lost wake up
  bcachefs: Check for logged ops when clean
  bcachefs: BCH_FS_clean_recovery
  bcachefs: Convert disk accounting BUG_ON() to WARN_ON()
  bcachefs: Fix BCH_TRANS_COMMIT_skip_accounting_apply
  bcachefs: Check for accounting keys with bversion=0
  bcachefs: rename version -> bversion
  bcachefs: Don't delete unlinked inodes before logged op resume
  bcachefs: Fix BCH_SB_ERRS() so we can reorder
  bcachefs: Fix fsck warnings from bkey validation
  bcachefs: Move transaction commit path validation to as late as possible
  bcachefs: Fix disk accounting attempting to mark invalid replicas entry
  bcachefs: Fix unlocked access to c->disk_sb.sb in bch2_replicas_entry_validate()
  bcachefs: Fix accounting read + device removal
  bcachefs: bch_accounting_mode
  bcachefs: fix transaction restart handling in check_extents(), check_dirents()
  bcachefs: kill inode_walker_entry.seen_this_pos
  ...
2024-09-29 09:17:44 -07:00
Linus Torvalds
894b3c35d1 Three CephFS fixes from Xiubo and Luis and a bunch of assorted
cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmb3JroTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizDiB/0elHQQaFxXMjuJRY1IzohozAHi0cHK
 gwgE4nEbECE8vRYK/QvyvZ3S+ep+N+r6jOIiIDyqhjtlY3//oSyyxL7RjMJlVFBq
 Ie37w8r4q1aL1mn9QDQ4iQxcRYyU+JxcUcPR1UUUvLiKgWaRixmq27zby/WQSrkA
 ke2ScBRDtEAYVtdxvxmUJK/DrPr3skwJAGY52KesjwgVhXSL8KG9X1zMUbWdJYDV
 THbQzLZsu4NVh7LlAsS/mh+z0EIZsXxQYU5IY3dIVEYcuLK93lXRGZb+7whtmUef
 wsDtYIe/w30QVxFdrN28qAQp8daUJhp+3t0EZSyecRcq5OPey6ICx1P4
 =+bdB
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Three CephFS fixes from Xiubo and Luis and a bunch of assorted
  cleanups"

* tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client:
  ceph: remove the incorrect Fw reference check when dirtying pages
  ceph: Remove empty definition in header file
  ceph: Fix typo in the comment
  ceph: fix a memory leak on cap_auths in MDS client
  ceph: flush all caps releases when syncing the whole filesystem
  ceph: rename ceph_flush_cap_releases() to ceph_flush_session_cap_releases()
  libceph: use min() to simplify code in ceph_dns_resolve_name()
  ceph: Convert to use jiffies macro
  ceph: Remove unused declarations
2024-09-28 08:40:36 -07:00
Linus Torvalds
9717d53438 5 smb3 server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmb2y3gACgkQiiy9cAdy
 T1EATQwAwQTCQSd916VK4HPzphM/lJLaxNCNNV0AY8QNXsUxhBxQny2FJ+jilFZU
 y4G5/zSi8+PX0YtyrPJqtofbFX+eeD6eKRCFT/1YEEkwEYp53mjsCIHWidSPGh6X
 S2du6tAebSCQSqlHv5zlTpL24UVhi6amse7aJyXs8v7JZO9ZjtEE0D+a1xqSV4kt
 0+6/W3RM49HTEAql7TavduNR3UcesYg2KS48qNVvGhHhY3wcGe92mZ0Sr4NUStfg
 IjtpfsxxBJWKiXDJhGBN8M/O6jqBtE++O/CyDknYGOs6M7QtPJ1xtXpESlq+OgWV
 JEqNorZI4qvl/5PbY/1+6wJDY3ogv2DhwyRaOdhtVc5CgF1JGLKW4lVBGBIrQz2B
 dyHbiGAXEA+Rm7/8UkyFZRmvbmLXDqRM7AEyLrXoeS5Vw51RxS6CT0oDesVGVsdX
 +koQ1OQ55AiR1TXhasDj6XvmFAyYKuEPh/qhBz1jEBX8unyhIUUVrG3CnMULl3rY
 FWbdmtDB
 =4ons
 -----END PGP SIGNATURE-----

Merge tag 'v6.12-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

 - fix querying dentry for char/block special files

 - small cleanup patches

* tag 'v6.12-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
  ksmbd: Correct typos in multiple comments across various files
  ksmbd: fix open failure from block and char device file
  ksmbd: remove unsafe_memcpy use in session setup
  ksmbd: Replace one-element arrays with flexible-array members
  ksmbd: fix warning: comparison of distinct pointer types lacks a cast
2024-09-28 08:35:21 -07:00
Linus Torvalds
f04ff5a02b five smb3 client fixes, and an immportant netfs fix for cifs write regression
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmb3FfgACgkQiiy9cAdy
 T1HLMQv/ZXkiNPvMmCHJE3rWdSBZQdGLDV1qOhxuW9y4CvenIhukwDmwOq7wjWOn
 3dQHDaqFkVTVurosozFOJK9Aw94iz2Ad9dlryMcNN+Gb4vY3d9l3AvsbqmgbZSsg
 DmdqOg1SA9NgDaHl6RFsFQQY9O5BsjkEeHBX71gZZUYMw1d6CTpFUT+wTD43L0LQ
 g8r0Ksil7edw9f5WGvu8YzB4rclR45QiTVG1OMgXmr43cvoJz3GrIPXeHNm7j5D7
 hzx6ELNviY77DPKnxSd55UGPngVms6c1qqWCOJMefsJRY5bhh3lEc+TQyX11HGft
 Kta1TQI2gI1xgueqpR2Dh/bUuprWcc6vCbNjxezXpFOiSMt6qNfGzQDE9gZAALAj
 568lRpwcMPgS9laqK4Sh9v5+Vw6E8T+FUAJKvtLRidv5iBoqI0+50mhHTBlBgiy6
 XdyiAdUGSzejcu/OiOGc9C4PvVRgf0qw9+hqqZeZAsRydKgw3QwPtJ6yW+RYu86p
 xQ5CE6zs
 =mpUs
 -----END PGP SIGNATURE-----

Merge tag '6.12rc-more-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull xmb client fixes from Steve French:

 - Noisy log message cleanup

 - Important netfs fix for cifs crash in generic/074

 - Three minor improvements to use of hashing (multichannel and mount
   improvements)

 - Fix decryption crash for large read with small esize

* tag '6.12rc-more-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: make SHA-512 TFM ephemeral
  smb: client: make HMAC-MD5 TFM ephemeral
  smb: client: stop flooding dmesg in smb2_calc_signature()
  smb: client: allocate crypto only for primary server
  smb: client: fix UAF in async decryption
  netfs: Fix write oops in generic/346 (9p) and generic/074 (cifs)
2024-09-28 08:30:27 -07:00
Kent Overstreet
3a5895e3ac bcachefs: check_subvol_path() now prints subvol root inode
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 22:32:23 -04:00
Kent Overstreet
0b0f0ad93c bcachefs: remove_backpointer() now checks if dirent points to inode
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 22:32:23 -04:00
Kent Overstreet
a6508079b1 bcachefs: dirent_points_to_inode() now warns on mismatch
if an inode backpointer points to a dirent that doesn't point back,
that's an error we should warn about.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 22:32:23 -04:00
Alan Huang
e057a290ef bcachefs: Fix lost wake up
If the reader acquires the read lock and then the writer enters the slow
path, while the reader proceeds to the unlock path, the following scenario
can occur without the change:

writer: pcpu_read_count(lock) return 1 (so __do_six_trylock will return 0)
reader: this_cpu_dec(*lock->readers)
reader: smp_mb()
reader: state = atomic_read(&lock->state) (there is no waiting flag set)
writer: six_set_bitmask()

then the writer will sleep forever.

Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 22:32:23 -04:00
Kent Overstreet
d50d7a5fa4 bcachefs: Check for logged ops when clean
If we shut down successfully, there shouldn't be any logged ops to
resume.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 22:32:22 -04:00
Kent Overstreet
1c0ee43b2c bcachefs: BCH_FS_clean_recovery
Add a filesystem flag to indicate whether we did a clean recovery -
using c->sb.clean after we've got rw is incorrect, since c->sb is
updated whenever we write the superblock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 22:32:22 -04:00
Kent Overstreet
9773547b16 bcachefs: Convert disk accounting BUG_ON() to WARN_ON()
We had a bug where disk accounting keys didn't always have their version
field set in journal replay; change the BUG_ON() to a WARN(), and
exclude this case since it's now checked for elsewhere (in the bkey
validate function).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 22:32:22 -04:00
Kent Overstreet
a3581ca35d bcachefs: Fix BCH_TRANS_COMMIT_skip_accounting_apply
This was added to avoid double-counting accounting keys in journal
replay. But applied incorrectly (easily done since it applies to the
transaction commit, not a particular update), it leads to skipping
in-mem accounting for real accounting updates, and failure to give them
a version number - which leads to journal replay becoming very confused
the next time around.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 22:32:20 -04:00
Kent Overstreet
f8911ad88d bcachefs: Check for accounting keys with bversion=0
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
cf49f8a8c2 bcachefs: rename version -> bversion
give bversions a more distinct name, to aid in grepping

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
fd65378db9 bcachefs: Don't delete unlinked inodes before logged op resume
Previously, check_inode() would delete unlinked inodes if they weren't
on the deleted list - this code dating from before there was a deleted
list.

But, if we crash during a logged op (truncate or finsert/fcollapse) of
an unlinked file, logged op resume will get confused if the inode has
already been deleted - instead, just add it to the deleted list if it
needs to be there; delete_dead_inodes runs after logged op resume.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
8d65b15f8d bcachefs: Fix BCH_SB_ERRS() so we can reorder
BCH_SB_ERRS() has a field for the actual enum val so that we can reorder
to reorganize, but the way BCH_SB_ERR_MAX was defined didn't allow for
this.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
5612daafb7 bcachefs: Fix fsck warnings from bkey validation
__bch2_fsck_err() warns if the current task has a btree_trans object and
it wasn't passed in, because if it has to prompt for user input it has
to be able to unlock it.

But plumbing the btree_trans through bkey_validate(), as well as
transaction restarts, is problematic - so instead make bkey fsck errors
FSCK_AUTOFIX, which doesn't need to warn.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
7c980a43e9 bcachefs: Move transaction commit path validation to as late as possible
In order to check for accounting keys with version=0, we need to run
validation after they've been assigned version numbers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
431312b59c bcachefs: Fix disk accounting attempting to mark invalid replicas entry
This fixes the following bug, where a disk accounting key has an invalid
replicas entry, and we attempt to add it to the superblock:

bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): starting version 1.12: rebalance_work_acct_fix opts=metadata_replicas=2,data_replicas=2,foreground_target=ssd,background_target=hdd,nopromote_whole_extents,verbose,fsck,fix_errors=yes
bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): recovering from clean shutdown, journal seq 15211644
bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): accounting_read...
accounting not marked in superblock replicas
  replicas cached: 1/1 [0], fixing
bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): sb invalid before write: Invalid superblock section replicas_v0: invalid device 0 in entry cached: 1/1 [0]
replicas_v0 (size 88):
user: 2 [3 5] user: 2 [1 4] cached: 1 [2] btree: 2 [1 2] user: 2 [2 5] cached: 1 [0] cached: 1 [4] journal: 2 [1 5] user: 2 [1 2] user: 2 [2 3] user: 2 [3 4] user: 2 [4 5] cached: 1 [1] cached: 1 [3] cached: 1 [5] journal: 2 [1 2] journal: 2 [2 5] btree: 2 [2 5] user: 2 [1 3] user: 2 [1 5] user: 2 [2 4]

bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): inconsistency detected - emergency read only at journal seq 15211644
accounting not marked in superblock replicas
  replicas user: 1/1 [3], fixing
bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): sb invalid before write: Invalid superblock section replicas_v0: invalid device 0 in entry cached: 1/1 [0]
replicas_v0 (size 96):
user: 2 [3 5] user: 2 [1 3] cached: 1 [2] btree: 2 [1 2] user: 2 [2 4] cached: 1 [0] cached: 1 [4] journal: 2 [1 5] user: 1 [3] user: 2 [1 5] user: 2 [3 4] user: 2 [4 5] cached: 1 [1] cached: 1 [3] cached: 1 [5] journal: 2 [1 2] journal: 2 [2 5] btree: 2 [2 5] user: 2 [1 2] user: 2 [1 4] user: 2 [2 3] user: 2 [2 5]

accounting not marked in superblock replicas
  replicas user: 1/2 [3 7], fixing
bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): sb invalid before write: Invalid superblock section replicas_v0: invalid device 7 in entry user: 1/2 [3 7]
replicas_v0 (size 96):
user: 2 [3 7] user: 2 [1 3] cached: 1 [2] btree: 2 [1 2] user: 2 [2 4] cached: 1 [0] cached: 1 [4] journal: 2 [1 5] user: 1 [3] user: 2 [1 5] user: 2 [3 4] user: 2 [4 5] cached: 1 [1] cached: 1 [3] cached: 1 [5] journal: 2 [1 2] journal: 2 [2 5] btree: 2 [2 5] user: 2 [1 2] user: 2 [1 4] user: 2 [2 3] user: 2 [2 5] user: 2 [3 5]

 done
bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): alloc_read... done
bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): stripes_read... done
bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): snapshots_read... done

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
49fd90b2cc bcachefs: Fix unlocked access to c->disk_sb.sb in bch2_replicas_entry_validate()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
9104fc1928 bcachefs: Fix accounting read + device removal
accounting read was checking if accounting replicas entries were marked
in the superblock prior to applying accounting from the journal,
which meant that a recently removed device could spuriously trigger a
"not marked in superblocked" error (when journal entries zero out the
offending counter).

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
1e0272ef47 bcachefs: bch_accounting_mode
Minor refactoring - replace multiple bool arguments with an enum; prep
work for fixing a bug in accounting read.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
3672bda8f5 bcachefs: fix transaction restart handling in check_extents(), check_dirents()
Dealing with outside state within a btree transaction is always tricky.

check_extents() and check_dirents() have to accumulate counters for
i_sectors and i_nlink (for subdirectories). There were two bugs:

- transaction commit may return a restart; therefore we have to commit
  before accumulating to those counters
- get_inode_all_snapshots() may return a transaction restart, before
  updating w->last_pos; then, on the restart,
  check_i_sectors()/check_subdir_count() would see inodes that were not
  for w->last_pos

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:35 -04:00
Kent Overstreet
22a507d68e bcachefs: kill inode_walker_entry.seen_this_pos
dead code

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Kent Overstreet
b29c30ab48 bcachefs: Fix incorrect IS_ERR_OR_NULL usage
Returning a positive integer instead of an error code causes error paths
to become very confused.

Closes: syzbot+c0360e8367d6d8d04a66@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Hongbo Li
dc5bfdf8ea bcachefs: fix the memory leak in exception case
The pointer clean points the memory allocated by kmemdup, when the
return value of bch2_sb_clean_validate_late is not zero. The memory
pointed by clean is leaked. So we should free it in this case.

Fixes: a37ad1a3ab ("bcachefs: sb-clean.c")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Hongbo Li
3125c95ea6 bcachefs: fast exit when darray_make_room failed
In downgrade_table_extra, the return value is needed. When it
return failed, we should exit immediately.

Fixes: 7773df19c3 ("bcachefs: metadata version bucket_stripe_sectors")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Kent Overstreet
951dd86e7c bcachefs: Fix iterator leak in check_subvol()
A couple small error handling fixes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Kent Overstreet
2a1df87346 bcachefs: Add snapshot to bch_inode_unpacked
this allows for various cleanups in fsck

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Diogo Jahchan Koike
40d40c6bea bcachefs: assign return error when iterating through layout
syzbot reported a null ptr deref in __copy_user [0]

In __bch2_read_super, when a corrupt backup superblock matches the
default opts offset, no error is assigned to ret and the freed superblock
gets through, possibly being assigned as the best sb in bch2_fs_open and
being later dereferenced, causing a fault. Assign EINVALID to ret when
iterating through layout.

[0]: https://syzkaller.appspot.com/bug?extid=18a5c5e8a9c856944876

Reported-by: syzbot+18a5c5e8a9c856944876@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=18a5c5e8a9c856944876
Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Kent Overstreet
c6040447c5 bcachefs: Fix srcu warning in check_topology
check_topology doesn't need the srcu lock and doesn't use normal btree
transactions - we can just drop the srcu lock.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Kent Overstreet
18c520f408 bcachefs: Fix error path in check_dirent_inode_dirent()
fsck_err() jumps to the fsck_err label when bailing out; need to make
sure bp_iter was initialized...

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Piotr Zalewski
0696a18a8c bcachefs: memset bounce buffer portion to 0 after key_sort_fix_overlapping
Zero-initialize part of allocated bounce buffer which wasn't touched by
subsequent bch2_key_sort_fix_overlapping to mitigate later uinit-value
use KMSAN bug[1].

After applying the patch reproducer still triggers stack overflow[2] but
it seems unrelated to the uninit-value use warning. After further
investigation it was found that stack overflow occurs because KMSAN adds
too many function calls[3]. Backtrace of where the stack magic number gets
smashed was added as a reply to syzkaller thread[3].

It was confirmed that task's stack magic number gets smashed after the code
path where KSMAN detects uninit-value use is executed, so it can be assumed
that it doesn't contribute in any way to uninit-value use detection.

[1] https://syzkaller.appspot.com/bug?extid=6f655a60d3244d0c6718
[2] https://lore.kernel.org/lkml/66e57e46.050a0220.115905.0002.GAE@google.com
[3] https://lore.kernel.org/all/rVaWgPULej8K7HqMPNIu8kVNyXNjjCiTB-QBtItLFBmk0alH6fV2tk4joVPk97Evnuv4ZRDd8HB5uDCkiFG6u81xKdzDj-KrtIMJSlF6Kt8=@proton.me

Reported-by: syzbot+6f655a60d3244d0c6718@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6f655a60d3244d0c6718
Fixes: ec4edd7b9d ("bcachefs: Prep work for variable size btree node buffers")
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Kent Overstreet
51b7cc7c0f bcachefs: Improve bch2_is_inode_open() warning message
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Kent Overstreet
4a8f8fafbd bcachefs: Add extra padding in bkey_make_mut_noupdate()
This fixes a kasan splat in propagate_key_to_snapshot_leaves() -
varint_decode_fast() does reads (that it never uses) up to 7 bytes past
the end of the integer.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Kent Overstreet
f890c8513f bcachefs: Mark inode errors as autofix
Most or all errors will be autofix in the future, we're currently just
doing the ones that we know are well tested.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-09-27 21:46:34 -04:00
Amir Goldstein
0c33037c82 ovl: fix file leak in ovl_real_fdget_meta()
ovl_open_realfile() is wrongly called twice after conversion to
new struct fd.

Fixes: 88a2f6468d ("struct fd: representation change")
Reported-by: syzbot+d9efec94dcbfa0de1c07@syzkaller.appspotmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-27 12:38:47 -07:00
Linus Torvalds
eee280841e 19 hotfixes. 13 are cc:stable.
There's a focus on fixes for the memfd_pin_folios() work which was added
 into 6.11.  Apart from that, the usual shower of singleton fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZvbhSAAKCRDdBJ7gKXxA
 jp8CAP47txk2c+tBLggog2MkQamADY5l5MT6E3fYq3ghSiKtVQEAnqX3LiQJ02tB
 o9LcPcVrM90QntpKrLP1CpWCVdR+zA8=
 =e0QC
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-09-27-09-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull  misc fixes from Andrew Morton:
 "19 hotfixes.  13 are cc:stable.

  There's a focus on fixes for the memfd_pin_folios() work which was
  added into 6.11. Apart from that, the usual shower of singleton fixes"

* tag 'mm-hotfixes-stable-2024-09-27-09-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  ocfs2: fix uninit-value in ocfs2_get_block()
  zram: don't free statically defined names
  memory tiers: use default_dram_perf_ref_source in log message
  Revert "list: test: fix tests for list_cut_position()"
  kselftests: mm: fix wrong __NR_userfaultfd value
  compiler.h: specify correct attribute for .rodata..c_jump_table
  mm/damon/Kconfig: update DAMON doc URL
  mm: kfence: fix elapsed time for allocated/freed track
  ocfs2: fix deadlock in ocfs2_get_system_file_inode
  ocfs2: reserve space for inline xattr before attaching reflink tree
  mm: migrate: annotate data-race in migrate_folio_unmap()
  mm/hugetlb: simplify refs in memfd_alloc_folio
  mm/gup: fix memfd_pin_folios alloc race panic
  mm/gup: fix memfd_pin_folios hugetlb page allocation
  mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak
  mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
  mm/filemap: fix filemap_get_folios_contig THP panic
  mm: make SPLIT_PTE_PTLOCKS depend on SMP
  tools: fix shared radix-tree build
2024-09-27 10:27:22 -07:00
David Howells
9fffa4e9b3
netfs: Advance iterator correctly rather than jumping it
In netfs_write_folio(), use iov_iter_advance() to advance the folio as we
split bits of it off to subrequests rather than manually jumping the
->iov_offset value around.  This becomes more problematic when we use a
bounce buffer made out of single-page folios to cover a multipage pagecache
folio.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/2238548.1727424522@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-27 18:29:20 +02:00
David Howells
ff98751bae
afs: Fix the setting of the server responding flag
In afs_wait_for_operation(), we set transcribe the call responded flag to
the server record that we used after doing the fileserver iteration loop -
but it's possible to exit the loop having had a response from the server
that we've discarded (e.g. it returned an abort or we started receiving
data, but the call didn't complete).

This means that op->server might be NULL, but we don't check that before
attempting to set the server flag.

Fixes: 98f9fda205 ("afs: Fold the afs_addr_cursor struct in")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240923150756.902363-7-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-27 18:29:20 +02:00
Thorsten Blum
19dcfb9c16
afs: Remove unused struct and function prototype
The struct afs_address_list and the function prototype
afs_put_address_list() are not used anymore and can be removed. Remove
them.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240911095046.3749-2-thorsten.blum@toblux.com/
Link: https://lore.kernel.org/r/20240923150756.902363-5-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-27 18:29:19 +02:00
Marc Dionne
f94d54208f
afs: Fix possible infinite loop with unresponsive servers
A return code of 0 from afs_wait_for_one_fs_probe is an indication
that the endpoint state attached to the operation is stale and has
been superseded.  In that case the iteration needs to be restarted
so that the newer probe result state gets used.

Failure to do so can result in an tight infinite loop around the
iterate_address label, where all addresses are thought to be responsive
and have been tried, with nothing to refresh the endpoint state.

Fixes: 495f2ae9e3 ("afs: Fix fileserver rotation")
Reported-by: Markus Suvanto <markus.suvanto@gmail.com>
Link: https://lists.infradead.org/pipermail/linux-afs/2024-July/008628.html
cc: linux-afs@lists.infradead.org
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240906134019.131553-1-marc.dionne@auristor.com/
Link: https://lore.kernel.org/r/20240923150756.902363-6-dhowells@redhat.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-27 18:29:19 +02:00
Christian Brauner
8a46067783
pidfs: check for valid pid namespace
When we access a no-current task's pid namespace we need check that the
task hasn't been reaped in the meantime and it's pid namespace isn't
accessible anymore.

The user namespace is fine because it is only released when the last
reference to struct task_struct is put and exit_creds() is called.

Link: https://lore.kernel.org/r/20240926-klebt-altgedienten-0415ad4d273c@brauner
Fixes: 5b08bd4085 ("pidfs: allow retrieval of namespace file descriptors")
CC: stable@vger.kernel.org # v6.11
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-27 18:29:19 +02:00
David Howells
2cf36327ee
afs: Fix missing wire-up of afs_retry_request()
afs_retry_request() is supposed to be pointed to by the afs_req_ops netfs
operations table, but the pointer got lost somewhere.  The function is used
during writeback to rotate through the authentication keys that were in
force when the file was modified locally.

Fix this by adding the pointer to the function.

Fixes: 1ecb146f7c ("netfs, afs: Use writeback retry to deal with alternate keys")
Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1690847.1726346402@warthog.procyon.org.uk
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-27 18:29:19 +02:00
Baokun Li
da6ef2dffe
cachefiles: fix dentry leak in cachefiles_open_file()
A dentry leak may be caused when a lookup cookie and a cull are concurrent:

            P1             |             P2
-----------------------------------------------------------
cachefiles_lookup_cookie
  cachefiles_look_up_object
    lookup_one_positive_unlocked
     // get dentry
                            cachefiles_cull
                              inode->i_flags |= S_KERNEL_FILE;
    cachefiles_open_file
      cachefiles_mark_inode_in_use
        __cachefiles_mark_inode_in_use
          can_use = false
          if (!(inode->i_flags & S_KERNEL_FILE))
            can_use = true
	  return false
        return false
        // Returns an error but doesn't put dentry

After that the following WARNING will be triggered when the backend folder
is umounted:

==================================================================
BUG: Dentry 000000008ad87947{i=7a,n=Dx_1_1.img}  still in use (1) [unmount of ext4 sda]
WARNING: CPU: 4 PID: 359261 at fs/dcache.c:1767 umount_check+0x5d/0x70
CPU: 4 PID: 359261 Comm: umount Not tainted 6.6.0-dirty #25
RIP: 0010:umount_check+0x5d/0x70
Call Trace:
 <TASK>
 d_walk+0xda/0x2b0
 do_one_tree+0x20/0x40
 shrink_dcache_for_umount+0x2c/0x90
 generic_shutdown_super+0x20/0x160
 kill_block_super+0x1a/0x40
 ext4_kill_sb+0x22/0x40
 deactivate_locked_super+0x35/0x80
 cleanup_mnt+0x104/0x160
==================================================================

Whether cachefiles_open_file() returns true or false, the reference count
obtained by lookup_positive_unlocked() in cachefiles_look_up_object()
should be released.

Therefore release that reference count in cachefiles_look_up_object() to
fix the above issue and simplify the code.

Fixes: 1f08c925e7 ("cachefiles: Implement backing file wrangling")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20240829083409.3788142-1-libaokun@huaweicloud.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-09-27 18:29:19 +02:00
Al Viro
cb787f4ac0 [tree-wide] finally take no_llseek out
no_llseek had been defined to NULL two years ago, in commit 868941b144
("fs: remove no_llseek")

To quote that commit,

  At -rc1 we'll need do a mechanical removal of no_llseek -

  git grep -l -w no_llseek | grep -v porting.rst | while read i; do
	sed -i '/\<no_llseek\>/d' $i
  done

  would do it.

Unfortunately, that hadn't been done.  Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
	.llseek = no_llseek,
so it's obviously safe.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-09-27 08:18:43 -07:00
Enzo Matsumiya
220d83b52c smb: client: make SHA-512 TFM ephemeral
The SHA-512 shash TFM is used only briefly during Session Setup stage,
when computing SMB 3.1.1 preauth hash.

There's no need to keep it allocated in servers' secmech the whole time,
so keep its lifetime inside smb311_update_preauth_hash().

This also makes smb311_crypto_shash_allocate() redundant, so expose
smb3_crypto_shash_allocate() and use that.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-26 18:15:19 -05:00
Enzo Matsumiya
db44ca9f7b smb: client: make HMAC-MD5 TFM ephemeral
The HMAC-MD5 shash TFM is used only briefly during Session Setup stage,
when computing NTLMv2 hashes.

There's no need to keep it allocated in servers' secmech the whole time,
so keep its lifetime inside setup_ntlmv2_rsp().

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-26 18:15:17 -05:00
Paulo Alcantara
a13ca780af smb: client: stop flooding dmesg in smb2_calc_signature()
When having several mounts that share same credential and the client
couldn't re-establish an SMB session due to an expired kerberos ticket
or rotated password, smb2_calc_signature() will end up flooding dmesg
when not finding SMB sessions to calculate signatures.

Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-26 18:15:06 -05:00
Enzo Matsumiya
f7025d8616 smb: client: allocate crypto only for primary server
For extra channels, point ->secmech.{enc,dec} to the primary
server ones.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-26 18:15:02 -05:00
Enzo Matsumiya
b0abcd65ec smb: client: fix UAF in async decryption
Doing an async decryption (large read) crashes with a
slab-use-after-free way down in the crypto API.

Reproducer:
    # mount.cifs -o ...,seal,esize=1 //srv/share /mnt
    # dd if=/mnt/largefile of=/dev/null
    ...
    [  194.196391] ==================================================================
    [  194.196844] BUG: KASAN: slab-use-after-free in gf128mul_4k_lle+0xc1/0x110
    [  194.197269] Read of size 8 at addr ffff888112bd0448 by task kworker/u77:2/899
    [  194.197707]
    [  194.197818] CPU: 12 UID: 0 PID: 899 Comm: kworker/u77:2 Not tainted 6.11.0-lku-00028-gfca3ca14a17a-dirty #43
    [  194.198400] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-prebuilt.qemu.org 04/01/2014
    [  194.199046] Workqueue: smb3decryptd smb2_decrypt_offload [cifs]
    [  194.200032] Call Trace:
    [  194.200191]  <TASK>
    [  194.200327]  dump_stack_lvl+0x4e/0x70
    [  194.200558]  ? gf128mul_4k_lle+0xc1/0x110
    [  194.200809]  print_report+0x174/0x505
    [  194.201040]  ? __pfx__raw_spin_lock_irqsave+0x10/0x10
    [  194.201352]  ? srso_return_thunk+0x5/0x5f
    [  194.201604]  ? __virt_addr_valid+0xdf/0x1c0
    [  194.201868]  ? gf128mul_4k_lle+0xc1/0x110
    [  194.202128]  kasan_report+0xc8/0x150
    [  194.202361]  ? gf128mul_4k_lle+0xc1/0x110
    [  194.202616]  gf128mul_4k_lle+0xc1/0x110
    [  194.202863]  ghash_update+0x184/0x210
    [  194.203103]  shash_ahash_update+0x184/0x2a0
    [  194.203377]  ? __pfx_shash_ahash_update+0x10/0x10
    [  194.203651]  ? srso_return_thunk+0x5/0x5f
    [  194.203877]  ? crypto_gcm_init_common+0x1ba/0x340
    [  194.204142]  gcm_hash_assoc_remain_continue+0x10a/0x140
    [  194.204434]  crypt_message+0xec1/0x10a0 [cifs]
    [  194.206489]  ? __pfx_crypt_message+0x10/0x10 [cifs]
    [  194.208507]  ? srso_return_thunk+0x5/0x5f
    [  194.209205]  ? srso_return_thunk+0x5/0x5f
    [  194.209925]  ? srso_return_thunk+0x5/0x5f
    [  194.210443]  ? srso_return_thunk+0x5/0x5f
    [  194.211037]  decrypt_raw_data+0x15f/0x250 [cifs]
    [  194.212906]  ? __pfx_decrypt_raw_data+0x10/0x10 [cifs]
    [  194.214670]  ? srso_return_thunk+0x5/0x5f
    [  194.215193]  smb2_decrypt_offload+0x12a/0x6c0 [cifs]

This is because TFM is being used in parallel.

Fix this by allocating a new AEAD TFM for async decryption, but keep
the existing one for synchronous READ cases (similar to what is done
in smb3_calc_signature()).

Also remove the calls to aead_request_set_callback() and
crypto_wait_req() since it's always going to be a synchronous operation.

Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-26 18:14:48 -05:00
David Howells
df9b455633 netfs: Fix write oops in generic/346 (9p) and generic/074 (cifs)
In netfslib, a buffered writeback operation has a 'write queue' of folios
that are being written, held in a linear sequence of folio_queue structs.
The 'issuer' adds new folio_queues on the leading edge of the queue and
populates each one progressively; the 'collector' pops them off the
trailing edge and discards them and the folios they point to as they are
consumed.

The queue is required to always retain at least one folio_queue structure.
This allows the queue to be accessed without locking and with just a bit of
barriering.

When a new subrequest is prepared, its ->io_iter iterator is pointed at the
current end of the write queue and then the iterator is extended as more
data is added to the queue until the subrequest is committed.

Now, the problem is that the folio_queue at the leading edge of the write
queue when a subrequest is prepared might have been entirely consumed - but
not yet removed from the queue as it is the only remaining one and is
preventing the queue from collapsing.

So, what happens is that subreq->io_iter is pointed at the spent
folio_queue, then a new folio_queue is added, and, at that point, the
collector is at entirely at liberty to immediately delete the spent
folio_queue.

This leaves the subreq->io_iter pointing at a freed object.  If the system
is lucky, iterate_folioq() sees ->io_iter, sees the as-yet uncorrupted
freed object and advances to the next folio_queue in the queue.

In the case seen, however, the freed object gets recycled and put back onto
the queue at the tail and filled to the end.  This confuses
iterate_folioq() and it tries to step ->next, which may be NULL - resulting
in an oops.

Fix this by the following means:

 (1) When preparing a write subrequest, make sure there's a folio_queue
     struct with space in it at the leading edge of the queue.  A function
     to make space is split out of the function to append a folio so that
     it can be called for this purpose.

 (2) If the request struct iterator is pointing to a completely spent
     folio_queue when we make space, then advance the iterator to the newly
     allocated folio_queue.  The subrequest's iterator will then be set
     from this.

The oops could be triggered using the generic/346 xfstest with a filesystem
on9P over TCP with cache=loose.  The oops looked something like:

 BUG: kernel NULL pointer dereference, address: 0000000000000008
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 ...
 RIP: 0010:_copy_from_iter+0x2db/0x530
 ...
 Call Trace:
  <TASK>
 ...
  p9pdu_vwritef+0x3d8/0x5d0
  p9_client_prepare_req+0xa8/0x140
  p9_client_rpc+0x81/0x280
  p9_client_write+0xcf/0x1c0
  v9fs_issue_write+0x87/0xc0
  netfs_advance_write+0xa0/0xb0
  netfs_write_folio.isra.0+0x42d/0x500
  netfs_writepages+0x15a/0x1f0
  do_writepages+0xd1/0x220
  filemap_fdatawrite_wbc+0x5c/0x80
  v9fs_mmap_vm_close+0x7d/0xb0
  remove_vma+0x35/0x70
  vms_complete_munmap_vmas+0x11a/0x170
  do_vmi_align_munmap+0x17d/0x1c0
  do_vmi_munmap+0x13e/0x150
  __vm_munmap+0x92/0xd0
  __x64_sys_munmap+0x17/0x20
  do_syscall_64+0x80/0xe0
  entry_SYSCALL_64_after_hwframe+0x71/0x79

This also fixed a similar-looking issue with cifs and generic/074.

Fixes: cd0277ed0c ("netfs: Use new folio_queue data type and iterator instead of xarray iter")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409180928.f20b5a08-oliver.sang@intel.com
Closes: https://lore.kernel.org/oe-lkp/202409131438.3f225fbf-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-09-26 17:45:20 -05:00
Joseph Qi
2af148ef85 ocfs2: fix uninit-value in ocfs2_get_block()
syzbot reported an uninit-value BUG:

BUG: KMSAN: uninit-value in ocfs2_get_block+0xed2/0x2710 fs/ocfs2/aops.c:159
ocfs2_get_block+0xed2/0x2710 fs/ocfs2/aops.c:159
do_mpage_readpage+0xc45/0x2780 fs/mpage.c:225
mpage_readahead+0x43f/0x840 fs/mpage.c:374
ocfs2_readahead+0x269/0x320 fs/ocfs2/aops.c:381
read_pages+0x193/0x1110 mm/readahead.c:160
page_cache_ra_unbounded+0x901/0x9f0 mm/readahead.c:273
do_page_cache_ra mm/readahead.c:303 [inline]
force_page_cache_ra+0x3b1/0x4b0 mm/readahead.c:332
force_page_cache_readahead mm/internal.h:347 [inline]
generic_fadvise+0x6b0/0xa90 mm/fadvise.c:106
vfs_fadvise mm/fadvise.c:185 [inline]
ksys_fadvise64_64 mm/fadvise.c:199 [inline]
__do_sys_fadvise64 mm/fadvise.c:214 [inline]
__se_sys_fadvise64 mm/fadvise.c:212 [inline]
__x64_sys_fadvise64+0x1fb/0x3a0 mm/fadvise.c:212
x64_sys_call+0xe11/0x3ba0
arch/x86/include/generated/asm/syscalls_64.h:222
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f

This is because when ocfs2_extent_map_get_blocks() fails, p_blkno is
uninitialized.  So the error log will trigger the above uninit-value
access.

The error log is out-of-date since get_blocks() was removed long time ago.
And the error code will be logged in ocfs2_extent_map_get_blocks() once
ocfs2_get_cluster() fails, so fix this by only logging inode and block.

Link: https://syzkaller.appspot.com/bug?extid=9709e73bae885b05314b
Link: https://lkml.kernel.org/r/20240925090600.3643376-1-joseph.qi@linux.alibaba.com
Fixes: ccd979bdbc ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: syzbot+9709e73bae885b05314b@syzkaller.appspotmail.com
Tested-by: syzbot+9709e73bae885b05314b@syzkaller.appspotmail.com
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26 14:01:45 -07:00
Mohammed Anees
7bf1823e01 ocfs2: fix deadlock in ocfs2_get_system_file_inode
syzbot has found a possible deadlock in ocfs2_get_system_file_inode [1].

The scenario is depicted here,

	CPU0					CPU1
lock(&ocfs2_file_ip_alloc_sem_key);
                               lock(&osb->system_file_mutex);
                               lock(&ocfs2_file_ip_alloc_sem_key);
lock(&osb->system_file_mutex);

The function calls which could lead to this are:

CPU0
ocfs2_mknod - lock(&ocfs2_file_ip_alloc_sem_key);
.
.
.
ocfs2_get_system_file_inode - lock(&osb->system_file_mutex);

CPU1 -
ocfs2_fill_super - lock(&osb->system_file_mutex);
.
.
.
ocfs2_read_virt_blocks - lock(&ocfs2_file_ip_alloc_sem_key);

This issue can be resolved by making the down_read -> down_read_try
in the ocfs2_read_virt_blocks.

[1] https://syzkaller.appspot.com/bug?extid=e0055ea09f1f5e6fabdd

Link: https://lkml.kernel.org/r/20240924093257.7181-1-pvmohammedanees2003@gmail.com
Signed-off-by: Mohammed Anees <pvmohammedanees2003@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: <syzbot+e0055ea09f1f5e6fabdd@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=e0055ea09f1f5e6fabdd
Tested-by: syzbot+e0055ea09f1f5e6fabdd@syzkaller.appspotmail.com
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc:  <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26 14:01:44 -07:00
Gautham Ananthakrishna
5ca60b86f5 ocfs2: reserve space for inline xattr before attaching reflink tree
One of our customers reported a crash and a corrupted ocfs2 filesystem. 
The crash was due to the detection of corruption.  Upon troubleshooting,
the fsck -fn output showed the below corruption

[EXTENT_LIST_FREE] Extent list in owner 33080590 claims 230 as the next free chain record,
but fsck believes the largest valid value is 227.  Clamp the next record value? n

The stat output from the debugfs.ocfs2 showed the following corruption
where the "Next Free Rec:" had overshot the "Count:" in the root metadata
block.

        Inode: 33080590   Mode: 0640   Generation: 2619713622 (0x9c25a856)
        FS Generation: 904309833 (0x35e6ac49)
        CRC32: 00000000   ECC: 0000
        Type: Regular   Attr: 0x0   Flags: Valid
        Dynamic Features: (0x16) HasXattr InlineXattr Refcounted
        Extended Attributes Block: 0  Extended Attributes Inline Size: 256
        User: 0 (root)   Group: 0 (root)   Size: 281320357888
        Links: 1   Clusters: 141738
        ctime: 0x66911b56 0x316edcb8 -- Fri Jul 12 06:02:30.829349048 2024
        atime: 0x66911d6b 0x7f7a28d -- Fri Jul 12 06:11:23.133669517 2024
        mtime: 0x66911b56 0x12ed75d7 -- Fri Jul 12 06:02:30.317552087 2024
        dtime: 0x0 -- Wed Dec 31 17:00:00 1969
        Refcount Block: 2777346
        Last Extblk: 2886943   Orphan Slot: 0
        Sub Alloc Slot: 0   Sub Alloc Bit: 14
        Tree Depth: 1   Count: 227   Next Free Rec: 230
        ## Offset        Clusters       Block#
        0  0             2310           2776351
        1  2310          2139           2777375
        2  4449          1221           2778399
        3  5670          731            2779423
        4  6401          566            2780447
        .......          ....           .......
        .......          ....           .......

The issue was in the reflink workfow while reserving space for inline
xattr.  The problematic function is ocfs2_reflink_xattr_inline().  By the
time this function is called the reflink tree is already recreated at the
destination inode from the source inode.  At this point, this function
reserves space for inline xattrs at the destination inode without even
checking if there is space at the root metadata block.  It simply reduces
the l_count from 243 to 227 thereby making space of 256 bytes for inline
xattr whereas the inode already has extents beyond this index (in this
case up to 230), thereby causing corruption.

The fix for this is to reserve space for inline metadata at the destination
inode before the reflink tree gets recreated. The customer has verified the
fix.

Link: https://lkml.kernel.org/r/20240918063844.1830332-1-gautham.ananthakrishna@oracle.com
Fixes: ef962df057 ("ocfs2: xattr: fix inlined xattr reflink")
Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-26 14:01:44 -07:00