Even though system user has a supplementary group, It gets
NT_STATUS_ACCESS_DENIED when attempting to create file or directory.
This patch add KSMBD_EVENT_LOGIN_REQUEST_EXT/RESPONSE_EXT netlink events
to get supplementary groups list. The new netlink event doesn't break
backward compatibility when using old ksmbd-tools.
Co-developed-by: Atte Heikkilä <atteh.mailbox@gmail.com>
Signed-off-by: Atte Heikkilä <atteh.mailbox@gmail.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
This fixes a regression which prevents parallel DIO reads.
Fixes: 0cac51185e ("f2fs: fix to avoid racing in between read and OPU dio write")
Reviewed-by: Daeho Jeong <daehojeong@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
The variable declaration in this function predates the merge of the
nrext64 (aka 64-bit extent counters) feature, which means that the
variable declaration type is insufficient to avoid an integer overflow.
Fix that by redeclaring the variable to be xfs_extnum_t.
Coverity-id: 1630958
Fixes: 8f71bede8e ("xfs: repair inode fork block mapping data structures")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcH4qAACgkQxWXV+ddt
WDtDig//czntfO+iRvDERZWTIB6vdVExfLd3r3ZNYlO1pIvgCuvqx3iYva+0ZhGW
8A+gcRax7cz0jCaxDp/+5lIRfdNxZH6/LwjZsDgU8Ly7himeRmwhtn2fCgNeiH/K
bUl92+ZMo2vwqTKXYa3xF1g3Hz6cRXVW7gJrMwNhb1hpPTGx+lgYJU02m/Io/vjK
1jcrZ84OEPIOY5uiAoDyO2hgsT/zVEeuuOiSTpKSzrghPbo0vmjLiYJ5T+CE5Uw3
u3w7/Fqnw49NwucqtncvyFFDXY9EWNuQhowi3hqJgOYTInqwwJigIpQV0hDDwYxb
ohGUGjazGfAEf/cy1jZXMbwCVgg8/Nj9x0eDKKhfs19VYUbMkEYQ8LKRTUlCeBwS
H/2AmqpqHEEO+tPY3P+w6MVwkNho8JNpWPdP5OzJs7XrD067IViOjD06HPM/k5ci
TU3zp9NYvgHVtmfZK1Aqsg9OYVhI1klVXejmlAzOLxejRPWXK/1hBw3kXbC6I+k1
50l0Yh1dgEnclMI3yWsKoj8IYUAkh2eudt0pNsot4a5vICMY++NVS2eukdz5UcEz
ix7hcpYcCcmzoOaelyEgmdAncWVGJT5w2Nzy85YaOp+Z1C65Ywb41utU+sSY+swB
kZfwl9vrsfu754vX7UKBherCvvYo+Lnj3GeX8Oe+1LoT2BP0TPk=
=lTqc
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- update fstrim loop and add more cancellation points, fix reported
delayed or blocked suspend if there's a huge chunk queued
- fix error handling in recent qgroup xarray conversion
- in zoned mode, fix warning printing device path without RCU
protection
- again fix invalid extent xarray state (6252690f7e), lost due to
refactoring
* tag 'for-6.12-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix clear_dirty and writeback ordering in submit_one_sector()
btrfs: zoned: fix missing RCU locking in error message when loading zone info
btrfs: fix missing error handling when adding delayed ref with qgroups enabled
btrfs: add cancellation points to trim loops
btrfs: split remaining space to discard in chunks
- Fix NFSD bring-up / shutdown
- Fix a UAF when releasing a stateid
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcHx6IACgkQM2qzM29m
f5ecExAAheSpSP9D+1n3hKeOlNfhY8FUzd41Arn4NYV+jIBJbtx94/FlSNOxA0mp
Ovm1I8uAxy4TR8TLt7tsxfT7JDOStKwXFl3QlOUZT/+uyyJr7/q5R959R3oMiccR
Rfrpj6j2yYWrI8qGDGHca4Vv2bDSxr4mzztWwDe0SHSsjwf4OAcv+XF5vcfZ/CJN
Bxulb9WfNU8XvdFcRDHokMfk6jiY6/+FCTwX8ckvbVEG6gHT8+CRYSUJ05j0LJGo
xKZV913NgzcuV7PH0vq6vExJE6+rEPt/ejDAT5FM5yeNe+WJ4RTDgsYyIr9iLbHF
mWB9M4NnP+EZhejtOCbZ9RZjjKro09ilEPpqILuuGQPtcHSeWmhNbFz0kwLe+zYZ
CdtjnPZhjB0ITWgZ1HCtoJ8k/ZcMa7iiM/kApMLGr9fVj8/BHHFzS95PK7K/Fqur
FLdhvo6CzZCnRd16e2kqWsG7wO2lPWcz4NWTf9wxIG5GCunXoVCEnK1VfHvnldbH
BIFXZ+ib5qnL2i3Qmz7bQxmfIp5ryZnNx1mF0OM8imR9K/rsnARd7JfQ99lpMy8D
mD4coZVTMMk/Zg9zuH8k5GBzB2zXXqgngp4IJIxqrKR7/AsuSU3R7r+O9CWN91GQ
GKpRtMn/rVUg81jxDr3qoKquyxONoyVrVXAKsj1PgUSQdjUJgqU=
=Rud7
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- Fix NFSD bring-up / shutdown
- Fix a UAF when releasing a stateid
* tag 'nfsd-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: fix possible badness in FREE_STATEID
nfsd: nfsd_destroy_serv() must call svc_destroy() even if nfsd_startup_net() failed
NFSD: Mark filecache "down" if init fails
* A few small typo fixes
* fstests xfs/538 DEBUG-only fix
* Performance fix on blockgc on COW'ed files,
by skipping trims on cowblock inodes currently
opened for write
* Prevent cowblocks to be freed under dirty pagecache
during unshare
* Update MAINTAINERS file to quote the new maintainer
Signed-off-by: Carlos Maiolino <cem@kernel.org>
-----BEGIN PGP SIGNATURE-----
iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZwY6mgAKCRBGdaER5Qtf
poE8AYCZzMJr9wMrs2RsWRnaEhMRJNZIPQmSKXgHAK3mV5AbXtdHRc8yGVNHf+mW
Nh0fwAkBf1Ix0VJWkXOSFHZI9O2lLRsCogbNjFhwYF0MHZch2/mq1Wa4Tj1SDlfg
Ny2PJBNHyA==
=OkRo
-----END PGP SIGNATURE-----
Merge tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
- A few small typo fixes
- fstests xfs/538 DEBUG-only fix
- Performance fix on blockgc on COW'ed files, by skipping trims on
cowblock inodes currently opened for write
- Prevent cowblocks to be freed under dirty pagecache during unshare
- Update MAINTAINERS file to quote the new maintainer
* tag 'xfs-6.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: fix a typo
xfs: don't free cowblocks from under dirty pagecache on unshare
xfs: skip background cowblock trims on inodes open for write
xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_alloc
xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btalloc
xfs: don't ifdef around the exact minlen allocations
xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocate
xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addname
xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_split
xfs: return bool from xfs_attr3_leaf_add
xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addname
xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate()
xfs: scrub: convert comma to semicolon
xfs: Remove empty declartion in header file
MAINTAINERS: add Carlos Maiolino as XFS release manager
While we do currently return -EFAULT in this case, it seems prudent to
follow the behaviour of other syscalls like clone3. It seems quite
unlikely that anyone depends on this error code being EFAULT, but we can
always revert this if it turns out to be an issue.
Cc: stable@vger.kernel.org # v5.6+
Fixes: fddb5d430a ("open: introduce openat2(2) syscall")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20241010-extensible-structs-check_fields-v3-3-d2833dfe6edd@cyphar.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
There is racy issue between smb2 session log off and smb2 session setup.
It will cause user-after-free from session log off.
This add session_lock when setting SMB2_SESSION_EXPIRED and referece
count to session struct not to free session while it is being used.
Cc: stable@vger.kernel.org # v5.15+
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-25282
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
are MM.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZwcILgAKCRDdBJ7gKXxA
jjnMAQDRl+UfscRUeMippi7wnL3ee6MKyhhZVOhoxP24uB7yBwD/Ulq4oE+mLHml
YTlK/wj5qTZIsdxGaBzM1yifqp3L7gU=
=lFmJ
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"12 hotfixes, 5 of which are c:stable. All singletons, about half of
which are MM"
* tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: zswap: delete comments for "value" member of 'struct zswap_entry'.
CREDITS: sort alphabetically by name
secretmem: disable memfd_secret() if arch cannot set direct map
.mailmap: update Fangrui's email
mm/huge_memory: check pmd_special() only after pmd_present()
resource, kunit: fix user-after-free in resource_test_region_intersects()
fs/proc/kcore.c: allow translation of physical memory addresses
selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map test
device-dax: correct pgoff align in dax_set_mapping()
kthread: unpark only parked kthread
Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"
bcachefs: do not use PF_MEMALLOC_NORECLAIM
Like how we already do when the allocator seems to be stuck, check if
we're waiting too long for a journal reservation and print some debug
info.
This is specifically to track down
https://github.com/koverstreet/bcachefs/issues/656
which is showing up in userspace where we don't have sysfs/debugfs to
get the journal debug info.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds a bounds check to the bch2_opt_to_text function to prevent
NULL pointer dereferences when accessing the opt->choices array. This
ensures that the index used is within valid bounds before dereferencing.
The new version enhances the readability.
Reported-and-tested-by: syzbot+37186860aa7812b331d5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=37186860aa7812b331d5
Signed-off-by: Mohammed Anees <pvmohammedanees2003@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We will get this if we wake up first:
Kernel panic - not syncing: btree_node_write_done leaked btree_trans
since there are still transactions waiting for cycle detectors after
BTREE_NODE_write_in_flight is cleared.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Fix failure to validate that accounting replicas entries point to
valid devices: this wasn't a real bug since they'd be cleaned up by
GC, but is still something we should know about
- Fix failure to validate that dev_data_type entries point to valid
devices: this does fix a real bug, since bch2_accounting_read() would
then try to copy the counters to that device and pop an inconsistent
error when the device didn't exist
- Remove accounting entries that are zeroed or invalid: if we're not
validating them we need to get rid of them: they might not exist in
the superblock, so we need the to trigger the superblock mark path
when they're readded.
This fixes the replication.ktest rereplicate test, which was failing
with "superblock not marked for replicas..."
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck can now correctly check if inodes in interior snapshot nodes are
open/in use.
- Tweak the vfs inode rhashtable so that the subvolume ID isn't hashed,
meaning inums in different subvolumes will hash to the same slot. Note
that this is a hack, and will cause problems if anyone ever has the
same file in many different snapshots open all at the same time.
- Then check if any of those subvolumes is a descendent of the snapshot
ID being checked
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's an inherent race in taking a snapshot while an unlinked file is
open, and then reattaching it in the child snapshot.
In the interior snapshot node the file will appear unlinked, as though
it should be deleted - it's not referenced by anything in that snapshot
- but we can't delete it, because the file data is referenced by the
child snapshot.
This was being handled incorrectly with
propagate_key_to_snapshot_leaves() - but that doesn't resolve the
fundamental inconsistency of "this file looks like it should be deleted
according to normal rules, but - ".
To fix this, we need to fix the rule for when an inode is deleted. The
previous rule, ignoring snapshots (there was no well-defined rule
for with snapshots) was:
Unlinked, non open files are deleted, either at recovery time or
during online fsck
The new rule is:
Unlinked, non open files, that do not exist in child snapshots, are
deleted.
To make this work transactionally, we add a new inode flag,
BCH_INODE_has_child_snapshot; it overrides BCH_INODE_unlinked when
considering whether to delete an inode, or put it on the deleted list.
For transactional consistency, clearing it handled by the inode trigger:
when deleting an inode we check if there are parent inodes which can now
have the BCH_INODE_has_child_snapshot flag cleared.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When /proc/kcore is read an attempt to read the first two pages results in
HW-specific page swap on s390 and another (so called prefix) pages are
accessed instead. That leads to a wrong read.
Allow architecture-specific translation of memory addresses using
kc_xlate_dev_mem_ptr() and kc_unxlate_dev_mem_ptr() callbacks similarily
to /dev/mem xlate_dev_mem_ptr() and unxlate_dev_mem_ptr() callbacks. That
way an architecture can deal with specific physical memory ranges.
Re-use the existing /dev/mem callback implementation on s390, which
handles the described prefix pages swapping correctly.
For other architectures the default callback is basically NOP. It is
expected the condition (vaddr == __va(__pa(vaddr))) always holds true for
KCORE_RAM memory type.
Link: https://lkml.kernel.org/r/20240930122119.1651546-1-agordeev@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "remove PF_MEMALLOC_NORECLAIM" v3.
This patch (of 2):
bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to allocate a new
inode to achieve GFP_NOWAIT semantic while holding locks. If this
allocation fails it will drop locks and use GFP_NOFS allocation context.
We would like to drop PF_MEMALLOC_NORECLAIM because it is really
dangerous to use if the caller doesn't control the full call chain with
this flag set. E.g. if any of the function down the chain needed
GFP_NOFAIL request the PF_MEMALLOC_NORECLAIM would override this and
cause unexpected failure.
While this is not the case in this particular case using the scoped gfp
semantic is not really needed bacause we can easily pus the allocation
context down the chain without too much clutter.
[akpm@linux-foundation.org: fix kerneldoc warnings]
Link: https://lkml.kernel.org/r/20240926172940.167084-1-mhocko@kernel.org
Link: https://lkml.kernel.org/r/20240926172940.167084-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz> # For vfs changes
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: James Morris <jmorris@namei.org>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
After the delegation is returned to the NFS server remove it
from the server's delegations list to reduce the time it takes
to scan this list.
Network trace captured while running the below script shows the
time taken to service the CB_RECALL increases gradually due to
the overhead of traversing the delegation list in
nfs_delegation_find_inode_server.
The NFS server in this test is a Solaris server which issues
CB_RECALL when receiving the all-zero stateid in the SETATTR.
mount=/mnt/data
for i in $(seq 1 20)
do
echo $i
mkdir $mount/testtarfile$i
time tar -C $mount/testtarfile$i -xf 5000_files.tar
done
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
* Patch to handle code-points with the Ignorable property as regular
character instead of treating them as an empty string. (Me)
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQS3XO7QfvpFoONBhH1OwQgI3t8RJgUCZwbNtQAKCRBOwQgI3t8R
JrlrAP4yCrZCp4YPlXO6oQGfS9RIeYpmcMzGmp1IAeqlzpB5qwD/YS53kiAzF4qV
+eD2fl/O4qNhZcWqBZKSH4shZBbXJAg=
=XCsY
-----END PGP SIGNATURE-----
Merge tag 'unicode-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode
Pull unicode fix from Gabriel Krisman Bertazi:
- Handle code-points with the Ignorable property as regular character
instead of treating them as an empty string (me)
* tag 'unicode-fixes-6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode:
unicode: Don't special case ignorable code points
We don't need to handle them separately. Instead, just let them
decompose/casefold to themselves.
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
This commit is a replay of commit 6252690f7e ("btrfs: fix invalid
mapping of extent xarray state"). We need to call
btrfs_folio_clear_dirty() before btrfs_set_range_writeback(), so that
xarray DIRTY tag is cleared.
With a refactoring commit 8189197425 ("btrfs: refactor
__extent_writepage_io() to do sector-by-sector submission"), it screwed
up and the order is reversed and causing the same hang. Fix the ordering
now in submit_one_sector().
Fixes: 8189197425 ("btrfs: refactor __extent_writepage_io() to do sector-by-sector submission")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_load_zone_info() we have an error path that is dereferencing
the name of a device which is a RCU string but we are not holding a RCU
read lock, which is incorrect.
Fix this by using btrfs_err_in_rcu() instead of btrfs_err().
The problem is there since commit 08e11a3db0 ("btrfs: zoned: load zone's
allocation offset"), back then at btrfs_load_block_group_zone_info() but
then later on that code was factored out into the helper
btrfs_load_zone_info() by commit 09a46725cc ("btrfs: zoned: factor out
per-zone logic from btrfs_load_block_group_zone_info").
Fixes: 08e11a3db0 ("btrfs: zoned: load zone's allocation offset")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix a typo in comments.
Signed-off-by: Andrew Kreimer <algonell@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
fallocate unshare mode explicitly breaks extent sharing. When a
command completes, it checks the data fork for any remaining shared
extents to determine whether the reflink inode flag and COW fork
preallocation can be removed. This logic doesn't consider in-core
pagecache and I/O state, however, which means we can unsafely remove
COW fork blocks that are still needed under certain conditions.
For example, consider the following command sequence:
xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \
-c "pwrite 0 32k" -c "funshare 0 1k" <file>
This allocates a data block at offset 0, shares it, and then
overwrites it with a larger buffered write. The overwrite triggers
COW fork preallocation, 32 blocks by default, which maps the entire
32k write to delalloc in the COW fork. All but the shared block at
offset 0 remains hole mapped in the data fork. The unshare command
redirties and flushes the folio at offset 0, removing the only
shared extent from the inode. Since the inode no longer maps shared
extents, unshare purges the COW fork before the remaining 28k may
have written back.
This leaves dirty pagecache backed by holes, which writeback quietly
skips, thus leaving clean, non-zeroed pagecache over holes in the
file. To verify, fiemap shows holes in the first 32k of the file and
reads return different data across a remount:
$ xfs_io -c "fiemap -v" <file>
<file>:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
...
1: [8..511]: hole 504
...
$ xfs_io -c "pread -v 4k 8" <file>
00001000: cd cd cd cd cd cd cd cd ........
$ umount <mnt>; mount <dev> <mnt>
$ xfs_io -c "pread -v 4k 8" <file>
00001000: 00 00 00 00 00 00 00 00 ........
To avoid this problem, make unshare follow the same rules used for
background cowblock scanning and never purge the COW fork for inodes
with dirty pagecache or in-flight I/O.
Fixes: 46afb0628b ("xfs: only flush the unshared range in xfs_reflink_unshare")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
New:
implement fallocate for compressed files;
add support for the compression attribute;
optimize large writes to sparse files.
Fixed:
fix several potential deadlock scenarios;
fix various internal bugs detected by syzbot;
add checks before accessing NTFS structures during parsing;
correct the format of output messages.
Refactored:
replace fsparam_flag_no with fsparam_flag in options parser;
remove unused functions and macros.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEh0DEKNP0I9IjwfWEqbAzH4MkB7YFAmcFRkgACgkQqbAzH4Mk
B7aQdQ//VT6lzBaxSZ0dm3xs/ygQYE3ter1OczAMQxh70O5px+Z0li5WCXQQe31W
I74MmAy2G7C+JdzMkSc4F7+pFF21DO+1WQjODl9rP7yoHiLteiLFkMI+P5tHuFFT
VPupAiCDQgm6QCtW2wofp/WR0ooxpLzC/yQsILrhUXJHdEY21oHLy/MQSOhVnv2k
VeFdWA70+YS+JEQ+YyZBSeGeMVMQG9VROA57dZ+eAy1vgBUJGcKhYp9v4eiWG1uT
A8rnV81FiHu+49OrE+ZwStguHkxu30dvlHuOqKW0pLW9/XoKCFf9w0WqLOGrAxLs
lKznVgQZ521+QW+lNURMHBZt/bqQxQQInuOOH7jwLyBh0tmKoZlIU9Vw4V3yrm3t
9NELb0FR/FeH02O89nMHl0SeaxDJvgFHDnfoOShuHaffur61d4RjChxqB4QcibOl
Ibvtns71Uo89ngPIXdeTJK+7XZQVmAZHBXXYL+Rwr84Yv0tyn2q8DbrqzmphiYlR
wbLyVBGqtx0ZcN/geuFLiVt2W/HSMPocXqrxv791ALX6cryL+CZdolQP7vWjt4Vh
rKoaRZYN8fdcfT+BIYy95Kz0k12T7ColOoMfxVwugxK9ypF6tH6BJBwYXjCW4MOv
AhkszJcsISxMKidsbzZA9xESM6M3sLHcW0swjAwWWRhc5zbSe6E=
=j71V
-----END PGP SIGNATURE-----
Merge tag 'ntfs3_for_6.12' of https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 updates from Konstantin Komarov:
"New:
- implement fallocate for compressed files
- add support for the compression attribute
- optimize large writes to sparse files
Fixes:
- fix several potential deadlock scenarios
- fix various internal bugs detected by syzbot
- add checks before accessing NTFS structures during parsing
- correct the format of output messages
Refactoring:
- replace fsparam_flag_no with fsparam_flag in options parser
- remove unused functions and macros"
* tag 'ntfs3_for_6.12' of https://github.com/Paragon-Software-Group/linux-ntfs3: (25 commits)
fs/ntfs3: Format output messages like others fs in kernel
fs/ntfs3: Additional check in ntfs_file_release
fs/ntfs3: Fix general protection fault in run_is_mapped_full
fs/ntfs3: Sequential field availability check in mi_enum_attr()
fs/ntfs3: Additional check in ni_clear()
fs/ntfs3: Fix possible deadlock in mi_read
ntfs3: Change to non-blocking allocation in ntfs_d_hash
fs/ntfs3: Remove unused al_delete_le
fs/ntfs3: Rename ntfs3_setattr into ntfs_setattr
fs/ntfs3: Replace fsparam_flag_no -> fsparam_flag
fs/ntfs3: Add support for the compression attribute
fs/ntfs3: Implement fallocate for compressed files
fs/ntfs3: Make checks in run_unpack more clear
fs/ntfs3: Add rough attr alloc_size check
fs/ntfs3: Stale inode instead of bad
fs/ntfs3: Refactor enum_rstbl to suppress static checker
fs/ntfs3: Fix sparse warning in ni_fiemap
fs/ntfs3: Fix warning possible deadlock in ntfs_set_state
fs/ntfs3: Fix sparse warning for bigendian
fs/ntfs3: Separete common code for file_read/write iter/splice
...
When adding a delayed ref head, at delayed-ref.c:add_delayed_ref_head(),
if we fail to insert the qgroup record we don't error out, we ignore it.
In fact we treat it as if there was no error and there was already an
existing record - we don't distinguish between the cases where
btrfs_qgroup_trace_extent_nolock() returns 1, meaning a record already
existed and we can free the given record, and the case where it returns
a negative error value, meaning the insertion into the xarray that is
used to track records failed.
Effectively we end up ignoring that we are lacking qgroup record in the
dirty extents xarray, resulting in incorrect qgroup accounting.
Fix this by checking for errors and return them to the callers.
Fixes: 3cce39a8ca ("btrfs: qgroup: use xarray to track dirty extents in transaction")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are reports that system cannot suspend due to running trim because
the task responsible for trimming the device isn't able to finish in
time, especially since we have a free extent discarding phase, which can
trim a lot of unallocated space. There are no limits on the trim size
(unlike the block group part).
Since trime isn't a critical call it can be interrupted at any time,
in such cases we stop the trim, report the amount of discarded bytes and
return an error.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Per Qu Wenruo in case we have a very large disk, e.g. 8TiB device,
mostly empty although we will do the split according to our super block
locations, the last super block ends at 256G, we can submit a huge
discard for the range [256G, 8T), causing a large delay.
Split the space left to discard based on BTRFS_MAX_DISCARD_CHUNK_SIZE in
preparation of introduction of cancellation points to trim. The value
of the chunk size is arbitrary, it can be higher or derived from actual
device capabilities but we can't easily read that using
bio_discard_limit().
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219180
Link: https://bugzilla.suse.com/show_bug.cgi?id=1229737
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Luca Stefani <luca.stefani.ge1@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The code that copies data from srcmap to iomap in dax_unshare_iter is
very very broken, which bfoster's recent fsx changes have exposed.
If the pos and len passed to dax_file_unshare are not aligned to an
fsblock boundary, the iter pos and length in the _iter function will
reflect this unalignment.
dax_iomap_direct_access always returns a pointer to the start of the
kmapped fsdax page, even if its pos argument is in the middle of that
page. This is catastrophic for data integrity when iter->pos is not
aligned to a page, because daddr/saddr do not point to the same byte in
the file as iter->pos. Hence we corrupt user data by copying it to the
wrong place.
If iter->pos + iomap_length() in the _iter function not aligned to a
page, then we fail to copy a full block, and only partially populate the
destination block. This is catastrophic for data confidentiality
because we expose stale pmem contents.
Fix both of these issues by aligning copy_pos/copy_len to a page
boundary (remember, this is fsdax so 1 fsblock == 1 base page) so that
we always copy full blocks.
We're not done yet -- there's no call to invalidate_inode_pages2_range,
so programs that have the file range mmap'd will continue accessing the
old memory mapping after the file metadata updates have completed.
Be careful with the return value -- if the unshare succeeds, we still
need to return the number of bytes that the iomap iter thinks we're
operating on.
Cc: ruansy.fnst@fujitsu.com
Fixes: d984648e42 ("fsdax,xfs: port unshare to fsdax")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/172796813328.1131942.16777025316348797355.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Remove the code in dax_unshare_iter that zeroes the destination memory
because it's not necessary.
If srcmap is unwritten, we don't have to do anything because that
unwritten extent came from the regular file mapping, and unwritten
extents cannot be shared. The same applies to holes.
Furthermore, zeroing to unshare a mapping is just plain wrong because
unsharing means copy on write, and we should be copying data.
This is effectively a revert of commit 13dd4e0462 ("fsdax: unshare:
zero destination if srcmap is HOLE or UNWRITTEN")
Cc: ruansy.fnst@fujitsu.com
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/172796813311.1131942.16033376284752798632.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The predicate code that iomap_unshare_iter uses to decide if it's really
needs to unshare a file range mapping should be shared with the fsdax
version, because right now they're opencoded and inconsistent.
Note that we simplify the predicate logic a bit -- we no longer allow
unsharing of inline data mappings, but there aren't any filesystems that
allow shared inline data currently.
This is a fix in the sense that it should have been ported to fsdax.
Fixes: b53fdb215d ("iomap: improve shared block detection in iomap_unshare_iter")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/172796813294.1131942.15762084021076932620.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
It doesn't make sense to allocate a COW extent when unsharing a hole
because holes cannot be shared.
Fixes: 1f1397b721 ("xfs: don't allocate into the data fork for an unshare request")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/172796813277.1131942.5486112889531210260.stgit@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
netfslib currently defers dropping the ref on the folios it obtains during
readahead to after it has started I/O on the basis that we can do it whilst
we wait for the I/O to complete, but this runs the risk of the I/O
collection racing with this in future.
Furthermore, Matthew Wilcox strongly suggests that the refs should be
dropped immediately, as readahead_folio() does (netfslib is using
__readahead_batch() which doesn't drop the refs).
Fixes: ee4cdf7ba8 ("netfs: Speed up buffered reading")
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3771538.1728052438@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The background blockgc scanner runs on a 5m interval by default and
trims preallocation (post-eof and cow fork) from inodes that are
otherwise idle. Idle effectively means that iolock can be acquired
without blocking and that the inode has no dirty pagecache or I/O in
flight.
This simple mechanism and heuristic has worked fairly well for
post-eof speculative preallocations. Support for reflink and COW
fork preallocations came sometime later and plugged into the same
mechanism, with similar heuristics. Some recent testing has shown
that COW fork preallocation may be notably more sensitive to blockgc
processing than post-eof preallocation, however.
For example, consider an 8GB reflinked file with a COW extent size
hint of 1MB. A worst case fully randomized overwrite of this file
results in ~8k extents of an average size of ~1MB. If the same
workload is interrupted a couple times for blockgc processing
(assuming the file goes idle), the resulting extent count explodes
to over 100k extents with an average size <100kB. This is
significantly worse than ideal and essentially defeats the COW
extent size hint mechanism.
While this particular test is instrumented, it reflects a fairly
reasonable pattern in practice where random I/Os might spread out
over a large period of time with varying periods of (in)activity.
For example, consider a cloned disk image file for a VM or container
with long uptime and variable and bursty usage. A background blockgc
scan that races and processes the image file when it happens to be
clean and idle can have a significant effect on the future
fragmentation level of the file, even when still in use.
To help combat this, update the heuristic to skip cowblocks inodes
that are currently opened for write access during non-sync blockgc
scans. This allows COW fork preallocations to persist for as long as
possible unless otherwise needed for functional purposes (i.e. a
sync scan), the file is idle and closed, or the inode is being
evicted from cache. While here, update the comments to help
distinguish performance oriented heuristics from the logic that
exists to maintain functional correctness.
Suggested-by: Darrick Wong <djwong@kernel.org>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Currently the debug-only xfs_bmap_exact_minlen_extent_alloc allocation
variant fails to drop into the lowmode last resort allocator, and
thus can sometimes fail allocations for which the caller has a
transaction block reservation.
Fix this by using xfs_bmap_btalloc_low_space to do the actual allocation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_bmap_exact_minlen_extent_alloc duplicates the args setup in
xfs_bmap_btalloc. Switch to call it from xfs_bmap_btalloc after
doing the basic setup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Exact minlen allocations only exist as an error injection tool for debug
builds. Currently this is implemented using ifdefs, which means the code
isn't even compiled for non-XFS_DEBUG builds. Enhance the compile test
coverage by always building the code and use the compilers' dead code
elimination to remove it from the generated binary instead.
The only downside is that the alloc_minlen_only field is unconditionally
added to struct xfs_alloc_args now, but by moving it around and packing
it tightly this doesn't actually increase the size of the structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Userdata and metadata allocations end up in the same allocation helpers.
Remove the separate xfs_bmap_alloc_userdata function to make this more
clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Just like xfs_attr3_leaf_split, xfs_attr_node_try_addname can return
-ENOSPC both for an actual failure to allocate a disk block, but also
to signal the caller to convert the format of the attr fork. Use magic
1 to ask for the conversion here as well.
Note that unlike the similar issue in xfs_attr3_leaf_split, this one was
only found by code review.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_attr3_leaf_split propagates the need for an extra btree split as
-ENOSPC to it's only caller, but the same return value can also be
returned from xfs_da_grow_inode when it fails to find free space.
Distinguish the two cases by returning 1 for the extra split case instead
of overloading -ENOSPC.
This can be triggered relatively easily with the pending realtime group
support and a file system with a lot of small zones that use metadata
space on the main device. In this case every about 5-10th run of
xfs/538 runs into the following assert:
ASSERT(oldblk->magic == XFS_ATTR_LEAF_MAGIC);
in xfs_attr3_leaf_split caused by an allocation failure. Note that
the allocation failure is caused by another bug that will be fixed
subsequently, but this commit at least sorts out the error handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_attr3_leaf_add only has two potential return values, indicating if the
entry could be added or not. Replace the errno return with a bool so that
ENOSPC from it can't easily be confused with a real ENOSPC.
Remove the return value from the xfs_attr3_leaf_add_work helper entirely,
as it always return 0.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
xfs_attr_leaf_try_add is only called by xfs_attr_leaf_addname, and
merging the two will simplify a following error handling fix.
To facilitate this move the remote block state save/restore helpers up in
the file so that they don't need forward declarations now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Use !try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in
xlog_cil_insert_pcp_aggregate(). x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg.
Also, try_cmpxchg implicitly assigns old *ptr value to "old" when
cmpxchg fails. There is no need to re-read the value in the loop.
Note that the value from *ptr should be read using READ_ONCE to
prevent the compiler from merging, refetching or reordering the read.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
Cc: Chandan Babu R <chandan.babu@oracle.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Replace a comma between expression statements by a semicolon.
Signed-off-by: Yan Zhen <yanzhen@vivo.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
The definition of xfs_attr_use_log_assist() has been removed since
commit d9c61ccb3b ("xfs: move xfs_attr_use_log_assist out of xfs_log.c").
So, Remove the empty declartion in header files.
Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
Calling 'ln -s . symlink' or 'ln -s .. symlink' creates symlink pointing to
some object name which ends with U+F029 unicode codepoint. This is because
trailing dot in the object name is replaced by non-ASCII unicode codepoint.
So Linux SMB client currently is not able to create native symlink pointing
to current or parent directory on Windows SMB server which can be read by
either on local Windows server or by any other SMB client which does not
implement compatible-reverse character replacement.
Fix this problem in cifsConvertToUTF16() function which is doing that
character replacement. Function comment already says that it does not need
to handle special cases '.' and '..', but after introduction of native
symlinks in reparse point form, this handling is needed.
Note that this change depends on the previous change
"cifs: Improve creating native symlinks pointing to directory".
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
SMB protocol for native symlinks distinguish between symlink to directory
and symlink to file. These two symlink types cannot be exchanged, which
means that symlink of file type pointing to directory cannot be resolved at
all (and vice-versa).
Windows follows this rule for local filesystems (NTFS) and also for SMB.
Linux SMB client currenly creates all native symlinks of file type. Which
means that Windows (and some other SMB clients) cannot resolve symlinks
pointing to directory created by Linux SMB client.
As Linux system does not distinguish between directory and file symlinks,
its API does not provide enough information for Linux SMB client during
creating of native symlinks.
Add some heuristic into the Linux SMB client for choosing the correct
symlink type during symlink creation. Check if the symlink target location
ends with slash, or last path component is dot or dot-dot, and check if the
target location on SMB share exists and is a directory. If at least one
condition is truth then create a new SMB symlink of directory type.
Otherwise create it as file type symlink.
This change improves interoperability with Windows systems. Windows systems
would be able to resolve more SMB symlinks created by Linux SMB client
which points to existing directory.
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
BCH_INODE_i_size_dirty dates from before we had logged operations for
truncate (as well as finsert) - it hasn't been needed since before
bcachefs was mainlined.
BCH_INODE_i_sectors_dirty hasn't been needed since we started always
updating i_sectors transactionally - it's been unused for even longer.
BCH_INODE_backptr_untrusted also hasn't been used since prior to
mainlining; when unlinking a hardling, we zero out the backpointer
fields if they're for the dirent being removed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we find an unreachable inode, we now reattach it in the oldest
version that needs to be reattached (thus avoiding redundant work
reattaching every single version), and we now fix up inode -> dirent
backpointers in newer versions as needed - or white out the reattaching
dirent in newer versions, if the newer version isn't supposed to be
reattached.
This results in the second verify fsck now passing cleanly after
repairing on a user-provided filesystem image with thousands of
different snapshots.
Reported-by: Christopher Snowhill <chris@kode54.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With inode backpointers, we can write a very simple
check_unreachable_inodes() pass that only looks for non-unlinked inodes
that are missing backpointers, and reattaches them.
This simplifies check_directory_structure() so that it's now only
checking for directory structure loops,
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
A lot of little fixes, bigger ones include:
- bcachefs's __wait_on_freeing_inode() was broken in rc1 due to vfs
changes, now fixed along with another lost wakeup
- fragmentation LRU fixes; fsck now repairs successfully (this is the
data structure copygc uses); along with some nice simplification.
- Rework logged op error handling, so that if logged op replay errors
(due to another filesystem error) we delete the logged op instead of
going into an infinite loop)
- Various small filesystem connectivitity repair fixes
The final part of this patch series, fixing snapshots + unlinked file
handling, is now out on the list - I'm giving that part of the series
more time for user testing.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcBhkIACgkQE6szbY3K
bnYt8RAAqZo6RcN91sgz6xGsJkUvE6DS4Rtj1J4vlVAmuiIa5NUhRhqFnS6j8V9A
AWZw63JwTizrglbLk4Z4knfiViT4GOeiKX4sttaJk7cLW7bxwCUddlho1G5Q7q0I
PFurYevqG1ltcl5oZpD6LhZiqEhndQI3XnkpEvKsmoXy9TSB4KEqaU8Y+cewjq4q
KCFuxTBhbmatxP9eTGuDhd6uWw5h0EVDGQyMitEcSutIaernGlSsBQ8gZ5n9dWSd
lP91qFT5iypmCMo9Arf8Fq1YBvOpV6P91eq8YPa4A3sKDfzHn3CCzsSyjUiGK0RM
Wcl+kNwqYJa7Fwtb7aGgTVhaMkqLzPTI+XYye3FXrXjJ6B0JKpl2QvvDoFhDxop9
ZPb57QyRgRBtOvofvFz8fWQOr67n+HNvaMbeG1iwGvqm6/MrgdSLsN6OaRh80uAE
5P0qX7rwTTOfJj5T6dKLxr3KuXKXNrM5AAIG0MjOMsha232+XUAZvofYNmqx7BMi
juJvqZc9/GXrcXqdPTYDyBs4UXDkwHsKdr744ooZ64VNiIYFs6eTvXp7V0XuajYH
ExLrEEjhO2UGPM5N9R9jw9AMsEhJstexgylHQsiiADtdi+jY4LKa/NZAJSJQQC+C
QQyE3Q7ZCpzRPiGPkkpIY/D7IRoIHL2H+LhbXV/K3oMGdbA7hS4=
=XnG4
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"A lot of little fixes, bigger ones include:
- bcachefs's __wait_on_freeing_inode() was broken in rc1 due to vfs
changes, now fixed along with another lost wakeup
- fragmentation LRU fixes; fsck now repairs successfully (this is the
data structure copygc uses); along with some nice simplification.
- Rework logged op error handling, so that if logged op replay errors
(due to another filesystem error) we delete the logged op instead
of going into an infinite loop)
- Various small filesystem connectivitity repair fixes"
* tag 'bcachefs-2024-10-05' of git://evilpiepirate.org/bcachefs:
bcachefs: Rework logged op error handling
bcachefs: Add warn param to subvol_get_snapshot, peek_inode
bcachefs: Kill snapshot arg to fsck_write_inode()
bcachefs: Check for unlinked, non-empty dirs in check_inode()
bcachefs: Check for unlinked inodes with dirents
bcachefs: Check for directories with no backpointers
bcachefs: Kill alloc_v4.fragmentation_lru
bcachefs: minor lru fsck fixes
bcachefs: Mark more errors AUTOFIX
bcachefs: Make sure we print error that causes fsck to bail out
bcachefs: bkey errors are only AUTOFIX during read
bcachefs: Create lost+found in correct snapshot
bcachefs: Fix reattach_inode()
bcachefs: Add missing wakeup to bch2_inode_hash_remove()
bcachefs: Fix trans_commit disk accounting revert
bcachefs: Fix bch2_inode_is_open() check
bcachefs: Fix return type of dirent_points_to_inode_nowarn()
bcachefs: Fix bad shift in bch2_read_flag_list()
When multiple FREE_STATEIDs are sent for the same delegation stateid,
it can lead to a possible either use-after-free or counter refcount
underflow errors.
In nfsd4_free_stateid() under the client lock we find a delegation
stateid, however the code drops the lock before calling nfs4_put_stid(),
that allows another FREE_STATE to find the stateid again. The first one
will proceed to then free the stateid which leads to either
use-after-free or decrementing already zeroed counter.
Fixes: 3f29cc82a8 ("nfsd: split sc_status out of sc_type")
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
fast commits.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmcAYaIACgkQ8vlZVpUN
gaO8fQf+IOHRpyqVBXd2WwDJ05vBJAbbaKdhaI1r6gFPF1AzyIZ+lpCnEV9f9B2u
IrDZaigb/CiJxO9mLRLrp0z/O9s6Z/+NjNdVFkoynJo2Crfr5mi07DXR9EQ7XkvT
WYcT0IXg7xVswOzVosgBOUA76yMLJwtuTPp7YsAmCmp5EksMk1LSk1VzUEJcKRgJ
BHSWSn2BqHxwMStB5L0eBnYtdaW8l1RLeWlTPdcPeCrD7UrYCXThvxfyz2KdztBz
YfjcQEZhY6OD/uKdNFE6oQc54kL8RpqGDpF3YGMBixiZ8h99PpGUUz3bB5VLTfco
WlxQwOD9dofPQ5Yh+s5icY8q67XXow==
=Cnp+
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Fix some ext4 bugs and regressions relating to oneline resize and fast
commits"
* tag 'ext4_for_linus-5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix off by one issue in alloc_flex_gd()
ext4: mark fc as ineligible using an handle in ext4_xattr_set()
ext4: use handle to mark fc as ineligible in __track_dentry_update()
Initially it was thought that we just wanted to ignore errors from
logged op replay, but it turns out we do need to catch -EROFS, or we'll
go into an infinite loop.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These shouldn't always be fatal errors - logged op resume, in
particular, and we want it as a parameter there.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It was initially believed that it would be better to be explicit about
the snapshot we're updating when writing inodes in fsck; however, it
turns out that passing around the snapshot separately is more error
prone and we're usually updating the inode in the same snapshow we read
it from.
This is different from normal filesystem paths, where we do the update
in the snapshot of the subvolume we're in.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We want to check for this early so it can be reattached if necessary in
check_unreachable_inodes(); better than letting it be deleted and having
the children reattached, losing their filenames.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
link count works differently in bcachefs - it's only nonzero for files
with multiple hardlinks, which means we can also avoid checking it
except for files that are known to have hardlinks.
That means we need a few different checks instead; in particular, we
don't want fsck to delet a file that has a dirent pointing to it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It's legal for regular files to have missing backpointers (due to
hardlinks), and fsck should automatically add them, but for directories
this is an error that should be flagged.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The fragmentation_lru field hasn't been needed since we reworked the LRU
btrees to use the btree write buffer; previously it was used to resolve
collisions, but the revised LRU btree uses the backpointer (the bucket)
as part of the key.
It should have been deleted at the time of the LRU rework; since it
wasn't, that left places for bugs to hide, in check/repair.
This fixes LRU fsck on a filesystem image helpfully provided by a user
who disappeared before I could get his name for the reported-by.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
check_lru_key() wasn't using write buffer updates for deleting bad lru
entries - dating from before the lru btree used the btree write buffer.
And when possibly flushing the btree write buffer (to make sure we're
seeing a real inconsistency), we need to be using the modern
bch2_btree_write_buffer_maybe_flush().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Newly generated keys, in the transaction commit path or write path,
should not be AUTOFIX; those indicate bugs that we need to fail fast
for.
Fixes: 5612daafb7 ("bcachefs: Fix fsck warnings from bkey validation")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ensure a copy of the lost+found inode exists in the snapshot that we're
reattaching, so that we don't trigger warnings in
lookup_inode_for_snapshot() later.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes two different bugs:
- Looser locking with the rhashtable means we need to recheck if the
inode is still hashed after prepare_to_wait(), and add a corresponding
wakeup after removing from the hash table.
- da18ecbf0f ("fs: add i_state helpers") changed the bit waitqueues
used for inodes, and bcachefs wasn't updated and thus broke; this
updates bcachefs to the new helper.
Fixes: 112d21fd1a ("bcachefs: switch to rhashtable for vfs inodes hash")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result
in a fast-commit being done before the filesystem is effectively marked as
ineligible. This patch moves the call to this function so that an handle
can be used. If a transaction fails to start, then there's not point in
trying to mark the filesystem as ineligible, and an error will eventually be
returned to user-space.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240923104909.18342-3-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Calling ext4_fc_mark_ineligible() with a NULL handle is racy and may result
in a fast-commit being done before the filesystem is effectively marked as
ineligible. This patch fixes the calls to this function in
__track_dentry_update() by adding an extra parameter to the callback used in
ext4_fc_track_template().
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Luis Henriques (SUSE) <luis.henriques@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240923104909.18342-2-luis.henriques@linux.dev
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Otherwise nfsd_file_acquire, nfsd_file_insert_err, and
nfsd_file_cons_err will hit a NULL pointer when they are enabled and
LOCALIO used.
Example trace output (note xid is 0x0 and LOCALIO flag set):
nfsd_file_acquire: xid=0x0 inode=0000000069a1b2e7
may_flags=WRITE|LOCALIO ref=1 nf_flags=HASHED|GC nf_may=WRITE
nf_file=0000000070123234 status=0
Fixes: c63f0e48fe ("nfsd: add nfsd_file_acquire_local()")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmb//Q8ACgkQnJ2qBz9k
QNlBoQf8CwTW4/BTbt8Pq3RIKHYH8HFyFurwxrlnKXoB/4E4QNAfd/iDbwxqkILJ
mbKHGVyk96St83PH5H9b2MYnwQ6PP6WEHIe7RZjxOrfrrLTSwuuOwdNR9RZbRWPd
rLQ1/nbnLiTWPR29RLZVbhFAhDiw1hyag8YKPklEw+7SQ/wSxFUTJ8DTQ8TZYq8f
O8SoTTJACulAP1CfoNwPlhm472JBe5jJP8WygaYeKg5wky6RPQYn8J+JSUmHs+rZ
qNY+RuxLZ/HauXA+vBgae34G1GAdWc3sqUEU2TrGBuy/x28K/Di84+eEdQXA74sr
+8+ACt6RRlm/w+FrjvgxJ0JUV/OAMA==
=CoCV
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_for_v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify fixes from Jan Kara:
"Fixes for an inotify deadlock and a data race in fsnotify"
* tag 'fsnotify_for_v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
inotify: Fix possible deadlock in fsnotify_destroy_mark
fsnotify: Avoid data race between fsnotify_recalc_mask() and fsnotify_object_watched()
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmb//cYACgkQnJ2qBz9k
QNmrxAgApRJqSQ0Koilm4Mtz+6k7CAM6MR/+JKRQLoJwl27cA4syw9REBPPdcNJP
9Gbvlcenf8ISrtZStZehw/eyOuNvE9uBzjjmRcMc3+0esDfgFqfteDCanJ/MQDdZ
NX3OB8pSMxwTtcf9iXJUyD/hvFIq58H378r8eOKAwI9S673r2zDZ5qGmske3S6DM
gyMwfWepzPigKOYhwt1Frih4GmjOaD3637bSoGBfRdhT1PvAH9bJaKCFtwucNP7Y
MG+on/JLrpPgegzbWQe2u4WTzQihkRDnEfhC+52Erdhb+Z4rv3IJUarVO2iIKzwF
3CS7q3NyArbYN3QbvOiIKR+rORNnZg==
=gbY2
-----END PGP SIGNATURE-----
Merge tag 'fs_for_v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull UDF fixes from Jan Kara:
"A couple of UDF error handling fixes for issues spotted by syzbot"
* tag 'fs_for_v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: fix uninit-value use in udf_get_fileshortad
udf: refactor inode_bmap() to handle error
udf: refactor udf_next_aext() to handle error
udf: refactor udf_current_aext() to handle error
a regression in cap handling which sneaked in through the netfs helper
library in 5.18 (marked for stable) and an unrelated one-line cleanup.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmcACf8THGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi3DiCACDP9ZubGLZ2GKv3fEyXEURnOyrSBAd
M4Ci3cc31W0Q1rWtQndGxW9QF/9smcp4kN/xgelWfSZc6FJ1HLR1RHUPjMJpBPx2
dQs+RaaBTIe0coVaMmiDnBQ1wnfuNhEZH8opnLonjHygANnoR2aYZpyIxpgYD7eZ
KHfk1F/2rmE+QzYGB+tE7sCpRs3bhbtL2xa/2yQJyw+EWfgyJUjPW1cZe0liiWHi
dXTMxJqfZLLUW2otLdGnaaZS+7yF3ItcoAWh6cUoiZ9cpqDu43B0RGXOV7Kegui1
2e8W0sK3QIIYbUKoHo/4VPqrmN4puSjeNteumDvpZE4SgG8BfKWzQnmh
=3OKU
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.12-rc2' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A fix from Patrick for a variety of CephFS lockup scenarios caused by
a regression in cap handling which sneaked in through the netfs helper
library in 5.18 (marked for stable) and an unrelated one-line cleanup"
* tag 'ceph-for-6.12-rc2' of https://github.com/ceph/ceph-client:
ceph: fix cap ref leak via netfs init_request
ceph: use struct_size() helper in __ceph_pool_perm_get()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmb/9agACgkQxWXV+ddt
WDtKFA/5AW88osk/1k/NVhOvOa0xPr5XyDLq1n8Gaxfy8uHlHAc8wdsvJzCDMS0M
qUOD/tOPhRI0HGXPiKD767erwbyXiZAcCTkSd8x5jlXy1hVjUHQSKO//JxD0vtAZ
jOscoUA1wJJutopCXcppnoUUFE2753edEg0w2EtUXMpfqivqOmMCR+1ZtKkfaNJo
oRuZCq3Oi8hu7Wsvmh4Etq/9MvGM+xovXAMAji6Op8nsP1jJlzWztpEUogLOQH2S
IhDFFxP9shBV9JjV+HSyXcAYr8VArH6HtYjapR9oajCH0pvSjLYQQPq/qQ0//8Hb
SHr4YP6RBbpEIvvaQeA7vwsckBHNBOrSAbEgRQ2+zdmiwha6SRIEVyXYh5LUwcYT
WnVozbk7ZX9rD1jOVhvgouFG6vUz8A6/qt3BD028bVcyMvBXW4gsEduCMVlFGmlN
D6+hNY6J08j4HUEGnPk7fAYi/lk5OEK1p5yarUgsOQ3GqWWS0ywkrfbmbWwyC+Ff
AxggFTl9YodU5RMs7EU2GeHkLU6LnXgevk6FFm0JzsLtT/BbEP7pj6tOot4Msl1e
2ovqFiSbuPNg5Wr70ZBRO9LDIAYtTZy1UrVR/YCSLzm+wZsXMelDIHMQfE7+Yodp
O5Cud23AanRTwjErvNl3X4rhut8rrI1FsR89gDyKK1EPG+mwofo=
=yslr
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- in incremental send, fix invalid clone operation for file that got
its size decreased
- fix __counted_by() annotation of send path cache entries, we do not
store the terminating NUL
- fix a longstanding bug in relocation (and quite hard to hit by
chance), drop back reference cache that can get out of sync after
transaction commit
- wait for fixup worker kthread before finishing umount
- add missing raid-stripe-tree extent for NOCOW files, zoned mode
cannot have NOCOW files but RST is meant to be a standalone feature
- handle transaction start error during relocation, avoid potential
NULL pointer dereference of relocation control structure (reported by
syzbot)
- disable module-wide rate limiting of debug level messages
- minor fix to tracepoint definition (reported by checkpatch.pl)
* tag 'for-6.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: disable rate limiting when debug enabled
btrfs: wait for fixup workers before stopping cleaner kthread during umount
btrfs: fix a NULL pointer dereference when failed to start a new trasacntion
btrfs: send: fix invalid clone operation for file that got its size decreased
btrfs: tracepoints: end assignment with semicolon at btrfs_qgroup_extent event class
btrfs: drop the backref cache during relocation if we commit
btrfs: also add stripe entries for NOCOW writes
btrfs: send: fix buffer overflow detection when copying path to cache entry
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmb+z50ACgkQiiy9cAdy
T1FS9Av+N4943ciID42gNZEL/33t+NuNymFHDeC1H4txMNEi6MZXb+H4HFQMCF22
ZWdb1IIUJ7dUjfX68hD+sIs7o+QsCUIGriyNLZvlg7xo7NIBXED2UVmWfLeiQZyh
DVhCC62vVCYCvvyrlLMBKuTIM/mNzRe7JUTpFBN+wiiakLAgf5G/9ifoFyC6cmA3
854V3z5644WM2mPOsLxWr0CkV+pELRWwgvWeMcHXnjjrljjIi6jpCX2jthkEJgkR
6rcFYwfnS74VqfzjZl7sMD4Oc/blaTuNjj0iwZz5ThJMUIN6p/RzUvSN+2qySPqJ
7ENhcElJwr9lslDfL7X412bLRAhma+vduofHW9IPNvD3Q3okXQDBZ/FRHx5q+mZR
ziBFCFT/OrotSSy8MZGOl7tD1bs29B1R98hO91qGUzo/rhzNzM93DiGAgYDxVLRE
/cw08lQzkFn468MIKHrXV3qKLqAN4Wp8h2tZ80oNAMjc3MwELoKzE5pljQc8vhVI
Oh3qN8IE
=Ky9w
-----END PGP SIGNATURE-----
Merge tag 'v6.12-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- statfs fix (e.g. when limited access to root directory of share)
- special file handling fixes: fix packet validation to avoid buffer
overflow for reparse points, fixes for symlink path parsing (one for
reparse points, and one for SFU use case), and fix for cleanup after
failed SET_REPARSE operation.
- fix for SMB2.1 signing bug introduced by recent patch to NFS symlink
path, and NFS reparse point validation
- comment cleanup
* tag 'v6.12-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Do not convert delimiter when parsing NFS-style symlinks
cifs: Validate content of NFS reparse point buffer
cifs: Fix buffer overflow when parsing NFS reparse points
smb: client: Correct typos in multiple comments across various files
smb: client: use actual path when queryfs
cifs: Remove intermediate object of failed create reparse call
Revert "smb: client: make SHA-512 TFM ephemeral"
smb: Update comments about some reparse point tags
cifs: Check for UTF-16 null codepoint in SFU symlink target location
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZv90qwAKCRBZ7Krx/gZQ
629DAP9NT9/ndByazKrP03nQ/ITNhYVd0cby2iPcXZiHGj/fsQD7BF4yUUiBy7EO
seIY8rGxc8S5TDJAG4HYa7t1Nuksnw4=
=oji5
-----END PGP SIGNATURE-----
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull close_range() fix from Al Viro:
"Fix the logic in descriptor table trimming"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
close_range(): fix the logics in descriptor table trimming
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZv8vaAAKCRBZ7Krx/gZQ
61tVAP9IpmZX+sJ0sVI6kesjlZsTBjMfw6VuWB7sJuFm4KOWagD+NHYbma8NSIGX
cyGKXaePWYUFww7Y0nh3Q+Dz0VXk9Q8=
=ccvV
-----END PGP SIGNATURE-----
Merge tag 'pull-fixes.ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ufs fix from Al Viro:
"Fix ufs_rename() braino introduced this cycle.
The 'folio_release_kmap(dir_folio, new_dir)' in ufs_rename() part of
folio conversion should've been getting a pointer to ufs directory
entry within the page, rather than a pointer to directory struct
inode..."
* tag 'pull-fixes.ufs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ufs_rename(): fix bogus argument of folio_release_kmap()
The 'default n' that was in NFS_COMMON_LOCALIO_SUPPORT caused these
extra defaults to be missed:
default y if NFSD=y || NFS_FS=y
default m if NFSD=m && NFS_FS=m
Remove the 'default n' for NFS_COMMON_LOCALIO_SUPPORT so that the
correct tristate is selected based on how NFSD and NFS_FS are
configured. This fixes the reported case where NFS_FS=y but
NFS_COMMON_LOCALIO_SUPPORT=m, it is now correctly set to =y.
In addition, add extra 'depends on NFS_LOCALIO' to
NFS_COMMON_LOCALIO_SUPPORT so that if NFS_LOCALIO isn't set then
NFS_COMMON_LOCALIO_SUPPORT will not be either.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202410031944.hMCFY9BO-lkp@intel.com/
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
Add nfs_to_nfsd_file_put_local() interface to fix race with nfsd
module unload. Similarly, use RCU around nfs_open_local_fh()'s error
path call to nfs_to->nfsd_serv_put(). Holding RCU ensures that NFS
will safely _call and return_ from its nfs_to calls into the NFSD
functions nfsd_file_put_local() and nfsd_serv_put().
Otherwise, if RCU isn't used then there is a narrow window when NFS's
reference for the nfsd_file and nfsd_serv are dropped and the NFSD
module could be unloaded, which could result in a crash from the
return instruction for either nfs_to->nfsd_file_put_local() or
nfs_to->nfsd_serv_put().
Reported-by: NeilBrown <neilb@suse.de>
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
The math in "rc_list->rcl_nrefcalls * 2 * sizeof(uint32_t)" could have an
integer overflow. Add bounds checking on rc_list->rcl_nrefcalls to fix
that.
Fixes: 4aece6a19c ("nfs41: cb_sequence xdr implementation")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
NFS-style symlinks have target location always stored in NFS/UNIX form
where backslash means the real UNIX backslash and not the SMB path
separator.
So do not mangle slash and backslash content of NFS-style symlink during
readlink() syscall as it is already in the correct Linux form.
This fixes interoperability of NFS-style symlinks with backslashes created
by Linux NFS3 client throw Windows NFS server and retrieved by Linux SMB
client throw Windows SMB server, where both Windows servers exports the
same directory.
Fixes: d5ecebc490 ("smb3: Allow query of symlinks stored as reparse points")
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Symlink target location stored in DataBuffer is encoded in UTF-16. So check
that symlink DataBuffer length is non-zero and even number. And check that
DataBuffer does not contain UTF-16 null codepoint because Linux cannot
process symlink with null byte.
DataBuffer for char and block devices is 8 bytes long as it contains two
32-bit numbers (major and minor). Add check for this.
DataBuffer buffer for sockets and fifos zero-length. Add checks for this.
Signed-off-by: Pali Rohár <pali@kernel.org>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
ReparseDataLength is sum of the InodeType size and DataBuffer size.
So to get DataBuffer size it is needed to subtract InodeType's size from
ReparseDataLength.
Function cifs_strndup_from_utf16() is currentlly accessing buf->DataBuffer
at position after the end of the buffer because it does not subtract
InodeType size from the length. Fix this problem and correctly subtract
variable len.
Member InodeType is present only when reparse buffer is large enough. Check
for ReparseDataLength before accessing InodeType to prevent another invalid
memory access.
Major and minor rdev values are present also only when reparse buffer is
large enough. Check for reparse buffer size before calling reparse_mkdev().
Fixes: d5ecebc490 ("smb3: Allow query of symlinks stored as reparse points")
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmb+HqAACgkQiiy9cAdy
T1EQbgv/aybFhucbglNe1QIjQ12DqBUUJoRRbV0xLX2TmvbpxxBuehbD11pkTqeH
c7zvCQE+Ank3PfSGvFjM77iY++AuhHtvDg5ugMtdZEUzqNtEdT6a1fnVcAsqmuhM
5ROER0IheSwSbIha6FJwgodwKAeJuPmmEmbU9e0PZ4ZZLqetAnuhpKNOEurMMxoa
G0K7hknyuG9/gOiyBfmVTysuorA9jP1IgWjnwBOANKJo+IbQdifaLd535XWaY/7+
sabRy+0QAmMejcrP6XHT5KUUjw63YODmhnFKo0MRaG3GODg4RO/7JRJLdD9FMDCY
DyL5at0Ro33zhzif7i0vFUn7VhvkWuivfXQBLL+ALk2xhHw+5Yk/zqce84fTbzQj
KOeFeevG5B2P0uxGbShjxVqxbaPUgIKD7f1N6SmwkAnCE3+zXcGwRTENKFR4C5mF
iMFd22hYPUMD3ED/yR6+1fEtLpGtHof9erHH99x1bRU4fL+Am+C6fHTy+klaFMPP
K3xXe/1i
=xo5T
-----END PGP SIGNATURE-----
Merge tag 'v6.12-rc1-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- small cleanup patches leveraging struct size to improve access bounds checking
* tag 'v6.12-rc1-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: Use struct_size() to improve smb_direct_rdma_xmit()
ksmbd: Annotate struct copychunk_ioctl_req with __counted_by_le()
ksmbd: Use struct_size() to improve get_file_alternate_info()
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZv5Y3gAKCRCRxhvAZXjc
ojFPAP45kz5JgVKFn8iZmwfjPa7qbCa11gEzmx0SbUt3zZ3mJAD/fL9k9KaNU+qA
LIcZW5BJn/p5fumUAw8/fKoz4ajCWQk=
=LIz1
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12-rc2.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"vfs:
- Ensure that iter_folioq_get_pages() advances to the next slot
otherwise it will end up using the same folio with an out-of-bound
offset.
iomap:
- Dont unshare delalloc extents which can't be reflinked, and thus
can't be shared.
- Constrain the file range passed to iomap_file_unshare() directly in
iomap instead of requiring the callers to do it.
netfs:
- Use folioq_count instead of folioq_nr_slot to prevent an
unitialized value warning in netfs_clear_buffer().
- Fix missing wakeup after issuing writes by scheduling the write
collector only if all the subrequest queues are empty and thus no
writes are pending.
- Fix two minor documentation bugs"
* tag 'vfs-6.12-rc2.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
iomap: constrain the file range passed to iomap_file_unshare
iomap: don't bother unsharing delalloc extents
netfs: Fix missing wakeup after issuing writes
Documentation: add missing folio_queue entry
folio_queue: fix documentation
netfs: Fix a KMSAN uninit-value error in netfs_clear_buffer
iov_iter: fix advancing slot in iter_folioq_get_pages()
File contents can only be shared (i.e. reflinked) below EOF, so it makes
no sense to try to unshare ranges beyond EOF. Constrain the file range
parameters here so that we don't have to do that in the callers.
Fixes: 5f4e5752a8 ("fs: add iomap_file_dirty")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20241002150213.GC21853@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
If unshare encounters a delalloc reservation in the srcmap, that means
that the file range isn't shared because delalloc reservations cannot be
reflinked. Therefore, don't try to unshare them.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20241002150040.GB21853@frogsfrogsfrogs
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Log recovered from a user's cluster:
<7>[ 5413.970692] ceph: get_cap_refs 00000000958c114b ret 1 got Fr
<7>[ 5413.970695] ceph: start_read 00000000958c114b, no cache cap
...
<7>[ 5473.934609] ceph: my wanted = Fr, used = Fr, dirty -
<7>[ 5473.934616] ceph: revocation: pAsLsXsFr -> pAsLsXs (revoking Fr)
<7>[ 5473.934632] ceph: __ceph_caps_issued 00000000958c114b cap 00000000f7784259 issued pAsLsXs
<7>[ 5473.934638] ceph: check_caps 10000000e68.fffffffffffffffe file_want - used Fr dirty - flushing - issued pAsLsXs revoking Fr retain pAsLsXsFsr AUTHONLY NOINVAL FLUSH_FORCE
The MDS subsequently complains that the kernel client is late releasing
caps.
Approximately, a series of changes to this code by commits 4987005600
("ceph: convert ceph_readpages to ceph_readahead"), 2de1604173
("netfs: Change ->init_request() to return an error code") and
a5c9dc4451 ("ceph: Make ceph_init_request() check caps on readahead")
resulted in subtle resource cleanup to be missed. The main culprit is
the change in error handling in 2de1604173 which meant that a failure
in init_request() would no longer cause cleanup to be called. That
would prevent the ceph_put_cap_refs() call which would cleanup the
leaked cap ref.
Cc: stable@vger.kernel.org
Fixes: a5c9dc4451 ("ceph: Make ceph_init_request() check caps on readahead")
Link: https://tracker.ceph.com/issues/67008
Signed-off-by: Patrick Donnelly <pdonnell@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Use struct_size() to calculate the number of bytes to be allocated.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
We only are applying JSET_ENTRY_TYPE_write_buffer_keys, revert path was
missed.
Fixes: a3581ca35d ("bcachefs: Fix BCH_TRANS_COMMIT_skip_accounting_apply")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
we're returning an error code now, not a bool
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZv3NAgAKCRBZ7Krx/gZQ
68kbAP0YzQxUgl0/o7Soda8XwKSPZTM9ls6kRk1UHTTG/i4ZigEA/G+i/mBQctL0
AB911kK8mxfXppfOXzstFBjoJSqiigQ=
=IE7D
-----END PGP SIGNATURE-----
Merge tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull generic unaligned.h cleanups from Al Viro:
"Get rid of architecture-specific <asm/unaligned.h> includes, replacing
them with a single generic <linux/unaligned.h> header file.
It's the second largest (after asm/io.h) class of asm/* includes, and
all but two architectures actually end up using exact same file.
Massage the remaining two (arc and parisc) to do the same and just
move the thing to from asm-generic/unaligned.h to linux/unaligned.h"
[ This is one of those things that we're better off doing outside the
merge window, and would only cause extra conflict noise if it was in
linux-next for the next release due to all the trivial #include line
updates. Rip off the band-aid. - Linus ]
* tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
move asm/unaligned.h to linux/unaligned.h
arc: get rid of private asm/unaligned.h
parisc: get rid of private asm/unaligned.h
asm/unaligned.h is always an include of asm-generic/unaligned.h;
might as well move that thing to linux/unaligned.h and include
that - there's nothing arch-specific in that header.
auto-generated by the following:
for i in `git grep -l -w asm/unaligned.h`; do
sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i
done
for i in `git grep -l -w asm-generic/unaligned.h`; do
sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i
done
git mv include/asm-generic/unaligned.h include/linux/unaligned.h
git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h
sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild
sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
- Add support for the FS_IOC_GETFSSYSFSPATH ioctl.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCZvu6SQAKCRDdoc3SxdoY
dk/OAQD0wLQqxdrRFe3sbcgVnqKrGxBa59Pj2jQwy70bM9kN4QEAowFKia7PxiFH
JZ20Z1a6om/ksjYouKuYc3QR+yRFkgE=
=vZIR
-----END PGP SIGNATURE-----
Merge tag 'zonefs-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs update from Damien Le Moal:
- Add support for the FS_IOC_GETFSSYSFSPATH ioctl
* tag 'zonefs-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: add support for FS_IOC_GETFSSYSFSPATH
After dividing up a proposed write into subrequests, netfslib sets
NETFS_RREQ_ALL_QUEUED to indicate to the collector that it can move on to
the final cleanup once it has emptied the subrequest queues.
Now, whilst the collector will normally end up running at least once after
this bit is set just because it takes a while to process all the write
subrequests before the collector runs out of subrequests, there exists the
possibility that the issuing thread will be forced to sleep and the
collector thread will clean up all the subrequests before ALL_QUEUED gets
set.
In such a case, the collector thread will not get triggered again and will
never clear NETFS_RREQ_IN_PROGRESS thus leaving a request uncompleted and
causing a potential futute hang.
Fix this by scheduling the write collector if all the subrequest queues are
empty (and thus no writes pending issuance).
Note that we'd do this ideally before queuing the subrequest, but in the
case of buffered writeback, at least, we can't find out that we've run out
of folios until after we've called writeback_iter() and it has returned
NULL - at which point we might not actually have any subrequests still
under construction.
Fixes: 288ace2f57 ("netfs: New writeback implementation")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/3317784.1727880350@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
[Syzbot reported]
WARNING: possible circular locking dependency detected
6.11.0-rc4-syzkaller-00019-gb311c1b497e5 #0 Not tainted
------------------------------------------------------
kswapd0/78 is trying to acquire lock:
ffff88801b8d8930 (&group->mark_mutex){+.+.}-{3:3}, at: fsnotify_group_lock include/linux/fsnotify_backend.h:270 [inline]
ffff88801b8d8930 (&group->mark_mutex){+.+.}-{3:3}, at: fsnotify_destroy_mark+0x38/0x3c0 fs/notify/mark.c:578
but task is already holding lock:
ffffffff8ea2fd60 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat mm/vmscan.c:6841 [inline]
ffffffff8ea2fd60 (fs_reclaim){+.+.}-{0:0}, at: kswapd+0xbb4/0x35a0 mm/vmscan.c:7223
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (fs_reclaim){+.+.}-{0:0}:
...
kmem_cache_alloc_noprof+0x3d/0x2a0 mm/slub.c:4044
inotify_new_watch fs/notify/inotify/inotify_user.c:599 [inline]
inotify_update_watch fs/notify/inotify/inotify_user.c:647 [inline]
__do_sys_inotify_add_watch fs/notify/inotify/inotify_user.c:786 [inline]
__se_sys_inotify_add_watch+0x72e/0x1070 fs/notify/inotify/inotify_user.c:729
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (&group->mark_mutex){+.+.}-{3:3}:
...
__mutex_lock+0x136/0xd70 kernel/locking/mutex.c:752
fsnotify_group_lock include/linux/fsnotify_backend.h:270 [inline]
fsnotify_destroy_mark+0x38/0x3c0 fs/notify/mark.c:578
fsnotify_destroy_marks+0x14a/0x660 fs/notify/mark.c:934
fsnotify_inoderemove include/linux/fsnotify.h:264 [inline]
dentry_unlink_inode+0x2e0/0x430 fs/dcache.c:403
__dentry_kill+0x20d/0x630 fs/dcache.c:610
shrink_kill+0xa9/0x2c0 fs/dcache.c:1055
shrink_dentry_list+0x2c0/0x5b0 fs/dcache.c:1082
prune_dcache_sb+0x10f/0x180 fs/dcache.c:1163
super_cache_scan+0x34f/0x4b0 fs/super.c:221
do_shrink_slab+0x701/0x1160 mm/shrinker.c:435
shrink_slab+0x1093/0x14d0 mm/shrinker.c:662
shrink_one+0x43b/0x850 mm/vmscan.c:4815
shrink_many mm/vmscan.c:4876 [inline]
lru_gen_shrink_node mm/vmscan.c:4954 [inline]
shrink_node+0x3799/0x3de0 mm/vmscan.c:5934
kswapd_shrink_node mm/vmscan.c:6762 [inline]
balance_pgdat mm/vmscan.c:6954 [inline]
kswapd+0x1bcd/0x35a0 mm/vmscan.c:7223
[Analysis]
The problem is that inotify_new_watch() is using GFP_KERNEL to allocate
new watches under group->mark_mutex, however if dentry reclaim races
with unlinking of an inode, it can end up dropping the last dentry reference
for an unlinked inode resulting in removal of fsnotify mark from reclaim
context which wants to acquire group->mark_mutex as well.
This scenario shows that all notification groups are in principle prone
to this kind of a deadlock (previously, we considered only fanotify and
dnotify to be problematic for other reasons) so make sure all
allocations under group->mark_mutex happen with GFP_NOFS.
Reported-and-tested-by: syzbot+c679f13773f295d2da53@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=c679f13773f295d2da53
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20240927143642.2369508-1-lizhi.xu@windriver.com
Since udf_current_aext() has error handling, udf_next_aext() should have
error handling too. Besides, when too many indirect extents found in one
inode, return -EFSCORRUPTED; when reading block failed, return -EIO.
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241001115425.266556-3-zhaomzhao@126.com
As Jan suggested in links below, refactor udf_current_aext() to
differentiate between error, hit EOF and success, it now takes pointer to
etype to store the extent type, return 1 when getting etype success,
return 0 when hitting EOF and return -errno when err.
Link: https://lore.kernel.org/all/20240912111235.6nr3wuqvktecy3vh@quack3/
Signed-off-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20241001115425.266556-2-zhaomzhao@126.com
new_dir does *NOT* point into dir_folio - it's an inode, not a pointer
to ufs directory entry.
Fixes: 516b97cf03 "ufs: Convert directory handling to kmap_local"
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Due to server permission control, the client does not have access to
the shared root directory, but can access subdirectories normally, so
users usually mount the shared subdirectories directly. In this case,
queryfs should use the actual path instead of the root directory to
avoid the call returning an error (EACCES).
Signed-off-by: wangrong <wangrong@uniontech.com>
Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Use struct_size() to calculate the number of bytes to allocate for a
new message.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Add the __counted_by_le compiler attribute to the flexible array member
Chunks to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Change the data type of the flexible array member Chunks from __u8[] to
struct srv_copychunk[] for ChunkCount to match the number of elements in
the Chunks array. (With __u8[], each srv_copychunk would occupy 24 array
entries and the __counted_by compiler attribute wouldn't be applicable.)
Use struct_size() to calculate the size of the copychunk_ioctl_req.
Read Chunks[0] after checking that ChunkCount is not 0.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Use struct_size() to calculate the output buffer length.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Disable ratelimiting for btrfs_printk when CONFIG_BTRFS_DEBUG is
enabled. This allows for more verbose output which is often needed by
functions like btrfs_dump_space_info().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Syzbot reported a NULL pointer dereference with the following crash:
FAULT_INJECTION: forcing a failure.
start_transaction+0x830/0x1670 fs/btrfs/transaction.c:676
prepare_to_relocate+0x31f/0x4c0 fs/btrfs/relocation.c:3642
relocate_block_group+0x169/0xd20 fs/btrfs/relocation.c:3678
...
BTRFS info (device loop0): balance: ended with status: -12
Oops: general protection fault, probably for non-canonical address 0xdffffc00000000cc: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000660-0x0000000000000667]
RIP: 0010:btrfs_update_reloc_root+0x362/0xa80 fs/btrfs/relocation.c:926
Call Trace:
<TASK>
commit_fs_roots+0x2ee/0x720 fs/btrfs/transaction.c:1496
btrfs_commit_transaction+0xfaf/0x3740 fs/btrfs/transaction.c:2430
del_balance_item fs/btrfs/volumes.c:3678 [inline]
reset_balance_state+0x25e/0x3c0 fs/btrfs/volumes.c:3742
btrfs_balance+0xead/0x10c0 fs/btrfs/volumes.c:4574
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:907 [inline]
__se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
[CAUSE]
The allocation failure happens at the start_transaction() inside
prepare_to_relocate(), and during the error handling we call
unset_reloc_control(), which makes fs_info->balance_ctl to be NULL.
Then we continue the error path cleanup in btrfs_balance() by calling
reset_balance_state() which will call del_balance_item() to fully delete
the balance item in the root tree.
However during the small window between set_reloc_contrl() and
unset_reloc_control(), we can have a subvolume tree update and created a
reloc_root for that subvolume.
Then we go into the final btrfs_commit_transaction() of
del_balance_item(), and into btrfs_update_reloc_root() inside
commit_fs_roots().
That function checks if fs_info->reloc_ctl is in the merge_reloc_tree
stage, but since fs_info->reloc_ctl is NULL, it results a NULL pointer
dereference.
[FIX]
Just add extra check on fs_info->reloc_ctl inside
btrfs_update_reloc_root(), before checking
fs_info->reloc_ctl->merge_reloc_tree.
That DEAD_RELOC_TREE handling is to prevent further modification to the
reloc tree during merge stage, but since there is no reloc_ctl at all,
we do not need to bother that.
Reported-by: syzbot+283673dbc38527ef9f3d@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/66f6bfa7.050a0220.38ace9.0019.GAE@google.com/
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During an incremental send we may end up sending an invalid clone
operation, for the last extent of a file which ends at an unaligned offset
that matches the final i_size of the file in the send snapshot, in case
the file had its initial size (the size in the parent snapshot) decreased
in the send snapshot. In this case the destination will fail to apply the
clone operation because its end offset is not sector size aligned and it
ends before the current size of the file.
Sending the truncate operation always happens when we finish processing an
inode, after we process all its extents (and xattrs, names, etc). So fix
this by ensuring the file has a valid size before we send a clone
operation for an unaligned extent that ends at the final i_size of the
file. The size we truncate to matches the start offset of the clone range
but it could be any value between that start offset and the final size of
the file since the clone operation will expand the i_size if the current
size is smaller than the end offset. The start offset of the range was
chosen because it's always sector size aligned and avoids a truncation
into the middle of a page, which results in dirtying the page due to
filling part of it with zeroes and then making the clone operation at the
receiver trigger IO.
The following test reproduces the issue:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create a file with a size of 256K + 5 bytes, having two extents, one
# with a size of 128K and another one with a size of 128K + 5 bytes.
last_ext_size=$((128 * 1024 + 5))
xfs_io -f -d -c "pwrite -S 0xab -b 128K 0 128K" \
-c "pwrite -S 0xcd -b $last_ext_size 128K $last_ext_size" \
$MNT/foo
# Another file which we will later clone foo into, but initially with
# a larger size than foo.
xfs_io -f -c "pwrite -S 0xef 0 1M" $MNT/bar
btrfs subvolume snapshot -r $MNT/ $MNT/snap1
# Now resize bar and clone foo into it.
xfs_io -c "truncate 0" \
-c "reflink $MNT/foo" $MNT/bar
btrfs subvolume snapshot -r $MNT/ $MNT/snap2
rm -f /tmp/send-full /tmp/send-inc
btrfs send -f /tmp/send-full $MNT/snap1
btrfs send -p $MNT/snap1 -f /tmp/send-inc $MNT/snap2
umount $MNT
mkfs.btrfs -f $DEV
mount $DEV $MNT
btrfs receive -f /tmp/send-full $MNT
btrfs receive -f /tmp/send-inc $MNT
umount $MNT
Running it before this patch:
$ ./test.sh
(...)
At subvol snap1
At snapshot snap2
ERROR: failed to clone extents to bar: Invalid argument
A test case for fstests will be sent soon.
Reported-by: Ben Millwood <thebenmachine@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAJhrHS2z+WViO2h=ojYvBPDLsATwLbg+7JaNCyYomv0fUxEpQQ@mail.gmail.com/
Fixes: 46a6e10a1a ("btrfs: send: allow cloning non-aligned extent if it ends at i_size")
CC: stable@vger.kernel.org # 6.11
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the inception of relocation we have maintained the backref cache
across transaction commits, updating the backref cache with the new
bytenr whenever we COWed blocks that were in the cache, and then
updating their bytenr once we detected a transaction id change.
This works as long as we're only ever modifying blocks, not changing the
structure of the tree.
However relocation does in fact change the structure of the tree. For
example, if we are relocating a data extent, we will look up all the
leaves that point to this data extent. We will then call
do_relocation() on each of these leaves, which will COW down to the leaf
and then update the file extent location.
But, a key feature of do_relocation() is the pending list. This is all
the pending nodes that we modified when we updated the file extent item.
We will then process all of these blocks via finish_pending_nodes, which
calls do_relocation() on all of the nodes that led up to that leaf.
The purpose of this is to make sure we don't break sharing unless we
absolutely have to. Consider the case that we have 3 snapshots that all
point to this leaf through the same nodes, the initial COW would have
created a whole new path. If we did this for all 3 snapshots we would
end up with 3x the number of nodes we had originally. To avoid this we
will cycle through each of the snapshots that point to each of these
nodes and update their pointers to point at the new nodes.
Once we update the pointer to the new node we will drop the node we
removed the link for and all of its children via btrfs_drop_subtree().
This is essentially just btrfs_drop_snapshot(), but for an arbitrary
point in the snapshot.
The problem with this is that we will never reflect this in the backref
cache. If we do this btrfs_drop_snapshot() for a node that is in the
backref tree, we will leave the node in the backref tree. This becomes
a problem when we change the transid, as now the backref cache has
entire subtrees that no longer exist, but exist as if they still are
pointed to by the same roots.
In the best case scenario you end up with "adding refs to an existing
tree ref" errors from insert_inline_extent_backref(), where we attempt
to link in nodes on roots that are no longer valid.
Worst case you will double free some random block and re-use it when
there's still references to the block.
This is extremely subtle, and the consequences are quite bad. There
isn't a way to make sure our backref cache is consistent between
transid's.
In order to fix this we need to simply evict the entire backref cache
anytime we cross transid's. This reduces performance in that we have to
rebuild this backref cache every time we change transid's, but fixes the
bug.
This has existed since relocation was added, and is a pretty critical
bug. There's a lot more cleanup that can be done now that this
functionality is going away, but this patch is as small as possible in
order to fix the problem and make it easy for us to backport it to all
the kernels it needs to be backported to.
Followup series will dismantle more of this code and simplify relocation
drastically to remove this functionality.
We have a reproducer that reproduced the corruption within a few minutes
of running. With this patch it survives several iterations/hours of
running the reproducer.
Fixes: 3fd0a5585e ("Btrfs: Metadata ENOSPC handling for balance")
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
NOCOW writes do not generate stripe_extent entries in the RAID stripe
tree, as the RAID stripe-tree feature initially was designed with a
zoned filesystem in mind and on a zoned filesystem, we do not allow NOCOW
writes. But the RAID stripe-tree feature is independent from the zoned
feature, so we must also do NOCOW writes for RAID stripe-tree filesystems.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fixed deleating of a non-resident attribute in ntfs_create_inode()
rollback.
Reported-by: syzbot+9af29acd8f27fbce94bc@syzkaller.appspotmail.com
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
The code is slightly reformatted to consistently check field availability
without duplication.
Fixes: 556bdf27c2 ("ntfs3: Add bounds checking to mi_enum_attr()")
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Checking of NTFS_FLAGS_LOG_REPLAYING added to prevent access to
uninitialized bitmap during replay process.
Reported-by: syzbot+3bfd2cc059ab93efcdb4@syzkaller.appspotmail.com
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
d_hash is done while under "rcu-walk" and should not sleep.
__get_name() allocates using GFP_KERNEL, having the possibility
to sleep when under memory pressure. Change the allocation to
GFP_NOWAIT.
Reported-by: syzbot+7f71f79bbfb4427b00e1@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=7f71f79bbfb4427b00e1
Fixes: d392e85fd1 ("fs/ntfs3: Fix the format of the "nocase" mount option")
Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
'al_delete_le' was added by:
Commit be71b5cba2 ("fs/ntfs3: Add attrib operations")
but has remained unused; there is an al_remove_le which seems
to be being used instead.
Remove 'al_delete_le'.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
If CREATE was successful but SMB2_OP_SET_REPARSE failed then remove the
intermediate object created by CREATE. Otherwise empty object stay on the
server when reparse call failed.
This ensures that if the creating of special files is unsupported by the
server then no empty file stay on the server as a result of unsupported
operation.
Fixes: 102466f303 ("smb: client: allow creating special files via reparse points")
Signed-off-by: Pali Rohár <pali@kernel.org>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The header files linux/module.h is included twice in localio.c,
so one inclusion of each can be removed.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=11073
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZvqoUgAKCRCRxhvAZXjc
oveLAQC952BB+giATJvX5lPz400g+MoDVlWiPdqAVbF0PCGRRwEA+/7TedOfOkQx
Df3iAzDXVbX/WvWYdZ/DyLp/r44WHAw=
=tGkE
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.12-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
"afs:
- Fix setting of the server responding flag
- Remove unused struct afs_address_list and afs_put_address_list()
function
- Fix infinite loop because of unresponsive servers
- Ensure that afs_retry_request() function is correctly added to the
afs_req_ops netfs operations table
netfs:
- Fix netfs_folio tracepoint handling to handle NULL mappings
- Add a missing folio_queue API documentation
- Ensure that netfs_write_folio() correctly advances the iterator via
iov_iter_advance()
- Fix a dentry leak during concurrent cull and cookie lookup
operations in cachefiles
pidfs:
- Correctly handle accessing another task's pid namespace"
* tag 'vfs-6.12-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
netfs: Fix the netfs_folio tracepoint to handle NULL mapping
netfs: Add folio_queue API documentation
netfs: Advance iterator correctly rather than jumping it
afs: Fix the setting of the server responding flag
afs: Remove unused struct and function prototype
afs: Fix possible infinite loop with unresponsive servers
pidfs: check for valid pid namespace
afs: Fix missing wire-up of afs_retry_request()
cachefiles: fix dentry leak in cachefiles_open_file()
Builds on big endian systems fail as follows.
fs/bcachefs/bkey.h: In function 'bch2_bkey_format_add_key':
fs/bcachefs/bkey.h:557:41: error:
'const struct bkey' has no member named 'bversion'
The original commit only renamed the variable for little endian builds.
Rename it for big endian builds as well to fix the problem.
Fixes: cf49f8a8c2 ("bcachefs: rename version -> bversion")
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cloning a descriptor table picks the size that would cover all currently
opened files. That's fine for clone() and unshare(), but for close_range()
there's an additional twist - we clone before we close, and it would be
a shame to have
close_range(3, ~0U, CLOSE_RANGE_UNSHARE)
leave us with a huge descriptor table when we are not going to keep
anything past stderr, just because some large file descriptor used to
be open before our call has taken it out.
Unfortunately, it had been dealt with in an inherently racy way -
sane_fdtable_size() gets a "don't copy anything past that" argument
(passed via unshare_fd() and dup_fd()), close_range() decides how much
should be trimmed and passes that to unshare_fd().
The problem is, a range that used to extend to the end of descriptor
table back when close_range() had looked at it might very well have stuff
grown after it by the time dup_fd() has allocated a new files_struct
and started to figure out the capacity of fdtable to be attached to that.
That leads to interesting pathological cases; at the very least it's a
QoI issue, since unshare(CLONE_FILES) is atomic in a sense that it takes
a snapshot of descriptor table one might have observed at some point.
Since CLOSE_RANGE_UNSHARE close_range() is supposed to be a combination
of unshare(CLONE_FILES) with plain close_range(), ending up with a
weird state that would never occur with unshare(2) is confusing, to put
it mildly.
It's not hard to get rid of - all it takes is passing both ends of the
range down to sane_fdtable_size(). There we are under ->files_lock,
so the race is trivially avoided.
So we do the following:
* switch close_files() from calling unshare_fd() to calling
dup_fd().
* undo the calling convention change done to unshare_fd() in
60997c3d45 "close_range: add CLOSE_RANGE_UNSHARE"
* introduce struct fd_range, pass a pointer to that to dup_fd()
and sane_fdtable_size() instead of "trim everything past that point"
they are currently getting. NULL means "we are not going to be punching
any holes"; NR_OPEN_MAX is gone.
* make sane_fdtable_size() use find_last_bit() instead of
open-coding it; it's easier to follow that way.
* while we are at it, have dup_fd() report errors by returning
ERR_PTR(), no need to use a separate int *errorp argument.
Fixes: 60997c3d45 "close_range: add CLOSE_RANGE_UNSHARE"
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Check that read buffer of SFU symlink target location does not contain
UTF-16 null codepoint (via UniStrnlen() call) because Linux cannot process
symlink with null byte, it truncates everything in buffer after null byte.
Fixes: cf2ce67345 ("cifs: Add support for reading SFU symlink location")
Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Assorted minor syzbot fixes, and for bigger stuff:
- Fix two disk accounting rewrite bugs
- Disk accounting keys use the version field of bkey so that journal
replay can tell which updates have been applied to the btree. This is
set in the transaction commit path, after we've gotten our journal
reservation (and our time ordering), but the
BCH_TRANS_COMMIT_skip_accounting_apply flag that journal replay uses
was incorrectly skipping this for new updates generated prior to
journal replay.
This fixes the underlying cause of an assertion pop in
disk_accounting_read.
- A couple fixes for disk accounting + device removal. Checking if
acocunting replicas entries were marked in the superblock was being
done at the wrong point, when deltas in the journal could still zero
them out, and then additionally we'd try to add a missing replicas
entry to the superblock without checking if it referred to an invalid
(removed) device.
- A whole slew of repair fixes
- fix infinite loop in propagate_key_to_snapshot_leaves(), this fixes
an infinite loop when repairing a filesystem with many snapshots
- fix incorrect transaction restart handling leading to occasional
"fsck counted ..." warnings"
- fix warning in __bch2_fsck_err() for bkey fsck errors
- check_inode() in fsck now correctly checks if the filesystem was
clean
- there shouldn't be pending logged ops if the fs was clean, we now
check for this
- remove_backpointer() doesn't remove a dirent that doesn't actually
point to the inode
- many more fsck errors are AUTOFIX
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmb4QtsACgkQE6szbY3K
bnYx4A//bhGgZYgP55FxduuxUH8XjX2eOnXwuPv/MmYO/4oCok5VBa9bRDTVXhIK
PtY4pP2IJZ3+u963mwbwJAawsPA01AEEty9tE+AdXbltDRQ03I33OEuIy0HFIso2
s8VBkVPbru6yU4RCCvYNIVvRG/9GOL+J0GgrR1t05zHVyKXe1FuS00Yq5+z3niNP
HtuGTsD273Nnhikz47bqyD+M6VizU+uzSUFLgnB3zrzpb+gPSGETSwgc4ggajlM4
2P10Vc4L/Nb3KYV9RW+C3WpRfUR/o8BZA3wjJfNo0JeA4iDaUbltSjpCA07EcAnA
3D6Omzqkm4aobL2WlvioT0UhZx4t8X/8x5t5F9HyX52i1k+g87oMT9/KIKec1Dzd
8vQCwCdXFfWaLSZoOJsHyIljip7BuRLKhWwKosdzzLIAnRQy5StxAhsG99fNStu6
JOWICPNCn1b6SkktnoKou1unL+K5RczeNfAxMAjcJjTD7IIAmytLe4mdRbP9q+Oa
x8no7pttbb4JnoRvfo42GVz8KWQR07oN/Zy7mH3K4Y0Ix+xDOrLqlfLIDLGpxMNv
HZz+UPchdlfpYJO+nTLoAOGXZWnKDqg70SAEcWKDc82Ri4vNOhraYDZvXrzl9qE+
63RPzqDbg3uXGxLYMvujjPe610QkPxS9zKKyDvUZZx0ZiUX4CjI=
=cdrz
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-09-28' of git://evilpiepirate.org/bcachefs
Pull more bcachefs updates from Kent Overstreet:
"Assorted minor syzbot fixes, and for bigger stuff:
Fix two disk accounting rewrite bugs:
- Disk accounting keys use the version field of bkey so that journal
replay can tell which updates have been applied to the btree.
This is set in the transaction commit path, after we've gotten our
journal reservation (and our time ordering), but the
BCH_TRANS_COMMIT_skip_accounting_apply flag that journal replay
uses was incorrectly skipping this for new updates generated prior
to journal replay.
This fixes the underlying cause of an assertion pop in
disk_accounting_read.
- A couple of fixes for disk accounting + device removal.
Checking if acocunting replicas entries were marked in the
superblock was being done at the wrong point, when deltas in the
journal could still zero them out, and then additionally we'd try
to add a missing replicas entry to the superblock without checking
if it referred to an invalid (removed) device.
A whole slew of repair fixes:
- fix infinite loop in propagate_key_to_snapshot_leaves(), this fixes
an infinite loop when repairing a filesystem with many snapshots
- fix incorrect transaction restart handling leading to occasional
"fsck counted ..." warnings
- fix warning in __bch2_fsck_err() for bkey fsck errors
- check_inode() in fsck now correctly checks if the filesystem was
clean
- there shouldn't be pending logged ops if the fs was clean, we now
check for this
- remove_backpointer() doesn't remove a dirent that doesn't actually
point to the inode
- many more fsck errors are AUTOFIX"
* tag 'bcachefs-2024-09-28' of git://evilpiepirate.org/bcachefs: (35 commits)
bcachefs: check_subvol_path() now prints subvol root inode
bcachefs: remove_backpointer() now checks if dirent points to inode
bcachefs: dirent_points_to_inode() now warns on mismatch
bcachefs: Fix lost wake up
bcachefs: Check for logged ops when clean
bcachefs: BCH_FS_clean_recovery
bcachefs: Convert disk accounting BUG_ON() to WARN_ON()
bcachefs: Fix BCH_TRANS_COMMIT_skip_accounting_apply
bcachefs: Check for accounting keys with bversion=0
bcachefs: rename version -> bversion
bcachefs: Don't delete unlinked inodes before logged op resume
bcachefs: Fix BCH_SB_ERRS() so we can reorder
bcachefs: Fix fsck warnings from bkey validation
bcachefs: Move transaction commit path validation to as late as possible
bcachefs: Fix disk accounting attempting to mark invalid replicas entry
bcachefs: Fix unlocked access to c->disk_sb.sb in bch2_replicas_entry_validate()
bcachefs: Fix accounting read + device removal
bcachefs: bch_accounting_mode
bcachefs: fix transaction restart handling in check_extents(), check_dirents()
bcachefs: kill inode_walker_entry.seen_this_pos
...
cleanups.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmb3JroTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzizDiB/0elHQQaFxXMjuJRY1IzohozAHi0cHK
gwgE4nEbECE8vRYK/QvyvZ3S+ep+N+r6jOIiIDyqhjtlY3//oSyyxL7RjMJlVFBq
Ie37w8r4q1aL1mn9QDQ4iQxcRYyU+JxcUcPR1UUUvLiKgWaRixmq27zby/WQSrkA
ke2ScBRDtEAYVtdxvxmUJK/DrPr3skwJAGY52KesjwgVhXSL8KG9X1zMUbWdJYDV
THbQzLZsu4NVh7LlAsS/mh+z0EIZsXxQYU5IY3dIVEYcuLK93lXRGZb+7whtmUef
wsDtYIe/w30QVxFdrN28qAQp8daUJhp+3t0EZSyecRcq5OPey6ICx1P4
=+bdB
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"Three CephFS fixes from Xiubo and Luis and a bunch of assorted
cleanups"
* tag 'ceph-for-6.12-rc1' of https://github.com/ceph/ceph-client:
ceph: remove the incorrect Fw reference check when dirtying pages
ceph: Remove empty definition in header file
ceph: Fix typo in the comment
ceph: fix a memory leak on cap_auths in MDS client
ceph: flush all caps releases when syncing the whole filesystem
ceph: rename ceph_flush_cap_releases() to ceph_flush_session_cap_releases()
libceph: use min() to simplify code in ceph_dns_resolve_name()
ceph: Convert to use jiffies macro
ceph: Remove unused declarations
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmb2y3gACgkQiiy9cAdy
T1EATQwAwQTCQSd916VK4HPzphM/lJLaxNCNNV0AY8QNXsUxhBxQny2FJ+jilFZU
y4G5/zSi8+PX0YtyrPJqtofbFX+eeD6eKRCFT/1YEEkwEYp53mjsCIHWidSPGh6X
S2du6tAebSCQSqlHv5zlTpL24UVhi6amse7aJyXs8v7JZO9ZjtEE0D+a1xqSV4kt
0+6/W3RM49HTEAql7TavduNR3UcesYg2KS48qNVvGhHhY3wcGe92mZ0Sr4NUStfg
IjtpfsxxBJWKiXDJhGBN8M/O6jqBtE++O/CyDknYGOs6M7QtPJ1xtXpESlq+OgWV
JEqNorZI4qvl/5PbY/1+6wJDY3ogv2DhwyRaOdhtVc5CgF1JGLKW4lVBGBIrQz2B
dyHbiGAXEA+Rm7/8UkyFZRmvbmLXDqRM7AEyLrXoeS5Vw51RxS6CT0oDesVGVsdX
+koQ1OQ55AiR1TXhasDj6XvmFAyYKuEPh/qhBz1jEBX8unyhIUUVrG3CnMULl3rY
FWbdmtDB
=4ons
-----END PGP SIGNATURE-----
Merge tag 'v6.12-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- fix querying dentry for char/block special files
- small cleanup patches
* tag 'v6.12-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: Correct typos in multiple comments across various files
ksmbd: fix open failure from block and char device file
ksmbd: remove unsafe_memcpy use in session setup
ksmbd: Replace one-element arrays with flexible-array members
ksmbd: fix warning: comparison of distinct pointer types lacks a cast
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmb3FfgACgkQiiy9cAdy
T1HLMQv/ZXkiNPvMmCHJE3rWdSBZQdGLDV1qOhxuW9y4CvenIhukwDmwOq7wjWOn
3dQHDaqFkVTVurosozFOJK9Aw94iz2Ad9dlryMcNN+Gb4vY3d9l3AvsbqmgbZSsg
DmdqOg1SA9NgDaHl6RFsFQQY9O5BsjkEeHBX71gZZUYMw1d6CTpFUT+wTD43L0LQ
g8r0Ksil7edw9f5WGvu8YzB4rclR45QiTVG1OMgXmr43cvoJz3GrIPXeHNm7j5D7
hzx6ELNviY77DPKnxSd55UGPngVms6c1qqWCOJMefsJRY5bhh3lEc+TQyX11HGft
Kta1TQI2gI1xgueqpR2Dh/bUuprWcc6vCbNjxezXpFOiSMt6qNfGzQDE9gZAALAj
568lRpwcMPgS9laqK4Sh9v5+Vw6E8T+FUAJKvtLRidv5iBoqI0+50mhHTBlBgiy6
XdyiAdUGSzejcu/OiOGc9C4PvVRgf0qw9+hqqZeZAsRydKgw3QwPtJ6yW+RYu86p
xQ5CE6zs
=mpUs
-----END PGP SIGNATURE-----
Merge tag '6.12rc-more-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull xmb client fixes from Steve French:
- Noisy log message cleanup
- Important netfs fix for cifs crash in generic/074
- Three minor improvements to use of hashing (multichannel and mount
improvements)
- Fix decryption crash for large read with small esize
* tag '6.12rc-more-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: make SHA-512 TFM ephemeral
smb: client: make HMAC-MD5 TFM ephemeral
smb: client: stop flooding dmesg in smb2_calc_signature()
smb: client: allocate crypto only for primary server
smb: client: fix UAF in async decryption
netfs: Fix write oops in generic/346 (9p) and generic/074 (cifs)
if an inode backpointer points to a dirent that doesn't point back,
that's an error we should warn about.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If the reader acquires the read lock and then the writer enters the slow
path, while the reader proceeds to the unlock path, the following scenario
can occur without the change:
writer: pcpu_read_count(lock) return 1 (so __do_six_trylock will return 0)
reader: this_cpu_dec(*lock->readers)
reader: smp_mb()
reader: state = atomic_read(&lock->state) (there is no waiting flag set)
writer: six_set_bitmask()
then the writer will sleep forever.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a filesystem flag to indicate whether we did a clean recovery -
using c->sb.clean after we've got rw is incorrect, since c->sb is
updated whenever we write the superblock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We had a bug where disk accounting keys didn't always have their version
field set in journal replay; change the BUG_ON() to a WARN(), and
exclude this case since it's now checked for elsewhere (in the bkey
validate function).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This was added to avoid double-counting accounting keys in journal
replay. But applied incorrectly (easily done since it applies to the
transaction commit, not a particular update), it leads to skipping
in-mem accounting for real accounting updates, and failure to give them
a version number - which leads to journal replay becoming very confused
the next time around.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Previously, check_inode() would delete unlinked inodes if they weren't
on the deleted list - this code dating from before there was a deleted
list.
But, if we crash during a logged op (truncate or finsert/fcollapse) of
an unlinked file, logged op resume will get confused if the inode has
already been deleted - instead, just add it to the deleted list if it
needs to be there; delete_dead_inodes runs after logged op resume.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_SB_ERRS() has a field for the actual enum val so that we can reorder
to reorganize, but the way BCH_SB_ERR_MAX was defined didn't allow for
this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
__bch2_fsck_err() warns if the current task has a btree_trans object and
it wasn't passed in, because if it has to prompt for user input it has
to be able to unlock it.
But plumbing the btree_trans through bkey_validate(), as well as
transaction restarts, is problematic - so instead make bkey fsck errors
FSCK_AUTOFIX, which doesn't need to warn.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In order to check for accounting keys with version=0, we need to run
validation after they've been assigned version numbers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
accounting read was checking if accounting replicas entries were marked
in the superblock prior to applying accounting from the journal,
which meant that a recently removed device could spuriously trigger a
"not marked in superblocked" error (when journal entries zero out the
offending counter).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Minor refactoring - replace multiple bool arguments with an enum; prep
work for fixing a bug in accounting read.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Dealing with outside state within a btree transaction is always tricky.
check_extents() and check_dirents() have to accumulate counters for
i_sectors and i_nlink (for subdirectories). There were two bugs:
- transaction commit may return a restart; therefore we have to commit
before accumulating to those counters
- get_inode_all_snapshots() may return a transaction restart, before
updating w->last_pos; then, on the restart,
check_i_sectors()/check_subdir_count() would see inodes that were not
for w->last_pos
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Returning a positive integer instead of an error code causes error paths
to become very confused.
Closes: syzbot+c0360e8367d6d8d04a66@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The pointer clean points the memory allocated by kmemdup, when the
return value of bch2_sb_clean_validate_late is not zero. The memory
pointed by clean is leaked. So we should free it in this case.
Fixes: a37ad1a3ab ("bcachefs: sb-clean.c")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
In downgrade_table_extra, the return value is needed. When it
return failed, we should exit immediately.
Fixes: 7773df19c3 ("bcachefs: metadata version bucket_stripe_sectors")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
check_topology doesn't need the srcu lock and doesn't use normal btree
transactions - we can just drop the srcu lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck_err() jumps to the fsck_err label when bailing out; need to make
sure bp_iter was initialized...
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This fixes a kasan splat in propagate_key_to_snapshot_leaves() -
varint_decode_fast() does reads (that it never uses) up to 7 bytes past
the end of the integer.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Most or all errors will be autofix in the future, we're currently just
doing the ones that we know are well tested.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
ovl_open_realfile() is wrongly called twice after conversion to
new struct fd.
Fixes: 88a2f6468d ("struct fd: representation change")
Reported-by: syzbot+d9efec94dcbfa0de1c07@syzkaller.appspotmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a focus on fixes for the memfd_pin_folios() work which was added
into 6.11. Apart from that, the usual shower of singleton fixes.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZvbhSAAKCRDdBJ7gKXxA
jp8CAP47txk2c+tBLggog2MkQamADY5l5MT6E3fYq3ghSiKtVQEAnqX3LiQJ02tB
o9LcPcVrM90QntpKrLP1CpWCVdR+zA8=
=e0QC
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-09-27-09-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"19 hotfixes. 13 are cc:stable.
There's a focus on fixes for the memfd_pin_folios() work which was
added into 6.11. Apart from that, the usual shower of singleton fixes"
* tag 'mm-hotfixes-stable-2024-09-27-09-45' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
ocfs2: fix uninit-value in ocfs2_get_block()
zram: don't free statically defined names
memory tiers: use default_dram_perf_ref_source in log message
Revert "list: test: fix tests for list_cut_position()"
kselftests: mm: fix wrong __NR_userfaultfd value
compiler.h: specify correct attribute for .rodata..c_jump_table
mm/damon/Kconfig: update DAMON doc URL
mm: kfence: fix elapsed time for allocated/freed track
ocfs2: fix deadlock in ocfs2_get_system_file_inode
ocfs2: reserve space for inline xattr before attaching reflink tree
mm: migrate: annotate data-race in migrate_folio_unmap()
mm/hugetlb: simplify refs in memfd_alloc_folio
mm/gup: fix memfd_pin_folios alloc race panic
mm/gup: fix memfd_pin_folios hugetlb page allocation
mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak
mm/hugetlb: fix memfd_pin_folios free_huge_pages leak
mm/filemap: fix filemap_get_folios_contig THP panic
mm: make SPLIT_PTE_PTLOCKS depend on SMP
tools: fix shared radix-tree build
In netfs_write_folio(), use iov_iter_advance() to advance the folio as we
split bits of it off to subrequests rather than manually jumping the
->iov_offset value around. This becomes more problematic when we use a
bounce buffer made out of single-page folios to cover a multipage pagecache
folio.
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/2238548.1727424522@warthog.procyon.org.uk
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
In afs_wait_for_operation(), we set transcribe the call responded flag to
the server record that we used after doing the fileserver iteration loop -
but it's possible to exit the loop having had a response from the server
that we've discarded (e.g. it returned an abort or we started receiving
data, but the call didn't complete).
This means that op->server might be NULL, but we don't check that before
attempting to set the server flag.
Fixes: 98f9fda205 ("afs: Fold the afs_addr_cursor struct in")
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20240923150756.902363-7-dhowells@redhat.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
When we access a no-current task's pid namespace we need check that the
task hasn't been reaped in the meantime and it's pid namespace isn't
accessible anymore.
The user namespace is fine because it is only released when the last
reference to struct task_struct is put and exit_creds() is called.
Link: https://lore.kernel.org/r/20240926-klebt-altgedienten-0415ad4d273c@brauner
Fixes: 5b08bd4085 ("pidfs: allow retrieval of namespace file descriptors")
CC: stable@vger.kernel.org # v6.11
Signed-off-by: Christian Brauner <brauner@kernel.org>
afs_retry_request() is supposed to be pointed to by the afs_req_ops netfs
operations table, but the pointer got lost somewhere. The function is used
during writeback to rotate through the authentication keys that were in
force when the file was modified locally.
Fix this by adding the pointer to the function.
Fixes: 1ecb146f7c ("netfs, afs: Use writeback retry to deal with alternate keys")
Reported-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1690847.1726346402@warthog.procyon.org.uk
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
A dentry leak may be caused when a lookup cookie and a cull are concurrent:
P1 | P2
-----------------------------------------------------------
cachefiles_lookup_cookie
cachefiles_look_up_object
lookup_one_positive_unlocked
// get dentry
cachefiles_cull
inode->i_flags |= S_KERNEL_FILE;
cachefiles_open_file
cachefiles_mark_inode_in_use
__cachefiles_mark_inode_in_use
can_use = false
if (!(inode->i_flags & S_KERNEL_FILE))
can_use = true
return false
return false
// Returns an error but doesn't put dentry
After that the following WARNING will be triggered when the backend folder
is umounted:
==================================================================
BUG: Dentry 000000008ad87947{i=7a,n=Dx_1_1.img} still in use (1) [unmount of ext4 sda]
WARNING: CPU: 4 PID: 359261 at fs/dcache.c:1767 umount_check+0x5d/0x70
CPU: 4 PID: 359261 Comm: umount Not tainted 6.6.0-dirty #25
RIP: 0010:umount_check+0x5d/0x70
Call Trace:
<TASK>
d_walk+0xda/0x2b0
do_one_tree+0x20/0x40
shrink_dcache_for_umount+0x2c/0x90
generic_shutdown_super+0x20/0x160
kill_block_super+0x1a/0x40
ext4_kill_sb+0x22/0x40
deactivate_locked_super+0x35/0x80
cleanup_mnt+0x104/0x160
==================================================================
Whether cachefiles_open_file() returns true or false, the reference count
obtained by lookup_positive_unlocked() in cachefiles_look_up_object()
should be released.
Therefore release that reference count in cachefiles_look_up_object() to
fix the above issue and simplify the code.
Fixes: 1f08c925e7 ("cachefiles: Implement backing file wrangling")
Cc: stable@kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20240829083409.3788142-1-libaokun@huaweicloud.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
no_llseek had been defined to NULL two years ago, in commit 868941b144
("fs: remove no_llseek")
To quote that commit,
At -rc1 we'll need do a mechanical removal of no_llseek -
git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.
Unfortunately, that hadn't been done. Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
.llseek = no_llseek,
so it's obviously safe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The SHA-512 shash TFM is used only briefly during Session Setup stage,
when computing SMB 3.1.1 preauth hash.
There's no need to keep it allocated in servers' secmech the whole time,
so keep its lifetime inside smb311_update_preauth_hash().
This also makes smb311_crypto_shash_allocate() redundant, so expose
smb3_crypto_shash_allocate() and use that.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
The HMAC-MD5 shash TFM is used only briefly during Session Setup stage,
when computing NTLMv2 hashes.
There's no need to keep it allocated in servers' secmech the whole time,
so keep its lifetime inside setup_ntlmv2_rsp().
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
When having several mounts that share same credential and the client
couldn't re-establish an SMB session due to an expired kerberos ticket
or rotated password, smb2_calc_signature() will end up flooding dmesg
when not finding SMB sessions to calculate signatures.
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
For extra channels, point ->secmech.{enc,dec} to the primary
server ones.
Signed-off-by: Enzo Matsumiya <ematsumiya@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
In netfslib, a buffered writeback operation has a 'write queue' of folios
that are being written, held in a linear sequence of folio_queue structs.
The 'issuer' adds new folio_queues on the leading edge of the queue and
populates each one progressively; the 'collector' pops them off the
trailing edge and discards them and the folios they point to as they are
consumed.
The queue is required to always retain at least one folio_queue structure.
This allows the queue to be accessed without locking and with just a bit of
barriering.
When a new subrequest is prepared, its ->io_iter iterator is pointed at the
current end of the write queue and then the iterator is extended as more
data is added to the queue until the subrequest is committed.
Now, the problem is that the folio_queue at the leading edge of the write
queue when a subrequest is prepared might have been entirely consumed - but
not yet removed from the queue as it is the only remaining one and is
preventing the queue from collapsing.
So, what happens is that subreq->io_iter is pointed at the spent
folio_queue, then a new folio_queue is added, and, at that point, the
collector is at entirely at liberty to immediately delete the spent
folio_queue.
This leaves the subreq->io_iter pointing at a freed object. If the system
is lucky, iterate_folioq() sees ->io_iter, sees the as-yet uncorrupted
freed object and advances to the next folio_queue in the queue.
In the case seen, however, the freed object gets recycled and put back onto
the queue at the tail and filled to the end. This confuses
iterate_folioq() and it tries to step ->next, which may be NULL - resulting
in an oops.
Fix this by the following means:
(1) When preparing a write subrequest, make sure there's a folio_queue
struct with space in it at the leading edge of the queue. A function
to make space is split out of the function to append a folio so that
it can be called for this purpose.
(2) If the request struct iterator is pointing to a completely spent
folio_queue when we make space, then advance the iterator to the newly
allocated folio_queue. The subrequest's iterator will then be set
from this.
The oops could be triggered using the generic/346 xfstest with a filesystem
on9P over TCP with cache=loose. The oops looked something like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
...
RIP: 0010:_copy_from_iter+0x2db/0x530
...
Call Trace:
<TASK>
...
p9pdu_vwritef+0x3d8/0x5d0
p9_client_prepare_req+0xa8/0x140
p9_client_rpc+0x81/0x280
p9_client_write+0xcf/0x1c0
v9fs_issue_write+0x87/0xc0
netfs_advance_write+0xa0/0xb0
netfs_write_folio.isra.0+0x42d/0x500
netfs_writepages+0x15a/0x1f0
do_writepages+0xd1/0x220
filemap_fdatawrite_wbc+0x5c/0x80
v9fs_mmap_vm_close+0x7d/0xb0
remove_vma+0x35/0x70
vms_complete_munmap_vmas+0x11a/0x170
do_vmi_align_munmap+0x17d/0x1c0
do_vmi_munmap+0x13e/0x150
__vm_munmap+0x92/0xd0
__x64_sys_munmap+0x17/0x20
do_syscall_64+0x80/0xe0
entry_SYSCALL_64_after_hwframe+0x71/0x79
This also fixed a similar-looking issue with cifs and generic/074.
Fixes: cd0277ed0c ("netfs: Use new folio_queue data type and iterator instead of xarray iter")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409180928.f20b5a08-oliver.sang@intel.com
Closes: https://lore.kernel.org/oe-lkp/202409131438.3f225fbf-oliver.sang@intel.com
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: kernel test robot <oliver.sang@intel.com>
cc: Eric Van Hensbergen <ericvh@kernel.org>
cc: Latchesar Ionkov <lucho@ionkov.net>
cc: Dominique Martinet <asmadeus@codewreck.org>
cc: Christian Schoenebeck <linux_oss@crudebyte.com>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: v9fs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
syzbot reported an uninit-value BUG:
BUG: KMSAN: uninit-value in ocfs2_get_block+0xed2/0x2710 fs/ocfs2/aops.c:159
ocfs2_get_block+0xed2/0x2710 fs/ocfs2/aops.c:159
do_mpage_readpage+0xc45/0x2780 fs/mpage.c:225
mpage_readahead+0x43f/0x840 fs/mpage.c:374
ocfs2_readahead+0x269/0x320 fs/ocfs2/aops.c:381
read_pages+0x193/0x1110 mm/readahead.c:160
page_cache_ra_unbounded+0x901/0x9f0 mm/readahead.c:273
do_page_cache_ra mm/readahead.c:303 [inline]
force_page_cache_ra+0x3b1/0x4b0 mm/readahead.c:332
force_page_cache_readahead mm/internal.h:347 [inline]
generic_fadvise+0x6b0/0xa90 mm/fadvise.c:106
vfs_fadvise mm/fadvise.c:185 [inline]
ksys_fadvise64_64 mm/fadvise.c:199 [inline]
__do_sys_fadvise64 mm/fadvise.c:214 [inline]
__se_sys_fadvise64 mm/fadvise.c:212 [inline]
__x64_sys_fadvise64+0x1fb/0x3a0 mm/fadvise.c:212
x64_sys_call+0xe11/0x3ba0
arch/x86/include/generated/asm/syscalls_64.h:222
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcd/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
This is because when ocfs2_extent_map_get_blocks() fails, p_blkno is
uninitialized. So the error log will trigger the above uninit-value
access.
The error log is out-of-date since get_blocks() was removed long time ago.
And the error code will be logged in ocfs2_extent_map_get_blocks() once
ocfs2_get_cluster() fails, so fix this by only logging inode and block.
Link: https://syzkaller.appspot.com/bug?extid=9709e73bae885b05314b
Link: https://lkml.kernel.org/r/20240925090600.3643376-1-joseph.qi@linux.alibaba.com
Fixes: ccd979bdbc ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: syzbot+9709e73bae885b05314b@syzkaller.appspotmail.com
Tested-by: syzbot+9709e73bae885b05314b@syzkaller.appspotmail.com
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
syzbot has found a possible deadlock in ocfs2_get_system_file_inode [1].
The scenario is depicted here,
CPU0 CPU1
lock(&ocfs2_file_ip_alloc_sem_key);
lock(&osb->system_file_mutex);
lock(&ocfs2_file_ip_alloc_sem_key);
lock(&osb->system_file_mutex);
The function calls which could lead to this are:
CPU0
ocfs2_mknod - lock(&ocfs2_file_ip_alloc_sem_key);
.
.
.
ocfs2_get_system_file_inode - lock(&osb->system_file_mutex);
CPU1 -
ocfs2_fill_super - lock(&osb->system_file_mutex);
.
.
.
ocfs2_read_virt_blocks - lock(&ocfs2_file_ip_alloc_sem_key);
This issue can be resolved by making the down_read -> down_read_try
in the ocfs2_read_virt_blocks.
[1] https://syzkaller.appspot.com/bug?extid=e0055ea09f1f5e6fabdd
Link: https://lkml.kernel.org/r/20240924093257.7181-1-pvmohammedanees2003@gmail.com
Signed-off-by: Mohammed Anees <pvmohammedanees2003@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reported-by: <syzbot+e0055ea09f1f5e6fabdd@syzkaller.appspotmail.com>
Closes: https://syzkaller.appspot.com/bug?extid=e0055ea09f1f5e6fabdd
Tested-by: syzbot+e0055ea09f1f5e6fabdd@syzkaller.appspotmail.com
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
One of our customers reported a crash and a corrupted ocfs2 filesystem.
The crash was due to the detection of corruption. Upon troubleshooting,
the fsck -fn output showed the below corruption
[EXTENT_LIST_FREE] Extent list in owner 33080590 claims 230 as the next free chain record,
but fsck believes the largest valid value is 227. Clamp the next record value? n
The stat output from the debugfs.ocfs2 showed the following corruption
where the "Next Free Rec:" had overshot the "Count:" in the root metadata
block.
Inode: 33080590 Mode: 0640 Generation: 2619713622 (0x9c25a856)
FS Generation: 904309833 (0x35e6ac49)
CRC32: 00000000 ECC: 0000
Type: Regular Attr: 0x0 Flags: Valid
Dynamic Features: (0x16) HasXattr InlineXattr Refcounted
Extended Attributes Block: 0 Extended Attributes Inline Size: 256
User: 0 (root) Group: 0 (root) Size: 281320357888
Links: 1 Clusters: 141738
ctime: 0x66911b56 0x316edcb8 -- Fri Jul 12 06:02:30.829349048 2024
atime: 0x66911d6b 0x7f7a28d -- Fri Jul 12 06:11:23.133669517 2024
mtime: 0x66911b56 0x12ed75d7 -- Fri Jul 12 06:02:30.317552087 2024
dtime: 0x0 -- Wed Dec 31 17:00:00 1969
Refcount Block: 2777346
Last Extblk: 2886943 Orphan Slot: 0
Sub Alloc Slot: 0 Sub Alloc Bit: 14
Tree Depth: 1 Count: 227 Next Free Rec: 230
## Offset Clusters Block#
0 0 2310 2776351
1 2310 2139 2777375
2 4449 1221 2778399
3 5670 731 2779423
4 6401 566 2780447
....... .... .......
....... .... .......
The issue was in the reflink workfow while reserving space for inline
xattr. The problematic function is ocfs2_reflink_xattr_inline(). By the
time this function is called the reflink tree is already recreated at the
destination inode from the source inode. At this point, this function
reserves space for inline xattrs at the destination inode without even
checking if there is space at the root metadata block. It simply reduces
the l_count from 243 to 227 thereby making space of 256 bytes for inline
xattr whereas the inode already has extents beyond this index (in this
case up to 230), thereby causing corruption.
The fix for this is to reserve space for inline metadata at the destination
inode before the reflink tree gets recreated. The customer has verified the
fix.
Link: https://lkml.kernel.org/r/20240918063844.1830332-1-gautham.ananthakrishna@oracle.com
Fixes: ef962df057 ("ocfs2: xattr: fix inlined xattr reflink")
Signed-off-by: Gautham Ananthakrishna <gautham.ananthakrishna@oracle.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>