linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-01 16:41:39 +00:00

Author	SHA1	Message	Date
Olga Kornievskaia	a204501e17	nfsd: prevent panic for nfsv4.0 closed files in nfs4_show_open Prior to commit `3f29cc82a8` ("nfsd: split sc_status out of sc_type") states_show() relied on sc_type field to be of valid type before calling into a subfunction to show content of a particular stateid. From that commit, we split the validity of the stateid into sc_status and no longer changed sc_type to 0 while unhashing the stateid. This resulted in kernel oopsing for nfsv4.0 opens that stay around and in nfs4_show_open() would derefence sc_file which was NULL. Instead, for closed open stateids forgo displaying information that relies of having a valid sc_file. To reproduce: mount the server with 4.0, read and close a file and then on the server cat /proc/fs/nfsd/clients/2/states [ 513.590804] Call trace: [ 513.590925] _raw_spin_lock+0xcc/0x160 [ 513.591119] nfs4_show_open+0x78/0x2c0 [nfsd] [ 513.591412] states_show+0x44c/0x488 [nfsd] [ 513.591681] seq_read_iter+0x5d8/0x760 [ 513.591896] seq_read+0x188/0x208 [ 513.592075] vfs_read+0x148/0x470 [ 513.592241] ksys_read+0xcc/0x178 Fixes: `3f29cc82a8` ("nfsd: split sc_status out of sc_type") Signed-off-by: Olga Kornievskaia <okorniev@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-08-23 12:25:46 -04:00
Ed Tsai	996b37da1e	backing-file: convert to using fops->splice_write Filesystems may define their own splice write. Therefore, use the file fops instead of invoking iter_file_splice_write() directly. Signed-off-by: Ed Tsai <ed.tsai@mediatek.com> Link: https://lore.kernel.org/r/20240708072208.25244-1-ed.tsai@mediatek.com Fixes: `5ca7346861` ("fuse: implement splice read/write passthrough") Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-23 13:08:31 +02:00
Trond Myklebust	f92214e4c3	NFS: Avoid unnecessary rescanning of the per-server delegation list If the call to nfs_delegation_grab_inode() fails, we will not have dropped any locks that require us to rescan the list. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2024-08-22 17:01:10 -04:00
Trond Myklebust	d72b796311	NFSv4: Fix clearing of layout segments in layoutreturn Make sure that we clear the layout segments in cases where we see a fatal error, and also in the case where the layout is invalid. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2024-08-22 17:01:10 -04:00
Trond Myklebust	a017ad1313	NFSv4: Add missing rescheduling points in nfs_client_return_marked_delegations We're seeing reports of soft lockups when iterating through the loops, so let's add rescheduling points. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2024-08-22 17:01:10 -04:00
Jeff Layton	95832998fb	nfs: fix bitmap decoder to handle a 3rd word It only decodes the first two words at this point. Have it decode the third word as well. Without this, the client doesn't send delegated timestamps in the CB_GETATTR response. With this change we also need to expand the on-stack bitmap in decode_recallany_args to 3 elements, in case the server sends a larger bitmap than expected. Fixes: `43df7110f4` ("NFSv4: Add CB_GETATTR support for delegated attributes") Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2024-08-22 17:01:10 -04:00
Jeff Layton	cb78f9b7d0	nfs: fix the fetch of FATTR4_OPEN_ARGUMENTS The client doesn't properly request FATTR4_OPEN_ARGUMENTS in the initial SERVER_CAPS getattr. Add FATTR4_WORD2_OPEN_ARGUMENTS to the initial request. Fixes: `707f13b3d0` (NFSv4: Add support for the FATTR4_OPEN_ARGUMENTS attribute) Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2024-08-22 17:01:09 -04:00
ChenXiaoSong	5e51224d2a	smb/client: fix typo: GlobalMid_Sem -> GlobalMid_Lock The comments have typos, fix that to not confuse readers. Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 15:44:19 -05:00
Jeff Layton	f58bab6fd4	nfsd: ensure that nfsd4_fattr_args.context is zeroed out If nfsd4_encode_fattr4 ends up doing a "goto out" before we get to checking for the security label, then args.context will be set to uninitialized junk on the stack, which we'll then try to free. Initialize it early. Fixes: `f59388a579` ("NFSD: Add nfsd4_encode_fattr4_sec_label()") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2024-08-22 14:49:10 -04:00
Paulo Alcantara	ec68680411	smb: client: ignore unhandled reparse tags Just ignore reparse points that the client can't parse rather than bailing out and not opening the file or directory. Reported-by: Marc <1marc1@gmail.com> Closes: https://lore.kernel.org/r/CAMHwNVv-B+Q6wa0FEXrAuzdchzcJRsPKDDRrNaYZJd6X-+iJzw@mail.gmail.com Fixes: `539aad7f14` ("smb: client: introduce ->parse_reparse_point()") Tested-by: Anthony Nandaa (Microsoft) <profnandaa@gmail.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 12:37:16 -05:00
Steve French	15179cf280	smb3: fix problem unloading module due to leaked refcount on shutdown The shutdown ioctl can leak a refcount on the tlink which can prevent rmmod (unloading the cifs.ko) module from working. Found while debugging xfstest generic/043 Fixes: `69ca1f5755` ("smb3: add dynamic tracepoints for shutdown ioctl") Reviewed-by: Meetakshi Setiya <msetiya@microsoft.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 12:36:57 -05:00
ChenXiaoSong	2b7e0573a4	smb/server: update misguided comment of smb2_allocate_rsp_buf() smb2_allocate_rsp_buf() will return other error code except -ENOMEM. Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 09:52:00 -05:00
ChenXiaoSong	0dd771b7d6	smb/server: remove useless assignment of 'file_present' in smb2_open() The variable is already true here. Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 09:52:00 -05:00
ChenXiaoSong	4e8771a366	smb/server: fix potential null-ptr-deref of lease_ctx_info in smb2_open() null-ptr-deref will occur when (req_op_level == SMB2_OPLOCK_LEVEL_LEASE) and parse_lease_state() return NULL. Fix this by check if 'lease_ctx_info' is NULL. Additionally, remove the redundant parentheses in parse_durable_handle_context(). Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 09:52:00 -05:00
ChenXiaoSong	2186a11653	smb/server: fix return value of smb2_open() In most error cases, error code is not returned in smb2_open(), __process_request() will not print error message. Fix this by returning the correct value at the end of smb2_open(). Signed-off-by: ChenXiaoSong <chenxiaosong@kylinos.cn> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 09:52:00 -05:00
Namjae Jeon	ce61b605a0	ksmbd: the buffer of smb2 query dir response has at least 1 byte When STATUS_NO_MORE_FILES status is set to smb2 query dir response, ->StructureSize is set to 9, which mean buffer has 1 byte. This issue occurs because ->Buffer[1] in smb2_query_directory_rsp to flex-array. Fixes: `eb3e28c1e8` ("smb3: Replace smb2pdu 1-element arrays with flex-arrays") Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-22 09:52:00 -05:00
Kent Overstreet	a592cdf516	bcachefs: don't use rht_bucket() in btree_key_cache_scan() rht_bucket() does strange complicated things when a rehash is in progress. Instead, just skip scanning when a rehash is in progress: scanning is going to be more expensive (many more empty slots to cover), and some sort of infinite loop is being observed Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 10:04:41 -04:00
Kent Overstreet	3e878fe5a0	bcachefs: add missing inode_walker_exit() fix a small leak Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 10:04:41 -04:00
Kent Overstreet	87313ac1f1	bcachefs: clear path->should_be_locked in bch2_btree_key_cache_drop() bch2_btree_key_cache_drop() evicts the key cache entry - it's used when we're doing an update that bypasses the key cache, because for cache coherency reasons a key can't be in the key cache unless it also exists in the btree - i.e. creates have to bypass the cache. After evicting, the path no longer points to a key cache key, and relock() will always fail if should_be_locked is true. Prep for improving path->should_be_locked assertions Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 03:12:57 -04:00
Yuesong Li	dedb2fe375	bcachefs: Fix double assignment in check_dirent_to_subvol() ret was assigned twice in check_dirent_to_subvol(). Reported by cocci. Signed-off-by: Yuesong Li <liyuesong@vivo.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:40:52 -04:00
Kent Overstreet	0b50b7313e	bcachefs: Fix refcounting in discard path bch_dev->io_ref does not protect against the filesystem going away; bch_fs->writes does. Thus the filesystem write ref needs to be the last ref we release. Reported-by: syzbot+9e0404b505e604f67e41@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	8ed823b192	bcachefs: Fix compat issue with old alloc_v4 keys we allow new fields to be added to existing key types, and new versions should treat them as being zeroed; this was not handled in alloc_v4_validate. Reported-by: syzbot+3b2968fa4953885dd66a@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	7f2de6947f	bcachefs: Fix warning in bch2_fs_journal_stop() j->last_empty_seq needs to match j->seq when the journal is empty Reported-by: syzbot+4093905737cf289b6b38@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	06f67437ab	fs/super.c: improve get_tree() error message seeing an odd bug where we fail to correctly return an error from .get_tree(): https://syzkaller.appspot.com/bug?extid=c0360e8367d6d8d04a66 we need to be able to distinguish between accidently returning a positive error (as implied by the log) and no error. Cc: David Howells <dhowells@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	bdbdd4759f	bcachefs: Fix missing validation in bch2_sb_journal_v2_validate() Reported-by: syzbot+47ecc948aadfb2ab3efc@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	cab18be695	bcachefs: Fix replay_now_at() assert Journal replay, in the slowpath where we insert keys in journal order, was inserting keys in the wrong order; keys from early repair come last. Reported-by: syzbot+2c4fcb257ce2b6a29d0e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	6575b8c987	bcachefs: Fix locking in bch2_ioc_setlabel() Fixes: `7a254053a5` ("bcachefs: support FS_IOC_SETFSLABEL") Reported-by: syzbot+7e9efdfec27fbde0141d@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	5dbfc4ef72	bcachefs: fix failure to relock in btree_node_fill() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	3c5d0b72a8	bcachefs: fix failure to relock in bch2_btree_node_mem_alloc() We weren't always so strict about trans->locked state - but now we are, and new assertions are shaking some bugs out. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	1dceae4cc1	bcachefs: unlock_long() before resort in journal replay Fix another SRCU splat - this one pretty harmless. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	cecc328240	bcachefs: fix missing bch2_err_str() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:23 -04:00
Kent Overstreet	b8db1bd802	bcachefs: fix time_stats_to_text() Fixes: `7423330e30` ("bcachefs: prt_printf() now respects \r\n\t") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Kent Overstreet	c2a503f3e9	bcachefs: Fix bch2_bucket_gens_init() Comparing the wrong bpos - this was missed because normally bucket_gens_init() runs on brand new filesystems, but this bug caused it to overwrite bucket_gens keys with 0s when upgrading ancient filesystems. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Kent Overstreet	e150a7e89c	bcachefs: Fix bch2_trigger_alloc assert On testing on an old mangled filesystem, we missed a case. Fixes: `bd864bc2d9` ("bcachefs: Fix bch2_trigger_alloc when upgrading from old versions") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Kent Overstreet	49203a6b9d	bcachefs: Fix failure to relock in btree_node_get() discovered by new trans->locked asserts Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Kent Overstreet	548e7f5167	bcachefs: setting bcachefs_effective.* xattrs is a noop bcachefs_effective.* xattrs show the options inherited from parent directories (as well as explicitly set); this namespace is not for setting bcachefs options. Change the .set() handler to a noop so that if e.g. rsync is copying xattrs it'll do the right thing, and only copy xattrs in the bcachefs.* namespace. We don't want to return an error, because that will cause rsync to bail out or get spammy. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Kent Overstreet	8cc0e50614	bcachefs: Fix "trying to move an extent, but nr_replicas=0" data_update_init() does a bunch of complicated stuff to decide how many replicas to add, since we only want to increase an extent's durability on an explicit rereplicate, but extent pointers may be on devices with different durability settings. There was a corner case when evacuating a device that had been set to durability=0 after data had been written to it, and extents on that device had already been rereplicated - then evacuate only needs to drop pointers on that device, not move them. So the assert for !m->op.nr_replicas was spurious; this was a perfectly legitimate case that needed to be handled. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Kent Overstreet	3f53d05041	bcachefs: bch2_data_update_init() cleanup Factor out some helpers - this function has gotten much too big. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-22 02:07:22 -04:00
Linus Torvalds	5c6154ffd4	Changes since last update: - Allow large folios on compressed inodes; - Fix invalid memory accesses if z_erofs_gbuf_growsize() partially fails; - Two minor cleanups. -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmbF1s8RHHhpYW5nQGtl cm5lbC5vcmcACgkQUXZn5Zlu5qqzghAAjSi/OITIcpjYe3GOxDNEbIxTFf+5d7ge 66KRMBm5FG6xBaXOK4xhit2EdFsDVMAna6FjYyRJYpDDQX5MAUsifDeZeGZCKOV5 KpUk2CfhIp1zi9CeVA5Tl6rfdz0b0SKz77uRK9BN8PWmCdbeenOKgK6HU06S5enC uPud9QegQELXad0bIe+uo+4zGQnqxwGMfkNEW+cfbp200m6kpig/SgttoXCtz1mm CEZpv87MlrQLbLtox+KcesFhYofxWdx/My5ckJVbaHGZdM29HKehRLIsBFX3di5N 4OkGnFbh16lhx6s4a/b1pCeDNH1NcmP7uqCoGr+dCVKQ79glSZQvfiV4myQk7vRN 4Orj+zq4aBSmYOvizIFMH++IGDioPBxkhKv3ReeCe4t/L8LHPkGEaeT+iM9b+b4n uK0faj6qREygf2rHIRr6ciq7GdxlMwdlF7AeNUIyCHNO84J39+GkAID3Nqs+vzuF wSJAuqp959WgJf2Co5ugb4vjbrpq8TKfZG0yk7KnIOWJkhAKDcIwta+ooloIYtD8 5gHhMRodIgKjYKPOi9vx9cQPlISTMsqVWpRuyzH9vCvqNvkzDriAez1kFq01tZTO gcK5QGekSgft/5h7PLFrbVV1kVfVdjNapsIY+XAEmBGGCJCScCrlIxjbsRwsseFW JIpQXilIU3c= =gFZW -----END PGP SIGNATURE----- Merge tag 'erofs-for-6.11-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs fixes from Gao Xiang: "As I mentioned in the merge window pull request, there is a regression which could cause system hang due to page migration. The corresponding fix landed upstream through MM tree last week (commit `2e6506e1c4`: "mm/migrate: fix deadlock in migrate_pages_batch() on large folios"), therefore large folios can be safely allowed for compressed inodes and stress tests have been running on my fleet for over 20 days without any regression. Users have explicitly requested this for months, so let's allow large folios for EROFS full cases now for wider testing. Additionally, there is a fix which addresses invalid memory accesses on a failure path triggered by fault injection and two minor cleanups to simplify the codebase. Summary: - Allow large folios on compressed inodes - Fix invalid memory accesses if z_erofs_gbuf_growsize() partially fails - Two minor cleanups" * tag 'erofs-for-6.11-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: fix out-of-bound access when z_erofs_gbuf_growsize() partially fails erofs: allow large folios for compressed files erofs: get rid of check_layout_compatibility() erofs: simplify readdir operation	2024-08-22 06:06:09 +08:00
Christian Brauner	524b2c6dc8	romfs: fix romfs_read_folio() Add the correct offset to folio_zero_tail(). Fixes: `d86f2de026` ("romfs: Convert romfs_read_folio() to use a folio") Reported-by: Greg Ungerer <gregungerer@westnet.com.au> Link: https://lore.kernel.org/r/Zr0GTnPHfeA0P8nb@casper.infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-21 22:32:58 +02:00
David Howells	92764e8822	netfs, ceph: Partially revert "netfs: Replace PG_fscache by setting folio->private and marking dirty" This partially reverts commit `2ff1e97587`. In addition to reverting the removal of PG_private_2 wrangling from the buffered read code[1][2], the removal of the waits for PG_private_2 from netfs_release_folio() and netfs_invalidate_folio() need reverting too. It also adds a wait into ceph_evict_inode() to wait for netfs read and copy-to-cache ops to complete. Fixes: `2ff1e97587` ("netfs: Replace PG_fscache by setting folio->private and marking dirty") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk [1] Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e5ced7804cb9184c4a23f8054551240562a8eda [2] Link: https://lore.kernel.org/r/20240814203850.2240469-2-dhowells@redhat.com cc: Max Kellermann <max.kellermann@ionos.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: ceph-devel@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-21 22:32:58 +02:00
Gao Xiang	0005e01e1e	erofs: fix out-of-bound access when z_erofs_gbuf_growsize() partially fails If z_erofs_gbuf_growsize() partially fails on a global buffer due to memory allocation failure or fault injection (as reported by syzbot [1]), new pages need to be freed by comparing to the existing pages to avoid memory leaks. However, the old gbuf->pages[] array may not be large enough, which can lead to null-ptr-deref or out-of-bound access. Fix this by checking against gbuf->nrpages in advance. [1] https://lore.kernel.org/r/000000000000f7b96e062018c6e3@google.com Reported-by: syzbot+242ee56aaa9585553766@syzkaller.appspotmail.com Fixes: `d6db47e571` ("erofs: do not use pagepool in z_erofs_gbuf_growsize()") Cc: <stable@vger.kernel.org> # 6.10+ Reviewed-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Sandeep Dhavale <dhavale@google.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240820085619.1375963-1-hsiangkao@linux.alibaba.com	2024-08-21 08:12:05 +08:00
Kent Overstreet	2102bdac67	bcachefs: Extra debug for data move path We don't have sufficient information to debug: https://github.com/koverstreet/bcachefs/issues/726 - print out durability of extent ptrs, when non default - print the number of replicas we need in data_update_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-19 21:42:08 -04:00
Gao Xiang	e080a26725	erofs: allow large folios for compressed files As commit `2e6506e1c4` ("mm/migrate: fix deadlock in migrate_pages_batch() on large folios") has landed upstream, large folios can be safely enabled for compressed inodes since all prerequisites have already landed in 6.11-rc1. Stress tests has been running on my fleet for over 20 days without any regression. Additionally, users [1] have requested it for months. Let's allow large folios for EROFS full cases upstream now for wider testing. [1] https://lore.kernel.org/r/CAGsJ_4wtE8OcpinuqVwG4jtdx6Qh5f+TON6wz+4HMCq=A2qFcA@mail.gmail.com Cc: Barry Song <21cnbao@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> [ Gao Xiang: minor commit typo fixes. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240819025207.3808649-1-hsiangkao@linux.alibaba.com	2024-08-19 16:10:04 +08:00
Hongzhen Luo	2c534624ae	erofs: get rid of check_layout_compatibility() Simple enough to just open-code it. Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240806112208.150323-1-hongzhen@linux.alibaba.com Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-08-19 11:06:20 +08:00
Hongzhen Luo	5b5c96c63d	erofs: simplify readdir operation - Use i_size instead of i_size_read() due to immutable fses; - Get rid of an unneeded goto since erofs_fill_dentries() also works; - Remove unnecessary lines. Signed-off-by: Hongzhen Luo <hongzhen@linux.alibaba.com> Link: https://lore.kernel.org/r/20240801112622.2164029-1-hongzhen@linux.alibaba.com Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-08-19 11:06:20 +08:00
Kent Overstreet	47cdc7b144	bcachefs: Fix incorrect gfp flags fixes: 00488 WARNING: CPU: 9 PID: 194 at mm/page_alloc.c:4410 __alloc_pages_noprof+0x1818/0x1888 00488 Modules linked in: 00488 CPU: 9 UID: 0 PID: 194 Comm: kworker/u66:1 Not tainted 6.11.0-rc1-ktest-g18fa10d6495f #2931 00488 Hardware name: linux,dummy-virt (DT) 00488 Workqueue: writeback wb_workfn (flush-bcachefs-2) 00488 pstate: 20001005 (nzCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 00488 pc : __alloc_pages_noprof+0x1818/0x1888 00488 lr : __alloc_pages_noprof+0x5f4/0x1888 00488 sp : ffffff80ccd8ed00 00488 x29: ffffff80ccd8ed00 x28: 0000000000000000 x27: dfffffc000000000 00488 x26: 0000000000000010 x25: 0000000000000002 x24: 0000000000000000 00488 x23: 0000000000000000 x22: 1ffffff0199b1dbe x21: ffffff80cc680900 00488 x20: 0000000000000000 x19: ffffff80ccd8eed0 x18: 0000000000000000 00488 x17: ffffff80cc58a010 x16: dfffffc000000000 x15: 1ffffff00474e518 00488 x14: 1ffffff00474e518 x13: 1ffffff00474e518 x12: ffffffb8104701b9 00488 x11: 1ffffff8104701b8 x10: ffffffb8104701b8 x9 : ffffffc08043cde8 00488 x8 : 00000047efb8fe48 x7 : ffffff80ccd8ee20 x6 : 0000000000048000 00488 x5 : 1ffffff810470138 x4 : 0000000000000050 x3 : 1ffffff0199b1d94 00488 x2 : ffffffb0199b1d94 x1 : 0000000000000001 x0 : ffffffc082387448 00488 Call trace: 00488 __alloc_pages_noprof+0x1818/0x1888 00488 new_slab+0x284/0x2f0 00488 ___slab_alloc+0x208/0x8e0 00488 __kmalloc_noprof+0x328/0x340 00488 __bch2_writepage+0x106c/0x1830 00488 write_cache_pages+0xa0/0xe8 due to __GFP_NOFAIL without allowing reclaim Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-18 20:42:05 -04:00
Kent Overstreet	d9f49c3106	bcachefs: fix field-spanning write warning attempts to retrofit memory safety onto C are increasingly annoying ------------[ cut here ]------------ memcpy: detected field-spanning write (size 4) of single field "&k.replicas" at fs/bcachefs/replicas.c:454 (size 3) WARNING: CPU: 5 PID: 6525 at fs/bcachefs/replicas.c:454 bch2_replicas_gc2+0x2cb/0x400 [bcachefs] bch2_replicas_gc2+0x2cb/0x400: bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3) Modules linked in: dm_mod tun nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay msr sctp bcachefs lz4hc_compress lz4_compress libcrc32c xor raid6_pq lz4_decompress pps_ldisc pps_core wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel curve25519_x86_64 libcurve25519_generic libchacha sit tunnel4 ip_tunnel af_packet bridge stp llc ip6table_nat ip6table_filter ip6_tables xt_MASQUERADE xt_conntrack iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter ip_tables x_tables tcp_bbr sch_fq_codel efivarfs nls_iso8859_1 nls_cp437 vfat fat cdc_mbim cdc_wdm cdc_ncm cdc_ether usbnet r8152 input_leds joydev mii amdgpu mousedev hid_generic usbhid hid ath10k_pci amd_atl edac_mce_amd ath10k_core kvm_amd ath kvm mac80211 bfq crc32_pclmul crc32c_intel polyval_clmulni polyval_generic sha512_ssse3 sha256_ssse3 sha1_ssse3 snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg i2c_algo_bit drm_exec snd_hda_codec r8169 drm_suballoc_helper aesni_intel gf128mul crypto_simd amdxcp realtek mfd_core tpm_crb drm_buddy snd_hwdep mdio_devres libarc4 cryptd tpm_tis wmi_bmof cfg80211 evdev libphy snd_hda_core tpm_tis_core gpu_sched rapl xhci_pci xhci_hcd snd_pcm drm_display_helper snd_timer tpm sp5100_tco rfkill efi_pstore mpt3sas drm_ttm_helper ahci usbcore libaescfb ccp snd ttm 8250 libahci watchdog soundcore raid_class sha1_generic acpi_cpufreq k10temp 8250_base usb_common scsi_transport_sas i2c_piix4 hwmon video serial_mctrl_gpio serial_base ecdh_generic wmi rtc_cmos backlight ecc gpio_amdpt rng_core gpio_generic button CPU: 5 UID: 0 PID: 6525 Comm: bcachefs Tainted: G W 6.11.0-rc1-ojab-00058-g224bc118aec9 #6 6d5debde398d2a84851f42ab300dae32c2992027 Tainted: [W]=WARN RIP: 0010:bch2_replicas_gc2+0x2cb/0x400 [bcachefs] Code: c7 c2 60 91 d1 c1 48 89 c6 48 c7 c7 98 91 d1 c1 4c 89 14 24 44 89 5c 24 08 48 89 44 24 20 c6 05 fa 68 04 00 01 e8 05 a3 40 e4 <0f> 0b 4c 8b 14 24 44 8b 5c 24 08 48 8b 44 24 20 e9 55 fe ff ff 8b RSP: 0018:ffffb434c9263d60 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff9a8efa79cc00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffb434c9263de0 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005 R13: ffff9a8efa73c300 R14: ffff9a8d9e880000 R15: ffff9a8d9e8806f8 FS: 0000000000000000(0000) GS:ffff9a9410c80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000565423373090 CR3: 0000000164e30000 CR4: 00000000003506f0 Call Trace: <TASK> ? __warn+0x97/0x150 ? bch2_replicas_gc2+0x2cb/0x400 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3] bch2_replicas_gc2+0x2cb/0x400: bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3) ? report_bug+0x196/0x1c0 ? handle_bug+0x3c/0x70 ? exc_invalid_op+0x17/0x80 ? __wake_up_klogd.part.0+0x4c/0x80 ? asm_exc_invalid_op+0x16/0x20 ? bch2_replicas_gc2+0x2cb/0x400 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3] bch2_replicas_gc2+0x2cb/0x400: bch2_replicas_gc2 at /home/ojab/src/bcachefs/fs/bcachefs/replicas.c:454 (discriminator 3) ? bch2_dev_usage_read+0xa0/0xa0 [bcachefs 9803eca5e131ef28f26250ede34072d5b50d98b3] bch2_dev_usage_read+0xa0/0xa0: discard_in_flight_remove at /home/ojab/src/bcachefs/fs/bcachefs/alloc_background.c:1712 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-18 20:42:05 -04:00
Kent Overstreet	d6d539c9a7	bcachefs: Reallocate table when we're increasing size Fixes: `c2f6e16a67` ("bcachefs: Increase size of cuckoo hash table on too many rehashes") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-18 20:41:50 -04:00
Thorsten Blum	7c525dddbe	ksmbd: Replace one-element arrays with flexible-array members Replace the deprecated one-element arrays with flexible-array members in the structs filesystem_attribute_info and filesystem_device_info. There are no binary differences after this conversion. Link: https://github.com/KSPP/linux/issues/79 Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-18 17:02:36 -05:00
Namjae Jeon	76e98a158b	ksmbd: fix race condition between destroy_previous_session() and smb2 operations() If there is ->PreviousSessionId field in the session setup request, The session of the previous connection should be destroyed. During this, if the smb2 operation requests in the previous session are being processed, a racy issue could happen with ksmbd_destroy_file_table(). This patch sets conn->status to KSMBD_SESS_NEED_RECONNECT to block incoming operations and waits until on-going operations are complete (i.e. idle) before desctorying the previous session. Fixes: `c8efcc7861` ("ksmbd: add support for durable handles v1/v2") Cc: stable@vger.kernel.org # v6.6+ Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-25040 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-18 17:02:36 -05:00
Namjae Jeon	dfd046d0ce	ksmbd: Use unsafe_memcpy() for ntlm_negotiate rsp buffer is allocated larger than spnego_blob from smb2_allocate_rsp_buf(). Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-18 17:02:36 -05:00
Steve French	e4be320eec	smb3: fix broken cached reads when posix locks Mandatory locking is enforced for cached reads, which violates default posix semantics, and also it is enforced inconsistently. This affected recent versions of libreoffice, and can be demonstrated by opening a file twice from the same client, locking it from handle one and trying to read from it from handle two (which fails, returning EACCES). There is already a mount option "forcemandatorylock" (which defaults to off), so with this change only when the user intentionally specifies "forcemandatorylock" on mount will we break posix semantics on read to a locked range (ie we will only fail in this case, if the user mounts with "forcemandatorylock"). An earlier patch fixed the write path. Fixes: `85160e03a7` ("CIFS: Implement caching mechanism for mandatory brlocks") Cc: stable@vger.kernel.org Cc: Pavel Shilovsky <piastryyy@gmail.com> Reviewed-by: David Howells <dhowells@redhat.com> Reported-by: abartlet@samba.org Reported-by: Kevin Ottens <kevin.ottens@enioka.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-18 17:01:06 -05:00
Linus Torvalds	57b14823ea	for-6.11-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmbB6dwACgkQxWXV+ddt WDu/Eg/9EXEoSPqRYxsRa2vLQjSbBbBCBDW1G75F5oUsvRKtfNhZ+w02JAqubirF wdBNaXoGQ9zJq/E0JtHHDqv9M6FV+g3aO0xM+ntmp9cZdFBVXRkrB3TlewMesKfI lXZW5kn35q6aeNi2MaJjk2G5Pr0MYjGGRezBuloc7TcIlgVijjLBlcnKEz263C1/ rXvENxowxPA20LiWviA4ZjlqlRQLBrgxqpSXLGg7mZs93XdbtPa2ZvzS7ffuZTI+ PUCYGEwI4E2Dpv+mswFb21SUdUPPmAycubERJvABqnZxCWupkevgvv+6MeWC5A2p 7OjoTmINRDDsNWYSvyHQ04U+0XkPmHCBKEkAuy0ZIajHJU4G6rpeDySX04/Cwzht mJZ37FzMGZ9LjEpL1uoPifWKcH0nUW9sWw4Tw9tgeuBG9RfI/BxZqRT9WLEkiXUI 7Bdq2Ir6fzv8IVKWkuqO6No+LDa//qF2ci0nWvCwbgdNquGFa9DmNmVDiKDqKW3R RlP+6laXEPCfKSycTpp94gASVeEEcKNYYC0B/FCBLJmXVcCJQ41qsMn/fYUn8Yn3 vVdmAuGKThYjO4RNjqy4FgNVLdY9280OGazH0B9t3HhL+U+9JLN8fnhD5H4hEpR/ dDC1quKmo6WIKAem6O6r3GRnCj8lbLaXpmp+MpyEUI7M5FyjsAc= =YBfJ -----END PGP SIGNATURE----- Merge tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull more btrfs fixes from David Sterba: "A more fixes. We got reports that shrinker added in 6.10 still causes latency spikes and the fixes don't handle all corner cases. Due to summer holidays we're taking a shortcut to disable it for release builds and will fix it in the near future. - only enable extent map shrinker for DEBUG builds, temporary quick fix to avoid latency spikes for regular builds - update target inode's ctime on unlink, mandated by POSIX - properly take lock to read/update block group's zoned variables - add counted_by() annotations" * tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: only enable extent map shrinker for DEBUG builds btrfs: zoned: properly take lock to read/update block group's zoned variables btrfs: tree-checker: add dev extent item checks btrfs: update target inode's ctime on unlink btrfs: send: annotate struct name_cache_entry with __counted_by()	2024-08-18 08:50:36 -07:00
Jann Horn	3c0da3d163	fuse: Initialize beyond-EOF page contents before setting uptodate fuse_notify_store(), unlike fuse_do_readpage(), does not enable page zeroing (because it can be used to change partial page contents). So fuse_notify_store() must be more careful to fully initialize page contents (including parts of the page that are beyond end-of-file) before marking the page uptodate. The current code can leave beyond-EOF page contents uninitialized, which makes these uninitialized page contents visible to userspace via mmap(). This is an information leak, but only affects systems which do not enable init-on-alloc (via CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y or the corresponding kernel command line parameter). Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2574 Cc: stable@kernel.org Fixes: `a1d75f2582` ("fuse: add store request") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-08-18 08:45:39 -07:00
Linus Torvalds	e0fac5fc8b	three client fixes, including two for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmbBGkoACgkQiiy9cAdy T1HAJAv9G2efGXOuLHuDKM4IkoUBoeAsC/o5g5sVbZfINON1Ra0vQBLmRLunhAlW xIY2Ln92jMdvM6wNwFcsAI5bIWTiIrjdqP/HY9kiKRU5O5NvqNWeyPEDOB3aM41O UXq8jNKyyyyFD1P4QJNYMeZucTZatLJVb7WRZHGDEDcVMrCWdDVcnPwnMfyNeD0w GndMPAAxiQxV+AoL+RgE6+nfVr4EwHI3VFG/h3FyNcaMp2ZSzYHDu/TIwmGBHq6P DCJyxjKMJoXKzKO+3hVp3tKzKZ9EuE3ljb8liBbZ8g6J4quCHbQWC3Mh8Jhmgav6 1KhDRKI6vjHZwu8tWjBEgadhwcRBHMuz/YZL+zrx3QHjA/AgV20Y7oyvyXKusj9t G5C1bTExusdhLnEOGN4+udxjAHrMkW36R6Vux5D85WYmhR3k2AbIdZevA+mLADKU veTye1VAX5vy9h0atyV69Zta9aBU6q3Mhcpgrcbj0u3C/Iuu1DafrEmb5hGgW7Dw xnGynYax =af3x -----END PGP SIGNATURE----- Merge tag 'v6.11-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - fix for clang warning - additional null check - fix for cached write with posix locks - flexible structure fix * tag 'v6.11-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: smb: smb2pdu.h: Use static_assert() to check struct sizes smb3: fix lock breakage for cached writes smb/client: avoid possible NULL dereference in cifs_free_subrequest()	2024-08-17 16:31:12 -07:00
Linus Torvalds	d09840f8b3	Bug fixes for 6.11-rc4: * Check for presence of only 'attr' feature before scrubbing an inode's attribute fork. * Restore the behaviour of setting AIL thread to TASK_INTERRUPTIBLE for long (i.e. 50ms) sleep durations to prevent high load averages. * Do not allow users to change the realtime flag of a file unless the datadev and rtdev both support fsdax access modes. Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZr1wqwAKCRAH7y4RirJu 9MYxAQCgHoAK8rqxb4obrrGmqVcHJdnHDYqSFRqbbvytRHybZgEA2hfaNbNpuQYT JOV5pGOUJf1LiSc5D6MBepg2BAFRNwo= =7Ibh -----END PGP SIGNATURE----- Merge tag 'xfs-6.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Chandan Babu: - Check for presence of only 'attr' feature before scrubbing an inode's attribute fork. - Restore the behaviour of setting AIL thread to TASK_INTERRUPTIBLE for long (i.e. 50ms) sleep durations to prevent high load averages. - Do not allow users to change the realtime flag of a file unless the datadev and rtdev both support fsdax access modes. * tag 'xfs-6.11-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: conditionally allow FS_XFLAG_REALTIME changes if S_DAX is set xfs: revert AIL TASK_KILLABLE threshold xfs: attr forks require attr, not attr2	2024-08-17 09:51:28 -07:00
Linus Torvalds	b718175853	bcachefs fixes for 6.11-rc4 - New on disk format version, bcachefs_metadata_version_disk_accounting_inum This adds one more disk accounting counter, which counts disk usage and number of extents per inode number. This lets us track fragmentation, for implementing defragmentation later, and it also counts disk usage per inode in all snapshots, which will be a useful thing to expose to users. - One performance issue we've observed is threads spinning when they should be waiting for dirty keys in the key cache to be flushed by journal reclaim, so we now have hysteresis for the waiting thread, as well as improving the tracepoint and a new time_stat, for tracking time blocked waiting on key cache flushing. And, various assorted smaller fixes. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAma/9QkACgkQE6szbY3K bnYcBw/+LBSZ415gWSjPktdecf5rc6K4KxETxAxV0f0KesYzxqAtQzN0SCDvKt65 3aALU03wM8vWITiLS38/ckT+j6S2BpXcOxdu/OC0nRYQEUg9ZLvqEG5lQ3a/LliV Q64N33qsSr6QaKszFllLYcN4tGduKg8HoMlHn6+vJ7HNPjdfv0HHERSUsc7K84/w jkRtDE2NxsRJZKMEvIFp8hd5KXUR5zyBz/kc4P0WliLXpSyJLITzhKw1JV7ikKVD 0mO2bJ/0i7wPIabAD2HJahvbC7fl+2fkYFxUJ2XnvMTgU/+QyeGHEufbcbVrVSp0 BpzBTmSMFbGXBkbQBruFX5rJetzXeBqdYf0Yfavd4KDhGvYlSfDZQUapXT1QKC2q aHSB/s+2r7Crr/MBJyjbeFgXFTNGvI5yerlbdp2yj1kxjYJHHaKrp6h7n6XXk21W /mGF5tkIMkFTv98rQnIaky4neJzOPsLTTgxeR8zEudCgMaVUqEcaMdIFvARDjY/3 n52VR0zl3olV3vu7LgHaHfgH6lfaMV0sHPaGNYGL0YL+bCJD+lYM8a6l9aaks8vk md7+mFcOS4FUdDdS8MEKIN/k/gkEOC/EpmI864i9rIl0SiNXNy7FPTDKON8b+Ury 5omBMUQMEe9Q/pgKGXfpJWFynhSPEVf4y1DIOsrXk/jeBqenFyo= =BPGT -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-08-16' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent OverstreetL - New on disk format version, bcachefs_metadata_version_disk_accounting_inum This adds one more disk accounting counter, which counts disk usage and number of extents per inode number. This lets us track fragmentation, for implementing defragmentation later, and it also counts disk usage per inode in all snapshots, which will be a useful thing to expose to users. - One performance issue we've observed is threads spinning when they should be waiting for dirty keys in the key cache to be flushed by journal reclaim, so we now have hysteresis for the waiting thread, as well as improving the tracepoint and a new time_stat, for tracking time blocked waiting on key cache flushing. ... and various assorted smaller fixes. * tag 'bcachefs-2024-08-16' of git://evilpiepirate.org/bcachefs: bcachefs: Fix locking in __bch2_trans_mark_dev_sb() bcachefs: fix incorrect i_state usage bcachefs: avoid overflowing LRU_TIME_BITS for cached data lru bcachefs: Fix forgetting to pass trans to fsck_err() bcachefs: Increase size of cuckoo hash table on too many rehashes bcachefs: bcachefs_metadata_version_disk_accounting_inum bcachefs: Kill __bch2_accounting_mem_mod() bcachefs: Make bkey_fsck_err() a wrapper around fsck_err() bcachefs: Fix warning in __bch2_fsck_err() for trans not passed in bcachefs: Add a time_stat for blocked on key cache flush bcachefs: Improve trans_blocked_journal_reclaim tracepoint bcachefs: Add hysteresis to waiting on btree key cache flush lib/generic-radix-tree.c: Fix rare race in __genradix_ptr_alloc() bcachefs: Convert for_each_btree_node() to lockrestart_do() bcachefs: Add missing downgrade table entry bcachefs: disk accounting: ignore unknown types bcachefs: bch2_accounting_invalid() fixup bcachefs: Fix bch2_trigger_alloc when upgrading from old versions bcachefs: delete faulty fastpath in bch2_btree_path_traverse_cached()	2024-08-17 09:46:10 -07:00
Kent Overstreet	0e49d3ff12	bcachefs: Fix locking in __bch2_trans_mark_dev_sb() We run this in full RW mode now, so we have to guard against the superblock buffer being reallocated. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-16 20:45:15 -04:00
Qu Wenruo	534f7eff92	btrfs: only enable extent map shrinker for DEBUG builds Although there are several patches improving the extent map shrinker, there are still reports of too frequent shrinker behavior, taking too much CPU for the kswapd process. So let's only enable extent shrinker for now, until we got more comprehensive understanding and a better solution. Link: https://lore.kernel.org/linux-btrfs/3df4acd616a07ef4d2dc6bad668701504b412ffc.camel@intelfx.name/ Link: https://lore.kernel.org/linux-btrfs/c30fd6b3-ca7a-4759-8a53-d42878bf84f7@gmail.com/ Fixes: `956a17d9d0` ("btrfs: add a shrinker for extent maps") CC: stable@vger.kernel.org # 6.10+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-16 21:22:39 +02:00
Kent Overstreet	99c87fe0f5	bcachefs: fix incorrect i_state usage Reported-by: syzbot+95e40eae71609e40d851@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-16 12:46:40 -04:00
Kent Overstreet	9482f3b053	bcachefs: avoid overflowing LRU_TIME_BITS for cached data lru Reported-by: syzbot+510b0b28f8e6de64d307@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-16 12:46:40 -04:00
Kent Overstreet	075cabf324	bcachefs: Fix forgetting to pass trans to fsck_err() Reported-by: syzbot+e3938cd6d761b78750e6@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-16 12:46:40 -04:00
Kent Overstreet	c2f6e16a67	bcachefs: Increase size of cuckoo hash table on too many rehashes Also, improve the calculation of the new table size, so that it can shrink when needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-16 12:46:40 -04:00
Gustavo A. R. Silva	5b4f3af39b	smb: smb2pdu.h: Use static_assert() to check struct sizes Commit `9f9bef9bc5` ("smb: smb2pdu.h: Avoid -Wflex-array-member-not-at-end warnings") introduced tagged `struct create_context_hdr`. We want to ensure that when new members need to be added to the flexible structure, they are always included within this tagged struct. So, we use `static_assert()` to ensure that the memory layout for both the flexible structure and the tagged struct is the same after any changes. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-15 16:06:55 -05:00
Steve French	836bb3268d	smb3: fix lock breakage for cached writes Mandatory locking is enforced for cached writes, which violates default posix semantics, and also it is enforced inconsistently. This apparently breaks recent versions of libreoffice, but can also be demonstrated by opening a file twice from the same client, locking it from handle one and writing to it from handle two (which fails, returning EACCES). Since there was already a mount option "forcemandatorylock" (which defaults to off), with this change only when the user intentionally specifies "forcemandatorylock" on mount will we break posix semantics on write to a locked range (ie we will only fail the write in this case, if the user mounts with "forcemandatorylock"). Fixes: `85160e03a7` ("CIFS: Implement caching mechanism for mandatory brlocks") Cc: stable@vger.kernel.org Cc: Pavel Shilovsky <piastryyy@gmail.com> Reported-by: abartlet@samba.org Reported-by: Kevin Ottens <kevin.ottens@enioka.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-15 16:04:47 -05:00
Su Hui	74c2ab6d65	smb/client: avoid possible NULL dereference in cifs_free_subrequest() Clang static checker (scan-build) warning: cifsglob.h:line 890, column 3 Access to field 'ops' results in a dereference of a null pointer. Commit `519be98971` ("cifs: Add a tracepoint to track credits involved in R/W requests") adds a check for 'rdata->server', and let clang throw this warning about NULL dereference. When 'rdata->credits.value != 0 && rdata->server == NULL' happens, add_credits_and_wake_if() will call rdata->server->ops->add_credits(). This will cause NULL dereference problem. Add a check for 'rdata->server' to avoid NULL dereference. Cc: stable@vger.kernel.org Fixes: `69c3c023af` ("cifs: Implement netfslib hooks") Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Su Hui <suhui@nfschina.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-15 15:32:30 -05:00
Naohiro Aota	e30729d4bd	btrfs: zoned: properly take lock to read/update block group's zoned variables __btrfs_add_free_space_zoned() references and modifies bg's alloc_offset, ro, and zone_unusable, but without taking the lock. It is mostly safe because they monotonically increase (at least for now) and this function is mostly called by a transaction commit, which is serialized by itself. Still, taking the lock is a safer and correct option and I'm going to add a change to reset zone_unusable while a block group is still alive. So, add locking around the operations. Fixes: `169e0da91a` ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-15 20:35:56 +02:00
Qu Wenruo	008e2512dc	btrfs: tree-checker: add dev extent item checks [REPORT] There is a corruption report that btrfs refused to mount a fs that has overlapping dev extents: BTRFS error (device sdc): dev extent devid 4 physical offset 14263979671552 overlap with previous dev extent end 14263980982272 BTRFS error (device sdc): failed to verify dev extents against chunks: -117 BTRFS error (device sdc): open_ctree failed [CAUSE] The direct cause is very obvious, there is a bad dev extent item with incorrect length. With btrfs check reporting two overlapping extents, the second one shows some clue on the cause: ERROR: dev extent devid 4 offset 14263979671552 len 6488064 overlap with previous dev extent end 14263980982272 ERROR: dev extent devid 13 offset 2257707008000 len 6488064 overlap with previous dev extent end 2257707270144 ERROR: errors found in extent allocation tree or chunk allocation The second one looks like a bitflip happened during new chunk allocation: hex(2257707008000) = 0x20da9d30000 hex(2257707270144) = 0x20da9d70000 diff = 0x00000040000 So it looks like a bitflip happened during new dev extent allocation, resulting the second overlap. Currently we only do the dev-extent verification at mount time, but if the corruption is caused by memory bitflip, we really want to catch it before writing the corruption to the storage. Furthermore the dev extent items has the following key definition: (<device id> DEV_EXTENT <physical offset>) Thus we can not just rely on the generic key order check to make sure there is no overlapping. [ENHANCEMENT] Introduce dedicated dev extent checks, including: - Fixed member checks * chunk_tree should always be BTRFS_CHUNK_TREE_OBJECTID (3) * chunk_objectid should always be BTRFS_FIRST_CHUNK_CHUNK_TREE_OBJECTID (256) - Alignment checks * chunk_offset should be aligned to sectorsize * length should be aligned to sectorsize * key.offset should be aligned to sectorsize - Overlap checks If the previous key is also a dev-extent item, with the same device id, make sure we do not overlap with the previous dev extent. Reported: Stefan N <stefannnau@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+W5K0rSO3koYTo=nzxxTm1-Pdu1HYgVxEpgJ=aGc7d=E8mGEg@mail.gmail.com/ CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-15 20:35:52 +02:00
Jeff Layton	3bc2ac2f8f	btrfs: update target inode's ctime on unlink Unlink changes the link count on the target inode. POSIX mandates that the ctime must also change when this occurs. According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html: "Upon successful completion, unlink() shall mark for update the last data modification and last file status change timestamps of the parent directory. Also, if the file's link count is not 0, the last file status change timestamp of the file shall be marked for update." Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> [ add link to the opengroup docs ] Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-15 20:35:44 +02:00
Thorsten Blum	c0247d289e	btrfs: send: annotate struct name_cache_entry with __counted_by() Add the __counted_by compiler attribute to the flexible array member name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-15 20:35:32 +02:00
Linus Torvalds	1fb918967b	for-6.11-rc3-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAma9KksACgkQxWXV+ddt WDvMYw/8Ds6A3IcMd1AByPDryhHZnOpqU/YQS/HhneisTg08MYHD0TMZLX02GXw0 vzqUyBHQ9yOYEt18SVtX67Dzapy0SWZWTK8Er/tJCn+14SLWVMdRiJCO2Y3Rdm8T J3S/b610iEVl5z6S6SpFD+zc56liCyVHfpK6obwSFBCzAyN6vm2p0ls5vq+hGsjb s/dOPJfOMnFTOFXVIumJ5KJRCubGuwG+PhZO9engwxiFIr1O4xxedzhKocXNSiiE +jt96gnKwW/K/Wh59YFGLbKk7h/jsIzM2CqE+JrEZlwIN+oNODFvP+Z37DbSC6jN 0x7G8gqY8vXgxOUk1rub+IqP+/wIjMmCTzkxpO2uy80hB0h3YzqzZnUNHk6DlIu/ zOhgcMnp5SMuvtJIXBpP2HRzzG/UbxfrjaPKDmUvwKvCUydrw0xL+XWdDMhi3bSn NPDW/Ixl1XGMS131YYla1v/KXTKJwZ2hK045svx7A8Aok0WaXAcYfwValQ9GMO0j gk89DRN0tMikBebaVD3aakE1FqyC/3dEZ80D3LSs7cgDGQA27wjdESHcSTtY373/ +fCBXDH/N9ubanZwsu+gUEzNp3DukEoaw2r73IicxcbsiskDpEqvddcKQmrZQ1xW 03UVziw1LpzkNSsDMZrlBwIoo5SnqbEbsEkTCMtjivJYOzhfyYk= =din2 -----END PGP SIGNATURE----- Merge tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - extend tree-checker verification of directory item type - fix regression in page/folio and extent state tracking in xarray, the dirty status can get out of sync and can cause problems e.g. a hang - in send, detect last extent and allow to clone it instead of sending it as write, reduces amount of data transferred in the stream - fix checking extent references when cleaning deleted subvolumes - fix one more case in the extent map shrinker, let it run only in the kswapd context so it does not cause latency spikes during other operations * tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix invalid mapping of extent xarray state btrfs: send: allow cloning non-aligned extent if it ends at i_size btrfs: only run the extent map shrinker from kswapd tasks btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type btrfs: check delayed refs when we're checking if a ref exists	2024-08-14 17:56:15 -07:00
Linus Torvalds	4ac0f08f44	vfs-6.11-rc4.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZrym4AAKCRCRxhvAZXjc oqT3AP9ydoUNavaZcRayH8r3ybvz9+aJGJ6Q7NznFVCk71vn0gD/buLzmq96Muns M5DWHbft2AFwK0Rz2nx8j5OXUeHwrQg= =HZBL -----END PGP SIGNATURE----- Merge tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "VFS: - Fix the name of file lease slab cache. When file leases were split out of file locks the name of the file lock slab cache was used for the file leases slab cache as well. - Fix a type in take_fd() helper. - Fix infinite directory iteration for stable offsets in tmpfs. - When the icache is pruned all reclaimable inodes are marked with I_FREEING and other processes that try to lookup such inodes will block. But some filesystems like ext4 can trigger lookups in their inode evict callback causing deadlocks. Ext4 does such lookups if the ea_inode feature is used whereby a separate inode may be used to store xattrs. Introduce I_LRU_ISOLATING which pins the inode while its pages are reclaimed. This avoids inode deletion during inode_lru_isolate() avoiding the deadlock and evict is made to wait until I_LRU_ISOLATING is done. netfs: - Fault in smaller chunks for non-large folio mappings for filesystems that haven't been converted to large folios yet. - Fix the CONFIG_NETFS_DEBUG config option. The config option was renamed a short while ago and that introduced two minor issues. First, it depended on CONFIG_NETFS whereas it wants to depend on CONFIG_NETFS_SUPPORT. The former doesn't exist, while the latter does. Second, the documentation for the config option wasn't fixed up. - Revert the removal of the PG_private_2 writeback flag as ceph is using it and fix how that flag is handled in netfs. - Fix DIO reads on 9p. A program watching a file on a 9p mount wouldn't see any changes in the size of the file being exported by the server if the file was changed directly in the source filesystem. Fix this by attempting to read the full size specified when a DIO read is requested. - Fix a NULL pointer dereference bug due to a data race where a cachefiles cookies was retired even though it was still in use. Check the cookie's n_accesses counter before discarding it. nsfs: - Fix ioctl declaration for NS_GET_MNTNS_ID from _IO() to _IOR() as the kernel is writing to userspace. pidfs: - Prevent the creation of pidfds for kthreads until we have a use-case for it and we know the semantics we want. It also confuses userspace why they can get pidfds for kthreads. squashfs: - Fix an unitialized value bug reported by KMSAN caused by a corrupted symbolic link size read from disk. Check that the symbolic link size is not larger than expected" * tag 'vfs-6.11-rc4.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: Squashfs: sanity check symbolic link size 9p: Fix DIO read through netfs vfs: Don't evict inode under the inode lru traversing context netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag" file: fix typo in take_fd() comment pidfd: prevent creation of pidfds for kthreads netfs: clean up after renaming FSCACHE_DEBUG config libfs: fix infinite directory reads for offset dir nsfs: fix ioctl declaration fs/netfs/fscache_cookie: add missing "n_accesses" check filelock: fix name of file_lease slab cache netfs: Fault in smaller chunks for non-large folio mappings	2024-08-14 09:06:28 -07:00
Darrick J. Wong	8d16762047	xfs: conditionally allow FS_XFLAG_REALTIME changes if S_DAX is set If a file has the S_DAX flag (aka fsdax access mode) set, we cannot allow users to change the realtime flag unless the datadev and rtdev both support fsdax access modes. Even if there are no extents allocated to the file, the setattr thread could be racing with another thread that has already started down the write code paths. Fixes: `ba23cba9b3` ("fs: allow per-device dax status checking for filesystems") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-14 21:20:24 +05:30
Darrick J. Wong	04d6dbb553	xfs: revert AIL TASK_KILLABLE threshold In commit `9adf40249e`, we changed the behavior of the AIL thread to set its own task state to KILLABLE whenever the timeout value is nonzero. Unfortunately, this missed the fact that xfsaild_push will return 50ms (aka a longish sleep) when we reach the push target or the AIL becomes empty, so xfsaild goes to sleep for a long period of time in uninterruptible D state. This results in artificially high load averages because KILLABLE processes are UNINTERRUPTIBLE, which contributes to load average even though the AIL is asleep waiting for someone to interrupt it. It's not blocked on IOs or anything, but people scrap ps for processes that look like they're stuck in D state, so restore the previous threshold. Fixes: `9adf40249e` ("xfs: AIL doesn't need manual pushing") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-14 21:19:34 +05:30
Darrick J. Wong	73c34b0b85	xfs: attr forks require attr, not attr2 It turns out that I misunderstood the difference between the attr and attr2 feature bits. "attr" means that at some point an attr fork was created somewhere in the filesystem. "attr2" means that inodes have variable-sized forks, but says nothing about whether or not there actually /are/ attr forks in the system. If we have an attr fork, we only need to check that attr is set. Fixes: `99d9d8d05d` ("xfs: scrub inode block mappings") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-08-14 21:19:14 +05:30
Kent Overstreet	58474f76a7	bcachefs: bcachefs_metadata_version_disk_accounting_inum This adds another disk accounting counter to track usage per inode number (any snapshot ID). This will be used for a couple things: - It'll give us a way to tell the user how much space a given file ista consuming in all snapshots; i.e. how much extra space it's consuming due to snapshot versioning. - It counts number of extents and total size of extents (both in btree keyspace sectors and actual disk usage), meaning it gives us average extent size: that is, it'll let us cheaply find fragmented files that should be defragmented. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:50 -04:00
Kent Overstreet	5132b99bb6	bcachefs: Kill __bch2_accounting_mem_mod() The next patch will be adding a disk accounting counter type which is not kept in the in-memory eytzinger tree. As prep, fold __bch2_accounting_mem_mod() into bch2_accounting_mem_mod_locked() so that we can check for that counter type and bail out without calling bpos_to_disk_accounting_pos() twice. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:50 -04:00
Kent Overstreet	d97de0d017	bcachefs: Make bkey_fsck_err() a wrapper around fsck_err() bkey_fsck_err() was added as an interface that looks like fsck_err(), but previously all it did was ensure that the appropriate error counter was incremented in the superblock. This is a cleanup and bugfix patch that converts it to a wrapper around fsck_err(). This is needed to fix an issue with the upgrade path to disk_accounting_v3, where the "silent fix" error list now includes bkey_fsck errors; fsck_err() handles this in a unified way, and since we need to change printing of bkey fsck errors from the caller to the inner bkey_fsck_err() calls, this ends up being a pretty big change. Als,, rename .invalid() methods to .validate(), for clarity, while we're changing the function signature anyways (to drop the printbuf argument). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:50 -04:00
Kent Overstreet	c99471024f	bcachefs: Fix warning in __bch2_fsck_err() for trans not passed in Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:50 -04:00
Kent Overstreet	06a8693b89	bcachefs: Add a time_stat for blocked on key cache flush Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:50 -04:00
Kent Overstreet	790666c8ac	bcachefs: Improve trans_blocked_journal_reclaim tracepoint include information about the state of the btree key cache Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:34 -04:00
Kent Overstreet	7254555c44	bcachefs: Add hysteresis to waiting on btree key cache flush This helps ensure key cache reclaim isn't contending with threads waiting for the key cache to be helped, and fixes a severe performance bug. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 23:00:34 -04:00
Kent Overstreet	968feb854a	bcachefs: Convert for_each_btree_node() to lockrestart_do() for_each_btree_node() now works similarly to for_each_btree_key(), where the loop body is passed as an argument to be passed to lockrestart_do(). This now calls trans_begin() on every loop iteration - which fixes an SRCU warning in backpointers fsck. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 22:56:50 -04:00
Kent Overstreet	48d6cc1b48	bcachefs: Add missing downgrade table entry Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 22:56:50 -04:00
Kent Overstreet	486d920735	bcachefs: disk accounting: ignore unknown types forward compat fix Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 22:56:50 -04:00
Kent Overstreet	d9e615762b	bcachefs: bch2_accounting_invalid() fixup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 22:56:50 -04:00
Kent Overstreet	bd864bc2d9	bcachefs: Fix bch2_trigger_alloc when upgrading from old versions bch2_trigger_alloc was assuming that the new key would always be newly created and thus always an alloc_v4 key, but - not when called from btree_gc. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 22:56:50 -04:00
Kent Overstreet	a24e6e7146	bcachefs: delete faulty fastpath in bch2_btree_path_traverse_cached() bch2_btree_path_traverse_cached() was previously checking if it could just relock the path, which is a common idiom in path traversal. However, it was using btree_node_relock(), not btree_path_relock(); btree_path_relock() only succeeds if the path was in state BTREE_ITER_NEED_RELOCK. If the path was in state BTREE_ITER_NEED_TRAVERSE a full traversal is needed; this led to a null ptr deref in bch2_btree_path_traverse_cached(). And the short circuit check here isn't needed, since it was already done in the main bch2_btree_path_traverse_one(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-13 22:56:50 -04:00
Linus Torvalds	6b0f8db921	execve fixes for v6.11-rc4 - binfmt_flat: Fix corruption when not offsetting data start - exec: Fix ToCToU between perm check and set-uid/gid usage -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCZrvBmwAKCRA2KwveOeQk u9Z2AQDgtypF4Kficiwn9BwZL5OxCxr9XnFuCYUue8Ufzm2WdQEAwhl1tcbs+Auf VAqpr6gZTRlpBjbl55LHeyMdxIbwhg4= =L28M -----END PGP SIGNATURE----- Merge tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve fixes from Kees Cook: - binfmt_flat: Fix corruption when not offsetting data start - exec: Fix ToCToU between perm check and set-uid/gid usage * tag 'execve-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: exec: Fix ToCToU between perm check and set-uid/gid usage binfmt_flat: Fix corruption when not offsetting data start	2024-08-13 16:10:32 -07:00
Kees Cook	f50733b45d	exec: Fix ToCToU between perm check and set-uid/gid usage When opening a file for exec via do_filp_open(), permission checking is done against the file's metadata at that moment, and on success, a file pointer is passed back. Much later in the execve() code path, the file metadata (specifically mode, uid, and gid) is used to determine if/how to set the uid and gid. However, those values may have changed since the permissions check, meaning the execution may gain unintended privileges. For example, if a file could change permissions from executable and not set-id: ---------x 1 root root 16048 Aug 7 13:16 target to set-id and non-executable: ---S------ 1 root root 16048 Aug 7 13:16 target it is possible to gain root privileges when execution should have been disallowed. While this race condition is rare in real-world scenarios, it has been observed (and proven exploitable) when package managers are updating the setuid bits of installed programs. Such files start with being world-executable but then are adjusted to be group-exec with a set-uid bit. For example, "chmod o-x,u+s target" makes "target" executable only by uid "root" and gid "cdrom", while also becoming setuid-root: -rwxr-xr-x 1 root cdrom 16048 Aug 7 13:16 target becomes: -rwsr-xr-- 1 root cdrom 16048 Aug 7 13:16 target But racing the chmod means users without group "cdrom" membership can get the permission to execute "target" just before the chmod, and when the chmod finishes, the exec reaches brpm_fill_uid(), and performs the setuid to root, violating the expressed authorization of "only cdrom group members can setuid to root". Re-check that we still have execute permissions in case the metadata has changed. It would be better to keep a copy from the perm-check time, but until we can do that refactoring, the least-bad option is to do a full inode_permission() call (under inode lock). It is understood that this is safe against dead-locks, but hardly optimal. Reported-by: Marco Vanotti <mvanotti@google.com> Tested-by: Marco Vanotti <mvanotti@google.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Cc: Eric Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Kees Cook <kees@kernel.org>	2024-08-13 13:24:29 -07:00
Linus Torvalds	6b4aa469f0	2 smb3 server fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAma6liAACgkQiiy9cAdy T1Eh4wwAuTQDHjehfvCDspMn6lG8IXAtb3oio2cntkII3warxxQ/dRiIyG1JcG5Z 38e+dokvRkaUF6ntrmudUbHOerw+NRl2ozYF5pQv0+ECyJLXHDqVGnuxNvNPAsD7 RtHfFf50PdgzGKmXjmUg0GbXMgA6eLSHe9r+wwDkqmIwZHMxaJ2nGuwVjHoO/+uJ oynxpYHIUROa2DeQiQKZAz/KHwpdSAGR4+KJRutvVCjInlb9bmSGp//BG34W4vva nyQIpnqskmlFg4elV/ktOgCp1rbHc4lgQwsWoCDYrNOyKX83HEIRRWHUEIi7fi+Y PBcFgTblrnuhYbUL4Z+rSmHB3YuUkvMLeKkSWSJm2M2qAZzoZWTUNLpzOcAOAcIF uhkt1+GUuLsZu3ZoDbolMZl477DtBsbBOKsM0DZ5IMji3MRu8GpvhmOfGOAdVRpT msTWfUoWvrc2CM09v3HBtnsAfjDXb/4ebztZxGTGVFk0uYJA1Zg655bHbYbw3tWr jXKVa805 =Q9Qj -----END PGP SIGNATURE----- Merge tag '6.11-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd Pull smb server fixes from Steve French: "Two smb3 server fixes for access denied problem on share path checks" * tag '6.11-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: override fsids for smb2_query_info() ksmbd: override fsids for share path check	2024-08-13 09:03:23 -07:00
Naohiro Aota	6252690f7e	btrfs: fix invalid mapping of extent xarray state In __extent_writepage_io(), we call btrfs_set_range_writeback() -> folio_start_writeback(), which clears PAGECACHE_TAG_DIRTY mark from the mapping xarray if the folio is not dirty. This worked fine before commit `97713b1a2c` ("btrfs: do not clear page dirty inside extent_write_locked_range()"). After the commit, however, the folio is still dirty at this point, so the mapping DIRTY tag is not cleared anymore. Then, __extent_writepage_io() calls btrfs_folio_clear_dirty() to clear the folio's dirty flag. That results in the page being unlocked with a "strange" state. The page is not PageDirty, but the mapping tag is set as PAGECACHE_TAG_DIRTY. This strange state looks like causing a hang with a call trace below when running fstests generic/091 on a null_blk device. It is waiting for a folio lock. While I don't have an exact relation between this hang and the strange state, fixing the state also fixes the hang. And, that state is worth fixing anyway. This commit reorders btrfs_folio_clear_dirty() and btrfs_set_range_writeback() in __extent_writepage_io(), so that the PAGECACHE_TAG_DIRTY tag is properly removed from the xarray. [464.274] task:fsx state:D stack:0 pid:3034 tgid:3034 ppid:2853 flags:0x00004002 [464.286] Call Trace: [464.291] <TASK> [464.295] __schedule+0x10ed/0x6260 [464.301] ? __pfx___blk_flush_plug+0x10/0x10 [464.308] ? __submit_bio+0x37c/0x450 [464.314] ? __pfx___schedule+0x10/0x10 [464.321] ? lock_release+0x567/0x790 [464.327] ? __pfx_lock_acquire+0x10/0x10 [464.334] ? __pfx_lock_release+0x10/0x10 [464.340] ? __pfx_lock_acquire+0x10/0x10 [464.347] ? __pfx_lock_release+0x10/0x10 [464.353] ? do_raw_spin_lock+0x12e/0x270 [464.360] schedule+0xdf/0x3b0 [464.365] io_schedule+0x8f/0xf0 [464.371] folio_wait_bit_common+0x2ca/0x6d0 [464.378] ? folio_wait_bit_common+0x1cc/0x6d0 [464.385] ? __pfx_folio_wait_bit_common+0x10/0x10 [464.392] ? __pfx_filemap_get_folios_tag+0x10/0x10 [464.400] ? __pfx_wake_page_function+0x10/0x10 [464.407] ? __pfx___might_resched+0x10/0x10 [464.414] ? do_raw_spin_unlock+0x58/0x1f0 [464.420] extent_write_cache_pages+0xe49/0x1620 [btrfs] [464.428] ? lock_acquire+0x435/0x500 [464.435] ? __pfx_extent_write_cache_pages+0x10/0x10 [btrfs] [464.443] ? btrfs_do_write_iter+0x493/0x640 [btrfs] [464.451] ? orc_find.part.0+0x1d4/0x380 [464.457] ? __pfx_lock_release+0x10/0x10 [464.464] ? __pfx_lock_release+0x10/0x10 [464.471] ? btrfs_do_write_iter+0x493/0x640 [btrfs] [464.478] btrfs_writepages+0x1cc/0x460 [btrfs] [464.485] ? __pfx_btrfs_writepages+0x10/0x10 [btrfs] [464.493] ? is_bpf_text_address+0x6e/0x100 [464.500] ? kernel_text_address+0x145/0x160 [464.507] ? unwind_get_return_address+0x5e/0xa0 [464.514] ? arch_stack_walk+0xac/0x100 [464.521] do_writepages+0x176/0x780 [464.527] ? lock_release+0x567/0x790 [464.533] ? __pfx_do_writepages+0x10/0x10 [464.540] ? __pfx_lock_acquire+0x10/0x10 [464.546] ? __pfx_stack_trace_save+0x10/0x10 [464.553] ? do_raw_spin_lock+0x12e/0x270 [464.560] ? do_raw_spin_unlock+0x58/0x1f0 [464.566] ? _raw_spin_unlock+0x23/0x40 [464.573] ? wbc_attach_and_unlock_inode+0x3da/0x7d0 [464.580] filemap_fdatawrite_wbc+0x113/0x180 [464.587] ? prepare_pages.constprop.0+0x13c/0x5c0 [btrfs] [464.596] __filemap_fdatawrite_range+0xaf/0xf0 [464.603] ? __pfx___filemap_fdatawrite_range+0x10/0x10 [464.611] ? trace_irq_enable.constprop.0+0xce/0x110 [464.618] ? kasan_quarantine_put+0xd7/0x1e0 [464.625] btrfs_start_ordered_extent+0x46f/0x570 [btrfs] [464.633] ? __pfx_btrfs_start_ordered_extent+0x10/0x10 [btrfs] [464.642] ? __clear_extent_bit+0x2c0/0x9d0 [btrfs] [464.650] btrfs_lock_and_flush_ordered_range+0xc6/0x180 [btrfs] [464.659] ? __pfx_btrfs_lock_and_flush_ordered_range+0x10/0x10 [btrfs] [464.669] btrfs_read_folio+0x12a/0x1d0 [btrfs] [464.676] ? __pfx_btrfs_read_folio+0x10/0x10 [btrfs] [464.684] ? __pfx_filemap_add_folio+0x10/0x10 [464.691] ? __pfx___might_resched+0x10/0x10 [464.698] ? __filemap_get_folio+0x1c5/0x450 [464.705] prepare_uptodate_page+0x12e/0x4d0 [btrfs] [464.713] prepare_pages.constprop.0+0x13c/0x5c0 [btrfs] [464.721] ? fault_in_iov_iter_readable+0xd2/0x240 [464.729] btrfs_buffered_write+0x5bd/0x12f0 [btrfs] [464.737] ? __pfx_btrfs_buffered_write+0x10/0x10 [btrfs] [464.745] ? __pfx_lock_release+0x10/0x10 [464.752] ? generic_write_checks+0x275/0x400 [464.759] ? down_write+0x118/0x1f0 [464.765] ? up_write+0x19b/0x500 [464.770] btrfs_direct_write+0x731/0xba0 [btrfs] [464.778] ? __pfx_btrfs_direct_write+0x10/0x10 [btrfs] [464.785] ? __pfx___might_resched+0x10/0x10 [464.792] ? lock_acquire+0x435/0x500 [464.798] ? lock_acquire+0x435/0x500 [464.804] btrfs_do_write_iter+0x494/0x640 [btrfs] [464.811] ? __pfx_btrfs_do_write_iter+0x10/0x10 [btrfs] [464.819] ? __pfx___might_resched+0x10/0x10 [464.825] ? rw_verify_area+0x6d/0x590 [464.831] vfs_write+0x5d7/0xf50 [464.837] ? __might_fault+0x9d/0x120 [464.843] ? __pfx_vfs_write+0x10/0x10 [464.849] ? btrfs_file_llseek+0xb1/0xfb0 [btrfs] [464.856] ? lock_release+0x567/0x790 [464.862] ksys_write+0xfb/0x1d0 [464.867] ? __pfx_ksys_write+0x10/0x10 [464.873] ? _raw_spin_unlock+0x23/0x40 [464.879] ? btrfs_getattr+0x4af/0x670 [btrfs] [464.886] ? vfs_getattr_nosec+0x79/0x340 [464.892] do_syscall_64+0x95/0x180 [464.898] ? __do_sys_newfstat+0xde/0xf0 [464.904] ? __pfx___do_sys_newfstat+0x10/0x10 [464.911] ? trace_irq_enable.constprop.0+0xce/0x110 [464.918] ? syscall_exit_to_user_mode+0xac/0x2a0 [464.925] ? do_syscall_64+0xa1/0x180 [464.931] ? trace_irq_enable.constprop.0+0xce/0x110 [464.939] ? trace_irq_enable.constprop.0+0xce/0x110 [464.946] ? syscall_exit_to_user_mode+0xac/0x2a0 [464.953] ? btrfs_file_llseek+0xb1/0xfb0 [btrfs] [464.960] ? do_syscall_64+0xa1/0x180 [464.966] ? btrfs_file_llseek+0xb1/0xfb0 [btrfs] [464.973] ? trace_irq_enable.constprop.0+0xce/0x110 [464.980] ? syscall_exit_to_user_mode+0xac/0x2a0 [464.987] ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs] [464.995] ? trace_irq_enable.constprop.0+0xce/0x110 [465.002] ? __pfx_btrfs_file_llseek+0x10/0x10 [btrfs] [465.010] ? do_syscall_64+0xa1/0x180 [465.016] ? lock_release+0x567/0x790 [465.022] ? __pfx_lock_acquire+0x10/0x10 [465.028] ? __pfx_lock_release+0x10/0x10 [465.034] ? trace_irq_enable.constprop.0+0xce/0x110 [465.042] ? syscall_exit_to_user_mode+0xac/0x2a0 [465.049] ? do_syscall_64+0xa1/0x180 [465.055] ? syscall_exit_to_user_mode+0xac/0x2a0 [465.062] ? do_syscall_64+0xa1/0x180 [465.068] ? syscall_exit_to_user_mode+0xac/0x2a0 [465.075] ? do_syscall_64+0xa1/0x180 [465.081] ? clear_bhb_loop+0x25/0x80 [465.087] ? clear_bhb_loop+0x25/0x80 [465.093] ? clear_bhb_loop+0x25/0x80 [465.099] entry_SYSCALL_64_after_hwframe+0x76/0x7e [465.106] RIP: 0033:0x7f093b8ee784 [465.111] RSP: 002b:00007ffc29d31b28 EFLAGS: 00000202 ORIG_RAX: 0000000000000001 [465.122] RAX: ffffffffffffffda RBX: 0000000000006000 RCX: 00007f093b8ee784 [465.131] RDX: 000000000001de00 RSI: 00007f093b6ed200 RDI: 0000000000000003 [465.141] RBP: 000000000001de00 R08: 0000000000006000 R09: 0000000000000000 [465.150] R10: 0000000000023e00 R11: 0000000000000202 R12: 0000000000006000 [465.160] R13: 0000000000023e00 R14: 0000000000023e00 R15: 0000000000000001 [465.170] </TASK> [465.174] INFO: lockdep is turned off. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: `97713b1a2c` ("btrfs: do not clear page dirty inside extent_write_locked_range()") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-13 15:36:57 +02:00
Phillip Lougher	810ee43d9c	Squashfs: sanity check symbolic link size Syzkiller reports a "KMSAN: uninit-value in pick_link" bug. This is caused by an uninitialised page, which is ultimately caused by a corrupted symbolic link size read from disk. The reason why the corrupted symlink size causes an uninitialised page is due to the following sequence of events: 1. squashfs_read_inode() is called to read the symbolic link from disk. This assigns the corrupted value 3875536935 to inode->i_size. 2. Later squashfs_symlink_read_folio() is called, which assigns this corrupted value to the length variable, which being a signed int, overflows producing a negative number. 3. The following loop that fills in the page contents checks that the copied bytes is less than length, which being negative means the loop is skipped, producing an uninitialised page. This patch adds a sanity check which checks that the symbolic link size is not larger than expected. -- Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Link: https://lore.kernel.org/r/20240811232821.13903-1-phillip@squashfs.org.uk Reported-by: Lizhi Xu <lizhi.xu@windriver.com> Reported-by: syzbot+24ac24ff58dc5b0d26b9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000a90e8c061e86a76b@google.com/ V2: fix spelling mistake. Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-13 13:56:46 +02:00
Dominique Martinet	e3786b29c5	9p: Fix DIO read through netfs If a program is watching a file on a 9p mount, it won't see any change in size if the file being exported by the server is changed directly in the source filesystem, presumably because 9p doesn't have change notifications, and because netfs skips the reads if the file is empty. Fix this by attempting to read the full size specified when a DIO read is requested (such as when 9p is operating in unbuffered mode) and dealing with a short read if the EOF was less than the expected read. To make this work, filesystems using netfslib must not set NETFS_SREQ_CLEAR_TAIL if performing a DIO read where that read hit the EOF. I don't want to mandatorily clear this flag in netfslib for DIO because, say, ceph might make a read from an object that is not completely filled, but does not reside at the end of file - and so we need to clear the excess. This can be tested by watching an empty file over 9p within a VM (such as in the ktest framework): while true; do read content; if [ -n "$content" ]; then echo $content; break; fi; done < /host/tmp/foo then writing something into the empty file. The watcher should immediately display the file content and break out of the loop. Without this fix, it remains in the loop indefinitely. Fixes: `80105ed2fd` ("9p: Use netfslib read/write_iter") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218916 Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/1229195.1723211769@warthog.procyon.org.uk cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Steve French <sfrench@samba.org> cc: Paulo Alcantara <pc@manguebit.com> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: ceph-devel@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: linux-nfs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Dominique Martinet <asmadeus@codewreck.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-13 13:53:09 +02:00
Zhihao Cheng	2a0629834c	vfs: Don't evict inode under the inode lru traversing context The inode reclaiming process(See function prune_icache_sb) collects all reclaimable inodes and mark them with I_FREEING flag at first, at that time, other processes will be stuck if they try getting these inodes (See function find_inode_fast), then the reclaiming process destroy the inodes by function dispose_list(). Some filesystems(eg. ext4 with ea_inode feature, ubifs with xattr) may do inode lookup in the inode evicting callback function, if the inode lookup is operated under the inode lru traversing context, deadlock problems may happen. Case 1: In function ext4_evict_inode(), the ea inode lookup could happen if ea_inode feature is enabled, the lookup process will be stuck under the evicting context like this: 1. File A has inode i_reg and an ea inode i_ea 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea 3. Then, following three processes running like this: PA PB echo 2 > /proc/sys/vm/drop_caches shrink_slab prune_dcache_sb // i_reg is added into lru, lru->i_ea->i_reg prune_icache_sb list_lru_walk_one inode_lru_isolate i_ea->i_state \|= I_FREEING // set inode state inode_lru_isolate __iget(i_reg) spin_unlock(&i_reg->i_lock) spin_unlock(lru_lock) rm file A i_reg->nlink = 0 iput(i_reg) // i_reg->nlink is 0, do evict ext4_evict_inode ext4_xattr_delete_inode ext4_xattr_inode_dec_ref_all ext4_xattr_inode_iget ext4_iget(i_ea->i_ino) iget_locked find_inode_fast __wait_on_freeing_inode(i_ea) ----→ AA deadlock dispose_list // cannot be executed by prune_icache_sb wake_up_bit(&i_ea->i_state) Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file deleting process holds BASEHD's wbuf->io_mutex while getting the xattr inode, which could race with inode reclaiming process(The reclaiming process could try locking BASEHD's wbuf->io_mutex in inode evicting function), then an ABBA deadlock problem would happen as following: 1. File A has inode ia and a xattr(with inode ixa), regular file B has inode ib and a xattr. 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa 3. Then, following three processes running like this: PA PB PC echo 2 > /proc/sys/vm/drop_caches shrink_slab prune_dcache_sb // ib and ia are added into lru, lru->ixa->ib->ia prune_icache_sb list_lru_walk_one inode_lru_isolate ixa->i_state \|= I_FREEING // set inode state inode_lru_isolate __iget(ib) spin_unlock(&ib->i_lock) spin_unlock(lru_lock) rm file B ib->nlink = 0 rm file A iput(ia) ubifs_evict_inode(ia) ubifs_jnl_delete_inode(ia) ubifs_jnl_write_inode(ia) make_reservation(BASEHD) // Lock wbuf->io_mutex ubifs_iget(ixa->i_ino) iget_locked find_inode_fast __wait_on_freeing_inode(ixa) \| iput(ib) // ib->nlink is 0, do evict \| ubifs_evict_inode \| ubifs_jnl_delete_inode(ib) ↓ ubifs_jnl_write_inode ABBA deadlock ←-----make_reservation(BASEHD) dispose_list // cannot be executed by prune_icache_sb wake_up_bit(&ixa->i_state) Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING to pin the inode in memory while inode_lru_isolate() reclaims its pages instead of using ordinary inode reference. This way inode deletion cannot be triggered from inode_lru_isolate() thus avoiding the deadlock. evict() is made to wait for I_LRU_ISOLATING to be cleared before proceeding with inode cleanup. Link: https://lore.kernel.org/all/37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com/ Link: https://bugzilla.kernel.org/show_bug.cgi?id=219022 Fixes: `e50e5129f3` ("ext4: xattr-in-inode support") Fixes: `7959cf3a75` ("ubifs: journal: Handle xattrs like files") Cc: stable@vger.kernel.org Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20240809031628.1069873-1-chengzhihao@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-13 13:52:16 +02:00
Filipe Manana	46a6e10a1a	btrfs: send: allow cloning non-aligned extent if it ends at i_size If we a find that an extent is shared but its end offset is not sector size aligned, then we don't clone it and issue write operations instead. This is because the reflink (remap_file_range) operation does not allow to clone unaligned ranges, except if the end offset of the range matches the i_size of the source and destination files (and the start offset is sector size aligned). While this is not incorrect because send can only guarantee that a file has the same data in the source and destination snapshots, it's not optimal and generates confusion and surprising behaviour for users. For example, running this test: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount $DEV $MNT # Use a file size not aligned to any possible sector size. file_size=$((1 * 1024 * 1024 + 5)) # 1MB + 5 bytes dd if=/dev/random of=$MNT/foo bs=$file_size count=1 cp --reflink=always $MNT/foo $MNT/bar btrfs subvolume snapshot -r $MNT/ $MNT/snap rm -f /tmp/send-test btrfs send -f /tmp/send-test $MNT/snap umount $MNT mkfs.btrfs -f $DEV mount $DEV $MNT btrfs receive -vv -f /tmp/send-test $MNT xfs_io -r -c "fiemap -v" $MNT/snap/bar umount $MNT Gives the following result: (...) mkfile o258-7-0 rename o258-7-0 -> bar write bar - offset=0 length=49152 write bar - offset=49152 length=49152 write bar - offset=98304 length=49152 write bar - offset=147456 length=49152 write bar - offset=196608 length=49152 write bar - offset=245760 length=49152 write bar - offset=294912 length=49152 write bar - offset=344064 length=49152 write bar - offset=393216 length=49152 write bar - offset=442368 length=49152 write bar - offset=491520 length=49152 write bar - offset=540672 length=49152 write bar - offset=589824 length=49152 write bar - offset=638976 length=49152 write bar - offset=688128 length=49152 write bar - offset=737280 length=49152 write bar - offset=786432 length=49152 write bar - offset=835584 length=49152 write bar - offset=884736 length=49152 write bar - offset=933888 length=49152 write bar - offset=983040 length=49152 write bar - offset=1032192 length=16389 chown bar - uid=0, gid=0 chmod bar - mode=0644 utimes bar utimes BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=06d640da-9ca1-604c-b87c-3375175a8eb3, stransid=7 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2055]: 26624..28679 2056 0x1 There's no clone operation to clone extents from the file foo into file bar and fiemap confirms there's no shared flag (0x2000). So update send_write_or_clone() so that it proceeds with cloning if the source and destination ranges end at the i_size of the respective files. After this changes the result of the test is: (...) mkfile o258-7-0 rename o258-7-0 -> bar clone bar - source=foo source offset=0 offset=0 length=1048581 chown bar - uid=0, gid=0 chmod bar - mode=0644 utimes bar utimes BTRFS_IOC_SET_RECEIVED_SUBVOL uuid=582420f3-ea7d-564e-bbe5-ce440d622190, stransid=7 /mnt/sdi/snap/bar: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..2055]: 26624..28679 2056 0x2001 A test case for fstests will also follow up soon. Link: https://github.com/kdave/btrfs-progs/issues/572#issuecomment-2282841416 CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-13 13:45:42 +02:00
Filipe Manana	ae1e766f62	btrfs: only run the extent map shrinker from kswapd tasks Currently the extent map shrinker can be run by any task when attempting to allocate memory and there's enough memory pressure to trigger it. To avoid too much latency we stop iterating over extent maps and removing them once the task needs to reschedule. This logic was introduced in commit `b3ebb9b7e9` ("btrfs: stop extent map shrinker if reschedule is needed"). While that solved high latency problems for some use cases, it's still not enough because with a too high number of tasks entering the extent map shrinker code, either due to memory allocations or because they are a kswapd task, we end up having a very high level of contention on some spin locks, namely: 1) The fs_info->fs_roots_radix_lock spin lock, which we need to find roots to iterate over their inodes; 2) The spin lock of the xarray used to track open inodes for a root (struct btrfs_root::inodes) - on 6.10 kernels and below, it used to be a red black tree and the spin lock was root->inode_lock; 3) The fs_info->delayed_iput_lock spin lock since the shrinker adds delayed iputs (calls btrfs_add_delayed_iput()). Instead of allowing the extent map shrinker to be run by any task, make it run only by kswapd tasks. This still solves the problem of running into OOM situations due to an unbounded extent map creation, which is simple to trigger by direct IO writes, as described in the changelog of commit `956a17d9d0` ("btrfs: add a shrinker for extent maps"), and by a similar case when doing buffered IO on files with a very large number of holes (keeping the file open and creating many holes, whose extent maps are only released when the file is closed). Reported-by: kzd <kzd@56709.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121 Reported-by: Octavia Togami <octavia.togami@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/ Fixes: `956a17d9d0` ("btrfs: add a shrinker for extent maps") CC: stable@vger.kernel.org # 6.10+ Tested-by: kzd <kzd@56709.net> Tested-by: Octavia Togami <octavia.togami@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-13 13:43:28 +02:00
Qu Wenruo	31723c9542	btrfs: tree-checker: reject BTRFS_FT_UNKNOWN dir type [REPORT] There is a bug report that kernel is rejecting a mismatching inode mode and its dir item: [ 1881.553937] BTRFS critical (device dm-0): inode mode mismatch with dir: inode mode=040700 btrfs type=2 dir type=0 [CAUSE] It looks like the inode mode is correct, while the dir item type 0 is BTRFS_FT_UNKNOWN, which should not be generated by btrfs at all. This may be caused by a memory bit flip. [ENHANCEMENT] Although tree-checker is not able to do any cross-leaf verification, for this particular case we can at least reject any dir type with BTRFS_FT_UNKNOWN. So here we enhance the dir type check from [0, BTRFS_FT_MAX), to (0, BTRFS_FT_MAX). Although the existing corruption can not be fixed just by such enhanced checking, it should prevent the same 0x2->0x0 bitflip for dir type to reach disk in the future. Reported-by: Kota <nospam@kota.moe> Link: https://lore.kernel.org/linux-btrfs/CACsxjPYnQF9ZF-0OhH16dAx50=BXXOcP74MxBc3BG+xae4vTTw@mail.gmail.com/ CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-13 13:42:26 +02:00
Josef Bacik	42fac187b5	btrfs: check delayed refs when we're checking if a ref exists In the patch `78c52d9eb6` ("btrfs: check for refs on snapshot delete resume") I added some code to handle file systems that had been corrupted by a bug that incorrectly skipped updating the drop progress key while dropping a snapshot. This code would check to see if we had already deleted our reference for a child block, and skip the deletion if we had already. Unfortunately there is a bug, as the check would only check the on-disk references. I made an incorrect assumption that blocks in an already deleted snapshot that was having the deletion resume on mount wouldn't be modified. If we have 2 pending deleted snapshots that share blocks, we can easily modify the rules for a block. Take the following example subvolume a exists, and subvolume b is a snapshot of subvolume a. They share references to block 1. Block 1 will have 2 full references, one for subvolume a and one for subvolume b, and it belongs to subvolume a (btrfs_header_owner(block 1) == subvolume a). When deleting subvolume a, we will drop our full reference for block 1, and because we are the owner we will drop our full reference for all of block 1's children, convert block 1 to FULL BACKREF, and add a shared reference to all of block 1's children. Then we will start the snapshot deletion of subvolume b. We look up the extent info for block 1, which checks delayed refs and tells us that FULL BACKREF is set, so sets parent to the bytenr of block 1. However because this is a resumed snapshot deletion, we call into check_ref_exists(). Because check_ref_exists() only looks at the disk, it doesn't find the shared backref for the child of block 1, and thus returns 0 and we skip deleting the reference for the child of block 1 and continue. This orphans the child of block 1. The fix is to lookup the delayed refs, similar to what we do in btrfs_lookup_extent_info(). However we only care about whether the reference exists or not. If we fail to find our reference on disk, go look up the bytenr in the delayed refs, and if it exists look for an existing ref in the delayed ref head. If that exists then we know we can delete the reference safely and carry on. If it doesn't exist we know we have to skip over this block. This bug has existed since I introduced this fix, however requires having multiple deleted snapshots pending when we unmount. We noticed this in production because our shutdown path stops the container on the system, which deletes a bunch of subvolumes, and then reboots the box. This gives us plenty of opportunities to hit this issue. Looking at the history we've seen this occasionally in production, but we had a big spike recently thanks to faster machines getting jobs with multiple subvolumes in the job. Chris Mason wrote a reproducer which does the following mount /dev/nvme4n1 /btrfs btrfs subvol create /btrfs/s1 simoop -E -f 4k -n 200000 -z /btrfs/s1 while(true) ; do btrfs subvol snap /btrfs/s1 /btrfs/s2 simoop -f 4k -n 200000 -r 10 -z /btrfs/s2 btrfs subvol snap /btrfs/s2 /btrfs/s3 btrfs balance start -dusage=80 /btrfs btrfs subvol del /btrfs/s2 /btrfs/s3 umount /btrfs btrfsck /dev/nvme4n1 \|\| exit 1 mount /dev/nvme4n1 /btrfs done On the second loop this would fail consistently, with my patch it has been running for hours and hasn't failed. I also used dm-log-writes to capture the state of the failure so I could debug the problem. Using the existing failure case to test my patch validated that it fixes the problem. Fixes: `78c52d9eb6` ("btrfs: check for refs on snapshot delete resume") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-13 13:42:26 +02:00
David Howells	7b589a9b45	netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags The NETFS_RREQ_USE_PGPRIV2 and NETFS_RREQ_WRITE_TO_CACHE flags aren't used correctly. The problem is that we try to set them up in the request initialisation, but we the cache may be in the process of setting up still, and so the state may not be correct. Further, we secondarily sample the cache state and make contradictory decisions later. The issue arises because we set up the cache resources, which allows the cache's ->prepare_read() to switch on NETFS_SREQ_COPY_TO_CACHE - which triggers cache writing even if we didn't set the flags when allocating. Fix this in the following way: (1) Drop NETFS_ICTX_USE_PGPRIV2 and instead set NETFS_RREQ_USE_PGPRIV2 in ->init_request() rather than trying to juggle that in netfs_alloc_request(). (2) Repurpose NETFS_RREQ_USE_PGPRIV2 to merely indicate that if caching is to be done, then PG_private_2 is to be used rather than only setting it if we decide to cache and then having netfs_rreq_unlock_folios() set the non-PG_private_2 writeback-to-cache if it wasn't set. (3) Split netfs_rreq_unlock_folios() into two functions, one of which contains the deprecated code for using PG_private_2 to avoid accidentally doing the writeback path - and always use it if USE_PGPRIV2 is set. (4) As NETFS_ICTX_USE_PGPRIV2 is removed, make netfs_write_begin() always wait for PG_private_2. This function is deprecated and only used by ceph anyway, and so label it so. (5) Drop the NETFS_RREQ_WRITE_TO_CACHE flag and use fscache_operation_valid() on the cache_resources instead. This has the advantage of picking up the result of netfs_begin_cache_read() and fscache_begin_write_operation() - which are called after the object is initialised and will wait for the cache to come to a usable state. Just reverting ae678317b95e[1] isn't a sufficient fix, so this need to be applied on top of that. Without this as well, things like: rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { and: WARNING: CPU: 13 PID: 3621 at fs/ceph/caps.c:3386 may happen, along with some UAFs due to PG_private_2 not getting used to wait on writeback completion. Fixes: `2ff1e97587` ("netfs: Replace PG_fscache by setting folio->private and marking dirty") Reported-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Hristo Venev <hristo@venev.name> cc: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: ceph-devel@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk/ [1] Link: https://lore.kernel.org/r/1173209.1723152682@warthog.procyon.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:27 +02:00
David Howells	8e5ced7804	netfs, ceph: Revert "netfs: Remove deprecated use of PG_private_2 as a second writeback flag" This reverts commit `ae678317b9`. Revert the patch that removes the deprecated use of PG_private_2 in netfslib for the moment as Ceph is actually still using this to track data copied to the cache. Fixes: `ae678317b9` ("netfs: Remove deprecated use of PG_private_2 as a second writeback flag") Reported-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Ilya Dryomov <idryomov@gmail.com> cc: Xiubo Li <xiubli@redhat.com> cc: Jeff Layton <jlayton@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: ceph-devel@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org https: //lore.kernel.org/r/3575457.1722355300@warthog.procyon.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:27 +02:00
Lukas Bulwahn	889ced4c93	netfs: clean up after renaming FSCACHE_DEBUG config Commit 6b8e61472529 ("netfs: Rename CONFIG_FSCACHE_DEBUG to CONFIG_NETFS_DEBUG") renames the config, but introduces two issues: First, NETFS_DEBUG mistakenly depends on the non-existing config NETFS, whereas the actual intended config is called NETFS_SUPPORT. Second, the config renaming misses to adjust the documentation of the functionality of this config. Clean up those two points. Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Link: https://lore.kernel.org/r/20240731073902.69262-1-lukas.bulwahn@redhat.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:26 +02:00
yangerkun	64a7ce76fb	libfs: fix infinite directory reads for offset dir After we switch tmpfs dir operations from simple_dir_operations to simple_offset_dir_operations, every rename happened will fill new dentry to dest dir's maple tree(&SHMEM_I(inode)->dir_offsets->mt) with a free key starting with octx->newx_offset, and then set newx_offset equals to free key + 1. This will lead to infinite readdir combine with rename happened at the same time, which fail generic/736 in xfstests(detail show as below). 1. create 5000 files(1 2 3...) under one dir 2. call readdir(man 3 readdir) once, and get one entry 3. rename(entry, "TEMPFILE"), then rename("TEMPFILE", entry) 4. loop 2~3, until readdir return nothing or we loop too many times(tmpfs break test with the second condition) We choose the same logic what commit `9b378f6ad4` ("btrfs: fix infinite directory reads") to fix it, record the last_index when we open dir, and do not emit the entry which index >= last_index. The file->private_data now used in offset dir can use directly to do this, and we also update the last_index when we llseek the dir file. Fixes: `a2e459555c` ("shmem: stable directory offsets") Signed-off-by: yangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20240731043835.1828697-1-yangerkun@huawei.com Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [brauner: only update last_index after seek when offset is zero like Jan suggested] Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:26 +02:00
Max Kellermann	f71aa06398	fs/netfs/fscache_cookie: add missing "n_accesses" check This fixes a NULL pointer dereference bug due to a data race which looks like this: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 33 PID: 16573 Comm: kworker/u97:799 Not tainted 6.8.7-cm4all1-hp+ #43 Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/17/2018 Workqueue: events_unbound netfs_rreq_write_to_cache_work RIP: 0010:cachefiles_prepare_write+0x30/0xa0 Code: 57 41 56 45 89 ce 41 55 49 89 cd 41 54 49 89 d4 55 53 48 89 fb 48 83 ec 08 48 8b 47 08 48 83 7f 10 00 48 89 34 24 48 8b 68 20 <48> 8b 45 08 4c 8b 38 74 45 49 8b 7f 50 e8 4e a9 b0 ff 48 8b 73 10 RSP: 0018:ffffb4e78113bde0 EFLAGS: 00010286 RAX: ffff976126be6d10 RBX: ffff97615cdb8438 RCX: 0000000000020000 RDX: ffff97605e6c4c68 RSI: ffff97605e6c4c60 RDI: ffff97615cdb8438 RBP: 0000000000000000 R08: 0000000000278333 R09: 0000000000000001 R10: ffff97605e6c4600 R11: 0000000000000001 R12: ffff97605e6c4c68 R13: 0000000000020000 R14: 0000000000000001 R15: ffff976064fe2c00 FS: 0000000000000000(0000) GS:ffff9776dfd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 000000005942c002 CR4: 00000000001706f0 Call Trace: <TASK> ? __die+0x1f/0x70 ? page_fault_oops+0x15d/0x440 ? search_module_extables+0xe/0x40 ? fixup_exception+0x22/0x2f0 ? exc_page_fault+0x5f/0x100 ? asm_exc_page_fault+0x22/0x30 ? cachefiles_prepare_write+0x30/0xa0 netfs_rreq_write_to_cache_work+0x135/0x2e0 process_one_work+0x137/0x2c0 worker_thread+0x2e9/0x400 ? __pfx_worker_thread+0x10/0x10 kthread+0xcc/0x100 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> Modules linked in: CR2: 0000000000000008 ---[ end trace 0000000000000000 ]--- This happened because fscache_cookie_state_machine() was slow and was still running while another process invoked fscache_unuse_cookie(); this led to a fscache_cookie_lru_do_one() call, setting the FSCACHE_COOKIE_DO_LRU_DISCARD flag, which was picked up by fscache_cookie_state_machine(), withdrawing the cookie via cachefiles_withdraw_cookie(), clearing cookie->cache_priv. At the same time, yet another process invoked cachefiles_prepare_write(), which found a NULL pointer in this code line: struct cachefiles_object object = cachefiles_cres_object(cres); The next line crashes, obviously: struct cachefiles_cache cache = object->volume->cache; During cachefiles_prepare_write(), the "n_accesses" counter is non-zero (via fscache_begin_operation()). The cookie must not be withdrawn until it drops to zero. The counter is checked by fscache_cookie_state_machine() before switching to FSCACHE_COOKIE_STATE_RELINQUISHING and FSCACHE_COOKIE_STATE_WITHDRAWING (in "case FSCACHE_COOKIE_STATE_FAILED"), but not for FSCACHE_COOKIE_STATE_LRU_DISCARDING ("case FSCACHE_COOKIE_STATE_ACTIVE"). This patch adds the missing check. With a non-zero access counter, the function returns and the next fscache_end_cookie_access() call will queue another fscache_cookie_state_machine() call to handle the still-pending FSCACHE_COOKIE_DO_LRU_DISCARD. Fixes: `12bb21a29c` ("fscache: Implement cookie user counting and resource pinning") Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/20240729162002.3436763-2-dhowells@redhat.com cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: stable@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:26 +02:00
Omar Sandoval	3f65f3c099	filelock: fix name of file_lease slab cache When struct file_lease was split out from struct file_lock, the name of the file_lock slab cache was copied to the new slab cache for file_lease. This name conflict causes confusion in /proc/slabinfo and /sys/kernel/slab. In particular, it caused failures in drgn's test case for slab cache merging. Link: `9ad29fd864/tests/linux_kernel/helpers/test_slab.py (L81)` Fixes: `c69ff40719` ("filelock: split leases out of struct file_lock") Signed-off-by: Omar Sandoval <osandov@fb.com> Link: https://lore.kernel.org/r/2d1d053da1cafb3e7940c4f25952da4f0af34e38.1722293276.git.osandov@fb.com Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:25 +02:00
Matthew Wilcox (Oracle)	98055bc359	netfs: Fault in smaller chunks for non-large folio mappings As in commit `4e527d5841` ("iomap: fault in smaller chunks for non-large folio mappings"), we can see a performance loss for filesystems which have not yet been converted to large folios. Fixes: `c38f4e96e6` ("netfs: Provide func to copy data to pagecache for buffered write") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: https://lore.kernel.org/r/20240527201735.1898381-1-willy@infradead.org Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-12 22:03:25 +02:00
Linus Torvalds	a1460e457e	fix bitmap corruption on close_range(), take 2 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZrFgeQAKCRBZ7Krx/gZQ 6w4SAP48jL+Vil7ifIXviasoBrQGzf9lbTcOAmWAoaxjSlvlpAEAw4OyPhJUmKHW ykB/yqUMCajsrrTQPN5lmc0W5v0nqQ4= =G0ms -----END PGP SIGNATURE----- Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fd bitmap fix from Al Viro: "Fix bitmap corruption on close_range() by cleaning up copy_fd_bitmaps()" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE	2024-08-12 08:03:28 -07:00
Linus Torvalds	5189dafa4c	nfsd-6.11 fixes: - Two minor fixes for recent changes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAma3pFoACgkQM2qzM29m f5f4dA//SiPOS7v60jQubfV3f031E+W3a6YxvB5tqByA23hOavjYowGREyvhtaS5 lgqRO25BXz7uNOjSWUXaV7SsR1GKElWtk/gGVsfYy03A91Qvafb24oapJRwRXeJ1 5n6X5AScxJeoDGEIUMoGU3O87v4T6UoTj5K6bjUDqr5kTRTL6rPFj3jYvy0YVLe3 EsA1P+BLctB1XiboMYNfZyEJbxNF5gpO7OaeS8VPQIeNYhaF1cwsa7A9wDJ22UpB BQEguTjPqgzkthOdkc3jNcoXtDUEzAXxQqACKR8CU6cCoyr159KupSlci4LqA2o4 8CwcVd01L79JjGcrZqo+4vmbuY6rfhN7U/EYcNNJPZKUIo7Q5h4dq/pCaBADK2Ke UJVPsmdYXKMwn/pPJI9kjoellqSrjOxqFDXXqBb/e8Z/xXuXmiYRl6E+dtGeBlty j4qm12d1BGKs7zRaLLThbFTH6s3kUCjVhEr+uphse6CVF6HV/NNK79jooHrrgCiS e1vgRaYIgzMREOyq8fZwvINDuk/8Pa1Z4wlR65orwVzWlZMmk9bLm+wwplcyj34x una8RrQhxlicY5nAGUa+aRCkKUtO9oSPs6hukoT+DB06+mhQ7u3ro96ojueMfMdQ pKux8zrCdJe7jZUlnIW30/QsrgWhHW9KsrdUA9eKQ1FoZIOdS9I= =YX9E -----END PGP SIGNATURE----- Merge tag 'nfsd-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Two minor fixes for recent changes * tag 'nfsd-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: nfsd: don't set SVC_SOCK_ANONYMOUS when creating nfsd sockets sunrpc: avoid -Wformat-security warning	2024-08-10 10:44:21 -07:00
Linus Torvalds	31b2444606	bcachefs fixes for 6.11-rc2, more - fix a bug that was causing ACLs to seemingly "disappear" - new on disk format version, bcachefs_metadata_version_disk_accounting_v3 bcachefs_metadata_version_disk_accounting_v2 accidentally included padding in disk_accounting_key; fortunately, 6.11 isn't out yet so we can fix this with another version bump. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAma3mGIACgkQE6szbY3K bnagAw//XP8wYgEE6worKtLb8sNx+w+jeKR1FeAsYBP012+58xBvljxsUm2J6Mct bs7nJmhextVEhcIhm43/SEaic1dXWNxlC5hvTPQzJCGAjyVCPH2naXpz284GQzIu HlwBlnWK2HjnSpPHFPEqRm3jq/H9SGEhENDJTrccg7E82towGktQspvmx1Bzyt/E cb31An0Y4lQmZGu1/bxX0IDjcKYdI4Jr+FQwiJV35KwGXSw9gzkfHii0wNY+gTAj OklXiw5oOtIR52Oy7FKLhYWdl6GIe4/hYS+yYJ3sMiSUuYY50otwIGGnCRLICRlf AsE4fmN8wjTPCRryOOTDvc/MCuKuanQa8x1pH0cM3WMKqhl4mrDD2wne+38gfgDN zQwoLHfDdDXtadf0NB/VoVQnHTfYSk6GKpHHduMhZeb45XNm0WNryZcefQMunSMz CTBq22f22iSngF0KCvHQgp7mm8XNk+d12keQ37ldINRkC+mDWPu7Yi/mhwYzN43O +6WrzxyQsQPhLE+nfYRlvAuDCCVsXqjL2U2EVyuaBmUJaMMZbFU+u9zpb48dMjIl dRxBQfOsJjvbvTMTNlG4EDslKi8HjoqHXMo5Y3nhvSAmQRjkltfhiXSRivdJxYuR NwjJYtszbbBIqSsUZ6DqR+1eTG6HSQ1ML0B+Nm8orLLMqIFlAkQ= =f3fR -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-08-10' of git://evilpiepirate.org/bcachefs Pull more bcachefs fixes from Kent Overstreet: "A couple last minute fixes for the new disk accounting - fix a bug that was causing ACLs to seemingly "disappear" - new on disk format version, bcachefs_metadata_version_disk_accounting_v3 bcachefs_metadata_version_disk_accounting_v2 accidentally included padding in disk_accounting_key; fortunately, 6.11 isn't out yet so we can fix this with another version bump" * tag 'bcachefs-2024-08-10' of git://evilpiepirate.org/bcachefs: bcachefs: bcachefs_metadata_version_disk_accounting_v3 bcachefs: improve bch2_dev_usage_to_text() bcachefs: bch2_accounting_invalid() bcachefs: Switch to .get_inode_acl()	2024-08-10 10:06:26 -07:00
Linus Torvalds	34ac1e82e5	Three smb3 client fixes including two for stable -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAma2k9YACgkQiiy9cAdy T1GkqAv7B8wyDp7pVU8qtTxct1gVtDEqvd2Kte6962uwvKWEhhf8SKzagypICudL vXxwJuPsRlFM5jWd8h36/xXC17BYM1mEsz6GVq9viUYUcqtrKrGPsOL7+OLVR3Mp SwPRPLYyLyMBHhc3Su2/6uGr87NMkcgDTzugZrx67ojAaw/xkYsiuz+Wm8ijbdik waDkJ3zosj0Nbhud47sf6rkrymiqC5kP717rhlcXF+TxNscFDQ/eFO8b0lt3g5q5 +Y338a4pLLFWXBm1jP9EOtbx7NSDKaqWVYYtnwEBs6EV3QGpKjbT1zBQ07wWjfT4 FhcFhYYukq7Jc4X39JouBXqYWR2wjB8VpMzCuwNsDNJ7FahNaGTgfXCMx6C0Bi0w XICAMoZfBXGSAs+NTyi38AHyPQ+KdYJeSgA9doL+3FhMGx9KKAVkT6XB3twhxoOS w/iMyzhx/1mwv4CukCKm3lDflN09AseB688QiFsTshTmnTlWFcHdEnORqWbNrd2N evhrKvxK =GIpz -----END PGP SIGNATURE----- Merge tag '6.11-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - DFS fix - fix for security flags for requiring encryption - minor cleanup * tag '6.11-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: cifs_inval_name_dfs_link_error: correct the check for fullpath Fix spelling errors in Server Message Block smb3: fix setting SecurityFlags when encryption is required	2024-08-09 21:33:25 -07:00
Kees Cook	3eb3cd5992	binfmt_flat: Fix corruption when not offsetting data start Commit `04d82a6d08` ("binfmt_flat: allow not offsetting data start") introduced a RISC-V specific variant of the FLAT format which does not allocate any space for the (obsolete) array of shared library pointers. However, it did not disable the code which initializes the array, resulting in the corruption of sizeof(long) bytes before the DATA segment, generally the end of the TEXT segment. Introduce MAX_SHARED_LIBS_UPDATE which depends on the state of CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET to guard the initialization of the shared library pointer region so that it will only be initialized if space is reserved for it. Fixes: `04d82a6d08` ("binfmt_flat: allow not offsetting data start") Co-developed-by: Stefan O'Rear <sorear@fastmail.com> Signed-off-by: Stefan O'Rear <sorear@fastmail.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Acked-by: Greg Ungerer <gerg@linux-m68k.org> Link: https://lore.kernel.org/r/20240807195119.it.782-kees@kernel.org Signed-off-by: Kees Cook <kees@kernel.org>	2024-08-09 20:19:00 -07:00
Kent Overstreet	8a2491db7b	bcachefs: bcachefs_metadata_version_disk_accounting_v3 bcachefs_metadata_version_disk_accounting_v2 erroneously had padding bytes in disk_accounting_key, which is a problem because we have to guarantee that all unused bytes in disk_accounting_key are zeroed. Fortunately 6.11 isn't out yet, so it's cheap to fix this by spinning a new version. Reported-by: Gabriel de Perthuis <g2p.code@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-09 19:21:28 -04:00
Kent Overstreet	1a9e219db1	bcachefs: improve bch2_dev_usage_to_text() Add a line for capacity Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-09 14:40:19 -04:00
Kent Overstreet	077e473723	bcachefs: bch2_accounting_invalid() Implement bch2_accounting_invalid(); check for junk at the end, and replicas accounting entries in particular need to be checked or we'll pop asserts later. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-09 14:40:19 -04:00
Namjae Jeon	f6bd41280a	ksmbd: override fsids for smb2_query_info() Sangsoo reported that a DAC denial error occurred when accessing files through the ksmbd thread. This patch override fsids for smb2_query_info(). Reported-by: Sangsoo Lee <constant.lee@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-08 22:54:09 -05:00
Namjae Jeon	a018c1b636	ksmbd: override fsids for share path check Sangsoo reported that a DAC denial error occurred when accessing files through the ksmbd thread. This patch override fsids for share path check. Reported-by: Sangsoo Lee <constant.lee@samsung.com> Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-08 22:54:09 -05:00
Gleb Korobeynikov	36bb22a08a	cifs: cifs_inval_name_dfs_link_error: correct the check for fullpath Replace the always-true check tcon->origin_fullpath with check of server->leaf_fullpath See https://bugzilla.kernel.org/show_bug.cgi?id=219083 The check of the new @tcon will always be true during mounting, since @tcon->origin_fullpath will only be set after the tree is connected to the latest common resource, as well as checking if the prefix paths from it are fully accessible. Fixes: `3ae872de41` ("smb: client: fix shared DFS root mounts with different prefixes") Reviewed-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Gleb Korobeynikov <gkorobeynikov@astralinux.ru> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-08 20:06:22 -05:00
Linus Torvalds	9466b6ae6b	tracing fixes for v6.11: - Have reading of event format files test if the meta data still exists. When a event is freed, a flag (EVENT_FILE_FL_FREED) in the meta data is set to state that it is to prevent any new references to it from happening while waiting for existing references to close. When the last reference closes, the meta data is freed. But the "format" was missing a check to this flag (along with some other files) that allowed new references to happen, and a use-after-free bug to occur. - Have the trace event meta data use the refcount infrastructure instead of relying on its own atomic counters. - Have tracefs inodes use alloc_inode_sb() for allocation instead of using kmem_cache_alloc() directly. - Have eventfs_create_dir() return an ERR_PTR instead of NULL as the callers expect a real object or an ERR_PTR. - Have release_ei() use call_srcu() and not call_rcu() as all the protection is on SRCU and not RCU. - Fix ftrace_graph_ret_addr() to use the task passed in and not current. - Fix overflow bug in get_free_elt() where the counter can overflow the integer and cause an infinite loop. - Remove unused function ring_buffer_nr_pages() - Have tracefs freeing use the inode RCU infrastructure instead of creating its own. When the kernel had randomize structure fields enabled, the rcu field of the tracefs_inode was overlapping the rcu field of the inode structure, and corrupting it. Instead, use the destroy_inode() callback to do the initial cleanup of the code, and then have free_inode() free it. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZrTvXxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qu39AP9ze6ELpShDrxbXhf0adbNqG2IXMepa MMLqfq8tU8E/vAEAuZXJ6rKXeGvKeONa06ocvWJ0dpb2cy/n4hmx+KtM5gI= =Pkh4 -----END PGP SIGNATURE----- Merge tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Have reading of event format files test if the metadata still exists. When a event is freed, a flag (EVENT_FILE_FL_FREED) in the metadata is set to state that it is to prevent any new references to it from happening while waiting for existing references to close. When the last reference closes, the metadata is freed. But the "format" was missing a check to this flag (along with some other files) that allowed new references to happen, and a use-after-free bug to occur. - Have the trace event meta data use the refcount infrastructure instead of relying on its own atomic counters. - Have tracefs inodes use alloc_inode_sb() for allocation instead of using kmem_cache_alloc() directly. - Have eventfs_create_dir() return an ERR_PTR instead of NULL as the callers expect a real object or an ERR_PTR. - Have release_ei() use call_srcu() and not call_rcu() as all the protection is on SRCU and not RCU. - Fix ftrace_graph_ret_addr() to use the task passed in and not current. - Fix overflow bug in get_free_elt() where the counter can overflow the integer and cause an infinite loop. - Remove unused function ring_buffer_nr_pages() - Have tracefs freeing use the inode RCU infrastructure instead of creating its own. When the kernel had randomize structure fields enabled, the rcu field of the tracefs_inode was overlapping the rcu field of the inode structure, and corrupting it. Instead, use the destroy_inode() callback to do the initial cleanup of the code, and then have free_inode() free it. * tag 'trace-v6.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracefs: Use generic inode RCU for synchronizing freeing ring-buffer: Remove unused function ring_buffer_nr_pages() tracing: Fix overflow in get_free_elt() function_graph: Fix the ret_stack used by ftrace_graph_ret_addr() eventfs: Use SRCU for freeing eventfs_inodes eventfs: Don't return NULL in eventfs_create_dir() tracefs: Fix inode allocation tracing: Use refcount for trace_event_file reference counter tracing: Have format file honor EVENT_FILE_FL_FREED	2024-08-08 13:32:59 -07:00
Linus Torvalds	b3f5620f76	bcachefs fixes for 6.11-rc3 Assorted little stuff: - lockdep fixup for lockdep_set_notrack_class() - we can now remove a device when using erasure coding without deadlocking, though we still hit other issues - the "allocator stuck" timeout is now configurable, and messages are ratelimited; default timeout has been increased from 10 seconds to 30 -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAma05JwACgkQE6szbY3K bnYE+Q//VJ6R/UxDoxjk8zgftRcdgwXod6U+/E0Kj3ZBKLYXkcGaWWmmGMkFafBp eL7Y3wtHSKiMsHYX9KEdFUZFLe1KI4c16RgNIXk9nwkF+3/+8pEDHKPFuoGHJH3O HComHGqwVg8Zx2jRNvEkvQ980iH7OBGhCjMFXhJ3xbMGLdw91TQQi49a+Q/vz7QT y3Cl1dgX5xBl7fqKefsYa+X6mpWi4/6t60vJvatI+bvDfznjI6jN3qGVLlQye7tC 6VbJAjHsPPyNMlWa99UaHqDdaM325zR2ES0bsfHd8Up4iAwO8OgjzYQxpYTgi51i 6DTiGEOV2S8gF+Rnprnbzsnau0hEvrtQY2Ub85TCIGbZJa8b+aDIlq9k8jF36O2E 2CUTleQ/E129RxXpkZGsVRpNmemdCi6rHAcluaFEgezX4FJH8BVOwQQq2Xz7rd7E 3ZP5iAWmX0IgOL0VOCP/ZXl/lEMwSk0VAED3jEbT7f2K7rU9nXDO2bIEx1wXDCm1 b32kvmUi2FBjqLHSqvAPEb52tvvZuliMUY7z9dEx+AX9AVC9kGE+amGexosKb/LY nWzey+D0cKHtgbkMFrCClkpg75Tnt9ISJbad53+5qhN8an/a71djdj8Zk0UQnQjv 6Amv4Ns1lDo3XGC1QtYkF5HqiWaupbUXAftptpS4Av4X1zZEQIc= =q1dD -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs Pull bcachefs fixes from Kent Overstreet: "Assorted little stuff: - lockdep fixup for lockdep_set_notrack_class() - we can now remove a device when using erasure coding without deadlocking, though we still hit other issues - the 'allocator stuck' timeout is now configurable, and messages are ratelimited. The default timeout has been increased from 10 seconds to 30" * tag 'bcachefs-2024-08-08' of git://evilpiepirate.org/bcachefs: bcachefs: Use bch2_wait_on_allocator() in btree node alloc path bcachefs: Make allocator stuck timeout configurable, ratelimit messages bcachefs: Add missing path_traverse() to btree_iter_next_node() bcachefs: ec should not allocate from ro devs bcachefs: Improved allocator debugging for ec bcachefs: Add missing bch2_trans_begin() call bcachefs: Add a comment for bucket helper types bcachefs: Don't rely on implicit unsigned -> signed integer conversion lockdep: Fix lockdep_set_notrack_class() for CONFIG_LOCK_STAT bcachefs: Fix double free of ca->buckets_nouse	2024-08-08 13:27:31 -07:00
Kent Overstreet	f39bae2e02	bcachefs: Switch to .get_inode_acl() .set_acl() requires a dentry, and if one isn't passed it marks the VFS inode as not having an ACL. This has been causing inodes with ACLs to have them "disappear" on bcachefs filesystem, depending on which path those inodes get pulled into the cache from. Switching to .get_inode_acl(), like other local filesystems, fixes this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-08 15:14:02 -04:00
Xiaxi Shen	bdcffe4be7	Fix spelling errors in Server Message Block Fixed typos in various files under fs/smb/client/ Signed-off-by: Xiaxi Shen <shenxiaxi26@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-08 11:15:33 -05:00
Steve French	1b5487aefb	smb3: fix setting SecurityFlags when encryption is required Setting encryption as required in security flags was broken. For example (to require all mounts to be encrypted by setting): "echo 0x400c5 > /proc/fs/cifs/SecurityFlags" Would return "Invalid argument" and log "Unsupported security flags" This patch fixes that (e.g. allowing overriding the default for SecurityFlags 0x00c5, including 0x40000 to require seal, ie SMB3.1.1 encryption) so now that works and forces encryption on subsequent mounts. Acked-by: Bharath SM <bharathsm@microsoft.com> Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-08 11:14:53 -05:00
Kent Overstreet	73dc1656f4	bcachefs: Use bch2_wait_on_allocator() in btree node alloc path If the allocator gets stuck, we need to know why. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-07 21:04:55 -04:00
Kent Overstreet	cecf72798b	bcachefs: Make allocator stuck timeout configurable, ratelimit messages Limit these messages to once every 2 minutes to avoid spamming logs; with multiple devices the output can be quite significant. Also, up the default timeout to 30 seconds from 10 seconds. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-07 21:04:55 -04:00
Kent Overstreet	6d496e02b4	bcachefs: Add missing path_traverse() to btree_iter_next_node() This fixes a bug exposed by the next path - we pop an assert in path_set_should_be_locked(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-07 21:04:55 -04:00
Steven Rostedt	0b6743bd60	tracefs: Use generic inode RCU for synchronizing freeing With structure layout randomization enabled for 'struct inode' we need to avoid overlapping any of the RCU-used / initialized-only-once members, e.g. i_lru or i_sb_list to not corrupt related list traversals when making use of the rcu_head. For an unlucky structure layout of 'struct inode' we may end up with the following splat when running the ftrace selftests: [<...>] list_del corruption, ffff888103ee2cb0->next (tracefs_inode_cache+0x0/0x4e0 [slab object]) is NULL (prev is tracefs_inode_cache+0x78/0x4e0 [slab object]) [<...>] ------------[ cut here ]------------ [<...>] kernel BUG at lib/list_debug.c:54! [<...>] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [<...>] CPU: 3 PID: 2550 Comm: mount Tainted: G N 6.8.12-grsec+ #122 ed2f536ca62f28b087b90e3cc906a8d25b3ddc65 [<...>] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [<...>] RIP: 0010:[<ffffffff84656018>] __list_del_entry_valid_or_report+0x138/0x3e0 [<...>] Code: 48 b8 99 fb 65 f2 ff ff ff ff e9 03 5c d9 fc cc 48 b8 99 fb 65 f2 ff ff ff ff e9 33 5a d9 fc cc 48 b8 99 fb 65 f2 ff ff ff ff <0f> 0b 4c 89 e9 48 89 ea 48 89 ee 48 c7 c7 60 8f dd 89 31 c0 e8 2f [<...>] RSP: 0018:fffffe80416afaf0 EFLAGS: 00010283 [<...>] RAX: 0000000000000098 RBX: ffff888103ee2cb0 RCX: 0000000000000000 [<...>] RDX: ffffffff84655fe8 RSI: ffffffff89dd8b60 RDI: 0000000000000001 [<...>] RBP: ffff888103ee2cb0 R08: 0000000000000001 R09: fffffbd0082d5f25 [<...>] R10: fffffe80416af92f R11: 0000000000000001 R12: fdf99c16731d9b6d [<...>] R13: 0000000000000000 R14: ffff88819ad4b8b8 R15: 0000000000000000 [<...>] RBX: tracefs_inode_cache+0x0/0x4e0 [slab object] [<...>] RDX: __list_del_entry_valid_or_report+0x108/0x3e0 [<...>] RSI: __func__.47+0x4340/0x4400 [<...>] RBP: tracefs_inode_cache+0x0/0x4e0 [slab object] [<...>] RSP: process kstack fffffe80416afaf0+0x7af0/0x8000 [mount 2550 2550] [<...>] R09: kasan shadow of process kstack fffffe80416af928+0x7928/0x8000 [mount 2550 2550] [<...>] R10: process kstack fffffe80416af92f+0x792f/0x8000 [mount 2550 2550] [<...>] R14: tracefs_inode_cache+0x78/0x4e0 [slab object] [<...>] FS: 00006dcb380c1840(0000) GS:ffff8881e0600000(0000) knlGS:0000000000000000 [<...>] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [<...>] CR2: 000076ab72b30e84 CR3: 000000000b088004 CR4: 0000000000360ef0 shadow CR4: 0000000000360ef0 [<...>] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [<...>] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [<...>] ASID: 0003 [<...>] Stack: [<...>] ffffffff818a2315 00000000f5c856ee ffffffff896f1840 ffff888103ee2cb0 [<...>] ffff88812b6b9750 0000000079d714b6 fffffbfff1e9280b ffffffff8f49405f [<...>] 0000000000000001 0000000000000000 ffff888104457280 ffffffff8248b392 [<...>] Call Trace: [<...>] <TASK> [<...>] [<ffffffff818a2315>] ? lock_release+0x175/0x380 fffffe80416afaf0 [<...>] [<ffffffff8248b392>] list_lru_del+0x152/0x740 fffffe80416afb48 [<...>] [<ffffffff8248ba93>] list_lru_del_obj+0x113/0x280 fffffe80416afb88 [<...>] [<ffffffff8940fd19>] ? _atomic_dec_and_lock+0x119/0x200 fffffe80416afb90 [<...>] [<ffffffff8295b244>] iput_final+0x1c4/0x9a0 fffffe80416afbb8 [<...>] [<ffffffff8293a52b>] dentry_unlink_inode+0x44b/0xaa0 fffffe80416afbf8 [<...>] [<ffffffff8293fefc>] __dentry_kill+0x23c/0xf00 fffffe80416afc40 [<...>] [<ffffffff8953a85f>] ? __this_cpu_preempt_check+0x1f/0xa0 fffffe80416afc48 [<...>] [<ffffffff82949ce5>] ? shrink_dentry_list+0x1c5/0x760 fffffe80416afc70 [<...>] [<ffffffff82949b71>] ? shrink_dentry_list+0x51/0x760 fffffe80416afc78 [<...>] [<ffffffff82949da8>] shrink_dentry_list+0x288/0x760 fffffe80416afc80 [<...>] [<ffffffff8294ae75>] shrink_dcache_sb+0x155/0x420 fffffe80416afcc8 [<...>] [<ffffffff8953a7c3>] ? debug_smp_processor_id+0x23/0xa0 fffffe80416afce0 [<...>] [<ffffffff8294ad20>] ? do_one_tree+0x140/0x140 fffffe80416afcf8 [<...>] [<ffffffff82997349>] ? do_remount+0x329/0xa00 fffffe80416afd18 [<...>] [<ffffffff83ebf7a1>] ? security_sb_remount+0x81/0x1c0 fffffe80416afd38 [<...>] [<ffffffff82892096>] reconfigure_super+0x856/0x14e0 fffffe80416afd70 [<...>] [<ffffffff815d1327>] ? ns_capable_common+0xe7/0x2a0 fffffe80416afd90 [<...>] [<ffffffff82997436>] do_remount+0x416/0xa00 fffffe80416afdd0 [<...>] [<ffffffff829b2ba4>] path_mount+0x5c4/0x900 fffffe80416afe28 [<...>] [<ffffffff829b25e0>] ? finish_automount+0x13a0/0x13a0 fffffe80416afe60 [<...>] [<ffffffff82903812>] ? user_path_at_empty+0xb2/0x140 fffffe80416afe88 [<...>] [<ffffffff829b2ff5>] do_mount+0x115/0x1c0 fffffe80416afeb8 [<...>] [<ffffffff829b2ee0>] ? path_mount+0x900/0x900 fffffe80416afed8 [<...>] [<ffffffff8272461c>] ? __kasan_check_write+0x1c/0xa0 fffffe80416afee0 [<...>] [<ffffffff829b31cf>] __do_sys_mount+0x12f/0x280 fffffe80416aff30 [<...>] [<ffffffff829b36cd>] __x64_sys_mount+0xcd/0x2e0 fffffe80416aff70 [<...>] [<ffffffff819f8818>] ? syscall_trace_enter+0x218/0x380 fffffe80416aff88 [<...>] [<ffffffff8111655e>] x64_sys_call+0x5d5e/0x6720 fffffe80416affa8 [<...>] [<ffffffff8952756d>] do_syscall_64+0xcd/0x3c0 fffffe80416affb8 [<...>] [<ffffffff8100119b>] entry_SYSCALL_64_safe_stack+0x4c/0x87 fffffe80416affe8 [<...>] </TASK> [<...>] <PTREGS> [<...>] RIP: 0033:[<00006dcb382ff66a>] vm_area_struct[mount 2550 2550 file 6dcb38225000-6dcb3837e000 22 55(read\|exec\|mayread\|mayexec)]+0x0/0xb8 [userland map] [<...>] Code: 48 8b 0d 29 18 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f6 17 0d 00 f7 d8 64 89 01 48 [<...>] RSP: 002b:0000763d68192558 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [<...>] RAX: ffffffffffffffda RBX: 00006dcb38433264 RCX: 00006dcb382ff66a [<...>] RDX: 000017c3e0d11210 RSI: 000017c3e0d1a5a0 RDI: 000017c3e0d1ae70 [<...>] RBP: 000017c3e0d10fb0 R08: 000017c3e0d11260 R09: 00006dcb383d1be0 [<...>] R10: 000000000020002e R11: 0000000000000246 R12: 0000000000000000 [<...>] R13: 000017c3e0d1ae70 R14: 000017c3e0d11210 R15: 000017c3e0d10fb0 [<...>] RBX: vm_area_struct[mount 2550 2550 file 6dcb38433000-6dcb38434000 5b 100033(read\|write\|mayread\|maywrite\|account)]+0x0/0xb8 [userland map] [<...>] RCX: vm_area_struct[mount 2550 2550 file 6dcb38225000-6dcb3837e000 22 55(read\|exec\|mayread\|mayexec)]+0x0/0xb8 [userland map] [<...>] RDX: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read\|write\|mayread\|maywrite\|account)]+0x0/0xb8 [userland map] [<...>] RSI: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read\|write\|mayread\|maywrite\|account)]+0x0/0xb8 [userland map] [<...>] RDI: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read\|write\|mayread\|maywrite\|account)]+0x0/0xb8 [userland map] [<...>] RBP: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read\|write\|mayread\|maywrite\|account)]+0x0/0xb8 [userland map] [<...>] RSP: vm_area_struct[mount 2550 2550 anon 763d68173000-763d68195000 7ffffffdd 100133(read\|write\|mayread\|maywrite\|growsdown\|account)]+0x0/0xb8 [userland map] [<...>] R08: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read\|write\|mayread\|maywrite\|account)]+0x0/0xb8 [userland map] [<...>] R09: vm_area_struct[mount 2550 2550 file 6dcb383d1000-6dcb383d3000 1cd 100033(read\|write\|mayread\|maywrite\|account)]+0x0/0xb8 [userland map] [<...>] R13: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read\|write\|mayread\|maywrite\|account)]+0x0/0xb8 [userland map] [<...>] R14: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read\|write\|mayread\|maywrite\|account)]+0x0/0xb8 [userland map] [<...>] R15: vm_area_struct[mount 2550 2550 anon 17c3e0d0f000-17c3e0d31000 17c3e0d0f 100033(read\|write\|mayread\|maywrite\|account)]+0x0/0xb8 [userland map] [<...>] </PTREGS> [<...>] Modules linked in: [<...>] ---[ end trace 0000000000000000 ]--- The list debug message as well as RBX's symbolic value point out that the object in question was allocated from 'tracefs_inode_cache' and that the list's '->next' member is at offset 0. Dumping the layout of the relevant parts of 'struct tracefs_inode' gives the following: struct tracefs_inode { union { struct inode { struct list_head { struct list_head * next; /* 0 8 / struct list_head prev; /* 8 8 / } i_lru; [...] } vfs_inode; struct callback_head { void (func)(struct callback_head ); / 0 8 / struct callback_head next; /* 8 8 / } rcu; }; [...] }; Above shows that 'vfs_inode.i_lru' overlaps with 'rcu' which will destroy the 'i_lru' list as soon as the 'rcu' member gets used, e.g. in call_rcu() or later when calling the RCU callback. This will disturb concurrent list traversals as well as object reuse which assumes these list heads will keep their integrity. For reproduction, the following diff manually overlays 'i_lru' with 'rcu' as, otherwise, one would require some good portion of luck for gambling an unlucky RANDSTRUCT seed: --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -629,6 +629,7 @@ struct inode { umode_t i_mode; unsigned short i_opflags; kuid_t i_uid; + struct list_head i_lru; / inode LRU list / kgid_t i_gid; unsigned int i_flags; @@ -690,7 +691,6 @@ struct inode { u16 i_wb_frn_avg_time; u16 i_wb_frn_history; #endif - struct list_head i_lru; / inode LRU list / struct list_head i_sb_list; struct list_head i_wb_list; / backing dev writeback list */ union { The tracefs inode does not need to supply its own RCU delayed destruction of its inode. The inode code itself offers both a "destroy_inode()" callback that gets called when the last reference of the inode is released, and the "free_inode()" which is called after a RCU synchronization period from the "destroy_inode()". The tracefs code can unlink the inode from its list in the destroy_inode() callback, and the simply free it from the free_inode() callback. This should provide the same protection. Link: https://lore.kernel.org/all/20240807115143.45927-3-minipli@grsecurity.net/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Ajay Kaher <ajay.kaher@broadcom.com> Cc: Ilkka =?utf-8?b?TmF1bGFww6TDpA==?= <digirigawa@gmail.com> Link: https://lore.kernel.org/20240807185402.61410544@gandalf.local.home Fixes: `baa23a8d43` ("tracefs: Reset permissions on remount if permissions are options") Reported-by: Mathias Krause <minipli@grsecurity.net> Reported-by: Brad Spengler <spender@grsecurity.net> Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-08-07 20:27:49 -04:00
Mathias Krause	8e55643247	eventfs: Use SRCU for freeing eventfs_inodes To mirror the SRCU lock held in eventfs_iterate() when iterating over eventfs inodes, use call_srcu() to free them too. This was accidentally(?) degraded to RCU in commit `43aa6f97c2` ("eventfs: Get rid of dentry pointers without refcounts"). Cc: Ajay Kaher <ajay.kaher@broadcom.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20240723210755.8970-1-minipli@grsecurity.net Fixes: `43aa6f97c2` ("eventfs: Get rid of dentry pointers without refcounts") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-08-07 19:28:48 -04:00
Mathias Krause	12c20c65d0	eventfs: Don't return NULL in eventfs_create_dir() Commit `77a06c33a2` ("eventfs: Test for ei->is_freed when accessing ei->dentry") added another check, testing if the parent was freed after we released the mutex. If so, the function returns NULL. However, all callers expect it to either return a valid pointer or an error pointer, at least since commit `5264a2f4bb` ("tracing: Fix a NULL vs IS_ERR() bug in event_subsystem_dir()"). Returning NULL will therefore fail the error condition check in the caller. Fix this by substituting the NULL return value with a fitting error pointer. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: stable@vger.kernel.org Fixes: `77a06c33a2` ("eventfs: Test for ei->is_freed when accessing ei->dentry") Link: https://lore.kernel.org/20240723122522.2724-1-minipli@grsecurity.net Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Ajay Kaher <ajay.kaher@broadcom.com> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-08-07 19:28:31 -04:00
Mathias Krause	0df2ac59be	tracefs: Fix inode allocation The leading comment above alloc_inode_sb() is pretty explicit about it: /* * This must be used for allocating filesystems specific inodes to set * up the inode reclaim context correctly. */ Switch tracefs over to alloc_inode_sb() to make sure inodes are properly linked. Cc: Ajay Kaher <ajay.kaher@broadcom.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20240807115143.45927-2-minipli@grsecurity.net Fixes: `ba37ff75e0` ("eventfs: Implement tracefs_inode_cache") Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2024-08-07 18:49:06 -04:00
Linus Torvalds	6a0e382640	for-6.11-rc2-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmazbGYACgkQxWXV+ddt WDsQAw//Z3XjjylTZPuHNk/AiMe5oochxB5T9ZracQOzG0o70gj1w/UQIZBkSzp1 66g8I4YdbwvEXKDg9Oi/GPDSON3GuhAiLXp+0Y/reeD/totgvrhROuJ3mIk5CZ0H B4fIKH3xCKLQan26Opgju4qjum7+AR7ekFveM6GicxXXb3eAYALgoEFt63eZjZVu 7myak78gmBuK5QdGHH+onEhn+HfC57UTGBqu1bJsSOQC7dANkU+WzmgbH6FeOHqx 2T/lN/tu2tBBoF4zMvC472Zjmj4PnubNQnwwv0oJ8Z2Y0yIY95joZV0uXIXO72oz BMQC6s0cltiTn1Tfe4iIWn+ZNjcfGAZO7aoD5NcJb2F/Mz5mZuMhtq3BND0wJ8/+ SuYk7PxX7tcOaFrDAn3Ne7XHsD7r5lLkFICXkzcNG4dqkBUxR3dN4Oi8KKMFrgCP kTpf0/lAkYKYSrU86Yn1zhwRLaH8jm3fulVUY8i/p0tJnpsW8AmeME1Sk97Bhxbb y6zt8+MPscPuQi3jPsBevaYd8Q8BIT34vVtU/jNmGAyqEv1wVYQRRK2Ma4sJJfNk EEQV9i4VqXWLc17dcWVZrG6lEsVRIIBE/2Adhth6Myq73bkunpB9ZNNNZApf1T2j Wn5gPzkuQd6FWrXe8V7oSN9rT1OPn9uHL6BmSUWhHUCuFeATgoc= =03Gf -----END PGP SIGNATURE----- Merge tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix double inode unlock for direct IO sync writes (reported by syzbot) - fix root tree id/name map definitions, don't use fixed size buffers for name (reported by -Werror=unterminated-string-initialization) - fix qgroup reserve leaks in bufferd write path - update scrub status structure more often so it can be reported in user space more accurately and let 'resume' not repeat work - in preparation to remove space cache v1 in the future print a warning if it's detected * tag 'for-6.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: avoid using fixed char array size for tree names btrfs: fix double inode unlock for direct IO sync writes btrfs: emit a warning about space cache v1 being deprecated btrfs: fix qgroup reserve leaks in cow_file_range btrfs: implement launder_folio for clearing dirty page reserve btrfs: scrub: update last_physical after scrubbing one stripe btrfs: factor out stripe length calculation into a helper	2024-08-07 09:53:41 -07:00
Kent Overstreet	2caca9fb16	bcachefs: ec should not allocate from ro devs This fixes a device removal deadlock when using erasure coding. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-07 08:31:10 -04:00
Kent Overstreet	c1e4446247	bcachefs: Improved allocator debugging for ec chasing down a device removal deadlock with erasure coding Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-07 08:31:10 -04:00
Kent Overstreet	02026e8931	bcachefs: Add missing bch2_trans_begin() call Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-07 08:31:10 -04:00
Kent Overstreet	90b211fa2d	bcachefs: Add a comment for bucket helper types We've had bugs in the past with incorrect integer conversions in disk accounting code, which is why bucket helpers now always return s64s; add a comment explaining this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-07 08:31:10 -04:00
Kent Overstreet	7442b5cdf2	bcachefs: Don't rely on implicit unsigned -> signed integer conversion implicit integer conversion is a fertile source of bugs, and we really would rather not have the min()/max() macros doing it implicitly. bcachefs appears to be the only place in the kernel where this happens, so let's fix it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-08-07 08:31:10 -04:00
Al Viro	9a2fa14720	fix bitmap corruption on close_range() with CLOSE_RANGE_UNSHARE copy_fd_bitmaps(new, old, count) is expected to copy the first count/BITS_PER_LONG bits from old->full_fds_bits[] and fill the rest with zeroes. What it does is copying enough words (BITS_TO_LONGS(count/BITS_PER_LONG)), then memsets the rest. That works fine, if all bits past the cutoff point are clear. Otherwise we are risking garbage from the last word we'd copied. For most of the callers that is true - expand_fdtable() has count equal to old->max_fds, so there's no open descriptors past count, let alone fully occupied words in ->open_fds[], which is what bits in ->full_fds_bits[] correspond to. The other caller (dup_fd()) passes sane_fdtable_size(old_fdt, max_fds), which is the smallest multiple of BITS_PER_LONG that covers all opened descriptors below max_fds. In the common case (copying on fork()) max_fds is ~0U, so all opened descriptors will be below it and we are fine, by the same reasons why the call in expand_fdtable() is safe. Unfortunately, there is a case where max_fds is less than that and where we might, indeed, end up with junk in ->full_fds_bits[] - close_range(from, to, CLOSE_RANGE_UNSHARE) with * descriptor table being currently shared * 'to' being above the current capacity of descriptor table * 'from' being just under some chunk of opened descriptors. In that case we end up with observably wrong behaviour - e.g. spawn a child with CLONE_FILES, get all descriptors in range 0..127 open, then close_range(64, ~0U, CLOSE_RANGE_UNSHARE) and watch dup(0) ending up with descriptor #128, despite #64 being observably not open. The minimally invasive fix would be to deal with that in dup_fd(). If this proves to add measurable overhead, we can go that way, but let's try to fix copy_fd_bitmaps() first. * new helper: bitmap_copy_and_expand(to, from, bits_to_copy, size). * make copy_fd_bitmaps() take the bitmap size in words, rather than bits; it's 'count' argument is always a multiple of BITS_PER_LONG, so we are not losing any information, and that way we can use the same helper for all three bitmaps - compiler will see that count is a multiple of BITS_PER_LONG for the large ones, so it'll generate plain memcpy()+memset(). Reproducer added to tools/testing/selftests/core/close_range_test.c Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-08-05 19:23:11 -04:00
Linus Torvalds	3f3f6d6123	'smb3 client fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmauj9AACgkQiiy9cAdy T1G4HgwAvdAPAn2BAFYT/SaXfkN3EX78mcGe85wA8CXQep7q/ik+3xAwvNMKOWmo OYmem3TqdfK4N4wkXCWd7TpKI+DZQAyt9ocIk8MhDWoIxp2A1nX/80SJUjTAJWvg 8Q2HBZu8GfYyw8PW9KfR4hBOixvA8dLXZI7vNSvHP4S7XA10OP/HFTkwi4pPlkLF ZuSZNMU0Enwmzay1pUkVp9r2dq1ZDKtilbFFmN+bnuMoAigp//HDFFxx/zjIXCqb FdhA+bl9Wj1f2r164qDRHoVg2kVX2lyIzhQJtAWdIqxPEAfUgZCu//KN5NgdstYx sQID8DL0MDeDYRhvuoAVinLpLvJFVf0O4K43f5kY1HXA4JFn//lY9zPNE4FLuwrw Ez+WsB70YHhof14n5w1hgcDE5XMeLZLa3SbVNoyhTW4C7xjJj1cqMEfnqZTGUsLx s2sZJnLhoX0aThTp9+Wc4KLy9Z8QjOy3GMmc7tmCtwHfYJTocly8wfWCrR9VYvBP yVIhZCbt =MKOr -----END PGP SIGNATURE----- Merge tag '6.11-rc1-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - two reparse point fixes - minor cleanup - additional trace point (to help debug a recent problem) * tag '6.11-rc1-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal version number smb: client: fix FSCTL_GET_REPARSE_POINT against NetApp smb3: add dynamic tracepoints for shutdown ioctl cifs: Remove cifs_aio_ctx smb: client: handle lack of FSCTL_GET_REPARSE_POINT support	2024-08-04 08:18:40 -07:00
Linus Torvalds	d3426a6ed9	Bug fixes for 6.11-rc1: * Fix memory leak when corruption is detected during scrubbing parent pointers. * Allow SECURE namespace xattrs to use reserved block pool to in order to prevent ENOSPC. * Save stack space by passing tracepoint's char array to file_path() instead of another stack variable. * Remove unused parameter in macro XFS_DQUOT_LOGRES. * Replace comma with semicolon in a couple of places. Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQjMC4mbgVeU7MxEIYH7y4RirJu9AUCZqoucAAKCRAH7y4RirJu 9LlvAP9J85bGKmcBcy0SLbuqatg6aut/ev/7+qI0FHdaRp1mYQEA+ryJarLdl8kM 8EqMUwDF3CzK3o88hrTMu6lT5F6Mpw4= =CCxD -----END PGP SIGNATURE----- Merge tag 'xfs-6.11-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs fixes from Chandan Babu: - Fix memory leak when corruption is detected during scrubbing parent pointers - Allow SECURE namespace xattrs to use reserved block pool to in order to prevent ENOSPC - Save stack space by passing tracepoint's char array to file_path() instead of another stack variable - Remove unused parameter in macro XFS_DQUOT_LOGRES - Replace comma with semicolon in a couple of places * tag 'xfs-6.11-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: convert comma to semicolon xfs: convert comma to semicolon xfs: remove unused parameter in macro XFS_DQUOT_LOGRES xfs: fix file_path handling in tracepoints xfs: allow SECURE namespace xattrs to use reserved block pool xfs: fix a memory leak	2024-08-03 09:09:25 -07:00
Qu Wenruo	12653ec361	btrfs: avoid using fixed char array size for tree names [BUG] There is a bug report that using the latest trunk GCC 15, btrfs would cause unterminated-string-initialization warning: linux-6.6/fs/btrfs/print-tree.c:29:49: error: initializer-string for array of ‘char’ is too long [-Werror=unterminated-string-initialization] 29 \| { BTRFS_BLOCK_GROUP_TREE_OBJECTID, "BLOCK_GROUP_TREE" }, \| ^~~~~~~~~~~~~~~~~~ [CAUSE] To print tree names we have an array of root_name_map structure, which uses "char name[16];" to store the name string of a tree. But the following trees have names exactly at 16 chars length: - "BLOCK_GROUP_TREE" - "RAID_STRIPE_TREE" This means we will have no space for the terminating '\0', and can lead to unexpected access when printing the name. [FIX] Instead of "char name[16];" use "const char *" instead. Since the name strings are all read-only data, and are all NULL terminated by default, there is not much need to bother the length at all. Reported-by: Sam James <sam@gentoo.org> Reported-by: Alejandro Colomar <alx@kernel.org> Fixes: `edde81f1ab` ("btrfs: add raid stripe tree pretty printer") Fixes: `9c54e80ddc` ("btrfs: add code to support the block group root") CC: stable@vger.kernel.org # 6.1+ Suggested-by: Alejandro Colomar <alx@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Alejandro Colomar <alx@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-02 22:44:27 +02:00
Filipe Manana	e0391e92f9	btrfs: fix double inode unlock for direct IO sync writes If we do a direct IO sync write, at btrfs_sync_file(), and we need to skip inode logging or we get an error starting a transaction or an error when flushing delalloc, we end up unlocking the inode when we shouldn't under the 'out_release_extents' label, and then unlock it again at btrfs_direct_write(). Fix that by checking if we have to skip inode unlocking under that label. Reported-by: syzbot+7dbbb74af6291b5a5a8b@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000dfd631061eaeb4bc@google.com/ Fixes: `939b656bc8` ("btrfs: fix corruption after buffer fault in during direct IO append write") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-02 22:32:40 +02:00
Linus Torvalds	1c4246294c	A fix for a potential hang in the MDS when cap revocation races with the client releasing the caps in question, marked for stable. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmatDEgTHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzixHdB/9k62+n/hLFTXem1WE421sg6oAwCNA+ KCps+TU4ekeo7ckvExeO1MwL5SyRMUc05nY9zHDiqey0hkpuUsZfjvj9v/q0XHDh YcGLfIe9gTl2QiAeEjl6FRkJas5GA/MULjEKltEFUpRTCH9WQvS2z5/MzJByUobf kHtEokAViEB+OpIabN3Vm4ZcQJ5ZLdbmVzlvf3nRYZ1Yw+OinQn3yVQJ2VKYXiOi l9rk9XxUSISV7EKUq9N0qlSH7M1YsspMTM1lSIC1vlaILyeTA1dJa/85ACEtXOfm Z7mC77H6ivR2o+eXilgvpEAyCCDRfmwUp4GV1LnqFDZKBJsBAcwLyLGY =I++X -----END PGP SIGNATURE----- Merge tag 'ceph-for-6.11-rc2' of https://github.com/ceph/ceph-client Pull ceph fix from Ilya Dryomov: "A fix for a potential hang in the MDS when cap revocation races with the client releasing the caps in question, marked for stable" * tag 'ceph-for-6.11-rc2' of https://github.com/ceph/ceph-client: ceph: force sending a cap update msg back to MDS for revoke op	2024-08-02 10:33:06 -07:00
Steve French	a91bfa6760	cifs: update internal version number To 2.50 Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-02 10:56:14 -05:00
Paulo Alcantara	ddecea00f8	smb: client: fix FSCTL_GET_REPARSE_POINT against NetApp NetApp server requires the file to be open with FILE_READ_EA access in order to support FSCTL_GET_REPARSE_POINT, otherwise it will return STATUS_INVALID_DEVICE_REQUEST. It doesn't make any sense because there's no requirement for FILE_READ_EA bit to be set nor STATUS_INVALID_DEVICE_REQUEST being used for something other than "unsupported reparse points" in MS-FSA. To fix it and improve compatibility, set FILE_READ_EA & SYNCHRONIZE bits to match what Windows client currently does. Tested-by: Sebastian Steinbeisser <Sebastian.Steinbeisser@lrz.de> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-02 10:56:02 -05:00
Steve French	69ca1f5755	smb3: add dynamic tracepoints for shutdown ioctl For debugging an umount failure in xfstests generic/043 generic/044 in some configurations, we needed more information on the shutdown ioctl which was suspected of being related to the cause, so tracepoints are added in this patch e.g. "trace-cmd record -e smb3_shutdown_enter -e smb3_shutdown_done -e smb3_shutdown_err" Sample output: godown-47084 [011] ..... 3313.756965: smb3_shutdown_enter: flags=0x1 tid=0x733b3e75 godown-47084 [011] ..... 3313.756968: smb3_shutdown_done: flags=0x1 tid=0x733b3e75 Tested-by: Anthony Nandaa (Microsoft) <profnandaa@gmail.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-02 10:55:49 -05:00
David Howells	cd93650798	cifs: Remove cifs_aio_ctx Remove struct cifs_aio_ctx and its associated alloc/release functions as it is no longer used, the functions being taken over by netfslib. Signed-off-by: David Howells <dhowells@redhat.com> cc: Steve French <sfrench@samba.org> cc: linux-cifs@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-02 10:55:45 -05:00
Paulo Alcantara	4b96024ef2	smb: client: handle lack of FSCTL_GET_REPARSE_POINT support As per MS-FSA 2.1.5.10.14, support for FSCTL_GET_REPARSE_POINT is optional and if the server doesn't support it, STATUS_INVALID_DEVICE_REQUEST must be returned for the operation. If we find files with reparse points and we can't read them due to lack of client or server support, just ignore it and then treat them as regular files or junctions. Fixes: `5f71ebc412` ("smb: client: parse reparse point flag in create response") Reported-by: Sebastian Steinbeisser <Sebastian.Steinbeisser@lrz.de> Tested-by: Sebastian Steinbeisser <Sebastian.Steinbeisser@lrz.de> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-08-02 10:55:22 -05:00
Linus Torvalds	bbea34e693	do_dup2() out-of-bounds array speculation fix -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZqvoJgAKCRBZ7Krx/gZQ 6/QwAQD7oRzLm3Wg63EQIcbvhpZZlQvA7/FHVUVbIPE6+5ovWQEAsdHqyvyEXKMc osLQJ5hAaYkfEgzjgy9kxR4f7tbdiwI= =sbYX -----END PGP SIGNATURE----- Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fix from Al Viro: "do_dup2() out-of-bounds array speculation fix" * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: protect the fetch of ->fd[fd] in do_dup2() from mispredictions	2024-08-02 08:52:27 -07:00
Al Viro	8aa37bde1a	protect the fetch of ->fd[fd] in do_dup2() from mispredictions both callers have verified that fd is not greater than ->max_fds; however, misprediction might end up with tofree = fdt->fd[fd]; being speculatively executed. That's wrong for the same reasons why it's wrong in close_fd()/file_close_fd_locked(); the same solution applies - array_index_nospec(fd, fdt->max_fds) could differ from fd only in case of speculative execution on mispredicted path. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2024-08-01 15:51:57 -04:00
Josef Bacik	1e7bec1f7d	btrfs: emit a warning about space cache v1 being deprecated We've been wanting to get rid of this for a while, add a message to indicate that this feature is going away and when so we can finally have a date when we're going to remove it. The output looks like this BTRFS warning (device nvme0n1): space cache v1 is being deprecated and will be removed in a future release, please use -o space_cache=v2 Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-01 17:30:50 +02:00
Boris Burkov	30479f31d4	btrfs: fix qgroup reserve leaks in cow_file_range In the buffered write path, the dirty page owns the qgroup reserve until it creates an ordered_extent. Therefore, any errors that occur before the ordered_extent is created must free that reservation, or else the space is leaked. The fstest generic/475 exercises various IO error paths, and is able to trigger errors in cow_file_range where we fail to get to allocating the ordered extent. Note that because we do clear delalloc, we are likely to remove the inode from the delalloc list, so the inodes/pages to not have invalidate/launder called on them in the commit abort path. This results in failures at the unmount stage of the test that look like: BTRFS: error (device dm-8 state EA) in cleanup_transaction:2018: errno=-5 IO failure BTRFS: error (device dm-8 state EA) in btrfs_replace_file_extents:2416: errno=-5 IO failure BTRFS warning (device dm-8 state EA): qgroup 0/5 has unreleased space, type 0 rsv 28672 ------------[ cut here ]------------ WARNING: CPU: 3 PID: 22588 at fs/btrfs/disk-io.c:4333 close_ctree+0x222/0x4d0 [btrfs] Modules linked in: btrfs blake2b_generic libcrc32c xor zstd_compress raid6_pq CPU: 3 PID: 22588 Comm: umount Kdump: loaded Tainted: G W 6.10.0-rc7-gab56fde445b8 #21 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 RIP: 0010:close_ctree+0x222/0x4d0 [btrfs] RSP: 0018:ffffb4465283be00 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffa1a1818e1000 RCX: 0000000000000001 RDX: 0000000000000000 RSI: ffffb4465283bbe0 RDI: ffffa1a19374fcb8 RBP: ffffa1a1818e13c0 R08: 0000000100028b16 R09: 0000000000000000 R10: 0000000000000003 R11: 0000000000000003 R12: ffffa1a18ad7972c R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f9168312b80(0000) GS:ffffa1a4afcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f91683c9140 CR3: 000000010acaa000 CR4: 00000000000006f0 Call Trace: <TASK> ? close_ctree+0x222/0x4d0 [btrfs] ? __warn.cold+0x8e/0xea ? close_ctree+0x222/0x4d0 [btrfs] ? report_bug+0xff/0x140 ? handle_bug+0x3b/0x70 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? close_ctree+0x222/0x4d0 [btrfs] generic_shutdown_super+0x70/0x160 kill_anon_super+0x11/0x40 btrfs_kill_super+0x11/0x20 [btrfs] deactivate_locked_super+0x2e/0xa0 cleanup_mnt+0xb5/0x150 task_work_run+0x57/0x80 syscall_exit_to_user_mode+0x121/0x130 do_syscall_64+0xab/0x1a0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f916847a887 ---[ end trace 0000000000000000 ]--- BTRFS error (device dm-8 state EA): qgroup reserved space leaked Cases 2 and 3 in the out_reserve path both pertain to this type of leak and must free the reserved qgroup data. Because it is already an error path, I opted not to handle the possible errors in btrfs_free_qgroup_data. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-01 17:26:52 +02:00
Boris Burkov	872617a089	btrfs: implement launder_folio for clearing dirty page reserve In the buffered write path, dirty pages can be said to "own" the qgroup reservation until they create an ordered_extent. It is possible for there to be outstanding dirty pages when a transaction is aborted, in which case there is no cancellation path for freeing this reservation and it is leaked. We do already walk the list of outstanding delalloc inodes in btrfs_destroy_delalloc_inodes() and call invalidate_inode_pages2() on them. This does not call btrfs_invalidate_folio(), as one might guess, but rather calls launder_folio() and release_folio(). Since this is a reservation associated with dirty pages only, rather than something associated with the private bit (ordered_extent is cancelled separately already in the cleanup transaction path), implementing this release should be done via launder_folio. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-01 17:26:40 +02:00
Qu Wenruo	63447b7dd4	btrfs: scrub: update last_physical after scrubbing one stripe Currently sctx->stat.last_physical only got updated in the following cases: - When the last stripe of a non-RAID56 chunk is scrubbed This implies a pitfall, if the last stripe is at the chunk boundary, and we finished the scrub of the whole chunk, we won't update last_physical at all until the next chunk. - When a P/Q stripe of a RAID56 chunk is scrubbed This leads the following two problems: - sctx->stat.last_physical is not updated for a almost full chunk This is especially bad, affecting scrub resume, as the resume would start from last_physical, causing unnecessary re-scrub. - "btrfs scrub status" will not report any progress for a long time Fix the problem by properly updating @last_physical after each stripe is scrubbed. And since we're here, for the sake of consistency, use spin lock to protect the update of @last_physical, just like all the remaining call sites touching sctx->stat. Reported-by: Michel Palleau <michel.palleau@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAMFk-+igFTv2E8svg=cQ6o3e6CrR5QwgQ3Ok9EyRaEvvthpqCQ@mail.gmail.com/ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-01 17:15:07 +02:00
Qu Wenruo	33eb1e5db3	btrfs: factor out stripe length calculation into a helper Currently there are two locations which need to calculate the real length of a stripe (which can be at the end of a chunk, and the chunk size may not always be 64K aligned). Factor them into a helper as we're going to have a third user soon. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-08-01 17:15:05 +02:00
Xiubo Li	31634d7597	ceph: force sending a cap update msg back to MDS for revoke op If a client sends out a cap update dropping caps with the prior 'seq' just before an incoming cap revoke request, then the client may drop the revoke because it believes it's already released the requested capabilities. This causes the MDS to wait indefinitely for the client to respond to the revoke. It's therefore always a good idea to ack the cap revoke request with the bumped up 'seq'. Currently if the cap->issued equals to the newcaps the check_caps() will do nothing, we should force flush the caps. Cc: stable@vger.kernel.org Link: https://tracker.ceph.com/issues/61782 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Venky Shankar <vshankar@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2024-08-01 13:14:28 +02:00
Linus Torvalds	e4fc196f5b	for-6.11-rc1-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmapmOQACgkQxWXV+ddt WDsXVhAAi4X+xt3o4jcN3IAu08JCQAAyXnFWC3lvn7sqYjSrcccI6ZT4/gAbHss+ qrifakRGoYQ7fAjYBmhw48HqPmHtI2OQjIUDaIqHQOS68aXShBo9HiE460HRY4GT QV/KT0w37E2/R0EDR9gyjLq3ZA3/raxN1n+LNCFhRWmtsAEZrk4XzsADWb05YkIq 1QBa92DzEhVpd04X8YHIYBgRidWbcYST6xhoWdyL9VZ1pzZsISq5LH67D4f/J1KU gXNf+ZnF9DXsQnptJrMsjhx61seJ2F0/vozFZ+l6SjRr0jeysmrJI0dxqQc/hUga gbLmdha6ztKdn03JOIL+lfdZYzICFl/2fekSWI2SNcag+TYszACjlFOyHusOgKsa 3qQwzVB699FheWO5nrOOvOtgq0ZqGsrIvhIXLhA7/bVpNavPnUB7IQCcs8n89ImQ hUIebfX1FZnYXTrB6Hhm92LUb0lyLSlW1we3SSmaAMiy1TiXHG7hO2G/sIbOPAJC 5VzdHf0DEjzEdjmTrGOV7JBfy5JmMK56oN8viZS95p70DYxNGvEOhLs/8n5twpri MWV8GElcOjjC+KnGnUH72spsnEKONpdzyccG9kiZEgkEi4csgHSxrkSmAehYD6i6 MFYk+i7jvZ1VsbOulmdGOLbHS7whxi9pWb/CT3KKF1Ei5/v07bU= =JdOX -----END PGP SIGNATURE----- Merge tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix regression in extent map rework when handling insertion of overlapping compressed extent - fix unexpected file length when appending to a file using direct io and buffer not faulted in - in zoned mode, fix accounting of unusable space when flipping read-only block group back to read-write - fix page locking when COWing an inline range, assertion failure found by syzbot - fix calculation of space info in debugging print - tree-checker, add validation of data reference item - fix a few -Wmaybe-uninitialized build warnings * tag 'for-6.11-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry() btrfs: fix corruption after buffer fault in during direct IO append write btrfs: zoned: fix zone_unusable accounting on making block group read-write again btrfs: do not subtract delalloc from avail bytes btrfs: make cow_file_range_inline() honor locked_page on error btrfs: fix corrupt read due to bad offset of a compressed extent map btrfs: tree-checker: validate dref root and objectid	2024-07-30 19:28:36 -07:00
Kent Overstreet	e61dd67860	bcachefs: Fix double free of ca->buckets_nouse Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: `ffcbec6076` ("bcachefs: Kill opts.buckets_nouse") Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-07-30 20:43:29 -04:00
David Sterba	b8e947e9f6	btrfs: initialize location to fix -Wmaybe-uninitialized in btrfs_lookup_dentry() Some arch + compiler combinations report a potentially unused variable location in btrfs_lookup_dentry(). This is a false alert as the variable is passed by value and always valid or there's an error. The compilers cannot probably reason about that although btrfs_inode_by_name() is in the same file. > + /kisskb/src/fs/btrfs/inode.c: error: 'location.objectid' may be used +uninitialized in this function [-Werror=maybe-uninitialized]: => 5603:9 > + /kisskb/src/fs/btrfs/inode.c: error: 'location.type' may be used +uninitialized in this function [-Werror=maybe-uninitialized]: => 5674:5 m68k-gcc8/m68k-allmodconfig mips-gcc8/mips-allmodconfig powerpc-gcc5/powerpc-all{mod,yes}config powerpc-gcc5/ppc64_defconfig Initialize it to zero, this should fix the warnings and won't change the behaviour as btrfs_inode_by_name() accepts only a root or inode item types, otherwise returns an error. Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Link: https://lore.kernel.org/linux-btrfs/bd4e9928-17b3-9257-8ba7-6b7f9bbb639a@linux-m68k.org/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-30 15:33:06 +02:00
Filipe Manana	939b656bc8	btrfs: fix corruption after buffer fault in during direct IO append write During an append (O_APPEND write flag) direct IO write if the input buffer was not previously faulted in, we can corrupt the file in a way that the final size is unexpected and it includes an unexpected hole. The problem happens like this: 1) We have an empty file, with size 0, for example; 2) We do an O_APPEND direct IO with a length of 4096 bytes and the input buffer is not currently faulted in; 3) We enter btrfs_direct_write(), lock the inode and call generic_write_checks(), which calls generic_write_checks_count(), and that function sets the iocb position to 0 with the following code: if (iocb->ki_flags & IOCB_APPEND) iocb->ki_pos = i_size_read(inode); 4) We call btrfs_dio_write() and enter into iomap, which will end up calling btrfs_dio_iomap_begin() and that calls btrfs_get_blocks_direct_write(), where we update the i_size of the inode to 4096 bytes; 5) After btrfs_dio_iomap_begin() returns, iomap will attempt to access the page of the write input buffer (at iomap_dio_bio_iter(), with a call to bio_iov_iter_get_pages()) and fail with -EFAULT, which gets returned to btrfs at btrfs_direct_write() via btrfs_dio_write(); 6) At btrfs_direct_write() we get the -EFAULT error, unlock the inode, fault in the write buffer and then goto to the label 'relock'; 7) We lock again the inode, do all the necessary checks again and call again generic_write_checks(), which calls generic_write_checks_count() again, and there we set the iocb's position to 4K, which is the current i_size of the inode, with the following code pointed above: if (iocb->ki_flags & IOCB_APPEND) iocb->ki_pos = i_size_read(inode); 8) Then we go again to btrfs_dio_write() and enter iomap and the write succeeds, but it wrote to the file range [4K, 8K), leaving a hole in the [0, 4K) range and an i_size of 8K, which goes against the expectations of having the data written to the range [0, 4K) and get an i_size of 4K. Fix this by not unlocking the inode before faulting in the input buffer, in case we get -EFAULT or an incomplete write, and not jumping to the 'relock' label after faulting in the buffer - instead jump to a location immediately before calling iomap, skipping all the write checks and relocking. This solves this problem and it's fine even in case the input buffer is memory mapped to the same file range, since only holding the range locked in the inode's io tree can cause a deadlock, it's safe to keep the inode lock (VFS lock), as was fixed and described in commit `51bd9563b6` ("btrfs: fix deadlock due to page faults during direct IO reads and writes"). A sample reproducer provided by a reporter is the following: $ cat test.c #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <fcntl.h> #include <stdio.h> #include <sys/mman.h> #include <sys/stat.h> #include <unistd.h> int main(int argc, char argv[]) { if (argc < 2) { fprintf(stderr, "Usage: %s <test file>\n", argv[0]); return 1; } int fd = open(argv[1], O_WRONLY \| O_CREAT \| O_TRUNC \| O_DIRECT \| O_APPEND, 0644); if (fd < 0) { perror("creating test file"); return 1; } char buf = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE \| MAP_ANONYMOUS, -1, 0); ssize_t ret = write(fd, buf, 4096); if (ret < 0) { perror("pwritev2"); return 1; } struct stat stbuf; ret = fstat(fd, &stbuf); if (ret < 0) { perror("stat"); return 1; } printf("size: %llu\n", (unsigned long long)stbuf.st_size); return stbuf.st_size == 4096 ? 0 : 1; } A test case for fstests will be sent soon. Reported-by: Hanna Czenczek <hreitz@redhat.com> Link: https://lore.kernel.org/linux-btrfs/0b841d46-12fe-4e64-9abb-871d8d0de271@redhat.com/ Fixes: `8184620ae2` ("btrfs: fix lost file sync on direct IO write with nowait and dsync iocb") CC: stable@vger.kernel.org # 6.1+ Tested-by: Hanna Czenczek <hreitz@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-29 19:21:22 +02:00
Naohiro Aota	8cd44dd1d1	btrfs: zoned: fix zone_unusable accounting on making block group read-write again When btrfs makes a block group read-only, it adds all free regions in the block group to space_info->bytes_readonly. That free space excludes reserved and pinned regions. OTOH, when btrfs makes the block group read-write again, it moves all the unused regions into the block group's zone_unusable. That unused region includes reserved and pinned regions. As a result, it counts too much zone_unusable bytes. Fortunately (or unfortunately), having erroneous zone_unusable does not affect the calculation of space_info->bytes_readonly, because free space (num_bytes in btrfs_dec_block_group_ro) calculation is done based on the erroneous zone_unusable and it reduces the num_bytes just to cancel the error. This behavior can be easily discovered by adding a WARN_ON to check e.g, "bg->pinned > 0" in btrfs_dec_block_group_ro(), and running fstests test case like btrfs/282. Fix it by properly considering pinned and reserved in btrfs_dec_block_group_ro(). Also, add a WARN_ON and introduce btrfs_space_info_update_bytes_zone_unusable() to catch a similar mistake. Fixes: `169e0da91a` ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.15+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-29 19:21:19 +02:00
Naohiro Aota	d89c285d28	btrfs: do not subtract delalloc from avail bytes The block group's avail bytes printed when dumping a space info subtract the delalloc_bytes. However, as shown in btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), it is added or subtracted along with "reserved" for the delalloc case, which means the "delalloc_bytes" is a part of the "reserved" bytes. So, excluding it to calculate the avail space counts delalloc_bytes twice, which can lead to an invalid result. Fixes: `e50b122b83` ("btrfs: print available space for a block group when dumping a space info") CC: stable@vger.kernel.org # 6.6+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-29 19:21:04 +02:00
Boris Burkov	478574370b	btrfs: make cow_file_range_inline() honor locked_page on error The btrfs buffered write path runs through __extent_writepage() which has some tricky return value handling for writepage_delalloc(). Specifically, when that returns 1, we exit, but for other return values we continue and end up calling btrfs_folio_end_all_writers(). If the folio has been unlocked (note that we check the PageLocked bit at the start of __extent_writepage()), this results in an assert panic like this one from syzbot: BTRFS: error (device loop0 state EAL) in free_log_tree:3267: errno=-5 IO failure BTRFS warning (device loop0 state EAL): Skipping commit of aborted transaction. BTRFS: error (device loop0 state EAL) in cleanup_transaction:2018: errno=-5 IO failure assertion failed: folio_test_locked(folio), in fs/btrfs/subpage.c:871 ------------[ cut here ]------------ kernel BUG at fs/btrfs/subpage.c:871! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI CPU: 1 PID: 5090 Comm: syz-executor225 Not tainted 6.10.0-syzkaller-05505-gb1bc554e009e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 06/27/2024 RIP: 0010:btrfs_folio_end_all_writers+0x55b/0x610 fs/btrfs/subpage.c:871 Code: e9 d3 fb ff ff e8 25 22 c2 fd 48 c7 c7 c0 3c 0e 8c 48 c7 c6 80 3d 0e 8c 48 c7 c2 60 3c 0e 8c b9 67 03 00 00 e8 66 47 ad 07 90 <0f> 0b e8 6e 45 b0 07 4c 89 ff be 08 00 00 00 e8 21 12 25 fe 4c 89 RSP: 0018:ffffc900033d72e0 EFLAGS: 00010246 RAX: 0000000000000045 RBX: 00fff0000000402c RCX: 663b7a08c50a0a00 RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000 RBP: ffffc900033d73b0 R08: ffffffff8176b98c R09: 1ffff9200067adfc R10: dffffc0000000000 R11: fffff5200067adfd R12: 0000000000000001 R13: dffffc0000000000 R14: 0000000000000000 R15: ffffea0001cbee80 FS: 0000000000000000(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f5f076012f8 CR3: 000000000e134000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> __extent_writepage fs/btrfs/extent_io.c:1597 [inline] extent_write_cache_pages fs/btrfs/extent_io.c:2251 [inline] btrfs_writepages+0x14d7/0x2760 fs/btrfs/extent_io.c:2373 do_writepages+0x359/0x870 mm/page-writeback.c:2656 filemap_fdatawrite_wbc+0x125/0x180 mm/filemap.c:397 __filemap_fdatawrite_range mm/filemap.c:430 [inline] __filemap_fdatawrite mm/filemap.c:436 [inline] filemap_flush+0xdf/0x130 mm/filemap.c:463 btrfs_release_file+0x117/0x130 fs/btrfs/file.c:1547 __fput+0x24a/0x8a0 fs/file_table.c:422 task_work_run+0x24f/0x310 kernel/task_work.c:222 exit_task_work include/linux/task_work.h:40 [inline] do_exit+0xa2f/0x27f0 kernel/exit.c:877 do_group_exit+0x207/0x2c0 kernel/exit.c:1026 __do_sys_exit_group kernel/exit.c:1037 [inline] __se_sys_exit_group kernel/exit.c:1035 [inline] __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1035 x64_sys_call+0x2634/0x2640 arch/x86/include/generated/asm/syscalls_64.h:232 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5f075b70c9 Code: Unable to access opcode bytes at 0x7f5f075b709f. I was hitting the same issue by doing hundreds of accelerated runs of generic/475, which also hits IO errors by design. I instrumented that reproducer with bpftrace and found that the undesirable folio_unlock was coming from the following callstack: folio_unlock+5 __process_pages_contig+475 cow_file_range_inline.constprop.0+230 cow_file_range+803 btrfs_run_delalloc_range+566 writepage_delalloc+332 __extent_writepage # inlined in my stacktrace, but I added it here extent_write_cache_pages+622 Looking at the bisected-to patch in the syzbot report, Josef realized that the logic of the cow_file_range_inline error path subtly changing. In the past, on error, it jumped to out_unlock in cow_file_range(), which honors the locked_page, so when we ultimately call folio_end_all_writers(), the folio of interest is still locked. After the change, we always unlocked ignoring the locked_page, on both success and error. On the success path, this all results in returning 1 to __extent_writepage(), which skips the folio_end_all_writers() call, which makes it OK to have unlocked. Fix the bug by wiring the locked_page into cow_file_range_inline() and only setting locked_page to NULL on success. Reported-by: syzbot+a14d8ac9af3a2a4fd0c8@syzkaller.appspotmail.com Fixes: `0586d0a89e` ("btrfs: move extent bit and page cleanup into cow_file_range_inline") CC: stable@vger.kernel.org # 6.10+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-29 19:20:51 +02:00
Chen Ni	7bf888fa26	xfs: convert comma to semicolon Replace a comma between expression statements by a semicolon. Fixes: `178b48d588` ("xfs: remove the for_each_xbitmap_ helpers") Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-29 09:34:18 +05:30
Chen Ni	8c2263b923	xfs: convert comma to semicolon Replace a comma between expression statements by a semicolon. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Fixes: `8f4b980ee6` ("xfs: pass the attr value to put_listent when possible") Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-29 09:32:53 +05:30
Julian Sun	af5d92f2fa	xfs: remove unused parameter in macro XFS_DQUOT_LOGRES In the macro definition of XFS_DQUOT_LOGRES, a parameter is accepted, but it is not used. Hence, it should be removed. This patch has only passed compilation test, but it should be fine. Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-29 09:29:31 +05:30
Darrick J. Wong	19ebc8f84e	xfs: fix file_path handling in tracepoints Since file_path() takes the output buffer as one of its arguments, we might as well have it format directly into the tracepoint's char array instead of wasting stack space. Fixes: `3934e8ebb7` ("xfs: create a big array data structure") Fixes: `5076a6040c` ("xfs: support in-memory buffer cache targets") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202403290419.HPcyvqZu-lkp@intel.com/ Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-29 09:27:23 +05:30
Eric Sandeen	39c1ddb064	xfs: allow SECURE namespace xattrs to use reserved block pool We got a report from the podman folks that selinux relabels that happen as part of their process were returning ENOSPC when the filesystem is completely full. This is because xattr changes reserve about 15 blocks for the worst case, but the common case is for selinux contexts to be the sole, in-inode xattr and consume no blocks. We already allow reserved space consumption for XFS_ATTR_ROOT for things such as ACLs, and SECURE namespace attributes are not so very different, so allow them to use the reserved space as well. Code-comment-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> V2: Remove local variable, add comment. V3: Add Dave's preferred comment V4: Spelling and comment beautification	2024-07-29 09:26:20 +05:30
Darrick J. Wong	80d3d33cdf	xfs: fix a memory leak kmemleak reported that we don't free the parent pointer names here if we found corruption. Fixes: `0d29a20fbd` ("xfs: scrub parent pointers") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2024-07-29 09:25:01 +05:30
Linus Torvalds	cb04e8b1d2	minmax: don't use max() in situations that want a C constant expression We only had a couple of array[] declarations, and changing them to just use 'MAX()' instead of 'max()' fixes the issue. This will allow us to simplify our min/max macros enormously, since they can now unconditionally use temporary variables to avoid using the argument values multiple times. Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-07-28 20:23:27 -07:00
Linus Torvalds	7e2d0ba732	This pull request contains updates (actually, just fixes) for UBI and UBIFS: - Many fixes for power-cut issues by Zhihao Cheng - Another ubiblock error path fix - ubiblock section mismatch fix - Misc fixes all over the place -----BEGIN PGP SIGNATURE----- iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmamiikWHHJpY2hhcmRA c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wTy6EACEj6VcTRei5Ndt2QXNiKHNgJ3j yV0QmLWUF5brfXbUsoP9oeKL/Y9s+02maDUkOJoHGcHRClLyHkIMMHXYy5gpXyiI h4zTcx3x33+oddYJeqdZYDDpWwk8RsqE1UjB8i7q+DF7vznWqYl3/+G7coH8ZD8M +//3AzWjD/U9Gj3KVuyBgok0OmnwFJ36w3Ckoj7NKxbFQrCmxa7qtLqseuiHumhu HjAMPS1V+Hmw6qw/s+hs6NdBhOvmO56s1wwT26mji+kQVMuh6tG3YNo2tyM7AHSv cGJDVGuDmO3F6iKqom3edjxHetWmCN3gvCkacpE6G/9Q1x3XO7vWhPfl6JHv+KTu zsc0WDhmNKkxUIHgBKvFGCylo6Ui4rIn5iFihXBiGTjB52tBqPU+0STDfcKcqz6U BrgCiKdAuX7tZT00tuFJE9eWbZvPVWQii0aNw5uLi9GwXOFmLP/VAi+FDQ0hsNSF XQxgWAd9b7eLmyWUrn6ORj2+FIzjmF6oxBfUOHgE4J1GPggIDXii9NDJcma3XSTn OcqpCJe5SOc38M7ldwbZup53fl+DFvsbnbDofrGCvhSTCqAyGx5su3m40YsIVRvI fpCk6L/6vGp/ONw90Vq/KKeIkAJaHnqi6KbnfyDtF6LSwW7cb4+Cw69BWvpP0gQb m1hTm4vIaPvD25840g== =0y6/ -----END PGP SIGNATURE----- Merge tag 'ubifs-for-linus-6.11-rc1-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs Pull UBI and UBIFS updates from Richard Weinberger: - Many fixes for power-cut issues by Zhihao Cheng - Another ubiblock error path fix - ubiblock section mismatch fix - Misc fixes all over the place * tag 'ubifs-for-linus-6.11-rc1-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs: ubi: Fix ubi_init() ubiblock_exit() section mismatch ubifs: add check for crypto_shash_tfm_digest ubifs: Fix inconsistent inode size when powercut happens during appendant writing ubi: block: fix null-pointer-dereference in ubiblock_create() ubifs: fix kernel-doc warnings ubifs: correct UBIFS_DFS_DIR_LEN macro definition and improve code clarity mtd: ubi: Restore missing cleanup on ubi_init() failure path ubifs: dbg_orphan_check: Fix missed key type checking ubifs: Fix unattached inode when powercut happens in creating ubifs: Fix space leak when powercut happens in linking tmpfile ubifs: Move ui->data initialization after initializing security ubifs: Fix adding orphan entry twice for the same inode ubifs: Remove insert_dead_orphan from replaying orphan process Revert "ubifs: ubifs_symlink: Fix memleak of inode->i_link in error path" ubifs: Don't add xattr inode into orphan area ubifs: Fix unattached xattr inode if powercut happens after deleting mtd: ubi: avoid expensive do_div() on 32-bit machines mtd: ubi: make ubi_class constant ubi: eba: properly rollback inside self_check_eba	2024-07-28 11:51:51 -07:00
Linus Torvalds	7b5d481889	Two small fixes to silent the compiler and static analyzers tools from Ben Dooks and Jeff Johnson. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRIEdicmeMNCZKVCdo6u2Upsdk6RAUCZqA9+gAKCRA6u2Upsdk6 RJETAQDN9OkX2GJlekEo5NPVD531ekV4G7OZMWTrmPKRINClZQEAj9Spt2zP5v4V 413unRBro9nuKfGgTaquXoHlCuPE+wE= =L/SP -----END PGP SIGNATURE----- Merge tag 'unicode-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode Pull unicode update from Gabriel Krisman Bertazi: "Two small fixes to silence the compiler and static analyzers tools from Ben Dooks and Jeff Johnson" * tag 'unicode-next-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/krisman/unicode: unicode: add MODULE_DESCRIPTION() macros unicode: make utf8 test count static	2024-07-28 09:14:11 -07:00
Linus Torvalds	5437f30d34	six smb3 client fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmalhJwACgkQiiy9cAdy T1GbRgv+NPJ07ZtG7D4EosxCHiBETQS9oezS1Ulbv78YdEBHfP/9T+pYcCh+3qZC Sa2HQlB1y3lLZNrhYQrVtyECtVcsdeUloXf6IIczBMAtCeS7FZ0+U8B07+9vJHGz 9p0paXOkRbOQ2JtYevsRN41Q0HxjvWqHSet/Y2tM8cj0M3yjCPHvJCFv3OC9ZUTV AyZZdYFoDFIYmW75459wq/80IADXhkSIsH/8IStTpshVhJbVdyGpr8FTrtW7G0m7 prYKEzXtgdvzM1CVlfR9boyf5HqUDvcHuV0ZBFjBOx7A3kXiShdRh7PFmDaY1vqX o3qgmmjTntX9aRR3zL9GYuayGD8XsXFPotWbuGniKLraX5WJNXe3o8OKybXgivoY OEXnkmlyp4GcggmWZpPCqq7J5J+YcLQImCKXxfQI7HjToI9cy7aNZ6qh9g0LIQBm 9totZcp5AMGk9Sbdf+MUeJ3cx8+3o26kc8a5MCV6fCPt/x7XNKG33ZRd5lne6rxr WX4neGG4 =nzTc -----END PGP SIGNATURE----- Merge tag '6.11-rc-smb-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6 Pull more smb client updates from Steve French: - fix for potential null pointer use in init cifs - additional dynamic trace points to improve debugging of some common scenarios - two SMB1 fixes (one addressing reconnect with POSIX extensions, one a mount parsing error) * tag '6.11-rc-smb-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: smb3: add dynamic trace point for session setup key expired failures smb3: add four dynamic tracepoints for copy_file_range and reflink smb3: add dynamic tracepoint for reflink errors cifs: mount with "unix" mount option for SMB1 incorrectly handled cifs: fix reconnect with SMB1 UNIX Extensions cifs: fix potential null pointer use in destroy_workqueue in init_cifs error path	2024-07-27 20:08:07 -07:00
Linus Torvalds	bc4eee85ca	vfs-6.11-rc1.fixes.3 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZqSoSAAKCRCRxhvAZXjc omGDAP9g7+3hDDFvDfYm3cjSw5CcbYodaS0cjzSZV9FQ8jSLUgEAv9741X4c5og1 u8hjOnkYXVJPKn+DzfwBza0wV8qtVQM= =YIoY -----END PGP SIGNATURE----- Merge tag 'vfs-6.11-rc1.fixes.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "This contains two fixes for this merge window: VFS: - I noticed that it is possible for a privileged user to mount most filesystems with a non-initial user namespace in sb->s_user_ns. When fsopen() is called in a non-init namespace the caller's namespace is recorded in fs_context->user_ns. If the returned file descriptor is then passed to a process privileged in init_user_ns, that process can call fsconfig(fd_fs, FSCONFIG_CMD_CREATE), creating a new superblock with sb->s_user_ns set to the namespace of the process which called fsopen(). This is problematic as only filesystems that raise FS_USERNS_MOUNT are known to be able to support a non-initial s_user_ns. Others may suffer security issues, on-disk corruption or outright crash the kernel. Prevent that by restricting such delegation to filesystems that allow FS_USERNS_MOUNT. Note, that this delegation requires a privileged process to actually create the superblock so either the privileged process is cooperaing or someone must have tricked a privileged process into operating on a fscontext file descriptor whose origin it doesn't know (a stupid idea). The bug dates back to about 5 years afaict. Misc: - Fix hostfs parsing when the mount request comes in via the legacy mount api. In the legacy mount api hostfs allows to specify the host directory mount without any key. Restore that behavior" tag 'vfs-6.11-rc1.fixes.3' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: hostfs: fix the host directory parse when mounting. fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT	2024-07-27 15:11:59 -07:00
Linus Torvalds	7b0acd911c	11 hotfixes, 7 of which are cc:stable. 7 are MM, 4 are other. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZqQWWQAKCRDdBJ7gKXxA jqJVAP9vU9HNzIyKDOOqoNHKMI+VzGn39w1FihWjG6AU5a+9NQD+MZJwr7bBwkpH ii43HLUGvNRQtsldBZSRypsaitCSwAI= =HGce -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2024-07-26-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "11 hotfixes, 7 of which are cc:stable. 7 are MM, 4 are other" * tag 'mm-hotfixes-stable-2024-07-26-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: nilfs2: handle inconsistent state in nilfs_btnode_create_block() selftests/mm: skip test for non-LPA2 and non-LVA systems mm/page_alloc: fix pcp->count race between drain_pages_zone() vs __rmqueue_pcplist() mm: memcg: add cacheline padding after lruvec in mem_cgroup_per_node alloc_tag: outline and export free_reserved_page() decompress_bunzip2: fix rare decompression failure mm/huge_memory: avoid PMD-size page cache if needed mm: huge_memory: use !CONFIG_64BIT to relax huge page alignment on 32 bit machines mm: fix old/young bit handling in the faulting path dt-bindings: arm: update James Clark's email address MAINTAINERS: mailmap: update James Clark's email address	2024-07-27 10:26:41 -07:00
Hongbo Li	ef9ca17ca4	hostfs: fix the host directory parse when mounting. hostfs not keep the host directory when mounting. When the host directory is none (default), fc->source is used as the host root directory, and this is wrong. Here we use `parse_monolithic` to handle the old mount path for parsing the root directory. For new mount path, The `parse_param` is used for the host directory parse. Reported-and-tested-by: Maciej Żenczykowski <maze@google.com> Fixes: `cd140ce9f6` ("hostfs: convert hostfs to use the new mount API") Link: https://lore.kernel.org/all/CANP3RGceNzwdb7w=vPf5=7BCid5HVQDmz1K5kC9JG42+HVAh_g@mail.gmail.com/ Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Link: https://lore.kernel.org/r/20240725065130.1821964-1-lihongbo22@huawei.com [brauner: minor fixes] Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-07-27 09:56:33 +02:00
Seth Forshee (DigitalOcean)	e1c5ae59c0	fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT Christian noticed that it is possible for a privileged user to mount most filesystems with a non-initial user namespace in sb->s_user_ns. When fsopen() is called in a non-init namespace the caller's namespace is recorded in fs_context->user_ns. If the returned file descriptor is then passed to a process priviliged in init_user_ns, that process can call fsconfig(fd_fs, FSCONFIG_CMD_CREATE), creating a new superblock with sb->s_user_ns set to the namespace of the process which called fsopen(). This is problematic. We cannot assume that any filesystem which does not set FS_USERNS_MOUNT has been written with a non-initial s_user_ns in mind, increasing the risk for bugs and security issues. Prevent this by returning EPERM from sget_fc() when FS_USERNS_MOUNT is not set for the filesystem and a non-initial user namespace will be used. sget() does not need to be updated as it always uses the user namespace of the current context, or the initial user namespace if SB_SUBMOUNT is set. Fixes: `cb50b348c7` ("convenience helpers: vfs_get_super() and sget_fc()") Reported-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Link: https://lore.kernel.org/r/20240724-s_user_ns-fix-v1-1-895d07c94701@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-07-27 09:56:33 +02:00
Ryusuke Konishi	4811f7af60	nilfs2: handle inconsistent state in nilfs_btnode_create_block() Syzbot reported that a buffer state inconsistency was detected in nilfs_btnode_create_block(), triggering a kernel bug. It is not appropriate to treat this inconsistency as a bug; it can occur if the argument block address (the buffer index of the newly created block) is a virtual block number and has been reallocated due to corruption of the bitmap used to manage its allocation state. So, modify nilfs_btnode_create_block() and its callers to treat it as a possible filesystem error, rather than triggering a kernel bug. Link: https://lkml.kernel.org/r/20240725052007.4562-1-konishi.ryusuke@gmail.com Fixes: `a60be987d4` ("nilfs2: B-tree node cache") Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+89cc4f2324ed37988b60@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=89cc4f2324ed37988b60 Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-07-26 14:33:10 -07:00
Steve French	b6f6a7aa68	smb3: add dynamic trace point for session setup key expired failures There are cases where services need to remount (or change their credentials files) when keys have expired, but it can be helpful to have a dynamic trace point to make it easier to notify the service to refresh the storage account key. Here is sample output, one from mount with bad password, one from a reconnect where the password has been changed or expired and reconnect fails (requiring remount with new storage account key) TASK-PID CPU# \|\|\|\|\| TIMESTAMP FUNCTION \| \| \| \|\|\|\|\| \| \| mount.cifs-11362 [000] ..... 6000.241620: smb3_key_expired: rc=-13 user=testpassu conn_id=0x2 server=localhost addr=127.0.0.1:445 kworker/4:0-8458 [004] ..... 6044.892283: smb3_key_expired: rc=-13 user=testpassu conn_id=0x3 server=localhost addr=127.0.0.1:445 Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-07-26 12:34:50 -05:00
Linus Torvalds	6467dfdfc9	A small patchset to address bogus I/O errors and ultimately an assertion failure in the face of watch errors with -o exclusive mappings in RBD marked for stable and some assorted CephFS fixes. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmajtIUTHGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi1fwB/4/CsFZLQC+ybWSqoZVYD01qND0muol 44Nr6NyKdW302jQhAXchecB6s6L+N0azhVlAKKB8sO9XCifKA8RzuW75Y0+8z78B pgB6K7ZOzAIuPIG9mmNbutUHEd24CzGXNA28lEQkrNT8D6UZTENXQRJb1dS2GzQ7 T2PyjFFoyF0z1bDZ85URHxeyMetEe/TzWUlG1P2QI98V+RP8nK+mGYmdXNGKhH87 Ltf2pxjsiomtoH4QRm4LX7vwOUY1Ljf4HpSS1p+c5Fova2UTtTDbVfTFbh+ZjuQV hlh1ypNLM+igifu3nVeJ/Ga2f71zVFM66tnmpjcY3DxZAp70e1W2HMFD =Zfcy -----END PGP SIGNATURE----- Merge tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client Pull ceph updates from Ilya Dryomov: "A small patchset to address bogus I/O errors and ultimately an assertion failure in the face of watch errors with -o exclusive mappings in RBD marked for stable and some assorted CephFS fixes" * tag 'ceph-for-6.11-rc1' of https://github.com/ceph/ceph-client: rbd: don't assume rbd_is_lock_owner() for exclusive mappings rbd: don't assume RBD_LOCK_STATE_LOCKED for exclusive mappings rbd: rename RBD_LOCK_STATE_RELEASING and releasing_wait ceph: fix incorrect kmalloc size of pagevec mempool ceph: periodically flush the cap releases ceph: convert comma to semicolon in __ceph_dentry_dir_lease_touch() ceph: use cap_wait_list only if debugfs is enabled	2024-07-26 10:34:42 -07:00
Steve French	6629f87b97	smb3: add four dynamic tracepoints for copy_file_range and reflink Add more dynamic tracepoints to help debug copy_file_range (copychunk) and clone_range ("duplicate extents"). These are tracepoints for entering the function and completing without error. For example: "trace-cmd record -e smb3_copychunk_enter -e smb3_copychunk_done" or "trace-cmd record -e smb3_clone_enter -e smb3_clone_done" Here is sample output: TASK-PID CPU# \|\|\|\|\| TIMESTAMP FUNCTION \| \| \| \|\|\|\|\| \| \| cp-5964 [005] ..... 2176.168977: smb3_clone_enter: xid=17 sid=0xeb275be4 tid=0x7ffa7cdb source fid=0x1ed02e15 source offset=0x0 target fid=0x1ed02e15 target offset=0x0 len=0xa0000 cp-5964 [005] ..... 2176.170668: smb3_clone_done: xid=17 sid=0xeb275be4 tid=0x7ffa7cdb source fid=0x1ed02e15 source offset=0x0 target fid=0x1ed02e15 target offset=0x0 len=0xa0000 Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-07-26 12:34:41 -05:00
Steve French	5779d398db	smb3: add dynamic tracepoint for reflink errors There are cases where debugging clone_range ("smb2_duplicate_extents" function) and in the future copy_range ("smb2_copychunk_range") can be helpful. Add dynamic trace points for any errors in clone, and a followon patch will add them for copychunk. "trace-cmd record -e smb3_clone_err" Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2024-07-26 12:34:19 -05:00
Linus Torvalds	732c275394	Changes since last update: - Support STATX_DIOALIGN and FS_IOC_GETFSSYSFSPATH; - Fix a race of LZ4 decompression due to recent refactoring; - Another multi-page folio adaption in erofs_bread(). -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmajmNoRHHhpYW5nQGtl cm5lbC5vcmcACgkQUXZn5Zlu5qra3Q//a4WoV2Tl9LzmENrJDaAHtX4GC84bOr5s ZPZlf5xcJQD9ZeWqfheDArw5KdMiua3ccsci80TMnsz8xg37YrpNlHcLII48mSvk pr6VLYDxo8un6w/NHK7gJilOGpp027H8PfuLHK0lysWypO9z30XgvMicg7tbO+85 RfFatCTiYceH6+GwcRsg2cxyYe00a2Nwu8N8pspXrIs8IEEQ7oZPnkTnzWI+q1EZ rnwt0iwFlXwsIg4IjpMGVjIaky5L2ge7tAG/Z3abkL+7UC5meAIbiA3XWD+9GtJL uAngzWT5y+vnexdY0aBlxmTrAi/J5utxMn/HQ+r8sEXwFnAWeGvC/dv+Cv+LCWw0 +n8a1T/Y/xlj2pa5KJ38xsRes7Uj6BNQ52XLBrEplT9yc3XT8LHyeahYnI80mlSg G57xTFJfVUwrXagTGBABU3jcWTAsXwzYlgPJwtu0aRFGq1RS7yy4V6tS/CtGaFK+ DkAKYR/GUAvaUkYbWnm/Q2355bWPZcyhZErY+qzxe8S/vAw5PStyQyLOPTFxqYUn EsWrOjkY5yXZjWs8Ft/B9HH9SniXhh38GpnkCuA1BfsJUC8qNBq5KaDLZzGqkcDR atyZ317+L7m8Tky1JrYtIAlebyY23a/oXDZlcfZH703ZKvZhKygl44pbB4G1HDaO oN9fN8aHtWI= =GKWx -----END PGP SIGNATURE----- Merge tag 'erofs-for-6.11-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull more erofs updates from Gao Xiang: - Support STATX_DIOALIGN and FS_IOC_GETFSSYSFSPATH - Fix a race of LZ4 decompression due to recent refactoring - Another multi-page folio adaption in erofs_bread() * tag 'erofs-for-6.11-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: convert comma to semicolon erofs: support multi-page folios for erofs_bread() erofs: add support for FS_IOC_GETFSSYSFSPATH erofs: fix race in z_erofs_get_gbuf() erofs: support STATX_DIOALIGN	2024-07-26 10:31:03 -07:00
Chen Ni	14e9283fb2	erofs: convert comma to semicolon Replace a comma between expression statements by a semicolon. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Link: https://lore.kernel.org/r/20240724020721.2389738-1-nichen@iscas.ac.cn Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-07-26 18:48:12 +08:00
Gao Xiang	5d3bb77e5f	erofs: support multi-page folios for erofs_bread() If the requested page is part of the previous multi-page folio, there is no need to call read_mapping_folio() again. Also, get rid of the remaining one of page->index [1] in our codebase. [1] https://lore.kernel.org/r/Zp8fgUSIBGQ1TN0D@casper.infradead.org Cc: Matthew Wilcox <willy@infradead.org> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240723073024.875290-1-hsiangkao@linux.alibaba.com	2024-07-26 18:47:57 +08:00
Huang Xiaojia	684b290abc	erofs: add support for FS_IOC_GETFSSYSFSPATH FS_IOC_GETFSSYSFSPATH ioctl exposes /sys/fs path of a given filesystem, potentially standarizing sysfs reporting. This patch add support for FS_IOC_GETFSSYSFSPATH for erofs, "erofs/<dev>" will be outputted for bdev cases, "erofs/[domain_id,]<fs_id>" will be outputted for fscache cases. Signed-off-by: Huang Xiaojia <huangxiaojia2@huawei.com> Link: https://lore.kernel.org/r/20240720082335.441563-1-huangxiaojia2@huawei.com Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>	2024-07-26 18:47:46 +08:00
Gao Xiang	7dc5537c3f	erofs: fix race in z_erofs_get_gbuf() In z_erofs_get_gbuf(), the current task may be migrated to another CPU between `z_erofs_gbuf_id()` and `spin_lock(&gbuf->lock)`. Therefore, z_erofs_put_gbuf() will trigger the following issue which was found by stress test: <2>[772156.434168] kernel BUG at fs/erofs/zutil.c:58! .. <4>[772156.435007] <4>[772156.439237] CPU: 0 PID: 3078 Comm: stress Kdump: loaded Tainted: G E 6.10.0-rc7+ #2 <4>[772156.439239] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 1.0.0 01/01/2017 <4>[772156.439241] pstate: 83400005 (Nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) <4>[772156.439243] pc : z_erofs_put_gbuf+0x64/0x70 [erofs] <4>[772156.439252] lr : z_erofs_lz4_decompress+0x600/0x6a0 [erofs] .. <6>[772156.445958] stress (3127): drop_caches: 1 <4>[772156.446120] Call trace: <4>[772156.446121] z_erofs_put_gbuf+0x64/0x70 [erofs] <4>[772156.446761] z_erofs_lz4_decompress+0x600/0x6a0 [erofs] <4>[772156.446897] z_erofs_decompress_queue+0x740/0xa10 [erofs] <4>[772156.447036] z_erofs_runqueue+0x428/0x8c0 [erofs] <4>[772156.447160] z_erofs_readahead+0x224/0x390 [erofs] .. Fixes: `f36f3010f6` ("erofs: rename per-CPU buffers to global buffer pool and make it configurable") Cc: <stable@vger.kernel.org> # 6.10+ Reviewed-by: Chunhai Guo <guochunhai@vivo.com> Reviewed-by: Sandeep Dhavale <dhavale@google.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240722035110.3456740-1-hsiangkao@linux.alibaba.com	2024-07-26 18:47:33 +08:00
Hongbo Li	9c421ef3f6	erofs: support STATX_DIOALIGN Add support for STATX_DIOALIGN to EROFS, so that direct I/O alignment restrictions are exposed to userspace in a generic way. [Before] ``` ./statx_test /mnt/erofs/testfile statx(/mnt/erofs/testfile) = 0 dio mem align:0 dio offset align:0 ``` [After] ``` ./statx_test /mnt/erofs/testfile statx(/mnt/erofs/testfile) = 0 dio mem align:512 dio offset align:512 ``` Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20240718083243.2485437-1-hsiangkao@linux.alibaba.com	2024-07-26 18:47:22 +08:00
Filipe Manana	de9f46cb00	btrfs: fix corrupt read due to bad offset of a compressed extent map If we attempt to insert a compressed extent map that has a range that overlaps another extent map we have in the inode's extent map tree, we can end up with an incorrect offset after adjusting the new extent map at merge_extent_mapping() because we don't update the extent map's offset. For example consider the following scenario: 1) We have a file extent item for a compressed extent covering the file range [108K, 144K) and currently there's no corresponding extent map in the inode's extent map tree; 2) The inode's size is 141K; 3) We have an encoded write (compressed) into the file range [120K, 128K), which overlaps the existing file extent item. The encoded write creates a matching extent map, adds it to the inode's extent map tree and creates an ordered extent for it. Note that the corresponding file extent item is added to the subvolume tree only when the ordered extent completes (when executing btrfs_finish_one_ordered()); 4) We have a write into the file range [160K, 164K). This writes increases the i_size of the file, and there's a hole between the current i_size (141K) and the start offset of this write, and since the old i_size is in the middle of the block [140K, 144K), we have to write zeroes to the range [141K, 144K) (3072 bytes) and therefore dirty that page. We then call btrfs_set_extent_delalloc() with a start offset of 140K. We then end up at btrfs_find_new_delalloc_bytes() which will call btrfs_get_extent() for the range [140K, 144K); 5) The btrfs_get_extent() doesn't find any extent map in the inode's extent map tree covering the range [140K, 144K), so it searches the subvolume tree for any file extent items covering that range. There it finds the file extent item for the range [108K, 144K), creates a compressed extent map for that range and then calls btrfs_add_extent_mapping() with that extent map and passes the range [140K, 144K) via the "start" and "len" parameters; 6) The call to add_extent_mapping() done by btrfs_add_extent_mapping() fails with -EEXIST because there's an extent map, created at step 2 for the [120K, 128K) range, that covers that overlaps with the range of the given extent map ([108K, 144K)). Then it does a lookup for extent map from step 2 add calls merge_extent_mapping() to adjust the input extent map ([108K, 144K)). That adjust the extent map to a start offset of 128K and a length of 16K (starting just after the extent map from step 2), but it does not update the offset field of the extent map, leaving it with a value of zero instead of updating to a value of 20K (128K - 108K = 20K). As a result any read for the range [128K, 144K) can return incorrect data since we read from a wrong section of the extent (unless both the correct and incorrect ranges happen to have the same data). So fix this by changing merge_extent_mapping() to update the extent map's offset even if it's compressed. Also add a test case to the self tests. This didn't happen before the patchset that does big changes in the extent map structure (which includes the commit in the Fixes tag below) because we kept track of the original start offset in the extent map (member "orig_start") so we could always calculate the correct offset by subtracting that offset from the start offset. A test case for fstests that triggered this problem using send/receive with compressed writes will be added soon. Fixes: `3d2ac99224` ("btrfs: introduce new members for extent_map") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-25 23:54:06 +02:00
Qu Wenruo	f333a3c7e8	btrfs: tree-checker: validate dref root and objectid [CORRUPTION] There is a bug report that btrfs flips RO due to a corruption in the extent tree, the involved dumps looks like this: item 188 key (402811572224 168 4096) itemoff 14598 itemsize 79 extent refs 3 gen 3678544 flags 1 ref#0: extent data backref root 13835058055282163977 objectid 281473384125923 offset 81432576 count 1 ref#1: shared data backref parent 1947073626112 count 1 ref#2: shared data backref parent 1156030103552 count 1 BTRFS critical (device vdc1: state EA): unable to find ref byte nr 402811572224 parent 0 root 265 owner 28703026 offset 81432576 slot 189 BTRFS error (device vdc1: state EA): failed to run delayed ref for logical 402811572224 num_bytes 4096 type 178 action 2 ref_mod 1: -2 [CAUSE] The corrupted entry is ref#0 of item 188. The root number 13835058055282163977 is beyond the upper limit for root items (the current limit is 1 << 48), and the objectid also looks suspicious. Only the offset and count is correct. [ENHANCEMENT] Although it's still unknown why we have such many bytes corrupted randomly, we can still enhance the tree-checker for data backrefs by: - Validate the root value For now there should only be 3 types of roots can have data backref: * subvolume trees * data reloc trees * root tree Only for v1 space cache - validate the objectid value The objectid should be a valid inode number. Hopefully we can catch such problem in the future with the new checkers. Reported-by: Kai Krakow <hurikhan77@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAMthOuPjg5RDT-G_LXeBBUUtzt3cq=JywF+D1_h+JYxe=WKp-Q@mail.gmail.com/#t Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2024-07-25 23:54:06 +02:00
Linus Torvalds	b485625078	sysctl: treewide: constify the ctl_table argument of proc_handlers Summary - const qualify struct ctl_table args in proc_handlers: This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified. -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEErkcJVyXmMSXOyyeQupfNUreWQU8FAmahVOcACgkQupfNUreW QU8QdgwAiDZMEQ6psvDfLGQlOvHyp21exZFC/i1OCrCq4BmTu83iYcojIKSaZ0q6 pfR9eYTVhNfAGsV2hQJvvUQPSVNESB/A7Z2zGLVQmZcOerAN17L2C7uA1SkbRBel P3yoMX6wruVCnBgQTcjktoBgpDqaOiqLTSh8Ko512qYOClLf22sCbbk+doiOwMF6 zbzh9bRo+qTuTgW30gwzRNml/lA1/XylSsM/cEFCxsc8k22lCneP+iWKDAfWtsr6 0GL+SvbBTyIEMFFvSw+0MUW30QctxaCYA1pBmn8nmc6eJ2yDly75/r8EWRr7zqDW SrRiXYDAzNcW3tiOdC2FW/J95VX62DiSQporVZ/OvahBKpQKGaLLNjx1EE3qGKIk ggnYiDmIO88ZBHoRbDDK9EdGGqrIZfRx9JeFD3bMOrwT3Ffxw30rwcYpzlOyjFz3 TIUeNthGXoQNRxZfTE+iybotKWcXnxCNe4U1Mvc2liY6+Z0jazDWfs+jV7CiuX1G Xk8jpTCR =JMg4 -----END PGP SIGNATURE----- Merge tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl Pull sysctl constification from Joel Granados: "Treewide constification of the ctl_table argument of proc_handlers using a coccinelle script and some manual code formatting fixups. This is a prerequisite to moving the static ctl_table structs into read-only data section which will ensure that proc_handler function pointers cannot be modified" * tag 'constfy-sysctl-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl: sysctl: treewide: constify the ctl_table argument of proc_handlers	2024-07-25 12:58:36 -07:00
Linus Torvalds	f9bcc61ad1	This pull request contains the following changes for UML: - Support for preemption - i386 Rust support - Huge cleanup by Benjamin Berg - UBSAN support - Removal of dead code -----BEGIN PGP SIGNATURE----- iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAmahIpkWHHJpY2hhcmRA c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wW4PD/wN03iDGNTPhegGgXTJTSwA8Gwk i5JTEmhc84ifE9/bJpru8w4mcLMiWLWFIpF4bGcqfKLp67tTi3jn9Vk7ivaYkn2G S875GqjdyqMVMfhJX+1qTxM6q/J5B7XGUpt1Zrot3AY1ANxnlwYscWX8jNvwmf+5 eCK9+xldkNWh1N67EjwsDgH6kkWyx3fcEe4E3gjXY0eSZtIwO/ZXYHSCSKznJOfu iXo1Sx02w8TZp4tf/EwpWR1SMkPL23X8Of+rmiyI5udyLZixTnrFlclu8WUK4ZBO ExYvOrzyYZ3E/mPFZf0E88h8xC3ETLsiHO3++JRAM1uDMp1+a6tPK7Bi6NTytemH PIT++XRiORAbXu3aSTjpFDAhTHIMZ925eJMvQAtVhtAAwbkjSNh9NbusbMiucPNm vvtYrEqYjPJpx+HRxy8kUywe/+jFLYofSDn6YrNRM+3HaM44YgkvbD6AOEMxWq19 YWkflmkDADez6eti03bAbiVuBB1v+Vnuz15ofrx45IUubb3uGVJYwEqQA5u8bAVr H4NeIWDRpXOuYLgSyxRLFFVYhe6eAWbAXeSWBFxcGNDY6OBpMqr7kgV1mBOZtooK 8aBgZ0YcyiTpmiEevskkNWSBnqUMKIdztKkD7Db9HfCgd9yy7Vvfl+iLJTIqFJ5m JxpvTy3it53ghQj40A== =ybqq -----END PGP SIGNATURE----- Merge tag 'uml-for-linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux Pull UML updates from Richard Weinberger: - Support for preemption - i386 Rust support - Huge cleanup by Benjamin Berg - UBSAN support - Removal of dead code * tag 'uml-for-linus-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/uml/linux: (41 commits) um: vector: always reset vp->opened um: vector: remove vp->lock um: register power-off handler um: line: always fill *error_out in setup_one_line() um: remove pcap driver from documentation um: Enable preemption in UML um: refactor TLB update handling um: simplify and consolidate TLB updates um: remove force_flush_all from fork_handler um: Do not flush MM in flush_thread um: Delay flushing syscalls until the thread is restarted um: remove copy_context_skas0 um: remove LDT support um: compress memory related stub syscalls while adding them um: Rework syscall handling um: Add generic stub_syscall6 function um: Create signal stack memory assignment in stub_data um: Remove stub-data.h include from common-offsets.h um: time-travel: fix signal blocking race/hang um: time-travel: remove time_exit() ...	2024-07-25 12:33:08 -07:00
Joel Granados	78eb4ea25c	sysctl: treewide: constify the ctl_table argument of proc_handlers const qualify the struct ctl_table argument in the proc_handler function signatures. This is a prerequisite to moving the static ctl_table structs into .rodata data which will ensure that proc_handler function pointers cannot be modified. This patch has been generated by the following coccinelle script: ``` virtual patch @r1@ identifier ctl, write, buffer, lenp, ppos; identifier func !~ "appldata_(timer\|interval)_handler\|sched_(rt\|rr)_handler\|rds_tcp_skbuf_handler\|proc_sctp_do_(hmac_alg\|rto_min\|rto_max\|udp_port\|alpha_beta\|auth\|probe_interval)"; @@ int func( - struct ctl_table ctl + const struct ctl_table ctl ,int write, void buffer, size_t lenp, loff_t ppos); @r2@ identifier func, ctl, write, buffer, lenp, ppos; @@ int func( - struct ctl_table ctl + const struct ctl_table ctl ,int write, void buffer, size_t lenp, loff_t ppos) { ... } @r3@ identifier func; @@ int func( - struct ctl_table * + const struct ctl_table * ,int , void , size_t , loff_t ); @r4@ identifier func, ctl; @@ int func( - struct ctl_table ctl + const struct ctl_table ctl ,int , void , size_t , loff_t ); @r5@ identifier func, write, buffer, lenp, ppos; @@ int func( - struct ctl_table * + const struct ctl_table * ,int write, void buffer, size_t lenp, loff_t ppos); ``` Code formatting was adjusted in xfs_sysctl.c to comply with code conventions. The xfs_stats_clear_proc_handler, xfs_panic_mask_proc_handler and xfs_deprecated_dointvec_minmax where adjusted. * The ctl_table argument in proc_watchdog_common was const qualified. This is called from a proc_handler itself and is calling back into another proc_handler, making it necessary to change it as part of the proc_handler migration. Co-developed-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Co-developed-by: Joel Granados <j.granados@samsung.com> Signed-off-by: Joel Granados <j.granados@samsung.com>	2024-07-24 20:59:29 +02:00
Linus Torvalds	7a3fad30fd	Random number generator updates for Linux 6.11-rc1. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmaarzgACgkQSfxwEqXe A66ZWBAAlhXx8bve0uKlDRK8fffWHgruho/fOY4lZJ137AKwA9JCtmOyqdfL4Dmk VxFe7pEQJlQhcA/6kH54uO7SBXwfKlKZJth6SYnaCRMUIbFifHjjIQ0QqldjEKi0 rP90Hu4FVsbwQC7u9i9lQj9n2P36zb6pn83BzpZQ/2PtoVCSCrdSJUe0Rxa3H3GN 0+nNkDSXQt5otCByLaeE3x7KJgXLWL9+G2eFSFLTZ8rSVfMx1CdOIAG37WlLGdWm BaFYPDKMyBTVvVJBNgAe9YSqtrsZ5nlmLz+Z9wAe/hTL7RlL03kWUu34/Udcpull zzMDH0WMntiGK3eFQ2gOYSWqypvAjwHgn3BzqNmjUb69+89mZsdU1slcvnxWsUwU D3vphrscaqarF629tfsXti3jc5PoXwUTjROZVcCyeFPBhyAZgzK8xUvPpJO+RT+K EuUABob9cpA6FCpW/QeolDmMDhXlNT8QgsZu1juokZac2xP3Ly3REyEvT7HLbU2W ZJjbEqm1ppp3RmGELUOJbyhwsLrnbt+OMDO7iEWoG8aSFK4diBK/ZM6WvLMkr8Oi 7ioXGIsYkCy3c47wpZKTrAapOPJp5keqNAiHSEbXw8mozp6429QAEZxNOcczgHKC Ea2JzRkctqutcIT+Slw/uUe//i1iSsIHXbE81fp5udcQTJcUByo= =P8aI -----END PGP SIGNATURE----- Merge tag 'random-6.11-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: "This adds getrandom() support to the vDSO. First, it adds a new kind of mapping to mmap(2), MAP_DROPPABLE, which lets the kernel zero out pages anytime under memory pressure, which enables allocating memory that never gets swapped to disk but also doesn't count as being mlocked. Then, the vDSO implementation of getrandom() is introduced in a generic manner and hooked into random.c. Next, this is implemented on x86. (Also, though it's not ready for this pull, somebody has begun an arm64 implementation already) Finally, two vDSO selftests are added. There are also two housekeeping cleanup commits" * tag 'random-6.11-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: MAINTAINERS: add random.h headers to RNG subsection random: note that RNDGETPOOL was removed in 2.6.9-rc2 selftests/vDSO: add tests for vgetrandom x86: vdso: Wire up getrandom() vDSO implementation random: introduce generic vDSO getrandom() implementation mm: add MAP_DROPPABLE for designating always lazily freeable mappings	2024-07-24 10:29:50 -07:00
Linus Torvalds	d1e9a63dcd	vfs-6.11-rc1.fixes.2 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZqDFUwAKCRCRxhvAZXjc omD6APwJKlepwDYlu5XZptI6/1kmai6SqaYnifTX1+ELR/rQQAD/Z37aho42v2JZ NYr+KFj02vj7ryKA5OWuSD8cw+6GlwQ= =dfob -----END PGP SIGNATURE----- Merge tag 'vfs-6.11-rc1.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "VFS: - The new 64bit mount ids start after the old mount id, i.e., at the first non-32 bit value. However, we started counting one id too late and thus lost 4294967296 as the first valid id. Fix that. - Update a few comments on some vfs_() creation helpers. - Move copying of the xattr name out from the locks required to start a filesystem write. - Extend the filelock lock UAF fix to the compat code as well. - Now that we added the ability to look up an inode under RCU it's possible that lockless hash lookup can find and lock an inode after it gets I_FREEING set. It then waits until inode teardown in evict() is finished. The flag however is still set after evict() has woken up all waiters. If the inode lock is taken late enough on the waiting side after hash removal and wakeup happened the waiting thread will never be woken. Before RCU based lookup this was synchronized via the inode_hash_lock. But since unhashing requires the inode lock as well we can check whether the inode is unhashed while holding inode lock even without holding inode_hash_lock. pidfd: - The nsproxy structure contains nearly all of the namespaces associated with a task. When a namespace type isn't supported nsproxy might contain a NULL pointer or always point to the initial namespace type. The logic isn't consistent. So when deriving namespace fds we need to ensure that the namespace type is supported. First, so that we don't risk dereferncing NULL pointers. The correct bigger fix would be to change all namespaces to always set a valid namespace pointer in struct nsproxy independent of whether or not it is compiled in. But that requires quite a few changes. Second, so that we don't allow deriving namespace fds when the namespace type doesn't exist and thus when they couldn't also be derived via /proc/self/ns/. - Add missing selftests for the new pidfd ioctls to derive namespace fds. This simply extends the already existing testsuite. netfs: - Fix debug logging and fix kconfig variable name so it actually works. - Fix writeback that goes both to the server and cache. The streams are only activated once a subreq is added. When a server write happens the subreq doesn't need to have finished by the time the cache write is started. If the server write has already finished by the time the cache write is about to start the cache write will operate on a folio that might already have been reused. Fix this by preactivating the cache write. - Limit cachefiles subreq size for cache writes to MAX_RW_COUNT" tag 'vfs-6.11-rc1.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: inode: clarify what's locked vfs: Fix potential circular locking through setxattr() and removexattr() filelock: Fix fcntl/close race recovery compat path fs: use all available ids cachefiles: Set the max subreq size for cache writes to MAX_RW_COUNT netfs: Fix writeback that needs to go to both server and cache pidfs: add selftests for new namespace ioctls pidfs: handle kernels without namespaces cleanly pidfs: when time ns disabled add check for ioctl vfs: correct the comments of vfs_*() helpers vfs: handle __wait_on_freeing_inode() and evict() race netfs: Rename CONFIG_FSCACHE_DEBUG to CONFIG_NETFS_DEBUG netfs: Revert "netfs: Switch debug logging to pr_debug()"	2024-07-24 09:42:51 -07:00
Linus Torvalds	e44be00289	hostfs: fix folio conversion Commit `e3ec0fe944` ("hostfs: Convert hostfs_read_folio() to use a folio") simplified hostfs_read_folio(), but in the process of converting to using folios natively also mis-used the folio_zero_tail() function due to the very confusing API of that function. Very arguably it's folio_zero_tail() API itself that is buggy, since it would make more sense (and the documentation kind of implies) that the third argument would be the pointer to the beginning of the folio buffer. But no, the third argument to folio_zero_tail() is where we should start zeroing the tail (even if we already also pass in the offset separately as the second argument). So fix the hostfs caller, and we can leave any folio_zero_tail() sanity cleanup for later. Reported-and-tested-by: Maciej Żenczykowski <maze@google.com> Fixes: `e3ec0fe944` ("hostfs: Convert hostfs_read_folio() to use a folio") Link: https://lore.kernel.org/all/CANP3RGceNzwdb7w=vPf5=7BCid5HVQDmz1K5kC9JG42+HVAh_g@mail.gmail.com/ Cc: Matthew Wilcox <willy@infradead.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-07-24 09:25:15 -07:00
Christian Brauner	f5e5e97c71	inode: clarify what's locked In __wait_on_freeing_inode() we warn in case the inode_hash_lock is held but the inode is unhashed. We then release the inode_lock. So using "locked" as parameter name is confusing. Use is_inode_hash_locked as parameter name instead. Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-07-24 11:11:40 +02:00
David Howells	c3a5e3e872	vfs: Fix potential circular locking through setxattr() and removexattr() When using cachefiles, lockdep may emit something similar to the circular locking dependency notice below. The problem appears to stem from the following: (1) Cachefiles manipulates xattrs on the files in its cache when called from ->writepages(). (2) The setxattr() and removexattr() system call handlers get the name (and value) from userspace after taking the sb_writers lock, putting accesses of the vma->vm_lock and mm->mmap_lock inside of that. (3) The afs filesystem uses a per-inode lock to prevent multiple revalidation RPCs and in writeback vs truncate to prevent parallel operations from deadlocking against the server on one side and local page locks on the other. Fix this by moving the getting of the name and value in {get,remove}xattr() outside of the sb_writers lock. This also has the minor benefits that we don't need to reget these in the event of a retry and we never try to take the sb_writers lock in the event we can't pull the name and value into the kernel. Alternative approaches that might fix this include moving the dispatch of a write to the cache off to a workqueue or trying to do without the validation lock in afs. Note that this might also affect other filesystems that use netfslib and/or cachefiles. ====================================================== WARNING: possible circular locking dependency detected 6.10.0-build2+ #956 Not tainted ------------------------------------------------------ fsstress/6050 is trying to acquire lock: ffff888138fd82f0 (mapping.invalidate_lock#3){++++}-{3:3}, at: filemap_fault+0x26e/0x8b0 but task is already holding lock: ffff888113f26d18 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x165/0x250 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&vma->vm_lock->lock){++++}-{3:3}: __lock_acquire+0xaf0/0xd80 lock_acquire.part.0+0x103/0x280 down_write+0x3b/0x50 vma_start_write+0x6b/0xa0 vma_link+0xcc/0x140 insert_vm_struct+0xb7/0xf0 alloc_bprm+0x2c1/0x390 kernel_execve+0x65/0x1a0 call_usermodehelper_exec_async+0x14d/0x190 ret_from_fork+0x24/0x40 ret_from_fork_asm+0x1a/0x30 -> #3 (&mm->mmap_lock){++++}-{3:3}: __lock_acquire+0xaf0/0xd80 lock_acquire.part.0+0x103/0x280 __might_fault+0x7c/0xb0 strncpy_from_user+0x25/0x160 removexattr+0x7f/0x100 __do_sys_fremovexattr+0x7e/0xb0 do_syscall_64+0x9f/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (sb_writers#14){.+.+}-{0:0}: __lock_acquire+0xaf0/0xd80 lock_acquire.part.0+0x103/0x280 percpu_down_read+0x3c/0x90 vfs_iocb_iter_write+0xe9/0x1d0 __cachefiles_write+0x367/0x430 cachefiles_issue_write+0x299/0x2f0 netfs_advance_write+0x117/0x140 netfs_write_folio.isra.0+0x5ca/0x6e0 netfs_writepages+0x230/0x2f0 afs_writepages+0x4d/0x70 do_writepages+0x1e8/0x3e0 filemap_fdatawrite_wbc+0x84/0xa0 __filemap_fdatawrite_range+0xa8/0xf0 file_write_and_wait_range+0x59/0x90 afs_release+0x10f/0x270 __fput+0x25f/0x3d0 __do_sys_close+0x43/0x70 do_syscall_64+0x9f/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&vnode->validate_lock){++++}-{3:3}: __lock_acquire+0xaf0/0xd80 lock_acquire.part.0+0x103/0x280 down_read+0x95/0x200 afs_writepages+0x37/0x70 do_writepages+0x1e8/0x3e0 filemap_fdatawrite_wbc+0x84/0xa0 filemap_invalidate_inode+0x167/0x1e0 netfs_unbuffered_write_iter+0x1bd/0x2d0 vfs_write+0x22e/0x320 ksys_write+0xbc/0x130 do_syscall_64+0x9f/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (mapping.invalidate_lock#3){++++}-{3:3}: check_noncircular+0x119/0x160 check_prev_add+0x195/0x430 __lock_acquire+0xaf0/0xd80 lock_acquire.part.0+0x103/0x280 down_read+0x95/0x200 filemap_fault+0x26e/0x8b0 __do_fault+0x57/0xd0 do_pte_missing+0x23b/0x320 __handle_mm_fault+0x2d4/0x320 handle_mm_fault+0x14f/0x260 do_user_addr_fault+0x2a2/0x500 exc_page_fault+0x71/0x90 asm_exc_page_fault+0x22/0x30 other info that might help us debug this: Chain exists of: mapping.invalidate_lock#3 --> &mm->mmap_lock --> &vma->vm_lock->lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- rlock(&vma->vm_lock->lock); lock(&mm->mmap_lock); lock(&vma->vm_lock->lock); rlock(mapping.invalidate_lock#3); * DEADLOCK * 1 lock held by fsstress/6050: #0: ffff888113f26d18 (&vma->vm_lock->lock){++++}-{3:3}, at: lock_vma_under_rcu+0x165/0x250 stack backtrace: CPU: 0 PID: 6050 Comm: fsstress Not tainted 6.10.0-build2+ #956 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 Call Trace: <TASK> dump_stack_lvl+0x57/0x80 check_noncircular+0x119/0x160 ? queued_spin_lock_slowpath+0x4be/0x510 ? __pfx_check_noncircular+0x10/0x10 ? __pfx_queued_spin_lock_slowpath+0x10/0x10 ? mark_lock+0x47/0x160 ? init_chain_block+0x9c/0xc0 ? add_chain_block+0x84/0xf0 check_prev_add+0x195/0x430 __lock_acquire+0xaf0/0xd80 ? __pfx___lock_acquire+0x10/0x10 ? __lock_release.isra.0+0x13b/0x230 lock_acquire.part.0+0x103/0x280 ? filemap_fault+0x26e/0x8b0 ? __pfx_lock_acquire.part.0+0x10/0x10 ? rcu_is_watching+0x34/0x60 ? lock_acquire+0xd7/0x120 down_read+0x95/0x200 ? filemap_fault+0x26e/0x8b0 ? __pfx_down_read+0x10/0x10 ? __filemap_get_folio+0x25/0x1a0 filemap_fault+0x26e/0x8b0 ? __pfx_filemap_fault+0x10/0x10 ? find_held_lock+0x7c/0x90 ? __pfx___lock_release.isra.0+0x10/0x10 ? __pte_offset_map+0x99/0x110 __do_fault+0x57/0xd0 do_pte_missing+0x23b/0x320 __handle_mm_fault+0x2d4/0x320 ? __pfx___handle_mm_fault+0x10/0x10 handle_mm_fault+0x14f/0x260 do_user_addr_fault+0x2a2/0x500 exc_page_fault+0x71/0x90 asm_exc_page_fault+0x22/0x30 Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/2136178.1721725194@warthog.procyon.org.uk cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Christian Brauner <brauner@kernel.org> cc: Jan Kara <jack@suse.cz> cc: Jeff Layton <jlayton@kernel.org> cc: Gao Xiang <xiang@kernel.org> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: linux-erofs@lists.ozlabs.org cc: linux-fsdevel@vger.kernel.org [brauner: fix minor issues] Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-07-24 10:53:14 +02:00
Jann Horn	f8138f2ad2	filelock: Fix fcntl/close race recovery compat path When I wrote commit `3cad1bc010` ("filelock: Remove locks reliably when fcntl/close race is detected"), I missed that there are two copies of the code I was patching: The normal version, and the version for 64-bit offsets on 32-bit kernels. Thanks to Greg KH for stumbling over this while doing the stable backport... Apply exactly the same fix to the compat path for 32-bit kernels. Fixes: `c293621bbf` ("[PATCH] stale POSIX lock handling") Cc: stable@kernel.org Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2563 Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20240723-fs-lock-recover-compatfix-v1-1-148096719529@google.com Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-07-24 10:53:14 +02:00
Christian Brauner	8eac5358ad	fs: use all available ids The counter is unconditionally incremented for each mount allocation. If we set it to 1ULL << 32 we're losing 4294967296 as the first valid non-32 bit mount id. Link: https://lore.kernel.org/r/20240719-work-mount-namespace-v1-1-834113cab0d2@kernel.org Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-07-24 10:53:13 +02:00
David Howells	51d37982bb	cachefiles: Set the max subreq size for cache writes to MAX_RW_COUNT Set the maximum size of a subrequest that writes to cachefiles to be MAX_RW_COUNT so that we don't overrun the maximum write we can make to the backing filesystem. Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/1599005.1721398742@warthog.procyon.org.uk cc: Jeff Layton <jlayton@kernel.org> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-07-24 10:53:13 +02:00

... 2 3 4 5 6 ...

93258 Commits