linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-02 00:51:44 +00:00

Author	SHA1	Message	Date
Darrick J. Wong	34389616a9	xfs: require a relatively recent V5 filesystem for LARP mode While reviewing the FIEXCHANGE code in XFS, I realized that the function that enables logged xattrs doesn't actually check that the superblock has a LOG_INCOMPAT feature bit field. Add a check to refuse the operation if we don't have a V5 filesystem... ...but on second though, let's require either reflink or rmap so that we only have to deal with LARP mode on relatively /modern/ kernel. 4.14 is about as far back as I feel like going. Seeing as LARP is a debugging-only option anyway, this isn't likely to affect any real users. Fixes: `d9c61ccb3b` ("xfs: move xfs_attr_use_log_assist out of xfs_log.c") Really-Fixes: `f3f36c893f` ("xfs: Add xfs_attr_set_deferred and xfs_attr_remove_deferred") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>	2023-09-12 10:31:08 -07:00
Darrick J. Wong	49813a21ed	xfs: make inode unlinked bucket recovery work with quotacheck Teach quotacheck to reload the unlinked inode lists when walking the inode table. This requires extra state handling, since it's possible that a reloaded inode will get inactivated before quotacheck tries to scan it; in this case, we need to ensure that the reloaded inode does not have dquots attached when it is freed. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2023-09-12 10:31:07 -07:00
Darrick J. Wong	68b957f64f	xfs: load uncached unlinked inodes into memory on demand shrikanth hegde reports that filesystems fail shortly after mount with the following failure: WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs] This of course is the WARN_ON_ONCE in xfs_iunlink_lookup: ip = radix_tree_lookup(&pag->pag_ici_root, agino); if (WARN_ON_ONCE(!ip \|\| !ip->i_ino)) { ... } From diagnostic data collected by the bug reporters, it would appear that we cleanly mounted a filesystem that contained unlinked inodes. Unlinked inodes are only processed as a final step of log recovery, which means that clean mounts do not process the unlinked list at all. Prior to the introduction of the incore unlinked lists, this wasn't a problem because the unlink code would (very expensively) traverse the entire ondisk metadata iunlink chain to keep things up to date. However, the incore unlinked list code complains when it realizes that it is out of sync with the ondisk metadata and shuts down the fs, which is bad. Ritesh proposed to solve this problem by unconditionally parsing the unlinked lists at mount time, but this imposes a mount time cost for every filesystem to catch something that should be very infrequent. Instead, let's target the places where we can encounter a next_unlinked pointer that refers to an inode that is not in cache, and load it into cache. Note: This patch does not address the problem of iget loading an inode from the middle of the iunlink list and needing to set i_prev_unlinked correctly. Reported-by: shrikanth hegde <sshegde@linux.vnet.ibm.com> Triaged-by: Ritesh Harjani <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-09-12 10:31:07 -07:00
Darrick J. Wong	3c919b0910	xfs: reserve less log space when recovering log intent items Wengang Wang reports that a customer's system was running a number of truncate operations on a filesystem with a very small log. Contention on the reserve heads lead to other threads stalling on smaller updates (e.g. mtime updates) long enough to result in the node being rebooted on account of the lack of responsivenes. The node failed to recover because log recovery of an EFI became stuck waiting for a grant of reserve space. From Wengang's report: "For the file deletion, log bytes are reserved basing on xfs_mount->tr_itruncate which is: tr_logres = 175488, tr_logcount = 2, tr_logflags = XFS_TRANS_PERM_LOG_RES, "You see it's a permanent log reservation with two log operations (two transactions in rolling mode). After calculation (xlog_calc_unit_res() adds space for various log headers), the final log space needed per transaction changes from 175488 to 180208 bytes. So the total log space needed is 360416 bytes (180208 * 2). [That quantity] of log space (360416 bytes) needs to be reserved for both run time inode removing (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover())." In other words, runtime pre-reserves 360K of space in anticipation of running a chain of two transactions in which each transaction gets a 180K reservation. Now that we've allocated the transaction, we delete the bmap mapping, log an EFI to free the space, and roll the transaction as part of finishing the deferops chain. Rolling creates a new xfs_trans which shares its ticket with the old transaction. Next, xfs_trans_roll calls __xfs_trans_commit with regrant == true, which calls xlog_cil_commit with the same regrant parameter. xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and subtracts t_curr_res from the reservation and write heads. If the filesystem is fresh and the first transaction only used (say) 20K, then t_curr_res will be 160K, and we give that much reservation back to the reservation head. Or if the file is really fragmented and the first transaction actually uses 170K, then t_curr_res will be 10K, and that's what we give back to the reservation. Having done that, we're now headed into the second transaction with an EFI and 180K of reservation. Other threads apparently consumed all the reservation for smaller transactions, such as timestamp updates. Now let's say the first transaction gets written to disk and we crash without ever completing the second transaction. Now we remount the fs, log recovery finds the unfinished EFI, and calls xfs_efi_recover to finish the EFI. However, xfs_efi_recover starts a new tr_itruncate tranasction, which asks for 360K log reservation. This is a lot more than the 180K that we had reserved at the time of the crash. If the first EFI to be recovered is also pinning the tail of the log, we will be unable to free any space in the log, and recovery livelocks. Wengang confirmed this: "Now we have the second transaction which has 180208 log bytes reserved too. The second transaction is supposed to process intents including extent freeing. With my hacking patch, I blocked the extent freeing 5 hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved. "With my test case, other transactions (update timestamps) then happen. As my hacking patch pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available log space (in memory in on disk). And finally the on disk (and in memory) available log space goes down near to 180208 bytes. Those 180208 bytes are reserved by [the] second (extent-free) transaction [in the chain]." Wengang and I noticed that EFI recovery starts a transaction, completes one step of the chain, and commits the transaction without completing any other steps of the chain. Those subsequent steps are completed by xlog_finish_defer_ops, which allocates yet another transaction to finish the rest of the chain. That transaction gets the same tr_logres as the head transaction, but with tr_logcount = 1 to force regranting with every roll to avoid livelocks. In other words, we already figured this out in commit `929b92f640` ("xfs: xfs_defer_capture should absorb remaining transaction reservation"), but should have applied that logic to each intent item's recovery function. For Wengang's case, the xfs_trans_alloc call in the EFI recovery function should only be asking for a single transaction's worth of log reservation -- 180K, not 360K. Quoting Wengang again: "With log recovery, during EFI recovery, we use tr_itruncate again to reserve two transactions that needs 360416 log bytes. Reserving 360416 bytes fails [stalls] because we now only have about 180208 available. "Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes. We have (a bit) more than 180208 available log bytes on disk, so [if we decrease the reservation to 180K] the reservation goes and the recovery [finishes]. That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res xfs_mount->tr_ext_free { tr_logres = 175488, tr_logcount = 0, tr_logflags = 0, } "and use tr_ext_free instead of tr_itruncate in EFI recover." However, I don't think it quite makes sense to create an entirely new transaction reservation type to handle single-stepping during log recovery. Instead, we should copy the transaction reservation information in the xfs_mount, change tr_logcount to 1, and pass that into xfs_trans_alloc. We know this won't risk changing the min log size computation since we always ask for a fraction of the reservation for all known transaction types. This looks like it's been lurking in the codebase since commit `3d3c8b5222`, which changed the xfs_trans_reserve call in xlog_recover_process_efi to use the tr_logcount in tr_itruncate. That changed the EFI recovery transaction from making a non-XFS_TRANS_PERM_LOG_RES request for one transaction's worth of log space to a XFS_TRANS_PERM_LOG_RES request for two transactions worth. Fixes: `3d3c8b5222` ("xfs: refactor xfs_trans_reserve() interface") Complements: `929b92f640` ("xfs: xfs_defer_capture should absorb remaining transaction reservation") Suggested-by: Wengang Wang <wen.gang.wang@oracle.com> Cc: Srikanth C S <srikanth.c.s@oracle.com> [djwong: apply the same transformation to all log intent recovery] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-09-12 10:31:07 -07:00
Darrick J. Wong	74ad4693b6	xfs: fix log recovery when unknown rocompat bits are set Log recovery has always run on read only mounts, even where the primary superblock advertises unknown rocompat bits. Due to a misunderstanding between Eric and Darrick back in 2018, we accidentally changed the superblock write verifier to shutdown the fs over that exact scenario. As a result, the log cleaning that occurs at the end of the mounting process fails if there are unknown rocompat bits set. As we now allow writing of the superblock if there are unknown rocompat bits set on a RO mount, we no longer want to turn off RO state to allow log recovery to succeed on a RO mount. Hence we also remove all the (now unnecessary) RO state toggling from the log recovery path. Fixes: `9e037cb797` ("xfs: check for unknown v5 feature bits in superblock write verifier" Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-09-12 10:31:07 -07:00
Darrick J. Wong	83771c50e4	xfs: reload entire unlinked bucket lists The previous patch to reload unrecovered unlinked inodes when adding a newly created inode to the unlinked list is missing a key piece of functionality. It doesn't handle the case that someone calls xfs_iget on an inode that is not the last item in the incore list. For example, if at mount time the ondisk iunlink bucket looks like this: AGI -> 7 -> 22 -> 3 -> NULL None of these three inodes are cached in memory. Now let's say that someone tries to open inode 3 by handle. We need to walk the list to make sure that inodes 7 and 22 get loaded cold, and that the i_prev_unlinked of inode 3 gets set to 22. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2023-09-12 10:31:07 -07:00
Darrick J. Wong	76e589013f	xfs: allow inode inactivation during a ro mount log recovery In the next patch, we're going to prohibit log recovery if the primary superblock contains an unrecognized rocompat feature bit even on readonly mounts. This requires removing all the code in the log mounting process that temporarily disables the readonly state. Unfortunately, inode inactivation disables itself on readonly mounts. Clearing the iunlinked lists after log recovery needs inactivation to run to free the unreferenced inodes, which (AFAICT) is the only reason why log mounting plays games with the readonly state in the first place. Therefore, change the inactivation predicates to allow inactivation during log recovery of a readonly mount. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-09-12 10:31:07 -07:00
Darrick J. Wong	f12b96683d	xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list Alter the definition of i_prev_unlinked slightly to make it more obvious when an inode with 0 link count is not part of the iunlink bucket lists rooted in the AGI. This distinction is necessary because it is not sufficient to check inode.i_nlink to decide if an inode is on the unlinked list. Updates to i_nlink can happen while holding only ILOCK_EXCL, but updates to an inode's position in the AGI unlinked list (which happen after the nlink update) requires both ILOCK_EXCL and the AGI buffer lock. The next few patches will make it possible to reload an entire unlinked bucket list when we're walking the inode table or performing handle operations and need more than the ability to iget the last inode in the chain. The upcoming directory repair code also needs to be able to make this distinction to decide if a zero link count directory should be moved to the orphanage or allowed to inactivate. An upcoming enhancement to the online AGI fsck code will need this distinction to check and rebuild the AGI unlinked buckets. Signed-off-by: Darrick J. Wong <djwong@kernel.org>	2023-09-12 10:31:07 -07:00
Christoph Hellwig	4aa8cdd5e5	iomap: handle error conditions more gracefully in iomap_to_bh iomap_to_bh currently BUG()s when the passed in block number is not in the iomap. For file systems that have proper synchronization this should never happen and so far hasn't in mainline, but for block devices size changes aren't fully synchronized against ongoing I/O. Instead of BUG()ing in this case, return -EIO to the caller, which already has proper error handling. While we're at it, also return -EIO for an unknown iomap state instead of returning garbage. Fixes: `487c607df7` ("block: use iomap for writes to block devices") Reported-by: syzbot+4a08ffdf3667b36650a1@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org>	2023-09-12 10:05:48 -07:00
Linus Torvalds	afe03f088f	overlayfs fixes for 6.6-rc2 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmT+s8AACgkQEVvVyTe/ 1Wr3GA/+IiTHYJSqHOjKvAWs8pVX3rcmJxoodPidOFAncPKzw5GlD7Lo0C8Qgmie CEzg8kSfMPg0n7VpywOCiA1n37gBJbmHdQ0108OxJA65IRna4K6qcyBQvfOGXq9u Qx360cTCJFGBITkkRmg8RZQKF+Dj3nd0vHn3feGkPftL113fhZTZ1uWFxyXIGsuu eki8wW3WgZFvS71Pp9nZEVFd/HPJukd03LKHQlSnQQBnCJJjZhvM6O2Y4o4lFznj aWTIHQdowz00Mj1kFEM46cGFDg3SwFtdOizpRrWxL0oOVnElIo8mRAxtf3gz4081 fvvhvYG5QEgSUrq7VfuGxaxlh2tyHgc8gh7lNXC0JCGSjc2lHGosvniFcRo6ecgU 7UwT+rX0odWzbSDH6TrMHPZsBSS/siKWreii63HUlMOo0mSorTzA20EK7f9qoXTA dgvMD7cbFiEOjwlS9SbIjMMuvYM5VykCyWuniBqicA5UzUj2/K5DG6apkMK/MGbn DU/r0HYROqdggk920i/Yyv4GoS6uERfELpkoJr9q7Lx1+wAkRGbNUpUe408Wp411 I66Ynie48oBmlDfU3LiyW9b3OPbFMPKE3WTPIngJurWRoHFXunxdkArfQi+rAUpx cmC5CovFAaVSL3HyhWAXh5lMbe1KjUasQM8ywTyhki2wZzrAI7U= =F/OX -----END PGP SIGNATURE----- Merge tag 'ovl-fixes-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs fixes from Amir Goldstein: "Two fixes for pretty old regressions" * tag 'ovl-fixes-6.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: fix incorrect fdput() on aio completion ovl: fix failed copyup of fileattr on a symlink	2023-09-12 09:00:25 -07:00
NeilBrown	88956eabfd	NFSD: fix possible oops when nfsd/pool_stats is closed. If /proc/fs/nfsd/pool_stats is open when the last nfsd thread exits, then when the file is closed a NULL pointer is dereferenced. This is because nfsd_pool_stats_release() assumes that the pointer to the svc_serv cannot become NULL while a reference is held. This used to be the case but a recent patch split nfsd_last_thread() out from nfsd_put(), and clearing the pointer is done in nfsd_last_thread(). This is easily reproduced by running rpc.nfsd 8 ; ( rpc.nfsd 0;true) < /proc/fs/nfsd/pool_stats Fortunately nfsd_pool_stats_release() has easy access to the svc_serv pointer, and so can call svc_put() on it directly. Fixes: `9f28a971ee` ("nfsd: separate nfsd_last_thread() from nfsd_put()") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-09-12 09:39:35 -04:00
Steven Rostedt (Google)	9243e54309	tracefs/eventfs: Use list_for_each_srcu() in dcache_dir_open_wrapper() The eventfs files list is protected by SRCU. In earlier iterations it was protected with just RCU, but because it needed to also call sleepable code, it had to be switch to SRCU. The dcache_dir_open_wrapper() list_for_each_rcu() was missed and did not get converted over to list_for_each_srcu(). That needs to be fixed. Link: https://lore.kernel.org/linux-trace-kernel/20230911120053.ca82f545e7f46ea753deda18@kernel.org/ Link: https://lore.kernel.org/linux-trace-kernel/20230911200654.71ce927c@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ajay Kaher <akaher@vmware.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Reported-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: `6394044955` ("eventfs: Implement eventfs lookup, read, open functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-09-11 22:05:02 -04:00
Darrick J. Wong	ef7d959339	xfs: remove CPU hotplug infrastructure There are no users of the cpu hotplug hooks in xfs now, so remove it. This reverts `f1653c2e28` ("xfs: introduce CPU hotplug infrastructure"). Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-09-11 08:39:04 -07:00
Darrick J. Wong	f5bfa695f0	xfs: remove the all-mounts list Revert commit `0ed17f01c8` ("xfs: introduce all-mounts list for cpu hotplug notifications") because the cpu hotplug hooks are now pointless, so we don't need this list anymore. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-09-11 08:39:04 -07:00
Darrick J. Wong	62334fab47	xfs: use per-mount cpumask to track nonempty percpu inodegc lists Directly track which CPUs have contributed to the inodegc percpu lists instead of trusting the cpu online mask. This eliminates a theoretical problem where the inodegc flush functions might fail to flush a CPU's inodes if that CPU happened to be dying at exactly the same time. Most likely nobody's noticed this because the CPU dead hook moves the percpu inodegc list to another CPU and schedules that worker immediately. But it's quite possible that this is a subtle race leading to UAF if the inodegc flush were part of an unmount. Further benefits: This reduces the overhead of the inodegc flush code slightly by allowing us to ignore CPUs that have empty lists. Better yet, it reduces our dependence on the cpu online masks, which have been the cause of confusion and drama lately. Fixes: `ab23a77687` ("xfs: per-cpu deferred inode inactivation queues") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-09-11 08:39:03 -07:00
Darrick J. Wong	cfa2df68b7	xfs: fix an agbno overflow in __xfs_getfsmap_datadev Dave Chinner reported that xfs/273 fails if the AG size happens to be an exact power of two. I traced this to an agbno integer overflow when the current GETFSMAP call is a continuation of a previous GETFSMAP call, and the last record returned was non-shareable space at the end of an AG. __xfs_getfsmap_datadev sets up a data device query by converting the incoming fmr_physical into an xfs_fsblock_t and cracking it into an agno and agbno pair. In the (failing) case of where fmr_blockcount of the low key is nonzero and the record was for a non-shareable extent, it will add fmr_blockcount to start_fsb and info->low.rm_startblock. If the low key was actually the last record for that AG, then this addition causes info->low.rm_startblock to point beyond EOAG. When the rmapbt range query starts, it'll return an empty set, and fsmap moves on to the next AG. Or so I thought. Remember how we added to start_fsb? If agsize < 1<<agblklog, start_fsb points to the same AG as the original fmr_physical from the low key. We run the rmapbt query, which returns nothing, so getfsmap zeroes info->low and moves on to the next AG. If agsize == 1<<agblklog, start_fsb now points to the next AG. We run the rmapbt query on the next AG with the excessively large rm_startblock. If this next AG is actually the last AG, we'll set info->high to EOFS (which is now has a lower rm_startblock than info->low), and the ranged btree query code will return -EINVAL. If it's not the last AG, we ignore all records for the intermediate AGs. Oops. Fix this by decoding start_fsb into agno and agbno only after making adjustments to start_fsb. This means that info->low.rm_startblock will always be set to a valid agbno, and we always start the rmapbt iteration in the correct AG. While we're at it, fix the predicate for determining if an fsmap record represents non-shareable space to include file data on pre-reflink filesystems. Reported-by: Dave Chinner <david@fromorbit.com> Fixes: `63ef7a3591` ("xfs: fix interval filtering in multi-step fsmap queries") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-09-11 08:39:02 -07:00
Darrick J. Wong	ecd49f7a36	xfs: fix per-cpu CIL structure aggregation racing with dying cpus In commit `7c8ade2121` ("xfs: implement percpu cil space used calculation"), the XFS committed (log) item list code was converted to use per-cpu lists and space tracking to reduce cpu contention when multiple threads are modifying different parts of the filesystem and hence end up contending on the log structures during transaction commit. Each CPU tracks its own commit items and space usage, and these do not have to be merged into the main CIL until either someone wants to push the CIL items, or we run over a soft threshold and switch to slower (but more accurate) accounting with atomics. Unfortunately, the for_each_cpu iteration suffers from the same race with cpu dying problem that was identified in commit `8b57b11cca` ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from cpu_online_mask before the CPUHP_XFS_DEAD callback gets called. As a result, both CIL percpu structure aggregation functions fail to collect the items and accounted space usage at the correct point in time. If we're lucky, the items that are collected from the online cpus exceed the space given to those cpus, and the log immediately shuts down in xlog_cil_insert_items due to the (apparent) log reservation overrun. This happens periodically with generic/650, which exercises cpu hotplug vs. the filesystem code: smpboot: CPU 3 is now offline XFS (sda3): ctx ticket reservation ran out. Need to up reservation XFS (sda3): ticket reservation summary: XFS (sda3): unit res = 9268 bytes XFS (sda3): current res = -40 bytes XFS (sda3): original count = 1 XFS (sda3): remaining count = 1 XFS (sda3): Filesystem has been shut down due to log error (0x2). Applying the same sort of fix from `8b57b11cca` to the CIL code seems to make the generic/650 problem go away, but I've been told that tglx was not happy when he saw: "...the only thing we actually need to care about is that percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when there are no CPUs dying, it has no addition overhead except for a cpumask_or() operation." The CPU hotplug code is rather complex and difficult to understand and I don't want to try to understand the cpu hotplug locking well enough to use cpu_dying mask. Furthermore, there's a performance improvement that could be had here. Attach a private cpu mask to the CIL structure so that we can track exactly which cpus have accessed the percpu data at all. It doesn't matter if the cpu has since gone offline; log item aggregation will still find the items. Better yet, we skip cpus that have not recently logged anything. Worse yet, Ritesh Harjani and Eric Sandeen both reported today that CPU hot remove racing with an xfs mount can crash if the cpu_dead notifier tries to access the log but the mount hasn't yet set up the log. Link: https://lore.kernel.org/linux-xfs/ZOLzgBOuyWHapOyZ@dread.disaster.area/T/ Link: https://lore.kernel.org/lkml/877cuj1mt1.ffs@tglx/ Link: https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/ Link: https://lore.kernel.org/linux-xfs/ZOVkjxWZq0YmjrJu@dread.disaster.area/T/ Cc: tglx@linutronix.de Cc: peterz@infradead.org Reported-by: ritesh.list@gmail.com Reported-by: sandeen@sandeen.net Fixes: `af1c2146a5` ("xfs: introduce per-cpu CIL tracking structure") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2023-09-11 08:39:02 -07:00
Trond Myklebust	954998b60c	NFS: Fix error handling for O_DIRECT write scheduling If we fail to schedule a request for transmission, there are 2 possibilities: 1) Either we hit a fatal error, and we just want to drop the remaining requests on the floor. 2) We were asked to try again, in which case we should allow the outstanding RPC calls to complete, so that we can recoalesce requests and try again. Fixes: `d600ad1f2b` ("NFS41: pop some layoutget errors to application") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-09-11 11:29:38 -04:00
Shigeru Yoshida	2251588143	reiserfs: Replace 1-element array with C99 style flex-array UBSAN found the following issue: ================================================================================ UBSAN: array-index-out-of-bounds in fs/reiserfs/journal.c:4166:22 index 1 is out of range for type '__le32 [1]' This is because struct reiserfs_journal_desc uses 1-element array for dynamically sized array member, j_realblock. This patch fixes this issue by replacing the 1-element array member with C99 style flex-array. This patch also fixes the same issue in struct reiserfs_journal_commit as the same manner. Fixes: `f466c6fdb3` ("move private bits of reiserfs_fs.h to fs/reiserfs/reiserfs.h") Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Message-Id: <20230821043312.1444068-1-syoshida@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-09-11 14:07:46 +02:00
Lukas Bulwahn	57c0f4a8ea	xfs: fix select in config XFS_ONLINE_SCRUB_STATS Commit `d7a74cad8f` ("xfs: track usage statistics of online fsck") introduces config XFS_ONLINE_SCRUB_STATS, which selects the non-existing config FS_DEBUG. It is probably intended to select the existing config XFS_DEBUG. Fix the select in config XFS_ONLINE_SCRUB_STATS. Fixes: `d7a74cad8f` ("xfs: track usage statistics of online fsck") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>	2023-09-11 14:52:23 +05:30
Heinrich Schuchardt	79b83606ab	efivarfs: fix statfs() on efivarfs Some firmware (notably U-Boot) provides GetVariable() and GetNextVariableName() but not QueryVariableInfo(). With commit `d86ff3333c` ("efivarfs: expose used and total size") the statfs syscall was broken for such firmware. If QueryVariableInfo() does not exist or returns EFI_UNSUPPORTED, just report the file system size as 0 as statfs_simple() previously did. Fixes: `d86ff3333c` ("efivarfs: expose used and total size") Link: https://lore.kernel.org/all/20230910045445.41632-1-heinrich.schuchardt@canonical.com/ Signed-off-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com> [ardb: log warning on QueryVariableInfo() failure] Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>	2023-09-11 09:10:02 +00:00
Gao Xiang	75a5221630	erofs: fix memory leak of LZMA global compressed deduplication When stressing microLZMA EROFS images with the new global compressed deduplication feature enabled (`-Ededupe`), I found some short-lived temporary pages weren't properly released, which could slowly cause unexpected OOMs hours later. Let's fix it now (LZ4 and DEFLATE don't have this issue.) Fixes: `5c2a64252c` ("erofs: introduce partial-referenced pclusters") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20230907050542.97152-1-hsiangkao@linux.alibaba.com	2023-09-11 08:03:32 +08:00
Linus Torvalds	fd3a5940e6	six smb3 client fixes, one fix for nls Kconfig, one minor spnego registry update -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmT8qQwACgkQiiy9cAdy T1Hjzgv/dCmqowHN9fI1gP/SpkbyZt0/GrlhpRMELQ2THAYuJmWQ9i/vR7VFMMv7 C7vfx3nYTQfQAcsshA+bfOkABr02A9HD5HuBEhWldEFQD/qIPrPMyU53DJ6m4QOb 81/8ox6XJMDNHWJUPJVfhAyHoKBxYyNcqqkITnBcSpqENbm7IZ8UhUJe1l75nkBi IYR28e4Aci5uGPuMB7NAuKmJM0IKi1ipMy4MCXkSD/vQZStz76mhzx200Wm6ObuX jqPzIb32simsVY7q/Hbkxt2MPDUbsn6kBhOSe8oJVxEgrMy56pZCOByPjc0ZFzE6 SziZ0JanrfjlGdBg5qjmcUamEMV44Oi1QQ6PpkHRA2zDyXMpv8ulwy/3Z40O7zbb WBBar0HsJE5osCL0jiwLbNorctEnmtSKWcyXOI924Bli4eesE/ojq2u4H4KrlXzA gDqPwAhhw/PrbuEmgRy9maYE8KwutZh2HCRvxgFB1NgbcNLvKgkQrly4QP29hKQy +HViqKAe =6QhY -----END PGP SIGNATURE----- Merge tag '6.6-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client fixes from Steve French: - six smb3 client fixes including ones to allow controlling smb3 directory caching timeout and limits, and one debugging improvement - one fix for nls Kconfig (don't need to expose NLS_UCS2_UTILS option) - one minor spnego registry update * tag '6.6-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: spnego: add missing OID to oid registry smb3: fix minor typo in SMB2_GLOBAL_CAP_LARGE_MTU cifs: update internal module version number for cifs.ko smb3: allow controlling maximum number of cached directories smb3: add trace point for queryfs (statfs) nls: Hide new NLS_UCS2_UTILS smb3: allow controlling length of time directory entries are cached with dir leases smb: propagate error code of extract_sharename()	2023-09-09 19:56:23 -07:00
Jeff Layton	fdd2630a73	nfsd: fix change_info in NFSv4 RENAME replies nfsd sends the transposed directory change info in the RENAME reply. The source directory is in save_fh and the target is in current_fh. Reported-by: Zhi Li <yieli@redhat.com> Reported-by: Benjamin Coddington <bcodding@redhat.com> Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2218844 Signed-off-by: Jeff Layton <jlayton@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-09-09 13:24:52 -04:00
Linus Torvalds	6099776f9f	one patch to remove unneded warning -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmT7eEwACgkQiiy9cAdy T1EY1Qv+OIgraDJ23wl4FX17dEQLKqcVH5wCWhU9IEq9oIhKSMngDE4kAKu/unEJ GOKy9QfY2eshPWHCoZjh9lI1Xg776nC3GeAtekOFHXX9FXRZfWP0+Sp6HRNk1TxD iTrRUDGnGA20QF7tgxNo5EJb2yvxBmEmhiSTwlFUuV+oOjvbQw0eB2EheEEzhPJF 7qrlvC0GvvgUOrVVg6YEMCNFb9bvrns06Rwrl3Hq86gaauXIg4pBIF/AHiwh7jCo +upXo1vgfWp29XMwX79LqCOx9u96q0i5AZonXTN2W/AxsgsWA6RQnGyZmL+Q/2/4 w4jN76NaLspoVE8grbzcSJq8b3Rlx2HUlzBaZgqAbuB8zx/59RXf559Aewt+Y9Pt anGsTTpp6a9nc9YYlAGSmaA97y5fnHO47sHpKHVhR9vKbQgvnqGzvmqgNUeRbDzb wgUjOma6SPfZL6JNqNb+SrktI410MjS11/+mP4NlTrVVomYyKZXslcvpnMuoiHMH U4RpYB84 =9Rd7 -----END PGP SIGNATURE----- Merge tag '6.6-rc-ksmbd' of git://git.samba.org/ksmbd Pull smb server update from Steve French: "After two years, many fixes and much testing, ksmbd is no longer experimental" * tag '6.6-rc-ksmbd' of git://git.samba.org/ksmbd: ksmbd: remove experimental warning	2023-09-08 22:01:55 -07:00
Steven Rostedt (Google)	d508ee2dd5	tracefs/eventfs: Free top level files on removal When an instance is removed, the top level files of the eventfs directory are not cleaned up. Call the eventfs_remove() on each of the entries to free them. This was found via kmemleak: unreferenced object 0xffff8881047c1280 (size 96): comm "mkdir", pid 924, jiffies 4294906489 (age 2013.077s) hex dump (first 32 bytes): 18 31 ed 03 81 88 ff ff 00 31 09 24 81 88 ff ff .1.......1.$.... 00 00 00 00 00 00 00 00 98 19 7c 04 81 88 ff ff ..........\|..... backtrace: [<000000000fa46b4d>] kmalloc_trace+0x2a/0xa0 [<00000000e729cd0c>] eventfs_prepare_ef.constprop.0+0x3a/0x160 [<000000009032e6a8>] eventfs_add_events_file+0xa0/0x160 [<00000000fe968442>] create_event_toplevel_files+0x6f/0x130 [<00000000e364d173>] event_trace_add_tracer+0x14/0x140 [<00000000411840fa>] trace_array_create_dir+0x52/0xf0 [<00000000967804fa>] trace_array_create+0x208/0x370 [<00000000da505565>] instance_mkdir+0x6b/0xb0 [<00000000dc1215af>] tracefs_syscall_mkdir+0x5b/0x90 [<00000000a8aca289>] vfs_mkdir+0x272/0x380 [<000000007709b242>] do_mkdirat+0xfc/0x1d0 [<00000000c0b6d219>] __x64_sys_mkdir+0x78/0xa0 [<0000000097b5dd4b>] do_syscall_64+0x3f/0x90 [<00000000a3f00cfa>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 unreferenced object 0xffff888103ed3118 (size 8): comm "mkdir", pid 924, jiffies 4294906489 (age 2013.077s) hex dump (first 8 bytes): 65 6e 61 62 6c 65 00 00 enable.. backtrace: [<0000000010f75127>] __kmalloc_node_track_caller+0x51/0x160 [<000000004b3eca91>] kstrdup+0x34/0x60 [<0000000050074d7a>] eventfs_prepare_ef.constprop.0+0x53/0x160 [<000000009032e6a8>] eventfs_add_events_file+0xa0/0x160 [<00000000fe968442>] create_event_toplevel_files+0x6f/0x130 [<00000000e364d173>] event_trace_add_tracer+0x14/0x140 [<00000000411840fa>] trace_array_create_dir+0x52/0xf0 [<00000000967804fa>] trace_array_create+0x208/0x370 [<00000000da505565>] instance_mkdir+0x6b/0xb0 [<00000000dc1215af>] tracefs_syscall_mkdir+0x5b/0x90 [<00000000a8aca289>] vfs_mkdir+0x272/0x380 [<000000007709b242>] do_mkdirat+0xfc/0x1d0 [<00000000c0b6d219>] __x64_sys_mkdir+0x78/0xa0 [<0000000097b5dd4b>] do_syscall_64+0x3f/0x90 [<00000000a3f00cfa>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Link: https://lore.kernel.org/linux-trace-kernel/20230907175859.6fedbaa2@gandalf.local.home Cc: Mark Rutland <mark.rutland@arm.com> Cc: Ajay Kaher <akaher@vmware.com> Cc: Zheng Yejian <zhengyejian1@huawei.com> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Fixes: `5bdcd5f533` eventfs: ("Implement removal of meta data from eventfs") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-09-08 23:12:06 -04:00
Steve French	702c390bc8	smb3: fix minor typo in SMB2_GLOBAL_CAP_LARGE_MTU There was a minor typo in the define for SMB2_GLOBAL_CAP_LARGE_MTU 0X00000004 instead of 0x00000004 make it consistent Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-09-08 19:01:16 -05:00
Bhaskar Chowdhury	5facccc940	MAINTAINERS: remove links to obsolete btrfs.wiki.kernel.org The wiki has been archived and is not updated anymore. Remove or replace the links in files that contain it (MAINTAINERS, Kconfig, docs). Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-09-08 14:21:27 +02:00
Filipe Manana	a57c2d4e46	btrfs: assert delayed node locked when removing delayed item When removing a delayed item, or releasing which will remove it as well, we will modify one of the delayed node's rbtrees and item counter if the delayed item is in one of the rbtrees. This require having the delayed node's mutex locked, otherwise we will race with other tasks modifying the rbtrees and the counter. This is motivated by a previous version of another patch actually calling btrfs_release_delayed_item() after unlocking the delayed node's mutex and against a delayed item that is in a rbtree. So assert at __btrfs_remove_delayed_item() that the delayed node's mutex is locked. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-09-08 14:20:40 +02:00
Filipe Manana	2c58c3931e	btrfs: remove BUG() after failure to insert delayed dir index item Instead of calling BUG() when we fail to insert a delayed dir index item into the delayed node's tree, we can just release all the resources we have allocated/acquired before and return the error to the caller. This is fine because all existing call chains undo anything they have done before calling btrfs_insert_delayed_dir_index() or BUG_ON (when creating pending snapshots in the transaction commit path). So remove the BUG() call and do proper error handling. This relates to a syzbot report linked below, but does not fix it because it only prevents hitting a BUG(), it does not fix the issue where somehow we attempt to use twice the same index number for different index items. Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/ CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-09-08 14:11:59 +02:00
Filipe Manana	91bfe3104b	btrfs: improve error message after failure to add delayed dir index item If we fail to add a delayed dir index item because there's already another item with the same index number, we print an error message (and then BUG). However that message isn't very helpful to debug anything because we don't know what's the index number and what are the values of index counters in the inode and its delayed inode (index_cnt fields of struct btrfs_inode and struct btrfs_delayed_node). So update the error message to include the index number and counters. We actually had a recent case where this issue was hit by a syzbot report (see the link below). Link: https://lore.kernel.org/linux-btrfs/00000000000036e1290603e097e0@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-09-08 14:11:57 +02:00
Qu Wenruo	5e0e879926	btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio [BUG] After commit `72a69cd030` ("btrfs: subpage: pack all subpage bitmaps into a larger bitmap"), the DEBUG section of btree_dirty_folio() would no longer compile. [CAUSE] If DEBUG is defined, we would do extra checks for btree_dirty_folio(), mostly to make sure the range we marked dirty has an extent buffer and that extent buffer is dirty. For subpage, we need to iterate through all the extent buffers covered by that page range, and make sure they all matches the criteria. However commit `72a69cd030` ("btrfs: subpage: pack all subpage bitmaps into a larger bitmap") changes how we store the bitmap, we pack all the 16 bits bitmaps into a larger bitmap, which would save some space. This means we no longer have btrfs_subpage::dirty_bitmap, instead the dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a length of btrfs_subpage_info::bitmap_nr_bits. [FIX] Although I'm not sure if it still makes sense to maintain such code, at least let it compile. This patch would let us test the bits one by one through the bitmaps. CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-09-08 14:11:04 +02:00
Josef Bacik	4ca8e03cf2	btrfs: check for BTRFS_FS_ERROR in pending ordered assert If we do fast tree logging we increment a counter on the current transaction for every ordered extent we need to wait for. This means we expect the transaction to still be there when we clear pending on the ordered extent. However if we happen to abort the transaction and clean it up, there could be no running transaction, and thus we'll trip the "ASSERT(trans)" check. This is obviously incorrect, and the code properly deals with the case that the transaction doesn't exist. Fix this ASSERT() to only fire if there's no trans and we don't have BTRFS_FS_ERROR() set on the file system. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-09-08 14:10:59 +02:00
Filipe Manana	e110f8911d	btrfs: fix lockdep splat and potential deadlock after failure running delayed items When running delayed items we are holding a delayed node's mutex and then we will attempt to modify a subvolume btree to insert/update/delete the delayed items. However if have an error during the insertions for example, btrfs_insert_delayed_items() may return with a path that has locked extent buffers (a leaf at the very least), and then we attempt to release the delayed node at __btrfs_run_delayed_items(), which requires taking the delayed node's mutex, causing an ABBA type of deadlock. This was reported by syzbot and the lockdep splat is the following: WARNING: possible circular locking dependency detected 6.5.0-rc7-syzkaller-00024-g93f5de5f648d #0 Not tainted ------------------------------------------------------ syz-executor.2/13257 is trying to acquire lock: ffff88801835c0c0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256 but task is already holding lock: ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-tree-00){++++}-{3:3}: __lock_release kernel/locking/lockdep.c:5475 [inline] lock_release+0x36f/0x9d0 kernel/locking/lockdep.c:5781 up_write+0x79/0x580 kernel/locking/rwsem.c:1625 btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline] btrfs_unlock_up_safe+0x179/0x3b0 fs/btrfs/locking.c:239 search_leaf fs/btrfs/ctree.c:1986 [inline] btrfs_search_slot+0x2511/0x2f80 fs/btrfs/ctree.c:2230 btrfs_insert_empty_items+0x9c/0x180 fs/btrfs/ctree.c:4376 btrfs_insert_delayed_item fs/btrfs/delayed-inode.c:746 [inline] btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline] __btrfs_commit_inode_delayed_items+0xd24/0x2410 fs/btrfs/delayed-inode.c:1111 __btrfs_run_delayed_items+0x1db/0x430 fs/btrfs/delayed-inode.c:1153 flush_space+0x269/0xe70 fs/btrfs/space-info.c:723 btrfs_async_reclaim_metadata_space+0x106/0x350 fs/btrfs/space-info.c:1078 process_one_work+0x92c/0x12c0 kernel/workqueue.c:2600 worker_thread+0xa63/0x1210 kernel/workqueue.c:2751 kthread+0x2b8/0x350 kernel/kthread.c:389 ret_from_fork+0x2e/0x60 arch/x86/kernel/process.c:145 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:304 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799 __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256 btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline] __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156 btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276 btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988 vfs_fsync_range fs/sync.c:188 [inline] vfs_fsync fs/sync.c:202 [inline] do_fsync fs/sync.c:212 [inline] __do_sys_fsync fs/sync.c:220 [inline] __se_sys_fsync fs/sync.c:218 [inline] __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-00); lock(&delayed_node->mutex); lock(btrfs-tree-00); lock(&delayed_node->mutex); * DEADLOCK * 3 locks held by syz-executor.2/13257: #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: spin_unlock include/linux/spinlock.h:391 [inline] #0: ffff88802c1ee370 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0xb87/0xe00 fs/btrfs/transaction.c:287 #1: ffff88802c1ee398 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0xbb2/0xe00 fs/btrfs/transaction.c:288 #2: ffff88802a5ab8e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x3c/0x2a0 fs/btrfs/locking.c:198 stack backtrace: CPU: 0 PID: 13257 Comm: syz-executor.2 Not tainted 6.5.0-rc7-syzkaller-00024-g93f5de5f648d #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106 check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195 check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 __mutex_lock_common+0x1d8/0x2530 kernel/locking/mutex.c:603 __mutex_lock kernel/locking/mutex.c:747 [inline] mutex_lock_nested+0x1b/0x20 kernel/locking/mutex.c:799 __btrfs_release_delayed_node+0x9a/0xaa0 fs/btrfs/delayed-inode.c:256 btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline] __btrfs_run_delayed_items+0x2b5/0x430 fs/btrfs/delayed-inode.c:1156 btrfs_commit_transaction+0x859/0x2ff0 fs/btrfs/transaction.c:2276 btrfs_sync_file+0xf56/0x1330 fs/btrfs/file.c:1988 vfs_fsync_range fs/sync.c:188 [inline] vfs_fsync fs/sync.c:202 [inline] do_fsync fs/sync.c:212 [inline] __do_sys_fsync fs/sync.c:220 [inline] __se_sys_fsync fs/sync.c:218 [inline] __x64_sys_fsync+0x196/0x1e0 fs/sync.c:218 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f3ad047cae9 Code: 28 00 00 00 75 (...) RSP: 002b:00007f3ad12510c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 00007f3ad059bf80 RCX: 00007f3ad047cae9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 RBP: 00007f3ad04c847a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007f3ad059bf80 R15: 00007ffe56af92f8 </TASK> ------------[ cut here ]------------ Fix this by releasing the path before releasing the delayed node in the error path at __btrfs_run_delayed_items(). Reported-by: syzbot+a379155f07c134ea9879@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000abba27060403b5bd@google.com/ CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-09-08 14:10:53 +02:00
Josef Bacik	77d20c685b	btrfs: do not block starts waiting on previous transaction commit Internally I got a report of very long stalls on normal operations like creating a new file when auto relocation was running. The reporter used the 'bpf offcputime' tracer to show that we would get stuck in start_transaction for 5 to 30 seconds, and were always being woken up by the transaction commit. Using my timing-everything script, which times how long a function takes and what percentage of that total time is taken up by its children, I saw several traces like this 1083 took 32812902424 ns 29929002926 ns 91.2110% wait_for_commit_duration 25568 ns 7.7920e-05% commit_fs_roots_duration 1007751 ns 0.00307% commit_cowonly_roots_duration 446855602 ns 1.36182% btrfs_run_delayed_refs_duration 271980 ns 0.00082% btrfs_run_delayed_items_duration 2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration 9656 ns 2.9427e-05% switch_commit_roots_duration 1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration 4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration Here I was only tracing functions that happen where we are between START_COMMIT and UNBLOCKED in order to see what would be keeping us blocked for so long. The wait_for_commit() we do is where we wait for a previous transaction that hasn't completed it's commit. This can include all of the unpin work and other cleanups, which tends to be the longest part of our transaction commit. There is no reason we should be blocking new things from entering the transaction at this point, it just adds to random latency spikes for no reason. Fix this by adding a PREP stage. This allows us to properly deal with multiple committers coming in at the same time, we retain the behavior that the winner waits on the previous transaction and the losers all wait for this transaction commit to occur. Nothing else is blocked during the PREP stage, and then once the wait is complete we switch to COMMIT_START and all of the same behavior as before is maintained. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-09-08 14:10:49 +02:00
Filipe Manana	ee34a82e89	btrfs: release path before inode lookup during the ino lookup ioctl During the ino lookup ioctl we can end up calling btrfs_iget() to get an inode reference while we are holding on a root's btree. If btrfs_iget() needs to lookup the inode from the root's btree, because it's not currently loaded in memory, then it will need to lock another or the same path in the same root btree. This may result in a deadlock and trigger the following lockdep splat: WARNING: possible circular locking dependency detected 6.5.0-rc7-syzkaller-00004-gf7757129e3de #0 Not tainted ------------------------------------------------------ syz-executor277/5012 is trying to acquire lock: ffff88802df41710 (btrfs-tree-01){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 but task is already holding lock: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-tree-00){++++}-{3:3}: down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645 __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 btrfs_search_slot+0x13a4/0x2f80 fs/btrfs/ctree.c:2302 btrfs_init_root_free_objectid+0x148/0x320 fs/btrfs/disk-io.c:4955 btrfs_init_fs_root fs/btrfs/disk-io.c:1128 [inline] btrfs_get_root_ref+0x5ae/0xae0 fs/btrfs/disk-io.c:1338 btrfs_get_fs_root fs/btrfs/disk-io.c:1390 [inline] open_ctree+0x29c8/0x3030 fs/btrfs/disk-io.c:3494 btrfs_fill_super+0x1c7/0x2f0 fs/btrfs/super.c:1154 btrfs_mount_root+0x7e0/0x910 fs/btrfs/super.c:1519 legacy_get_tree+0xef/0x190 fs/fs_context.c:611 vfs_get_tree+0x8c/0x270 fs/super.c:1519 fc_mount fs/namespace.c:1112 [inline] vfs_kern_mount+0xbc/0x150 fs/namespace.c:1142 btrfs_mount+0x39f/0xb50 fs/btrfs/super.c:1579 legacy_get_tree+0xef/0x190 fs/fs_context.c:611 vfs_get_tree+0x8c/0x270 fs/super.c:1519 do_new_mount+0x28f/0xae0 fs/namespace.c:3335 do_mount fs/namespace.c:3675 [inline] __do_sys_mount fs/namespace.c:3884 [inline] __se_sys_mount+0x2d9/0x3c0 fs/namespace.c:3861 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd -> #0 (btrfs-tree-01){++++}-{3:3}: check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645 __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline] btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281 btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline] btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154 btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412 btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline] btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716 btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline] btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105 btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- rlock(btrfs-tree-00); lock(btrfs-tree-01); lock(btrfs-tree-00); rlock(btrfs-tree-01); * DEADLOCK * 1 lock held by syz-executor277/5012: #0: ffff88802df418e8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 stack backtrace: CPU: 1 PID: 5012 Comm: syz-executor277 Not tainted 6.5.0-rc7-syzkaller-00004-gf7757129e3de #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106 check_noncircular+0x375/0x4a0 kernel/locking/lockdep.c:2195 check_prev_add kernel/locking/lockdep.c:3142 [inline] check_prevs_add kernel/locking/lockdep.c:3261 [inline] validate_chain kernel/locking/lockdep.c:3876 [inline] __lock_acquire+0x39ff/0x7f70 kernel/locking/lockdep.c:5144 lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5761 down_read_nested+0x49/0x2f0 kernel/locking/rwsem.c:1645 __btrfs_tree_read_lock+0x2f/0x220 fs/btrfs/locking.c:136 btrfs_tree_read_lock fs/btrfs/locking.c:142 [inline] btrfs_read_lock_root_node+0x292/0x3c0 fs/btrfs/locking.c:281 btrfs_search_slot_get_root fs/btrfs/ctree.c:1832 [inline] btrfs_search_slot+0x4ff/0x2f80 fs/btrfs/ctree.c:2154 btrfs_lookup_inode+0xdc/0x480 fs/btrfs/inode-item.c:412 btrfs_read_locked_inode fs/btrfs/inode.c:3892 [inline] btrfs_iget_path+0x2d9/0x1520 fs/btrfs/inode.c:5716 btrfs_search_path_in_tree_user fs/btrfs/ioctl.c:1961 [inline] btrfs_ioctl_ino_lookup_user+0x77a/0xf50 fs/btrfs/ioctl.c:2105 btrfs_ioctl+0xb0b/0xd40 fs/btrfs/ioctl.c:4683 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f0bec94ea39 Fix this simply by releasing the path before calling btrfs_iget() as at point we don't need the path anymore. Reported-by: syzbot+bf66ad948981797d2f1d@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000045fa140603c4a969@google.com/ Fixes: `23d0b79dfa` ("btrfs: Add unprivileged version of ino_lookup ioctl") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-09-08 14:10:40 +02:00
Filipe Manana	2d6cd791e6	btrfs: fix race between finishing block group creation and its item update Commit `675dfe1223` ("btrfs: fix block group item corruption after inserting new block group") fixed one race that resulted in not persisting a block group's item when its "used" bytes field decreases to zero. However there's another race that can happen in a much shorter time window that results in the same problem. The following sequence of steps explains how it can happen: 1) Task A creates a metadata block group X, its "used" and "commit_used" fields are initialized to 0; 2) Two extents are allocated from block group X, so its "used" field is updated to 32K, and its "commit_used" field remains as 0; 3) Transaction commit starts, by some task B, and it enters btrfs_start_dirty_block_groups(). There it tries to update the block group item for block group X, which currently has its "used" field with a value of 32K and its "commit_used" field with a value of 0. However that fails since the block group item was not yet inserted, so at update_block_group_item(), the btrfs_search_slot() call returns 1, and then we set 'ret' to -ENOENT. Before jumping to the label 'fail'... 4) The block group item is inserted by task A, when for example btrfs_create_pending_block_groups() is called when releasing its transaction handle. This results in insert_block_group_item() inserting the block group item in the extent tree (or block group tree), with a "used" field having a value of 32K and setting "commit_used", in struct btrfs_block_group, to the same value (32K); 5) Task B jumps to the 'fail' label and then resets the "commit_used" field to 0. At btrfs_start_dirty_block_groups(), because -ENOENT was returned from update_block_group_item(), we add the block group again to the list of dirty block groups, so that we will try again in the critical section of the transaction commit when calling btrfs_write_dirty_block_groups(); 6) Later the two extents from block group X are freed, so its "used" field becomes 0; 7) If no more extents are allocated from block group X before we get into btrfs_write_dirty_block_groups(), then when we call update_block_group_item() again for block group X, we will not update the block group item to reflect that it has 0 bytes used, because the "used" and "commit_used" fields in struct btrfs_block_group have the same value, a value of 0. As a result after committing the transaction we have an empty block group with its block group item having a 32K value for its "used" field. This will trigger errors from fsck ("btrfs check" command) and after mounting again the fs, the cleaner kthread will not automatically delete the empty block group, since its "used" field is not 0. Possibly there are other issues due to this inconsistency. When this issue happens, the error reported by fsck is like this: [1/7] checking root items [2/7] checking extents block group [1104150528 1073741824] used 39796736 but extent items used 0 ERROR: errors found in extent allocation tree or chunk allocation (...) So fix this by not resetting the "commit_used" field of a block group when we don't find the block group item at update_block_group_item(). Fixes: `7248e0cebb` ("btrfs: skip update of block group item if used bytes are the same") CC: stable@vger.kernel.org # 6.2+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-09-08 14:10:36 +02:00
Steven Rostedt (Google)	9879e5e1c5	tracefs/eventfs: Use dput to free the toplevel events directory Currently when rmdir on an instance is done, eventfs_remove_events_dir() is called and it does a dput on the dentry and then frees the eventfs_inode that represents the events directory. But there's no protection against a reader reading the top level events directory at the same time and we can get a use after free error. Instead, use the dput() associated to the dentry to also free the eventfs_inode associated to the events directory, as that will get called when the last reference to the directory is released. This issue triggered the following KASAN report: ================================================================== BUG: KASAN: slab-use-after-free in eventfs_root_lookup+0x88/0x1b0 Read of size 8 at addr ffff888120130ca0 by task ftracetest/1201 CPU: 4 PID: 1201 Comm: ftracetest Not tainted 6.5.0-test-10737-g469e0a8194e7 #13 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x57/0x90 print_report+0xcf/0x670 ? __pfx_ring_buffer_record_off+0x10/0x10 ? _raw_spin_lock_irqsave+0x2b/0x70 ? __virt_addr_valid+0xd9/0x160 kasan_report+0xd4/0x110 ? eventfs_root_lookup+0x88/0x1b0 ? eventfs_root_lookup+0x88/0x1b0 eventfs_root_lookup+0x88/0x1b0 ? eventfs_root_lookup+0x33/0x1b0 __lookup_slow+0x194/0x2a0 ? __pfx___lookup_slow+0x10/0x10 ? down_read+0x11c/0x330 walk_component+0x166/0x220 link_path_walk.part.0.constprop.0+0x3a3/0x5a0 ? seqcount_lockdep_reader_access+0x82/0x90 ? __pfx_link_path_walk.part.0.constprop.0+0x10/0x10 path_openat+0x143/0x11f0 ? __lock_acquire+0xa1a/0x3220 ? __pfx_path_openat+0x10/0x10 ? __pfx___lock_acquire+0x10/0x10 do_filp_open+0x166/0x290 ? __pfx_do_filp_open+0x10/0x10 ? lock_is_held_type+0xce/0x120 ? preempt_count_sub+0xb7/0x100 ? _raw_spin_unlock+0x29/0x50 ? alloc_fd+0x1a0/0x320 do_sys_openat2+0x126/0x160 ? rcu_is_watching+0x34/0x60 ? __pfx_do_sys_openat2+0x10/0x10 ? __might_resched+0x2cf/0x3b0 ? __fget_light+0xdf/0x100 __x64_sys_openat+0xcd/0x140 ? __pfx___x64_sys_openat+0x10/0x10 ? syscall_enter_from_user_mode+0x22/0x90 ? lockdep_hardirqs_on+0x7d/0x100 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7f1dceef5e51 Code: 75 57 89 f0 25 00 00 41 00 3d 00 00 41 00 74 49 80 3d 9a 27 0e 00 00 74 6d 89 da 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 93 00 00 00 48 8b 54 24 28 64 48 2b 14 25 RSP: 002b:00007fff2cddf380 EFLAGS: 00000202 ORIG_RAX: 0000000000000101 RAX: ffffffffffffffda RBX: 0000000000000241 RCX: 00007f1dceef5e51 RDX: 0000000000000241 RSI: 000055d7520677d0 RDI: 00000000ffffff9c RBP: 000055d7520677d0 R08: 000000000000001e R09: 0000000000000001 R10: 00000000000001b6 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000000003 R14: 000055d752035678 R15: 000055d752067788 </TASK> Allocated by task 1200: kasan_save_stack+0x2f/0x50 kasan_set_track+0x21/0x30 __kasan_kmalloc+0x8b/0x90 eventfs_create_events_dir+0x54/0x220 create_event_toplevel_files+0x42/0x130 event_trace_add_tracer+0x33/0x180 trace_array_create_dir+0x52/0xf0 trace_array_create+0x361/0x410 instance_mkdir+0x6b/0xb0 tracefs_syscall_mkdir+0x57/0x80 vfs_mkdir+0x275/0x380 do_mkdirat+0x1da/0x210 __x64_sys_mkdir+0x74/0xa0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Freed by task 1251: kasan_save_stack+0x2f/0x50 kasan_set_track+0x21/0x30 kasan_save_free_info+0x27/0x40 __kasan_slab_free+0x106/0x180 __kmem_cache_free+0x149/0x2e0 event_trace_del_tracer+0xcb/0x120 __remove_instance+0x16a/0x340 instance_rmdir+0x77/0xa0 tracefs_syscall_rmdir+0x77/0xc0 vfs_rmdir+0xed/0x2d0 do_rmdir+0x235/0x280 __x64_sys_rmdir+0x5f/0x90 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 The buggy address belongs to the object at ffff888120130ca0 which belongs to the cache kmalloc-16 of size 16 The buggy address is located 0 bytes inside of freed 16-byte region [ffff888120130ca0, ffff888120130cb0) The buggy address belongs to the physical page: page:000000004dbddbb0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x120130 flags: 0x17ffffc0000800(slab\|node=0\|zone=2\|lastcpupid=0x1fffff) page_type: 0xffffffff() raw: 0017ffffc0000800 ffff8881000423c0 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000800080 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888120130b80: 00 00 fc fc 00 05 fc fc 00 00 fc fc 00 02 fc fc ffff888120130c00: 00 07 fc fc 00 00 fc fc 00 00 fc fc fa fb fc fc >ffff888120130c80: 00 00 fc fc fa fb fc fc 00 00 fc fc 00 00 fc fc ^ ffff888120130d00: 00 00 fc fc 00 00 fc fc 00 00 fc fc fa fb fc fc ffff888120130d80: 00 00 fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc ================================================================== Link: https://lkml.kernel.org/r/20230907024803.250873643@goodmis.org Link: https://lore.kernel.org/all/1cb3aee2-19af-c472-e265-05176fe9bd84@huawei.com/ Cc: Ajay Kaher <akaher@vmware.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: `5bdcd5f533` eventfs: ("Implement removal of meta data from eventfs") Tested-by: Linux Kernel Functional Testing <lkft@linaro.org> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Reported-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-09-07 16:04:47 -04:00
Ritesh Harjani (IBM)	147d4a092e	jbd2: Remove page size assumptions jbd2_alloc() allocates a buffer from slab when the block size is smaller than PAGE_SIZE, and slab may be using a compound page. Before commit `8147c4c454`, we set b_page to the precise page containing the buffer and this code worked well. Now we set b_page to the head page of the allocation, so we can no longer use offset_in_page(). While we could do a 1:1 replacement with offset_in_folio(), use the more idiomatic bh_offset() and the folio APIs to map the buffer. This isn't enough to support a b_size larger than PAGE_SIZE on HIGHMEM machines, but this is good enough to fix the actual bug we're seeing. Fixes: `8147c4c454` ("jbd2: use a folio in jbd2_journal_write_metadata_buffer()") Reported-by: Zorro Lang <zlang@kernel.org> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> [converted to be more folio] Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>	2023-09-07 15:17:02 -04:00
Christian Brauner	78a06688a4	ntfs3: drop inode references in ntfs_put_super() Recently we moved most cleanup from ntfs_put_super() into ntfs3_kill_sb() as part of a bigger cleanup. This accidently also moved dropping inode references stashed in ntfs3's sb->s_fs_info from @sb->put_super() to @sb->kill_sb(). But generic_shutdown_super() verifies that there are no busy inodes past sb->put_super(). Fix this and disentangle dropping inode references from freeing @sb->s_fs_info. Fixes: `a4f64a300a` ("ntfs3: free the sbi in ->kill_sb") # mainline only Reported-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-09-07 10:23:37 -07:00
Linus Torvalds	9013c51c63	vfs: mostly undo glibc turning 'fstat()' into 'fstatat(AT_EMPTY_PATH)' Mateusz reports that glibc turns 'fstat()' calls into 'fstatat()', and that seems to have been going on for quite a long time due to glibc having tried to simplify its stat logic into just one point. This turns out to cause completely unnecessary overhead, where we then go off and allocate the kernel side pathname, and actually look up the empty path. Sure, our path lookup is quite optimized, but it still causes a fair bit of allocation overhead and a couple of completely unnecessary rounds of lockref accesses etc. This is all hopefully getting fixed in user space, and there is a patch floating around for just having glibc use the native fstat() system call. But even with the current situation we can at least improve on things by catching the situation and short-circuiting it. Note that this is still measurably slower than just a plain 'fstat()', since just checking that the filename is actually empty is somewhat expensive due to inevitable user space access overhead from the kernel (ie verifying pointers, and SMAP on x86). But it's still quite a bit faster than actually looking up the path for real. To quote numers from Mateusz: "Sapphire Rapids, will-it-scale, ops/s stock fstat 5088199 patched fstat 7625244 (+49%) real fstat 8540383 (+67% / +12%)" where that 'stock fstat' is the glibc translation of fstat into fstatat() with an empty path, the 'patched fstat' is with this short circuiting of the path lookup, and the 'real fstat' is the actual native fstat() system call with none of this overhead. Link: https://lore.kernel.org/lkml/20230903204858.lv7i3kqvw6eamhgz@f/ Reported-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-09-07 09:40:30 -07:00
Steve French	30bded94dc	cifs: update internal module version number for cifs.ko From 2.44 to 2.45 Signed-off-by: Steve French <stfrench@microsoft.com>	2023-09-07 00:06:04 -05:00
Steve French	6a50d71d0f	smb3: allow controlling maximum number of cached directories Allow adjusting the maximum number of cached directories per share (defaults to 16) via mount parm "max_cached_dirs" Signed-off-by: Steve French <stfrench@microsoft.com>	2023-09-07 00:06:04 -05:00
Steve French	feeec636b6	smb3: add trace point for queryfs (statfs) In debugging a recent performance problem with statfs, it would have been helpful to be able to trace the smb3 query fs info request more narrowly. Add a trace point "smb3_qfs_done" Which displays: stat-68950 [008] ..... 1472.360598: smb3_qfs_done: xid=14 sid=0xaa9765e4 tid=0x95a76f54 unc_name=\\localhost\test rc=0 Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-09-07 00:05:56 -05:00
Linus Torvalds	7ba2090ca6	Mixed with some fixes and cleanups, this brings in reasonably complete fscrypt support to CephFS! The list of things which don't work with encryption should be fairly short, mostly around the edges: fallocate (not supported well in CephFS to begin with), copy_file_range (requires re-encryption), non-default striping patterns. This was a multi-year effort principally by Jeff Layton with assistance from Xiubo Li, Luís Henriques and others, including several dependant changes in the MDS, netfs helper library and fscrypt framework itself. -----BEGIN PGP SIGNATURE----- iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAmT4pl4THGlkcnlvbW92 QGdtYWlsLmNvbQAKCRBKf944AhHzi5kzB/4sMgzZyUa3T1vA/G2pPvEkyy1qDxsW y+o4dDMWA9twcrBVpNuGd54wbXpmO/LAekHEdorjayH+f0zf10MsnP1ePz9WB3NG jr7RRujb+Gpd2OFYJXGSEbd3faTg8M2kpGCCrVe7SFNoyu8z9NwFItwWMog5aBjX ODGQrq+kA4ARA6xIqwzF5gP0zr+baT9rWhQdm7Xo9itWdosnbyDLJx1dpEfLuqBX te3SmifDzedn3Gw73hdNo/+ybw0kHARoK+RmXCTsoDDQw+JsoO9KxZF5Q8QcDELq 2woPNp0Hl+Dm4MkzGnPxv56Qj8ZDViS59syXC0CfGRmu4nzF1Rw+0qn5 =/WlE -----END PGP SIGNATURE----- Merge tag 'ceph-for-6.6-rc1' of https://github.com/ceph/ceph-client Pull ceph updates from Ilya Dryomov: "Mixed with some fixes and cleanups, this brings in reasonably complete fscrypt support to CephFS! The list of things which don't work with encryption should be fairly short, mostly around the edges: fallocate (not supported well in CephFS to begin with), copy_file_range (requires re-encryption), non-default striping patterns. This was a multi-year effort principally by Jeff Layton with assistance from Xiubo Li, Luís Henriques and others, including several dependant changes in the MDS, netfs helper library and fscrypt framework itself" * tag 'ceph-for-6.6-rc1' of https://github.com/ceph/ceph-client: (53 commits) ceph: make num_fwd and num_retry to __u32 ceph: make members in struct ceph_mds_request_args_ext a union rbd: use list_for_each_entry() helper libceph: do not include crypto/algapi.h ceph: switch ceph_lookup/atomic_open() to use new fscrypt helper ceph: fix updating i_truncate_pagecache_size for fscrypt ceph: wait for OSD requests' callbacks to finish when unmounting ceph: drop messages from MDS when unmounting ceph: update documentation regarding snapshot naming limitations ceph: prevent snapshot creation in encrypted locked directories ceph: add support for encrypted snapshot names ceph: invalidate pages when doing direct/sync writes ceph: plumb in decryption during reads ceph: add encryption support to writepage and writepages ceph: add read/modify/write to ceph_sync_write ceph: align data in pages in ceph_sync_write ceph: don't use special DIO path for encrypted inodes ceph: add truncate size handling support for fscrypt ceph: add object version support for sync read libceph: allow ceph_osdc_new_request to accept a multi-op read ...	2023-09-06 12:10:15 -07:00
Steven Rostedt (Google)	e24709454c	tracefs/eventfs: Add missing lockdown checks All the eventfs external functions do not check if TRACEFS_LOCKDOWN was set or not. This can caused some functions to return success while others fail, which can trigger unexpected errors. Add the missing lockdown checks. Link: https://lkml.kernel.org/r/20230905182711.899724045@goodmis.org Link: https://lore.kernel.org/all/202309050916.58201dc6-oliver.sang@intel.com/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Cc: Ching-lin Yu <chinglinyu@google.com> Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-09-05 21:14:08 -04:00
Steven Rostedt (Google)	51aab5ffce	tracefs: Add missing lockdown check to tracefs_create_dir() The function tracefs_create_dir() was missing a lockdown check and was called by the RV code. This gave an inconsistent behavior of this function returning success while other tracefs functions failed. This caused the inode being freed by the wrong kmem_cache. Link: https://lkml.kernel.org/r/20230905182711.692687042@goodmis.org Link: https://lore.kernel.org/all/202309050916.58201dc6-oliver.sang@intel.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ajay Kaher <akaher@vmware.com> Cc: Ching-lin Yu <chinglinyu@google.com> Fixes: `bf8e602186` ("tracing: Do not create tracefs files if tracefs lockdown is in effect") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>	2023-09-05 21:13:48 -04:00
Linus Torvalds	65d6e954e3	gfs2 fixes - Fix a glock state (non-)transition bug when a dlm request times out and is canceled, and we have locking requests that can now be granted immediately. - Various fixes and cleanups in how the logd and quotad daemons are woken up and terminated. - Fix several bugs in the quota data reference counting and shrinking. Free quota data objects synchronously in put_super() instead of letting call_rcu() run wild. - Make sure not to deallocate quota data during a withdraw; rather, defer quota data deallocation to put_super(). Withdraws can happen in contexts in which callers on the stack are holding quota data references. - Many minor quota fixes and cleanups by Bob. - Update the the mailing list address for gfs2 and dlm. (It's the same list for both and we are moving it to gfs2@lists.linux.dev.) - Various other minor cleanups. -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmT3T7UUHGFncnVlbmJh QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqxhw/+IWp+4cY4htNkTRG7xkheTeQ+5whG NU40mp7Hj+WY5GoHqsk676q1pBkVAq5mNN1kt9S/oC6lLHrdu1HLpdIkgFow2nAC nDqlEqx9/Da9Q4H/+K442usO90S4o1MmOXOE9xcGcvJLqK4FLDOVDXbUWa43OXrK 4HxgjgGSNPF4itD+U0o/V4f19jQ+4cNwmo6hGLyhsYillaUHiIXJezQlH7XycByM qGJqlG6odJ56wE38NG8Bt9Lj+91PsLLqO1TJxSYyzpf0h9QGQ2ySvu6/esTXwxWO XRuT4db7yjyAUhJoJMw+YU77xWQTz0/jriIDS7VqzvR1ns3GPaWdtb31TdUTBG4H IvBA8ep3oxHtcYFoPzCLBXgOIDej6KjAgS3RSv51yLeaZRHFUBc21fTSXbcDTIUs gkusZlRNQ9ANdBCVyf8hZxbE54HnaBJ8dKMZtynOXJEHs0EtGV8YKCNIpkFLxOvE vZkKcRsmVtuZ9fVhX1iH7dYmcsCMPI8RNo47k7hHk2EG8dU+eqyPSbi4QCmErNFf DlqX+fIuiDtOkbmWcrb2qdphn6j6bMLhDaOMJGIBOmgOPi+AU9dNAfmtu1cG4u1b 2TFyUISayiwqHJQgguzvDed15fxexYdgoLB7O9t9TMbCENxisguNa5TsAN6ZkiLQ 0hY6h80xSR2kCPU= =EonA -----END PGP SIGNATURE----- Merge tag 'gfs2-v6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Fix a glock state (non-)transition bug when a dlm request times out and is canceled, and we have locking requests that can now be granted immediately - Various fixes and cleanups in how the logd and quotad daemons are woken up and terminated - Fix several bugs in the quota data reference counting and shrinking. Free quota data objects synchronously in put_super() instead of letting call_rcu() run wild - Make sure not to deallocate quota data during a withdraw; rather, defer quota data deallocation to put_super(). Withdraws can happen in contexts in which callers on the stack are holding quota data references - Many minor quota fixes and cleanups by Bob - Update the the mailing list address for gfs2 and dlm. (It's the same list for both and we are moving it to gfs2@lists.linux.dev) - Various other minor cleanups * tag 'gfs2-v6.5-rc5-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: (51 commits) MAINTAINERS: Update dlm mailing list MAINTAINERS: Update gfs2 mailing list gfs2: change qd_slot_count to qd_slot_ref gfs2: check for no eligible quota changes gfs2: Remove useless assignment gfs2: simplify slot_get gfs2: Simplify qd2offset gfs2: introduce qd_bh_get_or_undo gfs2: Remove quota allocation info from quota file gfs2: use constant for array size gfs2: Set qd_sync_gen in do_sync gfs2: Remove useless err set gfs2: Small gfs2_quota_lock cleanup gfs2: move qdsb_put and reduce redundancy gfs2: improvements to sysfs status gfs2: Don't try to sync non-changes gfs2: Simplify function need_sync gfs2: remove unneeded pg_oflow variable gfs2: remove unneeded variable done gfs2: pass sdp to gfs2_write_buf_to_page ...	2023-09-05 13:00:28 -07:00
Linus Torvalds	9e310ea5c8	fuse update for 6.6 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCZPYlzAAKCRDh3BK/laaZ PEcxAP4suFAlonGntKJ5ltR+7ZN+WYdiraQ+5c6ISBFc+pFXgQD7B0xhztV4umSF III+pbD6lE5gP5u7+Kw/pOnTI42yTQ8= =aPjn -----END PGP SIGNATURE----- Merge tag 'fuse-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Revert non-waiting FLUSH due to a regression - Fix a lookup counter leak in readdirplus - Add an option to allow shared mmaps in no-cache mode - Add btime support and statx intrastructure to the protocol - Invalidate positive/negative dentry on failed create/delete * tag 'fuse-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: conditionally fill kstat in fuse_do_statx() fuse: invalidate dentry on EEXIST creates or ENOENT deletes fuse: cache btime fuse: implement statx fuse: add ATTR_TIMEOUT macro fuse: add STATX request fuse: handle empty request_mask in statx fuse: write back dirty pages before direct write in direct_io_relax mode fuse: add a new fuse init flag to relax restrictions in no cache mode fuse: invalidate page cache pages before direct write fuse: nlookup missing decrement in fuse_direntplus_link Revert "fuse: in fuse_flush only wait if someone wants the return code"	2023-09-05 12:45:55 -07:00
Linus Torvalds	5eea5820c7	- Stefan Roesch has added ksm statistics to /proc/pid/smaps - Also a number of singleton patches, mainly cleanups and leftovers. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZPZGXwAKCRDdBJ7gKXxA jkjpAP9F0t5xy3JGs8Iew47Yqva+fvvrZdUSx3aHIZ/C3HyaJwEAi7DwzqludyHi 851+qSdyX3bWnDEuejuNeMykh2QF1wo= =pw9A -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-09-04-14-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more MM updates from Andrew Morton: - Stefan Roesch has added ksm statistics to /proc/pid/smaps - Also a number of singleton patches, mainly cleanups and leftovers * tag 'mm-stable-2023-09-04-14-00' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/kmemleak: move up cond_resched() call in page scanning loop mm: page_alloc: remove stale CMA guard code MAINTAINERS: add rmap.h to mm entry rmap: remove anon_vma_link() nommu stub proc/ksm: add ksm stats to /proc/pid/smaps mm/hwpoison: rename hwp_walk* to hwpoison_walk* mm: memory-failure: add PageOffline() check	2023-09-05 10:56:27 -07:00
Bob Peterson	0e072cac92	gfs2: change qd_slot_count to qd_slot_ref Variable qd_slot_count is a reference count, not a count of slots. This patch renames it to qd_slot_ref to make that more clear. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	06aa6fd31a	gfs2: check for no eligible quota changes Before this patch, function gfs2_quota_sync would always allocate a page full of memory and increment its quota sync generation number. This happened even when the system was completely idle or if no blocks were allocated or quota changes made. This patch adds function qd_changed to determine if any changes have been made that qualify for a quota sync. If not, it avoids the memory allocation and bumping the generation number, along with all the additional work it would do. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	36a740916a	gfs2: Remove useless assignment This assignment is unnecessary because if error was not already 0, it would have branched to an error label already. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	9ab7b78a13	gfs2: simplify slot_get Simplify function slot_get and get rid of the goto that jumps into the middle of an else branch. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	8f190c97a4	gfs2: Simplify qd2offset This is a minor cleanup of function qd2offset. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	7dbc6ae60d	gfs2: introduce qd_bh_get_or_undo This patch is an attempt to force some consistency in quota sync processing. Two functions (qd_fish and gfs2_quota_unlock) called qd_check_sync, after which they both called bh_get, and if that failed, they took the same steps to undo the actions of qd_check_sync. This patch introduces a new function, qd_bh_get_or_undo, which performs the same steps, reducing code redundancy. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	3932e50730	gfs2: Remove quota allocation info from quota file Function do_sync called gfs2_qa_get and put for quota allocation data. But the inode in question is the system master quota file, which is never subject to quotas. Therefore, a qa structure should be unnecessary and if anything accesses it, it's probably a bug. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	c9ff3c65c2	gfs2: use constant for array size Function gfs2_quota_unlock declared an array of 4 qd elements. We have a constant for that, we should be using it. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	fce17cb0ee	gfs2: Set qd_sync_gen in do_sync Func do_sync was called in two places: gfs2_quota_unlock and gfs2_quota_sync. In gfs2_quota_sync it updated qd_sync_gen to the latest superblock sync gen, if do_sync was successful. In gfs2_quota_unlock it didn't update the value. That can only lead to extra work, for example, if the value is synced by gfs2_quota_unlock but still has the old value. This patch moves the setting of qd_sync_gen inside do_sync so we are guaranteed consistency. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	dec64ae37b	gfs2: Remove useless err set Function gfs2_adjust_quota set variable err, then set it again to a different value. This patch removes the redundant set. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	f511e60a55	gfs2: Small gfs2_quota_lock cleanup No need to set error = 0 since it's set further down. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	a4d22e337d	gfs2: move qdsb_put and reduce redundancy This patch looks more invasive than it is. It simply moves function qdsb_put before qd_unlock, then changes qd_unlock to call it rather than open coding it. Again, this reduces redundancy. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	03d468f1c0	gfs2: improvements to sysfs status This patch adds some new fields to the gfs2 status file in sysfs to aid in debugging. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	9f494e9bdc	gfs2: Don't try to sync non-changes Function need_sync is supposed to determine if a qd element needs to be synced. If the "change" (qd_change) is zero, it does not need to be synced because there's literally no change in the value. Before this patch need_sync returned false if value < 0. That should be <= 0. This patch changes the check to <=. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	2a4f651167	gfs2: Simplify function need_sync This patch simplifies function need_sync by eliminating a variable in favor of just returning the appropriate value as soon as we know it. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:18 +02:00
Bob Peterson	e34c16c9c6	gfs2: remove unneeded pg_oflow variable Function gfs2_write_disk_quota checks if its write overflows onto another page, and if so, does a second write. Before this patch it kept two variables for this, but only one is needed. This patch simplifies it by eliminating pg_oflow. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Bob Peterson	f0418e4b56	gfs2: remove unneeded variable done Function gfs2_write_buf_to_page uses variable done to exit its loop, but it's unnecessary if we just code an infinite loop and exit when we need. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Bob Peterson	d96dad2715	gfs2: pass sdp to gfs2_write_buf_to_page This patch passes the superblock pointer to gfs2_write_buf_to_page so it becomes more apparent it's dealing with the system quota file. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Bob Peterson	adfd2b5e4f	gfs2: pass sdp in to gfs2_write_disk_quota Like the previous patch, we now pass the superblock pointer to function gfs2_write_disk_quota. This makes the code more understandable, since it only operates on the quota inode. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Bob Peterson	ee1768e467	gfs2: Pass sdp to gfs2_adjust_quota Before this change function gfs2_adjust_quota's first parameter was an gfs2_inode pointer. But it always pointed to the quota inode. Here we switch that to pass the superblock pointer, sdp, so it is easier to read the code and understand that it's only dealing with the quota inode. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Bob Peterson	768963ab07	gfs2: remove dead code for quota writes Since patch `845802b112` function gfs2_write_buf_to_page checks if the target inode is jdata or ordered. This function only operates on the system quota file, which is always jdata, so the check for jdata is useless. This patch removes it. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Bob Peterson	eef46ab713	gfs2: Introduce new quota=quiet mount option This patch adds a new mount option quota=quiet which is the same as quota=on but it suppresses gfs2 quota error messages. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Andreas Gruenbacher	267d1a011e	gfs2: Add device name to gfs2_logd and gfs2_quotad Add the device name to the names of the gfs2_logd and gfs2_quotad kernel threads to allow for easier identification. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Andreas Gruenbacher	ab8eecf5d0	gfs2: Rename "freeze_workqueue" to "gfs2_freeze" Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Andreas Gruenbacher	5c0dc371a2	gfs2: Rename "gfs_recovery" workqueue to "gfs2_recovery" Rename the "gfs_recovery" workqueue to "gfs2_recovery", and gfs_recovery_wq to gfs2_recovery_wq. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Andreas Gruenbacher	e3da6be3d7	gfs2: Fix withdraw race Function gfs2_withdraw() tries to synchronize concurrent callers by atomically setting the SDF_WITHDRAWN flag in the first caller, setting the SDF_WITHDRAW_IN_PROG flag to indicate that a withdraw is in progress, performing the actual withdraw, and clearing the SDF_WITHDRAW_IN_PROG flag when done. All other callers wait for the SDF_WITHDRAW_IN_PROG flag to be cleared before returning. This leaves a small window in which callers can find the SDF_WITHDRAWN flag set before the SDF_WITHDRAW_IN_PROG flag has been set, causing them to return prematurely, before the withdraw has been completed. Fix that by setting the SDF_WITHDRAWN and SDF_WITHDRAW_IN_PROG flags atomically. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Andreas Gruenbacher	fe0690f0a6	gfs2: Sanitize kthread stopping Immediately stop the logd and quotad kernel threads when a filesystem withdraw is detected: those threads aren't doing anything useful after a withdraw. (Depends on the extra logd and quotad task struct references held since commit 7a109f383fa3 ("gfs2: Fix asynchronous thread destruction").) In addition, check for kthread_should_stop() in the wait condition in gfs2_quotad() to stop immediately when kthread_stop() is called. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Andreas Gruenbacher	e4a8b5481c	gfs2: Switch to wait_event in gfs2_quotad In gfs2_quotad(), switch from an open-coded wait loop to wait_event_interruptible_timeout(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Andreas Gruenbacher	fe4f7940d2	gfs2: Fix asynchronous thread destruction The kernel threads are currently stopped and destroyed synchronously by gfs2_make_fs_ro() and gfs2_put_super(), and asynchronously by signal_our_withdraw(), with no synchronization, so the synchronous and asynchronous contexts can race with each other. First, when creating the kernel threads, take an extra task struct reference so that the task struct won't go away immediately when they terminate. This allows those kthreads to terminate immediately when they're done rather than hanging around as zombies until they are reaped by kthread_stop(). When kthread_stop() is called on a terminated kthread, it will return immediately. Second, in signal_our_withdraw(), once the SDF_JOURNAL_LIVE flag has been cleared, wake up the logd and quotad wait queues instead of stopping the logd and quotad kthreads. The kthreads are then expected to terminate automatically within short time, but if they cannot, they will not block the withdraw. For example, if a user process and one of the kthread decide to withdraw at the same time, only one of them will perform the actual withdraw and the other will wait for it to be done. If the kthread ends up being the one to wait, the withdrawing user process won't be able to stop it. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Andreas Gruenbacher	f66af88e33	gfs2: Stop using gfs2_make_fs_ro for withdraw [ 81.372851][ T5532] CPU: 1 PID: 5532 Comm: syz-executor.0 Not tainted 6.2.0-rc1-syzkaller-dirty #0 [ 81.382080][ T5532] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023 [ 81.392343][ T5532] Call Trace: [ 81.395654][ T5532] <TASK> [ 81.398603][ T5532] dump_stack_lvl+0x1b1/0x290 [ 81.418421][ T5532] gfs2_assert_warn_i+0x19a/0x2e0 [ 81.423480][ T5532] gfs2_quota_cleanup+0x4c6/0x6b0 [ 81.428611][ T5532] gfs2_make_fs_ro+0x517/0x610 [ 81.457802][ T5532] gfs2_withdraw+0x609/0x1540 [ 81.481452][ T5532] gfs2_inode_refresh+0xb2d/0xf60 [ 81.506658][ T5532] gfs2_instantiate+0x15e/0x220 [ 81.511504][ T5532] gfs2_glock_wait+0x1d9/0x2a0 [ 81.516352][ T5532] do_sync+0x485/0xc80 [ 81.554943][ T5532] gfs2_quota_sync+0x3da/0x8b0 [ 81.559738][ T5532] gfs2_sync_fs+0x49/0xb0 [ 81.564063][ T5532] sync_filesystem+0xe8/0x220 [ 81.568740][ T5532] generic_shutdown_super+0x6b/0x310 [ 81.574112][ T5532] kill_block_super+0x79/0xd0 [ 81.578779][ T5532] deactivate_locked_super+0xa7/0xf0 [ 81.584064][ T5532] cleanup_mnt+0x494/0x520 [ 81.593753][ T5532] task_work_run+0x243/0x300 [ 81.608837][ T5532] exit_to_user_mode_loop+0x124/0x150 [ 81.614232][ T5532] exit_to_user_mode_prepare+0xb2/0x140 [ 81.619820][ T5532] syscall_exit_to_user_mode+0x26/0x60 [ 81.625287][ T5532] do_syscall_64+0x49/0xb0 [ 81.629710][ T5532] entry_SYSCALL_64_after_hwframe+0x63/0xcd In this backtrace, gfs2_quota_sync() takes quota data references and then calls do_sync(). Function do_sync() encounters filesystem corruption and withdraws the filesystem, which (among other things) calls gfs2_quota_cleanup(). Function gfs2_quota_cleanup() wrongly assumes that nobody is holding any quota data references anymore, and destroys all quota data objects. When gfs2_quota_sync() then resumes and dereferences the quota data objects it is holding, those objects are no longer there. Function gfs2_quota_cleanup() deals with resource deallocation and can easily be delayed until gfs2_put_super() in the case of a filesystem withdraw. In fact, most of the other work gfs2_make_fs_ro() does is unnecessary during a withdraw as well, so change signal_our_withdraw() to skip gfs2_make_fs_ro() and perform the necessary steps directly instead. Thanks to Edward Adam Davis <eadavis@sina.com> for the initial patches. Link: https://lore.kernel.org/all/0000000000002b5e2405f14e860f@google.com Reported-by: syzbot+3f6a670108ce43356017@syzkaller.appspotmail.com Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Andreas Gruenbacher	a475c5dd16	gfs2: Free quota data objects synchronously In gfs2_quota_cleanup(), wait for the quota data objects to be freed before returning. Otherwise, there is no guarantee that the quota data objects will be gone when their kmem cache is destroyed. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Andreas Gruenbacher	bb73ae8ff3	gfs2: Fix initial quota data refcount Fix the refcount of quota data objects created directly by gfs2_quota_init(): those are placed into the in-memory quota "database" for eventual syncing to the main quota file, but they are not actively held and should thus have an initial refcount of 0. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:17 +02:00
Andreas Gruenbacher	fae2e73a55	gfs2: No more quota complaints after withdraw Once a filesystem is withdrawn, don't complain about quota changes that can't be synced to the main quota file anymore. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	faada74a90	gfs2: Factor out duplicate quota data disposal code Rename gfs2_qd_dispose() to gfs2_qd_dispose_list(). Move some code duplicated in gfs2_qd_dispose_list() and gfs2_quota_cleanup() into a new gfs2_qd_dispose() function. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	961fe3422e	gfs2: Use gfs2_qd_dispose in gfs2_quota_cleanup Change gfs2_quota_cleanup() to move the quota data objects to dispose of on a dispose list and call gfs2_qd_dispose() on that list, like gfs2_qd_shrink_scan() does, instead of disposing of the quota data objects directly. This may look a bit pointless by itself, but it will make more sense in combination with a fix that follows. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	6b0e9a5f1e	gfs2: Fix wrong quota shrinker return value Function gfs2_qd_isolate must only return LRU_REMOVED when removing the item from the lru list; otherwise, the number of items on the list will go wrong. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	e7beb8b6de	gfs2: Rename SDF_DEACTIVATING to SDF_KILL Rename the SDF_DEACTIVATING flag to SDF_KILL to make it more obvious that this relates to the kill_sb filesystem operation. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	3c69c437bf	gfs2: Rename sd_{ glock => kill }_wait Rename sd_glock_wait to sd_kill_wait: we'll use it for other things related to "killing" a filesystem on unmount soon (kill_sb). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Bob Peterson	481f6e7d73	gfs2: Use qd_sbd more consequently Before this patch many of the functions in quota.c got their superblock pointer, sdp, from the quota_data's glock pointer. That's silly because the qd already has its own pointer to the superblock (qd_sbd). This patch changes references to use that instead, eliminating a level of indirection. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	db77789bae	gfs2: journal flush threshold fixes and cleanup Commit `f07b352021` ("GFS2: Made logd daemon take into account log demand") changed gfs2_ail_flush_reqd() and gfs2_jrnl_flush_reqd() to take sd_log_blks_needed into account, but the checks in gfs2_log_commit() were not updated correspondingly. Once that is fixed, gfs2_jrnl_flush_reqd() and gfs2_ail_flush_reqd() can be used in gfs2_log_commit(). Make those two helpers available to gfs2_log_commit() by defining them above gfs2_log_commit(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	b6b8f72a11	gfs2: Fix logd wakeup on I/O error When quotad detects an I/O error, it sets sd_log_error and then it wakes up logd to withdraw the filesystem. However, logd doesn't wake up when sd_log_error is set. Fix that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	b74cd55aa9	gfs2: low-memory forced flush fixes First, function gfs2_ail_flush_reqd checks the SDF_FORCE_AIL_FLUSH flag to determine if an AIL flush should be forced in low-memory situations. However, it also immediately clears the flag, and when called repeatedly as in function gfs2_logd, the flag will be lost. Fix that by pulling the SDF_FORCE_AIL_FLUSH flag check out of gfs2_ail_flush_reqd. Second, function gfs2_writepages sets the SDF_FORCE_AIL_FLUSH flag whether or not enough pages were written. If enough pages could be written, flushing the AIL is unnecessary, though. Third, gfs2_writepages doesn't wake up logd after setting the SDF_FORCE_AIL_FLUSH flag, so it can take a long time for logd to react. It would be preferable to wake up logd, but that hurts the performance of some workloads and we don't quite understand why so far, so don't wake up logd so far. Fixes: `b066a4eebd` ("gfs2: forcibly flush ail to relieve memory pressure") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	6df373b09b	gfs2: Switch to wait_event in gfs2_logd In gfs2_logd(), switch from an open-coded wait loop to wait_event_interruptible_timeout(). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Bob Peterson	66fa9912ec	gfs2: conversion deadlock do_promote bypass Consider the following case: 1. A glock is held in shared mode. 2. A process requests the glock in exclusive mode (rename). 3. Before the lock is granted, more processes (read / ls) request the glock in shared mode again. 4. gfs2 sends a request to dlm for the lock in exclusive mode because that holder is at the head of the queue. 5. Somehow the dlm request gets canceled, so dlm sends us back a response with state == LM_ST_SHARED and LM_OUT_CANCELED. So at that point, the glock is still held in shared mode. 6. finish_xmote gets called to process the response from dlm. It detects that the glock is not in the requested mode and no demote is in progress, so it moves the canceled holder to the tail of the queue and finds the new holder at the head of the queue. That holder is requesting the glock in shared mode. 7. finish_xmote calls do_xmote to transition the glock into shared mode, but the glock is already in shared mode and so do_xmote complains about that with: GLOCK_BUG_ON(gl, gl->gl_state == gl->gl_target); Instead, in finish_xmote, after moving the canceled holder to the tail of the queue, check if any new holders can be granted. Only call do_xmote to repeat the dlm request if the holder at the head of the queue is requesting the glock in a mode that is incompatible with the mode the glock is currently held in. Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	0b93bac227	gfs2: Remove LM_FLAG_PRIORITY flag The last user of this flag was removed in commit `b77b4a4815` ("gfs2: Rework freeze / thaw logic"). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	de3e7f97ae	gfs2: do_promote cleanup Change function do_promote to return true on success, and false otherwise. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	dc0b943523	gfs: Don't use GFP_NOFS in gfs2_unstuff_dinode Revert the rest of commit `220cca2a4f` ("GFS2: Change truncate page allocation to be GFP_NOFS"): In gfs2_unstuff_dinode(), there is no need to carry out the page cache allocation under GFP_NOFS because inodes on the "regular" filesystem are never un-inlined under memory pressure, so switch back from find_or_create_page() to grab_cache_page() here as well. Inodes on the "metadata" filesystem can theoretically be un-inlined under memory pressure, but any page cache allocations in that context would happen in GFP_NOFS context because those inodes have inode->i_mapping->gfp_mask set to GFP_NOFS (see the previous patch). Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:16 +02:00
Andreas Gruenbacher	111c7d27a1	gfs2: Use mapping->gfp_mask for metadata inodes Set mapping->gfp mask to GFP_NOFS for all metadata inodes so that allocating pages in the address space of those inodes won't call back into the filesystem. This allows to switch back from find_or_create_page() to grab_cache_page() in two places. Partially reverts commit `220cca2a4f` ("GFS2: Change truncate page allocation to be GFP_NOFS"). Thanks to Dan Carpenter <dan.carpenter@linaro.org> for pointing out a Smatch static checker warning. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:15 +02:00
Minjie Du	5f02d16868	gfs2: increase usage of folio_next_index() helper Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using the existing helper folio_next_index(). Signed-off-by: Minjie Du <duminjie@vivo.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>	2023-09-05 15:58:15 +02:00
Amir Goldstein	724768a393	ovl: fix incorrect fdput() on aio completion ovl_{read,write}_iter() always call fdput(real) to put one or zero refcounts of the real file, but for aio, whether it was submitted or not, ovl_aio_put() also calls fdput(), which is not balanced. This is only a problem in the less common case when FDPUT_FPUT flag is set. To fix the problem use get_file() to take file refcount and use fput() instead of fdput() in ovl_aio_put(). Fixes: `2406a307ac` ("ovl: implement async IO routines") Cc: <stable@vger.kernel.org> # v5.6 Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Amir Goldstein <amir73il@gmail.com>	2023-09-04 18:27:38 +03:00
Amir Goldstein	ab04830202	ovl: fix failed copyup of fileattr on a symlink Some local filesystems support setting persistent fileattr flags (e.g. FS_NOATIME_FL) on directories and regular files via ioctl. Some of those persistent fileattr flags are reflected to vfs as in-memory inode flags (e.g. S_NOATIME). Overlayfs uses the in-memory inode flags (e.g. S_NOATIME) on a lower file as an indication that a the lower file may have persistent inode fileattr flags (e.g. FS_NOATIME_FL) that need to be copied to upper file. However, in some cases, the S_NOATIME in-memory flag could be a false indication for persistent FS_NOATIME_FL fileattr. For example, with NFS and FUSE lower fs, as was the case in the two bug reports, the S_NOATIME flag is set unconditionally for all inodes. Users cannot set persistent fileattr flags on symlinks and special files, but in some local fs, such as ext4/btrfs/tmpfs, the FS_NOATIME_FL fileattr flag are inheritted to symlinks and special files from parent directory. In both cases described above, when lower symlink has the S_NOATIME flag, overlayfs will try to copy the symlink's fileattrs and fail with error ENOXIO, because it could not open the symlink for the ioctl security hook. To solve this failure, do not attempt to copyup fileattrs for anything other than directories and regular files. Reported-by: Ruiwen Zhao <ruiwen@google.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217850 Fixes: `72db82115d` ("ovl: copy up sync/noatime fileattr flags") Cc: <stable@vger.kernel.org> # v5.15 Reviewed-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Amir Goldstein <amir73il@gmail.com>	2023-09-04 18:27:18 +03:00
Steve French	f5069159f3	ksmbd: remove experimental warning ksmbd has made significant improvements over the past two years and is regularly tested and used. Remove the experimental warning. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-09-03 21:06:36 -05:00
Linus Torvalds	92901222f8	f2fs update for 6.6-rc1 In this cycle, we don't have a highlighted feature enhancement, but mostly have fixed issues mainly in two parts: 1) zoned block device, 2) compression support. For zoned block device, we've tried to improve the power-off recovery flow as much as possible. For compression, we found some corner cases caused by wrong compression policy and logics. Other than them, there were some reverts and stat corrections. Bug fix: - use finish zone command when closing a zone - check zone type before sending async reset zone command - fix to assign compress_level for lz4 correctly - fix error path of f2fs_submit_page_read() - don't {,de}compress non-full cluster - send small discard commands during checkpoint back - flush inode if atomic file is aborted - correct to account gc/cp stats And, there are minor bug fixes, avoiding false lockdep warning, and clean-ups. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmTyemoACgkQQBSofoJI UNLLKQ//bupYGPOqAgKbd/s7FhtULMiiRmFVy7W2eoMIc/oeeXOGrzDAF/1NifLC WLV4uBNVTS4PS8D1vRzxZNEZt9aqPS0vQ8hxW/3nTI9Z425NX3nz7gLSxxmwIkIe xj++V6tvKPcCH0BfKvfFCtcxj09PsflgdEuT8w/sIkH6p4o+VEMFs1Lc9PQsjUmh epznK7JGBwpAxmHqI74n1eAw2CI6W+oKx23YDTNMBD6hmXTU0fkTeBURrOlSsUHZ nhafPecsrCEI+OpAj03G/7e/zt+iTUKdmHx9O5ir/P00vF/c+SU2vSwB97FiHqBi B4UmocTM0MAsU80PQcmE6aU3zgQFI0Yun5yZ24VeWjKTu76ssZSmT2HA/4RL+LLf AeAW4FSyfh76pls8X5IWfilxGLWq6kTzSZA0MF7dH2q7qlj5apL5wKpm/XH6POqn qELY/Y9+P1QuCcNL8BiYrgA5xBqVJ7Uw/6/6U3Y77PElc+Pwl3vI8UZ7uCOBrsXL e0TLXy23AJA6AS2DyLLziy669nXAZRb95B8TWMfEeVZIMFvCeeqYc74N8jOFa0T8 q6uQFZs+0cETLZA8MSZdlNhzvhJmbW6wgSIz++CEdikWSLBZMKWxBVjCPkkCY9uc DMh8zruSVbYPZWBTcxkMFEBJKKrU43++e7pb8ZoqTj4Pq1317b0= =Qa8+ -----END PGP SIGNATURE----- Merge tag 'f2fs-for-6-6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this cycle, we don't have a highlighted feature enhancement, but mostly have fixed issues mainly in two parts: 1) zoned block device, and 2) compression support. For zoned block device, we've tried to improve the power-off recovery flow as much as possible. For compression, we found some corner cases caused by wrong compression policy and logics. Other than them, there were some reverts and stat corrections. Bug fixes: - use finish zone command when closing a zone - check zone type before sending async reset zone command - fix to assign compress_level for lz4 correctly - fix error path of f2fs_submit_page_read() - don't {,de}compress non-full cluster - send small discard commands during checkpoint back - flush inode if atomic file is aborted - correct to account gc/cp stats And, there are minor bug fixes, avoiding false lockdep warning, and clean-ups" * tag 'f2fs-for-6-6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (25 commits) f2fs: use finish zone command when closing a zone f2fs: compress: fix to assign compress_level for lz4 correctly f2fs: fix error path of f2fs_submit_page_read() f2fs: clean up error handling in sanity_check_{compress_,}inode() f2fs: avoid false alarm of circular locking Revert "f2fs: do not issue small discard commands during checkpoint" f2fs: doc: fix description of max_small_discards f2fs: should update REQ_TIME for direct write f2fs: fix to account cp stats correctly f2fs: fix to account gc stats correctly f2fs: remove unneeded check condition in __f2fs_setxattr() f2fs: fix to update i_ctime in __f2fs_setxattr() Revert "f2fs: fix to do sanity check on extent cache correctly" f2fs: increase usage of folio_next_index() helper f2fs: Only lfs mode is allowed with zoned block device feature f2fs: check zone type before sending async reset zone command f2fs: compress: don't {,de}compress non-full cluster f2fs: allow f2fs_ioc_{,de}compress_file to be interrupted f2fs: don't reopen the main block device in f2fs_scan_devices f2fs: fix to avoid mmap vs set_compress_option case ...	2023-09-02 15:37:59 -07:00
Stefan Roesch	8b47933544	proc/ksm: add ksm stats to /proc/pid/smaps With madvise and prctl KSM can be enabled for different VMA's. Once it is enabled we can query how effective KSM is overall. However we cannot easily query if an individual VMA benefits from KSM. This commit adds a KSM section to the /prod/<pid>/smaps file. It reports how many of the pages are KSM pages. Note that KSM-placed zeropages are not included, only actual KSM pages. Here is a typical output: 7f420a000000-7f421a000000 rw-p 00000000 00:00 0 Size: 262144 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 51212 kB Pss: 8276 kB Shared_Clean: 172 kB Shared_Dirty: 42996 kB Private_Clean: 196 kB Private_Dirty: 7848 kB Referenced: 15388 kB Anonymous: 51212 kB KSM: 41376 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 202016 kB SwapPss: 3882 kB Locked: 0 kB THPeligible: 0 ProtectionKey: 0 ksm_state: 0 ksm_skip_base: 0 ksm_skip_count: 0 VmFlags: rd wr mr mw me nr mg anon This information also helps with the following workflow: - First enable KSM for all the VMA's of a process with prctl. - Then analyze with the above smaps report which VMA's benefit the most - Change the application (if possible) to add the corresponding madvise calls for the VMA's that benefit the most [shr@devkernel.io: v5] Link: https://lkml.kernel.org/r/20230823170107.1457915-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20230822180539.1424843-1-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-09-02 15:17:33 -07:00
Linus Torvalds	82c5561b57	pstore fix for v6.6-rc1 - Adjust sizes of buffers just avoid uncompress failures (Ard Biesheuvel) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmTyLIIWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJrdjD/9Vl4t3OmP4ahaYVaRXAC3iTCdr ZB2idXOdGnpDDUywthjEKHUvco9eyHr0ApjnXjRRctRByf+nBN0DnTbokF8Uv+93 Bb4/Aqi5y8yioRMTyZ6zt4ZgYpjwiB0fknpOW2NGSVP0CpgDEE33zRU0x64drMbT XOcOQ41o75NN39BrZ7ccp5jKor/XIYz063/zpqYE78HwTJ9Op9+SQettIgZfdk2/ imTNMpSiqpbMxbTnS5cldhPgpr93kRoItr6CceF/7+bd8azWJsUEe7XXcAmK9RJN InO6XxnySv3eML94kVMBpTdWuXwM3O4BxHRBz1lunQZfD1kES/B74UJ2iF0IfpFe gT6kgGy3rgIuY7rkmcFHjzIZ7zV0tynlfppTWSf/Y1lzZNBFdtX0pXBNweDEDD53 LA7DrKbusRSMx9srIJnVFiGYSndaaTyViPYD1esVxppv2+wS6DWAHGpxLfYmBA2A i5geEblp3W25zAtpg6G1Uu2sTY8f1X54V3W74D9AkalnoMDSyqVz67QS/q+wsYyf jQZ+JgTuJBHOCYzgfRlnuQpkiGlz6EKoNlgJ59H8ls6sAygPYS0gKmdnRGDrjF9R cSmjVBebBcOV0FiGIoRlh03FoQyTG81DXvXpro6IeqKk4miV9PQtmCeMtIwZ9/cq rVxyOHK2u+Q9S0230A== =rf8u -----END PGP SIGNATURE----- Merge tag 'pstore-v6.6-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore fix from Kees Cook: - Adjust sizes of buffers just avoid uncompress failures (Ard Biesheuvel) * tag 'pstore-v6.6-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore: Base compression input buffer size on estimated compressed size	2023-09-02 10:45:17 -07:00
Linus Torvalds	34232fcfe9	Tracing updates for 6.6: User visible changes: - Added a way to easier filter with cpumasks: # echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter - Show actual size of ring buffer after modifying the ring buffer size via buffer_size_kb. Currently it just returns what was written, but the actual size rounds up to the sub buffer size. Show that real size instead. Major changes: - Added "eventfs". This is the code that handles the inodes and dentries of tracefs/events directory. As there are thousands of events, and each event has several inodes and dentries that currently exist even when tracing is never used, they take up precious memory. Instead, eventfs will allocate the inodes and dentries in a JIT way (similar to what procfs does). There is now metadata that handles the events and subdirectories, and will create the inodes and dentries when they are used. Note, I also have patches that remove the subdirectory meta data, but will wait till the next merge window before applying them. It's a little more complex, and I want to make sure the dynamic code works properly before adding more complexity, making it easier to revert if need be. Minor changes: - Optimization to user event list traversal. - Remove intermediate permission of tracefs files (note the intermediate permission removes all access to the files so it is not a security concern, but just a clean up.) - Add the complex fix to FORTIFY_SOURCE to the kernel stack event logic. - Other minor clean ups. -----BEGIN PGP SIGNATURE----- iQJIBAABCgAyFiEEXtmkj8VMCiLR0IBM68Js21pW3nMFAmTwtAsUHHJvc3RlZHRA Z29vZG1pcy5vcmcACgkQ68Js21pW3nNOXRAAsslQT6alY4OeplC4x47+V6+6NiIA oDtOmWAqf7TsH9bukzRFD36rUly42O20RJDx9z0Q3iRc3vGxEawId8z6P0HmBwRb VSl5BryWvL5Wc5w94xS8EeCuC1MRfhVDyfbtVFmWigzfvd/f+hp71ViMPHUvrRJX KhzzNSBc4ir5E1lzfwa7meYTXzDwrQlZbYfdf5aH94IWAkqDj85PUZDJ7UmLZhXG CIglSpNFXZ0j19Wo/U6KZlHR1XfunBKungCzJ5Dbznc9YLWZTQXOIZF4YPKfPIJL ulRG9chwXY0nQWhG3xM1UHZLsAMSWw5i13a4ZN4d8FCNOgv8ttcJnfDk7ZYUS0Oz RmY1dGcSRKAZTUTjm8ZBtmyiUCc9kZAIk0fyEfIHtoDYXmhnvni3wuTnbRSdXaSi q4YkxPaLfX8Fn3QloCqqddt8iONu7BnbpZOhUCl2AtBib52gnTTF7+rQ6/0D3rjo SSuvEHhnjJhzk+3jM2odxjmTAztNT+yu6FbKXZUKPt1Kj9YHv1J9cEQw9/Etw+GV 8jQBe979D8hFJmDOJOT/O/TdPqE9mQoMNBt6Y8QnE4nbJWM+i/MBrThFpUSQhRCr 0Ya/HgR2QyRH7RmZW5o2H9mNtN+V9c7RxZW8erYzRbUs0YofK2OpGi9SrPzxWCke w6j0VVZHaxdPguM= =/s+e -----END PGP SIGNATURE----- Merge tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing updates from Steven Rostedt: "User visible changes: - Added a way to easier filter with cpumasks: # echo 'cpumask & CPUS{17-42}' > /sys/kernel/tracing/events/ipi_send_cpumask/filter - Show actual size of ring buffer after modifying the ring buffer size via buffer_size_kb. Currently it just returns what was written, but the actual size rounds up to the sub buffer size. Show that real size instead. Major changes: - Added "eventfs". This is the code that handles the inodes and dentries of tracefs/events directory. As there are thousands of events, and each event has several inodes and dentries that currently exist even when tracing is never used, they take up precious memory. Instead, eventfs will allocate the inodes and dentries in a JIT way (similar to what procfs does). There is now metadata that handles the events and subdirectories, and will create the inodes and dentries when they are used. Note, I also have patches that remove the subdirectory meta data, but will wait till the next merge window before applying them. It's a little more complex, and I want to make sure the dynamic code works properly before adding more complexity, making it easier to revert if need be. Minor changes: - Optimization to user event list traversal - Remove intermediate permission of tracefs files (note the intermediate permission removes all access to the files so it is not a security concern, but just a clean up) - Add the complex fix to FORTIFY_SOURCE to the kernel stack event logic - Other minor cleanups" * tag 'trace-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (29 commits) tracefs: Remove kerneldoc from struct eventfs_file tracefs: Avoid changing i_mode to a temp value tracing/user_events: Optimize safe list traversals ftrace: Remove empty declaration ftrace_enable_daemon() and ftrace_disable_daemon() tracing: Remove unused function declarations tracing/filters: Document cpumask filtering tracing/filters: Further optimise scalar vs cpumask comparison tracing/filters: Optimise CPU vs cpumask filtering when the user mask is a single CPU tracing/filters: Optimise scalar vs cpumask filtering when the user mask is a single CPU tracing/filters: Optimise cpumask vs cpumask filtering when user mask is a single CPU tracing/filters: Enable filtering the CPU common field by a cpumask tracing/filters: Enable filtering a scalar field by a cpumask tracing/filters: Enable filtering a cpumask field by another cpumask tracing/filters: Dynamically allocate filter_pred.regex test: ftrace: Fix kprobe test for eventfs eventfs: Move tracing/events to eventfs eventfs: Implement removal of meta data from eventfs eventfs: Implement functions to create files and dirs when accessed eventfs: Implement eventfs lookup, read, open functions eventfs: Implement eventfs file add functions ...	2023-09-01 16:34:25 -07:00
Linus Torvalds	1c9f8dff62	Char/Misc driver changes for 6.6-rc1 Here is the big set of char/misc and other small driver subsystem changes for 6.6-rc1. Stuff all over the place here, lots of driver updates and changes and new additions. Short summary is: - new IIO drivers and updates - Interconnect driver updates - fpga driver updates and additions - fsi driver updates - mei driver updates - coresight driver updates - nvmem driver updates - counter driver updates - lots of smaller misc and char driver updates and additions All of these have been in linux-next for a long time with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZPH64g8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ynr2QCfd3RKeR+WnGzyEOFhksl30UJJhiIAoNZtYT5+ t9KG0iMDXRuTsOqeEQbd =tVnk -----END PGP SIGNATURE----- Merge tag 'char-misc-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char/misc driver updates from Greg KH: "Here is the big set of char/misc and other small driver subsystem changes for 6.6-rc1. Stuff all over the place here, lots of driver updates and changes and new additions. Short summary is: - new IIO drivers and updates - Interconnect driver updates - fpga driver updates and additions - fsi driver updates - mei driver updates - coresight driver updates - nvmem driver updates - counter driver updates - lots of smaller misc and char driver updates and additions All of these have been in linux-next for a long time with no reported problems" * tag 'char-misc-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (267 commits) nvmem: core: Notify when a new layout is registered nvmem: core: Do not open-code existing functions nvmem: core: Return NULL when no nvmem layout is found nvmem: core: Create all cells before adding the nvmem device nvmem: u-boot-env:: Replace zero-length array with DECLARE_FLEX_ARRAY() helper nvmem: sec-qfprom: Add Qualcomm secure QFPROM support dt-bindings: nvmem: sec-qfprom: Add bindings for secure qfprom dt-bindings: nvmem: Add compatible for QCM2290 nvmem: Kconfig: Fix typo "drive" -> "driver" nvmem: Explicitly include correct DT includes nvmem: add new NXP QorIQ eFuse driver dt-bindings: nvmem: Add t1023-sfp efuse support dt-bindings: nvmem: qfprom: Add compatible for MSM8226 nvmem: uniphier: Use devm_platform_get_and_ioremap_resource() nvmem: qfprom: do some cleanup nvmem: stm32-romem: Use devm_platform_get_and_ioremap_resource() nvmem: rockchip-efuse: Use devm_platform_get_and_ioremap_resource() nvmem: meson-mx-efuse: Convert to devm_platform_ioremap_resource() nvmem: lpc18xx_otp: Convert to devm_platform_ioremap_resource() nvmem: brcm_nvram: Use devm_platform_get_and_ioremap_resource() ...	2023-09-01 09:53:54 -07:00
Linus Torvalds	28a4f91f5f	Driver core changes for 6.6-rc1 Here is a small set of driver core updates and additions for 6.6-rc1. Included in here are: - stable kernel documentation updates - class structure const work from Ivan on various subsystems - kernfs tweaks - driver core tests! - kobject sanity cleanups - kobject structure reordering to save space - driver core error code handling fixups - other minor driver core cleanups All of these have been in linux-next for a while with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCZPH77Q8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ylZMACePk8SitfaJc6FfFf5I7YK7Nq0V8MAn0nUjgsR i8NcNpu/Yv4HGrDgTdh/ =PJbk -----END PGP SIGNATURE----- Merge tag 'driver-core-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is a small set of driver core updates and additions for 6.6-rc1. Included in here are: - stable kernel documentation updates - class structure const work from Ivan on various subsystems - kernfs tweaks - driver core tests! - kobject sanity cleanups - kobject structure reordering to save space - driver core error code handling fixups - other minor driver core cleanups All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (32 commits) driver core: Call in reversed order in device_platform_notify_remove() driver core: Return proper error code when dev_set_name() fails kobject: Remove redundant checks for whether ktype is NULL kobject: Add sanity check for kset->kobj.ktype in kset_register() drivers: base: test: Add missing MODULE_* macros to root device tests drivers: base: test: Add missing MODULE_* macros for platform devices tests drivers: base: Free devm resources when unregistering a device drivers: base: Add basic devm tests for platform devices drivers: base: Add basic devm tests for root devices kernfs: fix missing kernfs_iattr_rwsem locking docs: stable-kernel-rules: mention that regressions must be prevented docs: stable-kernel-rules: fine-tune various details docs: stable-kernel-rules: make the examples for option 1 a proper list docs: stable-kernel-rules: move text around to improve flow docs: stable-kernel-rules: improve structure by changing headlines base/node: Remove duplicated include kernfs: attach uuid for every kernfs and report it in fsid kernfs: add stub helper for kernfs_generic_poll() x86/resctrl: make pseudo_lock_class a static const structure x86/MSR: make msr_class a static const structure ...	2023-09-01 09:43:18 -07:00
Linus Torvalds	e0152e7481	RISC-V Patches for the 6.6 Merge Window, Part 1 * Support for the new "riscv,isa-extensions" and "riscv,isa-base" device tree interfaces for probing extensions. * Support for userspace access to the performance counters. * Support for more instructions in kprobes. * Crash kernels can be allocated above 4GiB. * Support for KCFI. * Support for ELFs in !MMU configurations. * ARCH_KMALLOC_MINALIGN has been reduced to 8. * mmap() defaults to sv48-sized addresses, with longer addresses hidden behind a hint (similar to Arm and Intel). * Also various fixes and cleanups. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEKzw3R0RoQ7JKlDp6LhMZ81+7GIkFAmTx96kTHHBhbG1lckBk YWJiZWx0LmNvbQAKCRAuExnzX7sYiVjRD/9DYVLlkQ/OEDJjPaEcYCP49xgIVUUU lhs3XbSs2VNHBaiG114f6Q0AaT/uNi+uqSej3CeTmEot2kZkBk/f2yu+UNIriPZ9 GQiZsdyXhu921C+5VFtiI47KDWOVZ+Jpy3M1ll61IWt3yPSQHr1xOP0AOiyHHqe3 cmqpNnzjajlfVDoXPc2mGGzUJt/7ar4thcwnMNi98raXR5Qh7SP6rrHjoQhE1oFk LMP3CHqEAcHE2tE4CxZVpc6HOQ5m0LpQIOK7ypufGMyoIYESm5dt/JOT4MlhTtDw 6JzyVKtiM7lartUnUaW3ZoX4trQYT5gbXxWrJ2gCnUGy3VulikoXr1Rpz0qfdeOR XN8OLkVAqHfTGFI7oKk24f9Adw96R5NPZcdCay90h4J/kMfCiC7ZyUUI1XIa5iy1 np5pZCkf8HNcdywML7qcFd5n2O0wchyFnRLFZo6kJP9Ls5cEi6kBx/1jSdTcNgx/ fUKXyoEcriGoQiiwn29+4RZnU69gJV3zqQNLPpuwDQ5F/Q1zHTlrr+dqzezKkzcO dRTV2d2Q4A5vIDXPptzNNLlRQdrc8qxPJ1lxQVkPIU4/mtqczmZBwlyY2u9zwPyS sehJgJZnoAf+jm71NgQAKLck4MUBsMnMogOWunhXkVRCoZlbbkUWX4ECZYwPKsVk W7zVPmLvSM0l5g== =/tXb -----END PGP SIGNATURE----- Merge tag 'riscv-for-linus-6.6-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V updates from Palmer Dabbelt: - Support for the new "riscv,isa-extensions" and "riscv,isa-base" device tree interfaces for probing extensions - Support for userspace access to the performance counters - Support for more instructions in kprobes - Crash kernels can be allocated above 4GiB - Support for KCFI - Support for ELFs in !MMU configurations - ARCH_KMALLOC_MINALIGN has been reduced to 8 - mmap() defaults to sv48-sized addresses, with longer addresses hidden behind a hint (similar to Arm and Intel) - Also various fixes and cleanups * tag 'riscv-for-linus-6.6-mw1' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: (51 commits) lib/Kconfig.debug: Restrict DEBUG_INFO_SPLIT for RISC-V riscv: support PREEMPT_DYNAMIC with static keys riscv: Move create_tmp_mapping() to init sections riscv: Mark KASAN tmp* page tables variables as static riscv: mm: use bitmap_zero() API riscv: enable DEBUG_FORCE_FUNCTION_ALIGN_64B riscv: remove redundant mv instructions RISC-V: mm: Document mmap changes RISC-V: mm: Update pgtable comment documentation RISC-V: mm: Add tests for RISC-V mm RISC-V: mm: Restrict address space for sv39,sv48,sv57 riscv: enable DMA_BOUNCE_UNALIGNED_KMALLOC for !dma_coherent riscv: allow kmalloc() caches aligned to the smallest value riscv: support the elf-fdpic binfmt loader binfmt_elf_fdpic: support 64-bit systems riscv: Allow CONFIG_CFI_CLANG to be selected riscv/purgatory: Disable CFI riscv: Add CFI error handling riscv: Add ftrace_stub_graph riscv: Add types to indirectly called assembly functions ...	2023-09-01 08:09:48 -07:00
Linus Torvalds	99d99825fc	NFS CLient Updates for Linux 6.6 New Features: * Enable the NFS v4.2 READ_PLUS operation by default Stable Fixes: * NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info * NFS: Fix a potential data corruption Bugfixes: * Fix various READ_PLUS issues including: * smatch warnings * xdr size calculations * scratch buffer handling * 32bit / highmem xdr page handling * Fix checkpatch errors in file.c * Fix redundant readdir request after an EOF * Fix handling of COPY ERR_OFFLOAD_NO_REQ * Fix assignment of xprtdata.cred Cleanups: * Remove unused xprtrdma function declarations * Clean up an integer overflow check to avoid a warning * Clean up #includes in dns_resolve.c * Clean up nfs4_get_device_info so we don't pass a NULL pointer to __free_page() * Clean up sunrpc TCP socket timeout configuration * Guard against READDIR loops when entry names are too long * Use EXCHID4_FLAG_USE_PNFS_DS for DS servers -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmTwzwYACgkQ18tUv7Cl QOtIhBAA+BOh7MB6yjlyctFxABJiXz2x2Dehxy7Ox15LfnyStqQAUEpk35CXWvjC iNxpZJ486+WrzM76WGEaRbECK9nTQLK1yacR3V1zpnDwHWIJA6VHN6qU4JAfSMu7 XbhWkHWry6d7PXhvqHlaiYvPX2pF39wUzfH+vLlzS2QLIkpT6LnG0zVRJTQvLCmq zE5xD+NCQ1Dpo9VnouuzW7VVfm532hI7GQNrpo0E0vWKgeQD+/fOpDu23MW8A1Ua ZgVMAc7vScgDZH8/20Ze5PH4jAEB4gwEIzjreQlIXr7Tf+mE7qn435lgOuvdMQCW eHhdNriZ2X6HMLhNFFpup8bkRKGCCTooTHC1W66n9CuxIAuVT5DNwBbakpagHSZf J4ho81hEgBfc5zppISVjV6eFK4brM0rF9AliaIw8r/qGcMmO1CILi9tLGiheiJcT LuId7U2sE/vfIa6SiBt7rx37/MkrgLlAgjpk4dCRJQW+gKVBi09sMGnDlgaRvCZz T0WCsK4DgI9q2rScpwJYJbNWbC2Q8qUtYWW9LSRvwhbNdm/VbRnEHWA7eOwqqm8r KkkF4chyoTJqpnF3SjxT/lyFk6GwsD+wXafOmEeuFA6Si3dHDU9i3aUf+cCXhwRI uUzCUHYcCKnv4QVGPuAbIdxMgueNCuLoeWgTClVlqidv7GRyz7Y= =rjmq -----END PGP SIGNATURE----- Merge tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client updates from Anna Schumaker: "New Features: - Enable the NFS v4.2 READ_PLUS operation by default Stable Fixes: - NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info - NFS: Fix a potential data corruption Bugfixes: - Fix various READ_PLUS issues including: - smatch warnings - xdr size calculations - scratch buffer handling - 32bit / highmem xdr page handling - Fix checkpatch errors in file.c - Fix redundant readdir request after an EOF - Fix handling of COPY ERR_OFFLOAD_NO_REQ - Fix assignment of xprtdata.cred Cleanups: - Remove unused xprtrdma function declarations - Clean up an integer overflow check to avoid a warning - Clean up #includes in dns_resolve.c - Clean up nfs4_get_device_info so we don't pass a NULL pointer to __free_page() - Clean up sunrpc TCP socket timeout configuration - Guard against READDIR loops when entry names are too long - Use EXCHID4_FLAG_USE_PNFS_DS for DS servers" * tag 'nfs-for-6.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (22 commits) pNFS: Fix assignment of xprtdata.cred NFSv4.2: fix handling of COPY ERR_OFFLOAD_NO_REQ NFS: Guard against READDIR loop when entry names exceed MAXNAMELEN NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server NFS/pNFS: Set the connect timeout for the pNFS flexfiles driver SUNRPC: Don't override connect timeouts in rpc_clnt_add_xprt() SUNRPC: Allow specification of TCP client connect timeout at setup SUNRPC: Refactor and simplify connect timeout SUNRPC: Set the TCP_SYNCNT to match the socket timeout NFS: Fix a potential data corruption nfs: fix redundant readdir request after get eof nfs/blocklayout: Use the passed in gfp flags filemap: Fix errors in file.c NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info NFS: Move common includes outside ifdef SUNRPC: clean up integer overflow check xprtrdma: Remove unused function declaration rpcrdma_bc_post_recv() NFS: Enable the READ_PLUS operation by default SUNRPC: kmap() the xdr pages during decode NFSv4.2: Rework scratch handling for READ_PLUS (again) ...	2023-08-31 15:36:41 -07:00
Linus Torvalds	f35d170615	NFSD 6.6 Release Notes I'm thrilled to announce that the Linux in-kernel NFS server now offers NFSv4 write delegations. A write delegation enables a client to cache data and metadata for a single file more aggressively, reducing network round trips and server workload. Many thanks to Dai Ngo for contributing this facility, and to Jeff Layton and Neil Brown for reviewing and testing it. This release also sees the removal of all support for DES- and triple-DES-based Kerberos encryption types in the kernel's SunRPC implementation. These encryption types have been deprecated by the Internet community for years and are considered insecure. This change affects both the in-kernel NFS client and server. The server's UDP and TCP socket transports have now fully adopted David Howells' new bio_vec iterator so that no more than one sendmsg() call is needed to transmit each RPC message. In particular, this helps kTLS optimize record boundaries when sending RPC-with-TLS replies, and it takes the server a baby step closer to handling file I/O via folios. We've begun work on overhauling the SunRPC thread scheduler to remove a costly linked-list walk when looking for an idle RPC service thread to wake. The pre-requisites are included in this release. Thanks to Neil Brown for his ongoing work on this improvement. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmTwoa0ACgkQM2qzM29m f5cZvw/8CmFVNC27aMrJEhRRhwwrXbLzUkWh9GCYkG98PHYiLxLTvZ6qELXAax/a UjSgIDSRcWl4z8M/tyBQtgsw7NADr+7XWqEoXR7HZ5pEEC/KNGM0oQWQ92ojjKYy JmHdB02uaDJfcd9ioFTU13cw7q2BQfoe2xLI8yqis2vcVSu92AM7aIw+cvJIpwQB inA3TIIsYTV/gPByXSfEtvmYACadoFiMvfvYwaWhjFS9MdSzFmcVG0Dp3EFIig29 odmWEofcz6uIvUWvUswWEGdoSu7uOKIztSuAI4PlTwaofUaSKG6e5kmtpr3cLERD Uhg2lm5JgqkXBd7QHObNimJ4DtQzFwHmhA08qo8rd/zba75mn/Hr5IF0q3Rxs99J SRYHcAeP8afKn5Ge0yzoTgCNcqhfz8KLRfoCQX49mljr+muNxld4nMklD2KdUwJi XEB512/q3E4nUgopXZiSJYQYAq1CfdR5WpGipZ9X0XK9HZBDF/qhXGtk1YQuNWyj ZxJS3bfBza4oVIvP5/ehjCIQwOvqkcrC5zZGDIgDvw9Q6L3L1wqmVntsdCLCLRcJ jB4IOsj+DECfJ6w2vP2SZ3GeMtFnyuTQjhUTkjPuAKTBBiKo4Tj0o/agpfDYbWZy 1l3a2yH5jqJgkm4MaVh3YHRJGc0ub0ccpIrs3QQ4jvjMLQ/3Gcs= =XGHs -----END PGP SIGNATURE----- Merge tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd updates from Chuck Lever: "I'm thrilled to announce that the Linux in-kernel NFS server now offers NFSv4 write delegations. A write delegation enables a client to cache data and metadata for a single file more aggressively, reducing network round trips and server workload. Many thanks to Dai Ngo for contributing this facility, and to Jeff Layton and Neil Brown for reviewing and testing it. This release also sees the removal of all support for DES- and triple-DES-based Kerberos encryption types in the kernel's SunRPC implementation. These encryption types have been deprecated by the Internet community for years and are considered insecure. This change affects both the in-kernel NFS client and server. The server's UDP and TCP socket transports have now fully adopted David Howells' new bio_vec iterator so that no more than one sendmsg() call is needed to transmit each RPC message. In particular, this helps kTLS optimize record boundaries when sending RPC-with-TLS replies, and it takes the server a baby step closer to handling file I/O via folios. We've begun work on overhauling the SunRPC thread scheduler to remove a costly linked-list walk when looking for an idle RPC service thread to wake. The pre-requisites are included in this release. Thanks to Neil Brown for his ongoing work on this improvement" * tag 'nfsd-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (56 commits) Documentation: Add missing documentation for EXPORT_OP flags SUNRPC: Remove unused declaration rpc_modcount() SUNRPC: Remove unused declarations NFSD: da_addr_body field missing in some GETDEVICEINFO replies SUNRPC: Remove return value of svc_pool_wake_idle_thread() SUNRPC: make rqst_should_sleep() idempotent() SUNRPC: Clean up svc_set_num_threads SUNRPC: Count ingress RPC messages per svc_pool SUNRPC: Deduplicate thread wake-up code SUNRPC: Move trace_svc_xprt_enqueue SUNRPC: Add enum svc_auth_status SUNRPC: change svc_xprt::xpt_flags bits to enum SUNRPC: change svc_rqst::rq_flags bits to enum SUNRPC: change svc_pool::sp_flags bits to enum SUNRPC: change cache_head.flags bits to enum SUNRPC: remove timeout arg from svc_recv() SUNRPC: change svc_recv() to return void. SUNRPC: call svc_process() from svc_recv(). nfsd: separate nfsd_last_thread() from nfsd_put() nfsd: Simplify code around svc_exit_thread() call in nfsd() ...	2023-08-31 15:32:18 -07:00
Linus Torvalds	8ae5d298ef	ten ksmbd server fixes -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmTuK5AACgkQiiy9cAdy T1EggQv/RNMaOguGItp1qUgOg9XuHSboLVkCJgwoCTl0cb7VYWFlj8G4IOi37WoQ M3SXVQ/DjxB3eSms2LaIFKFjyVZXzZZ9GFjb+dBGssLWB5Zdk6Ez+IJLOpwNUar1 nwSC0kU/Lqj/gUnmUsmDlmV2Y/14uTeZEh6RiA1IzDMOAEr8KEuakgFzRg5DudYM CgZWfO466A48N/YXGmNNqq+8RVEtKaM3A31NZKgAsm4Lw03+V8JwYK/sSx8HWRBx heb8Goa7AUIbpggtoVnWf6PPzJsWOgELrVzvUYdyj7JD5HzaY0LDVOJ6YYyuRTBP M4n7yZT0mlAFDflHMvydaOKNJS+6HlE94xVPySo/S8uJJ9hHWcMqe8oJov11h6CT a76Q7bMkBBXK6GfEjetIY6qWwhN78M1d/Rf9EJRll+d4vIU1i5gPpCTptvfTqMCc A53bROyc3TPcH7A5PBWK0ecENIJ0S4wQd+7UzspQjXj+dk429CYF0+bksfWhijjf ubEIo9fE =4EB/ -----END PGP SIGNATURE----- Merge tag '6.6-rc-ksmbd-fixes-part1' of git://git.samba.org/ksmbd Pull smb server updates from Steve French: - fix potential overflows in decoding create and in session setup requests - cleanup fixes - compounding fixes, including one for MacOS compounded read requests - session setup error handling fix - fix mode bit bug when applying force_directory_mode and force_create_mode - RDMA (smbdirect) write fix * tag '6.6-rc-ksmbd-fixes-part1' of git://git.samba.org/ksmbd: ksmbd: add missing calling smb2_set_err_rsp() on error ksmbd: replace one-element array with flex-array member in struct smb2_ea_info ksmbd: fix slub overflow in ksmbd_decode_ntlmssp_auth_blob() ksmbd: fix wrong DataOffset validation of create context ksmbd: Fix one kernel-doc comment ksmbd: reduce descriptor size if remaining bytes is less than request size ksmbd: fix `force create mode' and `force directory mode' ksmbd: fix wrong interim response on compound ksmbd: add support for read compound ksmbd: switch to use kmemdup_nul() helper	2023-08-31 15:28:26 -07:00
Linus Torvalds	7e5cd6f697	A few small fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAmTwrykACgkQNqiEXrVA jGSIxg/9EkmkMiFyAFymr1EYVavngY7RsTwG4CpCv3jKCthjrsBi5PN9whJPTfBg FCV3tvzGSk8pIpEADqywuYp9+e+0/gqNwExlyr1+JCatPtXeFpBN8yJN/a2u7zon +CXmcSn6veKuVHWptBdxoQVwjjhznw12psa+kPiGPe/q4uZyFIvVAnDUkEeURV3f dT7yOG6KEMMq7NZis1t2Tf9fuflzYpKOmF7qzTWAGOCXhbJbHWB51wMpFKJSyqP8 kxZQ9GvdjDMnI3V+IbV7WktN07ztGGiJ3SGRNuQFbkL8xCf6KTySgGnieTj8vBod lg/UFEZrd2ZL9f+hUTyWeta+dhEVAAqnUJpMuyfMWBGg1ae4U6IO2t+Q7xM1zGLg qGHfxka9C5tvKToldLsaoFBfW+9+KxCxyrI25FkxSXzJBJWnSaq/IC1/QEbubqiY 2zAD7hh/B8c3rzLIwIfGptRDoeMu8yiWx3I5jISZHZG5Azkui1VqC7slXCpcqhLF 7PoJHZ4hemK2zkPwCjZ914lHuCtePDtvvHkEL5G1tK8kW3e9k1Sk314zck69Oyjw IuXICm14Qu5Pp8QLBrXTzXenoUXKiIwm+GIW7UkIzGRrKaLCMc8YyDvvdp4UoG5H Pg+8Y93P/fvRbRcfm9jk1BWqaUFuIWRyzxQnMv8pN1xxabrgnGQ= =W5Xa -----END PGP SIGNATURE----- Merge tag 'jfs-6.6' of github.com:kleikamp/linux-shaggy Pull jfs updates from Dave Kleikamp: "A few small fixes" * tag 'jfs-6.6' of github.com:kleikamp/linux-shaggy: jfs: validate max amount of blocks before allocation. jfs: remove redundant initialization to pointer ip jfs: fix invalid free of JFS_IP(ipimap)->i_imap in diUnmount FS: JFS: (trivial) Fix grammatical error in extAlloc fs/jfs: prevent double-free in dbUnmount() after failed jfs_remount()	2023-08-31 15:25:01 -07:00
Linus Torvalds	3ef96fcfd5	Many ext4 and jbd2 cleanups and bug fixes for v6.6-rc1. * Cleanups in the ext4 remount code when going to and from read-only * Cleanups in ext4's multiblock allocator * Cleanups in the jbd2 setup/mounting code paths * Performance improvements when appending to a delayed allocation file * Miscenallenous syzbot and other bug fixes -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAmTwqUMACgkQ8vlZVpUN gaMqgwf6Aui6MlrtNJx6CrJt4dxLANQ8G6bsJ2Zr+6QNS1X/GAUrCCyLWWom1dfb OJ/n4/JUCNc9v5yLCTqHOE5ZFTdQItOBJUKXbJYff8EdnR+zCUULpj6bPbEs5BKp U7CiiZ9TIi9S2TWezvIJKIa2VxgPej7CH/HOt8ISh/Msq8nHvcEEJIyOEvVk9odQ LEkiQCsikWaljB7qEOIYo+xgFffMZfttc4zuTkdr/h1I6OWhvQYmlwSnTuAiE7BS BVf3ebD2Dg8TChUMXOsk2d743iZNWf/+yTfbXVu93/uEM9vgF0+HO6EerTK8RMeM yxhshg9z7ccuFjdY/2NYDXe6pEuDKw== =cMIX -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Many ext4 and jbd2 cleanups and bug fixes: - Cleanups in the ext4 remount code when going to and from read-only - Cleanups in ext4's multiblock allocator - Cleanups in the jbd2 setup/mounting code paths - Performance improvements when appending to a delayed allocation file - Miscellaneous syzbot and other bug fixes" * tag 'ext4_for_linus-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (60 commits) ext4: fix slab-use-after-free in ext4_es_insert_extent() libfs: remove redundant checks of s_encoding ext4: remove redundant checks of s_encoding ext4: reject casefold inode flag without casefold feature ext4: use LIST_HEAD() to initialize the list_head in mballoc.c ext4: do not mark inode dirty every time when appending using delalloc ext4: rename s_error_work to s_sb_upd_work ext4: add periodic superblock update check ext4: drop dio overwrite only flag and associated warning ext4: add correct group descriptors and reserved GDT blocks to system zone ext4: remove unused function declaration ext4: mballoc: avoid garbage value from err ext4: use sbi instead of EXT4_SB(sb) in ext4_mb_new_blocks_simple() ext4: change the type of blocksize in ext4_mb_init_cache() ext4: fix unttached inode after power cut with orphan file feature enabled jbd2: correct the end of the journal recovery scan range ext4: ext4_get_{dev}_journal return proper error value ext4: cleanup ext4_get_dev_journal() and ext4_get_journal() jbd2: jbd2_journal_init_{dev,inode} return proper error return value jbd2: drop useless error tag in jbd2_journal_wipe() ...	2023-08-31 15:18:15 -07:00
Linus Torvalds	659b3613fc	dlm for 6.6 Changes include: - Allow blocking posix lock requests to be interrupted while waiting. This requires a cancel request to be sent to the userspace daemon where posix lock requests are processed across the cluster. - Fix a posix lock patch from the previous cycle in which lock requests from different file systems could be mixed up. - Fix some long standing problems with nfs posix lock cancelation. - Add a new debugfs file for printing queued callbacks. - Stop modifying buffers that have been used to receive a message. - Misc cleanups and some refactoring. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJk8KCgAAoJEDgbc8f8gGmqfk4P/jB4L2qwaamq2mNRxFPXSzpp y5UiNoMG8Mw4OT9vytu2xzmmrYT7d1TvZ4lNcLYjkNYmcyuTZzu8o/kvGwt9gnXC 94DPmGQb0RQY/pZOdTMcIBplXiCSFpooweFOQjiWo7wlwVlYGVcfEIv9xQTNIT2/ m0niBFEWDDbVudbWXXaa4lnvo07RTmSxiHjtxqbkea2jLUgxw9mYOR8C6De3rlJf Uh450Kitktak9tywBZa3yj8Cgy8SbiWNHlNvcV1DI3QE7LKOM5+6qVuwERYYx9lw JbdtEoRr97QFf4w40YrJpAxFBiHCLXAquz3D3cJI8mW0RDqDuGUFU6SfsCfQEza6 Dau6XrtfuumArMn/zViBIase9xkSb36RNFopr2Si6mUoLpPalUPuLr+42qmxZY3c KOvWis4UFq4OiOqZY5gBBS6IKoJ+X4pVnNJswScvKFA2VBLCf9fucKRoEVOAUTbg BoJEwOjBQCoaATbGBHjwdjZ4yX/x/tLN0LsPW202QOMGdfSdeD6Wr+COyS916eVK 8Nk3lcBcU21Nhulf2Ci3Zr6B9nG09UqDRHYfH0LJJX0dq++SBRvQvjI2lcdJ0Dvj We7nVqhcW/r486oS/r8kTXOtctYYMxecoQFYPcVufQAIU8+6YZUD53wui8EyVL/2 3GmejZgMomvGn8D4kNPC =BBCe -----END PGP SIGNATURE----- Merge tag 'dlm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: - Allow blocking posix lock requests to be interrupted while waiting. This requires a cancel request to be sent to the userspace daemon where posix lock requests are processed across the cluster. - Fix a posix lock patch from the previous cycle in which lock requests from different file systems could be mixed up. - Fix some long standing problems with nfs posix lock cancelation. - Add a new debugfs file for printing queued callbacks. - Stop modifying buffers that have been used to receive a message. - Misc cleanups and some refactoring. * tag 'dlm-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: fix plock lookup when using multiple lockspaces fs: dlm: don't use RCOM_NAMES for version detection fs: dlm: create midcomms nodes when configure fs: dlm: constify receive buffer fs: dlm: drop rxbuf manipulation in dlm_recover_master_copy fs: dlm: drop rxbuf manipulation in dlm_copy_master_names fs: dlm: get recovery sequence number as parameter fs: dlm: cleanup lock order fs: dlm: remove clear_members_cb fs: dlm: add plock dev tracepoints fs: dlm: check on plock ops when exit dlm fs: dlm: debugfs for queued callbacks fs: dlm: remove unused processed_nodes fs: dlm: add missing spin_unlock fs: dlm: fix F_CANCELLK to cancel pending request fs: dlm: allow to F_SETLKW getting interrupted fs: dlm: remove twice newline	2023-08-31 15:02:12 -07:00
Linus Torvalds	e7e9423db4	v6.6-vfs.super.fixes.2 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZPBy4AAKCRCRxhvAZXjc ok3jAP9+iZREbmcPgrAUGZOjq7+Gx1kJ297Uw/LKiWmxZeX2NwD/cKyv239YXHBM CB4dCwk3pvBZ8uD4dUonDX3PJYFauAU= =knzN -----END PGP SIGNATURE----- Merge tag 'v6.6-vfs.super.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull more superblock follow-on fixes from Christian Brauner: "This contains two more small follow-up fixes for the super work this cycle. I went through all filesystems once more and detected two minor issues that still needed fixing: - Some filesystems support mtd devices (e.g., mount -t jffs2 mtd2 /mnt). The mtd infrastructure uses the sb->s_mtd pointer to find an existing superblock. When the mtd device is put and sb->s_mtd cleared the superblock can still be found fs_supers and so this risks a use-after-free. Add a small patch that aligns mtd with what we did for regular block devices and switch keying to rely on sb->s_dev. (This was tested with mtd devices and jffs2 as xfstests doesn't support mtd devices.) - Switch nfs back to rely on kill_anon_super() so the superblock is removed from the list of active supers before sb->s_fs_info is freed" * tag 'v6.6-vfs.super.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: NFS: switch back to using kill_anon_super mtd: key superblock by device number fs: export sget_dev()	2023-08-31 14:52:20 -07:00
Ard Biesheuvel	9416006239	pstore: Base compression input buffer size on estimated compressed size Commit `1756ddea69` ("pstore: Remove worst-case compression size logic") removed some clunky per-algorithm worst case size estimation routines on the basis that we can always store pstore records uncompressed, and these worst case estimations are about how much the size might inadvertently increase due to encapsulation overhead when the input cannot be compressed at all. So if compression results in a size increase, we just store the original data instead. However, it seems that the original code was misinterpreting these calculations as an estimation of how much uncompressed data might fit into a compressed buffer of a given size, and it was using the results to consume the input data in larger chunks than the pstore record size, relying on the compression to ensure that what ultimately gets stored fits into the available space. One result of this, as observed and reported by Linus, is that upgrading to a newer kernel that includes the given commit may result in pstore decompression errors reported in the kernel log. This is due to the fact that the existing records may unexpectedly decompress to a size that is larger than the pstore record size. Another potential problem caused by this change is that we may underutilize the fixed sized records on pstore backends such as ramoops. And on pstore backends with variable sized records such as EFI, we will end up creating many more entries than before to store the same amount of compressed data. So let's fix both issues, by bringing back the typical case estimation of how much ASCII text captured from the dmesg log might fit into a pstore record of a given size after compression. The original implementation used the computation given below for zlib: switch (size) { /* buffer range for efivars / case 1000 ... 2000: cmpr = 56; break; case 2001 ... 3000: cmpr = 54; break; case 3001 ... 3999: cmpr = 52; break; / buffer range for nvram, erst / case 4000 ... 10000: cmpr = 45; break; default: cmpr = 60; break; } return (size 100) / cmpr; We will use the previous worst-case of 60% for compression. For decompression go extra large (3x) so we make sure there's enough space for anything. While at it, rate limit the error message so we don't flood the log unnecessarily on systems that have accumulated a lot of pstore history. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/20230830212238.135900-1-ardb@kernel.org Co-developed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Kees Cook <keescook@chromium.org>	2023-08-31 13:58:49 -07:00
Linus Torvalds	df57721f9a	Add x86 shadow stack support Convert IBT selftest to asm to fix objtool warning -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEV76QKkVc4xCGURexaDWVMHDJkrAFAmTv1QQACgkQaDWVMHDJ krAUwhAAn6TOwHJK8BSkHeiQhON1nrlP3c5cv0AyZ2NP8RYDrZrSZvhpYBJ6wgKC Cx5CGq5nn9twYsYS3KsktLKDfR3lRdsQ7K9qtyFtYiaeaVKo+7gEKl/K+klwai8/ gninQWHk0zmSCja8Vi77q52WOMkQKapT8+vaON9EVDO8dVEi+CvhAIfPwMafuiwO Rk4X86SzoZu9FP79LcCg9XyGC/XbM2OG9eNUTSCKT40qTTKm5y4gix687NvAlaHR ko5MTsdl0Wfp6Qk0ohT74LnoA2c1g/FluvZIM33ci/2rFpkf9Hw7ip3lUXqn6CPx rKiZ+pVRc0xikVWkraMfIGMJfUd2rhelp8OyoozD7DB7UZw40Q4RW4N5tgq9Fhe9 MQs3p1v9N8xHdRKl365UcOczUxNAmv4u0nV5gY/4FMC6VjldCl2V9fmqYXyzFS4/ Ogg4FSd7c2JyGFKPs+5uXyi+RY2qOX4+nzHOoKD7SY616IYqtgKoz5usxETLwZ6s VtJOmJL0h//z0A7tBliB0zd+SQ5UQQBDC2XouQH2fNX2isJMn0UDmWJGjaHgK6Hh 8jVp6LNqf+CEQS387UxckOyj7fu438hDky1Ggaw4YqowEOhQeqLVO4++x+HITrbp AupXfbJw9h9cMN63Yc0gVxXQ9IMZ+M7UxLtZ3Cd8/PVztNy/clA= =3UUm -----END PGP SIGNATURE----- Merge tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 shadow stack support from Dave Hansen: "This is the long awaited x86 shadow stack support, part of Intel's Control-flow Enforcement Technology (CET). CET consists of two related security features: shadow stacks and indirect branch tracking. This series implements just the shadow stack part of this feature, and just for userspace. The main use case for shadow stack is providing protection against return oriented programming attacks. It works by maintaining a secondary (shadow) stack using a special memory type that has protections against modification. When executing a CALL instruction, the processor pushes the return address to both the normal stack and to the special permission shadow stack. Upon RET, the processor pops the shadow stack copy and compares it to the normal stack copy. For more information, refer to the links below for the earlier versions of this patch set" Link: https://lore.kernel.org/lkml/20220130211838.8382-1-rick.p.edgecombe@intel.com/ Link: https://lore.kernel.org/lkml/20230613001108.3040476-1-rick.p.edgecombe@intel.com/ * tag 'x86_shstk_for_6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (47 commits) x86/shstk: Change order of __user in type x86/ibt: Convert IBT selftest to asm x86/shstk: Don't retry vm_munmap() on -EINTR x86/kbuild: Fix Documentation/ reference x86/shstk: Move arch detail comment out of core mm x86/shstk: Add ARCH_SHSTK_STATUS x86/shstk: Add ARCH_SHSTK_UNLOCK x86: Add PTRACE interface for shadow stack selftests/x86: Add shadow stack test x86/cpufeatures: Enable CET CR4 bit for shadow stack x86/shstk: Wire in shadow stack interface x86: Expose thread features in /proc/$PID/status x86/shstk: Support WRSS for userspace x86/shstk: Introduce map_shadow_stack syscall x86/shstk: Check that signal frame is shadow stack mem x86/shstk: Check that SSP is aligned on sigreturn x86/shstk: Handle signals for shadow stack x86/shstk: Introduce routines modifying shstk x86/shstk: Handle thread shadow stack x86/shstk: Add user-mode shadow stack support ...	2023-08-31 12:20:12 -07:00
Dr. David Alan Gilbert	f48d4d35ad	nls: Hide new NLS_UCS2_UTILS NLS_UCS2_UTILS is an option selected by filesystems that need it, don't expose it to users. Fixes: `089f7f5913` ("fs/smb: Swing unicode common code from smb->NLS") Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-31 12:07:34 -05:00
Steve French	238b351d09	smb3: allow controlling length of time directory entries are cached with dir leases Currently with directory leases we cache directory contents for a fixed period of time (default 30 seconds) but for many workloads this is too short. Allow configuring the maximum amount of time directory entries are cached when a directory lease is held on that directory. Add module load parm "max_dir_cache" For example to set the timeout to 10 minutes you would do: echo 600 > /sys/module/cifs/parameters/dir_cache_timeout or to disable caching directory contents: echo 0 > /sys/module/cifs/parameters/dir_cache_timeout Reviewed-by: Bharath SM <bharathsm@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-31 10:06:36 -05:00
Xiubo Li	ce0d5bd3a6	ceph: make num_fwd and num_retry to __u32 The num_fwd in MClientRequestForward is int32_t, while the num_fwd in ceph_mds_request_head is __u8. This is buggy when the num_fwd is larger than 256 it will always be truncate to 0 again. But the client couldn't recoginize this. This will make them to __u32 instead. Because the old cephs will directly copy the raw memories when decoding the reqeust's head, so we need to make sure this kclient will be compatible with old cephs. For newer cephs they will decode the requests depending the version, which will be much simpler and easier to extend new members. Link: https://tracker.ceph.com/issues/62145 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-31 14:56:27 +02:00
Christoph Hellwig	5069ba84b5	NFS: switch back to using kill_anon_super NFS switch to open coding kill_anon_super in `7b14a21389` ("nfs: don't call bdi_unregister") to avoid the extra bdi_unregister call. At that point bdi_destroy was called in nfs_free_server and thus it required a later freeing of the anon dev_t. But since `0db10944a7` ("nfs: Convert to separately allocated bdi") the bdi has been free implicitly by the sb destruction, so this isn't needed anymore. By not open coding kill_anon_super, nfs now inherits the fix in `dc3216b141` ("super: ensure valid info"), and we remove the only open coded version of kill_anon_super. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230831052940.256193-1-hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-31 12:47:16 +02:00
Christian Brauner	69881be3d9	fs: export sget_dev() They will be used for mtd devices as well. Acked-by: Richard Weinberger <richard@nod.at> Reviewed-by: Jan Kara <jack@suse.cz> Message-Id: <20230829-vfs-super-mtd-v1-1-fecb572e5df3@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-31 12:47:15 +02:00
Katya Orlova	efc0b0bcff	smb: propagate error code of extract_sharename() In addition to the EINVAL, there may be an ENOMEM. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: `70431bfd82` ("cifs: Support fscache indexing rewrite") Signed-off-by: Katya Orlova <e.orlova@ispras.ru> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-30 23:38:49 -05:00
Linus Torvalds	b97d64c722	22 smb3/cifs client fixes and two related changes (for unicode mapping) -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmTvk44ACgkQiiy9cAdy T1GT+wwAkiM+BFVoW0QhLK9ztPptYofGiTKr1AySV09AWos9Fwdhv0aS4LDKhauW PVsORfnFcLdyAHtgX2DlhJHMpLWDz3Z51KWiUSo7AAZjIp/4K0yEarg4WKPtUPN0 PMET2OuqAfIfYLCxSZYFjiGK6xgSJEz+xIhX0qJPRZsyJp50WlFlyZRUfFa+6hXt pguatCVw4qhP9hkdcklCY8rwlFDdWEHj9wD/PB2Qschw4gzxDUMwOJjDgT6PNxjA SAC6J+NQVtMcnASd5pn0+Mbc+vNfKZ0PM+KZcDrJphcBz+arY6Hu57v3/yu2y++L DqRI6QtEwVmHzytM51x5JaWFE0Asj/NsH69LVm4bXVIkkcXBut6lLhrd/KVSP+xN LY4EcYEoufAAaecrQrMO4x2Tm10f+GMi1Fh9NvpLZVRrUXy4rdxLP2aC+q+i3uY3 34FaAbpjQ7NJq2yZTL8xDOdCvi8E3t58DsBv4jA9Y/SYGWYao8Kw0vxhCt0SVZPc HaoMfkxl =CtQO -----END PGP SIGNATURE----- Merge tag '6.6-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6 Pull smb client updates from Steve French: - fixes for excessive stack usage - multichannel reconnect improvements - DFS fix and cleanup patches - move UCS-2 conversion code to fs/nls and update cifs and jfs to use them - cleanup patch for compounding, one to fix confusing function name - inode number collision fix - reparse point fixes (including avoiding an extra unneeded query on symlinks) and a minor cleanup - directory lease (caching) improvement * tag '6.6-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6: (24 commits) fs/jfs: Use common ucs2 upper case table fs/smb/client: Use common code in client fs/smb: Swing unicode common code from smb->NLS fs/smb: Remove unicode 'lower' tables SMB3: rename macro CIFS_SERVER_IS_CHAN to avoid confusion [SMB3] send channel sequence number in SMB3 requests after reconnects cifs: update desired access while requesting for directory lease smb: client: reduce stack usage in smb2_query_reparse_point() smb: client: reduce stack usage in smb2_query_info_compound() smb: client: reduce stack usage in smb2_set_ea() smb: client: reduce stack usage in smb_send_rqst() smb: client: reduce stack usage in cifs_demultiplex_thread() smb: client: reduce stack usage in cifs_try_adding_channels() smb: cilent: set reparse mount points as automounts smb: client: query reparse points in older dialects smb: client: do not query reparse points twice on symlinks smb: client: parse reparse point flag in create response smb: client: get rid of dfs code dep in namespace.c smb: client: get rid of dfs naming in automount code smb: client: rename cifs_dfs_ref.c to namespace.c ...	2023-08-30 21:01:40 -07:00
Linus Torvalds	53ea7f624f	New code for 6.6: * Chandan Babu will be taking over as the XFS release manager. He has reviewed all the patches that are in this branch, though I'm signing the branch one last time since I'm still technically maintainer. :P * Create a maintainer entry profile for XFS in which we lay out the various roles that I have played for many years. Aside from release manager, the remaining roles are as yet unfilled. * Start merging online repair -- we now have in-memory pageable memory for staging btrees, a bunch of pending fixes, and we've started the process of refactoring the scrub support code to support more of repair. In particular, reaping of old blocks from damaged structures. * Scrub the realtime summary file. * Fix a bug where scrub's quota iteration only ever returned the root dquot. Oooops. * Fix some typos. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZOQE2AAKCRBKO3ySh0YR pvmZAQDe+KceaVx6Dv2f9ihckeS2dILSpDTo1bh9BeXnt005VwD/ceHTaJxEl8lp u/dixFDkRgp9RYtoTAK2WNiUxYetsAc= =oZN6 -----END PGP SIGNATURE----- Merge tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Chandan Babu: - Chandan Babu will be taking over as the XFS release manager. He has reviewed all the patches that are in this branch, though I'm signing the branch one last time since I'm still technically maintainer. :P - Create a maintainer entry profile for XFS in which we lay out the various roles that I have played for many years. Aside from release manager, the remaining roles are as yet unfilled. - Start merging online repair -- we now have in-memory pageable memory for staging btrees, a bunch of pending fixes, and we've started the process of refactoring the scrub support code to support more of repair. In particular, reaping of old blocks from damaged structures. - Scrub the realtime summary file. - Fix a bug where scrub's quota iteration only ever returned the root dquot. Oooops. - Fix some typos. [ Pull request from Chandan Babu, but signed tag and description from Darrick Wong, thus the first person singular above is Darrick, not Chandan ] * tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (37 commits) fs/xfs: Fix typos in comments xfs: fix dqiterate thinko xfs: don't check reflink iflag state when checking cow fork xfs: simplify returns in xchk_bmap xfs: rewrite xchk_inode_is_allocated to work properly xfs: hide xfs_inode_is_allocated in scrub common code xfs: fix agf_fllast when repairing an empty AGFL xfs: allow userspace to rebuild metadata structures xfs: clear pagf_agflreset when repairing the AGFL xfs: allow the user to cancel repairs before we start writing xfs: don't complain about unfixed metadata when repairs were injected xfs: implement online scrubbing of rtsummary info xfs: always rescan allegedly healthy per-ag metadata after repair xfs: move the realtime summary file scrubber to a separate source file xfs: wrap ilock/iunlock operations on sc->ip xfs: get our own reference to inodes that we want to scrub xfs: track usage statistics of online fsck xfs: improve xfarray quicksort pivot xfs: create scaffolding for creating debugfs entries xfs: cache pages used for xfarray quicksort convergence ...	2023-08-30 12:34:12 -07:00
Linus Torvalds	1500e7e072	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmTvFssACgkQnJ2qBz9k QNl7HggAuY154urYJdh7M+mbKDSywhcK0YT5pNNkcXVpv/t2c073Ce57+ObDCBaS xetyFgH2XlvuAJ4dWmRDwBEzJ0jquKzvYJEMiXAexgy47ctnNPx5kLPsXpt3g+2q pro7sK1b5BmX/zrgOontbJ8/YAwX85XToD4Cv5XyNSx/ex6/zsd5FProfdiY/HAt qAcv7NkNTBbJBEBHhBNQSL2wOj3LzQV1U8v0XEcsBvTUxlX2jH8J4CsuFIotXqCF 37SNvZPk2c04HbaLgyU4Ura69qD0fn4vTMocuCoaf0CN2PL5jblRAwsAO2bfSqJE AxZFq3afI0YV3Y9OrVlzHtSALuiZMQ== =QPEQ -----END PGP SIGNATURE----- Merge tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2, quota, and udf updates from Jan Kara: - fixes for possible use-after-free issues with quota when racing with chown - fixes for ext2 crashing when xattr allocation races with another block allocation to the same file from page writeback code - fix for block number overflow in ext2 - marking of reiserfs as obsolete in MAINTAINERS - assorted minor cleanups * tag 'for_v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: Fix kernel-doc warnings ext2: improve consistency of ext2_fsblk_t datatype usage ext2: dump current reservation window info ext2: fix race between setxattr and write back ext2: introduce new flags argument for ext2_new_blocks() ext2: remove ext2_new_block() ext2: fix datatype of block number in ext2_xattr_set2() udf: Drop pointless aops assignment quota: use lockdep_assert_held_write in dquot_load_quota_sb MAINTAINERS: change reiserfs status to obsolete udf: Fix -Wstringop-overflow warnings quota: simplify drop_dquot_ref() quota: fix dqput() to follow the guarantees dquot_srcu should provide quota: add new helper dquot_active() quota: rename dquot_active() to inode_quota_active() quota: factor out dquot_write_dquot() ext2: remove redundant assignment to variable desc and variable best_desc	2023-08-30 12:10:50 -07:00
Linus Torvalds	63580f669d	overlayfs update for 6.6 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE9zuTYTs0RXF+Ke33EVvVyTe/1WoFAmTu0QoACgkQEVvVyTe/ 1WpbzBAAjIZXzhn8KldDpG0muw9JKaSOxM45uhZE1s/2uKsVCyp4k3lubTbxxYO1 S9rUjhF2gSJFOfuSOK/XXEKXyu4MGT7iy7pKswu0k8+AHDDRBksPXJKA/AkhLPUr vX1pU6aWw2OSn1xdhIgY+F4DveyzYQL/CEoUzFyRPxSB0G/yjktRAjdZ2HL4cAvN eVXPyTj0bd4LVj1ITla4uj8DbgivrqmRJbZ9bKnSRE8GXWBriJhV//M2Q3QRno+W 04TtAvyh+klQeqZFVOQ0reZUFZzYBBZZTmqoFiUzTny7oljWl5F0+JfJOHhRGknG LYZCia34+T6TZPhOnZzT/szTDoXVvNJhEf+vBQCqhaCugqJc/2uJdw9CW8ZcDvA9 ZNOMxEbXE4VgGjJ0HM6MoDMUoIEUiNWEnXWEaKyCAfOPqgYwPy+QeDO4JtBPQpRn fwZx7Xpc1FLpTc9feHxzox9o81S8rPRMycUBg2c3KZB6TFnYNDxWIIo365naMCzz A8IDVGf+gd+S4NaZvh9FUijciIslYfyFgqwQERZmJnpDk1d1NyeUC7Nn7EkmUpyp guRaC+rUcqYP4CpuSHTCPle94qHqiAkbsKSJWebZ2M1j9fjZ+okPw0k83Nih79vu vRhs70Ah51v1lpBb0mlDjsV3vKm3Apv8nMJKZvVuC+Cw6Qiob5s= =F4Hi -----END PGP SIGNATURE----- Merge tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs Pull overlayfs updates from Amir Goldstein: - add verification feature needed by composefs (Alexander Larsson) - improve integration of overlayfs and fanotify (Amir Goldstein) - fortify some overlayfs code (Andrea Righi) * tag 'ovl-update-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs: ovl: validate superblock in OVL_FS() ovl: make consistent use of OVL_FS() ovl: Kconfig: introduce CONFIG_OVERLAY_FS_DEBUG ovl: auto generate uuid for new overlay filesystems ovl: store persistent uuid/fsid with uuid=on ovl: add support for unique fsid per instance ovl: support encoding non-decodable file handles ovl: Handle verity during copy-up ovl: Validate verity xattr when resolving lowerdata ovl: Add versioned header for overlay.metacopy xattr ovl: Add framework for verity support	2023-08-30 11:54:09 -07:00
Anna Schumaker	c4a123d2e8	pNFS: Fix assignment of xprtdata.cred The comma at the end of the line was leftover from an earlier refactor of the _nfs4_pnfs_v3_ds_connect() function. This is technically valid C, so the compilers didn't catch it, but if I'm understanding how it works correctly it assigns the return value of rpc_clnt_add_xprtr() to xprtdata.cred. Reported-by: Olga Kornievskaia <kolga@netapp.com> Fixes: `a12f996d34` ("NFSv4/pNFS: Use connections to a DS that are all of the same protocol family") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-08-30 14:31:31 -04:00
Olga Kornievskaia	5690eed941	NFSv4.2: fix handling of COPY ERR_OFFLOAD_NO_REQ If the client sent a synchronous copy and the server replied with ERR_OFFLOAD_NO_REQ indicating that it wants an asynchronous copy instead, the client should retry with asynchronous copy. Fixes: `539f57b3e0` ("NFS handle COPY ERR_OFFLOAD_NO_REQS") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-08-30 11:08:27 -04:00
Benjamin Coddington	f67b55b658	NFS: Guard against READDIR loop when entry names exceed MAXNAMELEN Commit `64cfca85ba` asserts the only valid return values for nfs2/3_decode_dirent should not include -ENAMETOOLONG, but for a server that sends a filename3 which exceeds MAXNAMELEN in a READDIR response the client's behavior will be to endlessly retry the operation. We could map -ENAMETOOLONG into -EBADCOOKIE, but that would produce truncated listings without any error. The client should return an error for this case to clearly assert that the server implementation must be corrected. Fixes: `64cfca85ba` ("NFS: Return valid errors from nfs2/3_decode_dirent()") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-08-30 11:08:27 -04:00
Dr. David Alan Gilbert	f3a9b3758e	fs/jfs: Use common ucs2 upper case table Use the UCS-2 upper case tables from nls, that are shared with smb. This code in JFS is hard to test, so we're only reusing the same tables (which are identical), not trying to reuse the rest of the helper functions. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-30 08:55:52 -05:00
Dr. David Alan Gilbert	de54845290	fs/smb/client: Use common code in client Now we've got the common code, use it for the client as well. Note there's a change here where we're using the server version of UniStrcat now which had different types (__le16 vs wchar_t) but it's not interpreting the value other than checking for 0, however we do need casts to keep sparse happy. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-30 08:55:52 -05:00
Dr. David Alan Gilbert	089f7f5913	fs/smb: Swing unicode common code from smb->NLS Swing most of the inline functions and unicode tables into nls from the copy in smb/server. This is UCS-2 rather than most of the rest of the code in NLS, but it currently seems like the best place for it. The actual unicode.c implementations vary much more between server and client so they're unmoved. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-30 08:55:51 -05:00
Dr. David Alan Gilbert	9e74938954	fs/smb: Remove unicode 'lower' tables The unicode glue in smb//..uniupr.h has a section guarded by 'ifndef UNIUPR_NOLOWER' - but that's always defined in smb//..unicode.h. Nuke those tables. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Dave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-30 08:55:51 -05:00
Steve French	b3773b19d4	SMB3: rename macro CIFS_SERVER_IS_CHAN to avoid confusion Since older dialects such as CIFS do not support multichannel the macro CIFS_SERVER_IS_CHAN can be confusing (it requires SMB 3 or later) so shorten its name to "SERVER_IS_CHAN" Suggested-by: Tom Talpey <tom@talpey.com> Acked-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-30 08:55:02 -05:00
Linus Torvalds	3d3dfeb3ae	for-6.6/block-2023-08-28 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmTs08EQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqa4EACu/zKE+omGXBV0Q7kEpVsChjp0ElGtSDIJ tJfTuvnWqQjrqRv4ksmZvGdx8SkqFuXri4/7oBXlsaqeUVbIQdWJUpLErBye6nxa lUb6nXOFWwyG94cMRYs71lN0loosjb7aiVw7oVLAIhntq3p3doFl/cyy3ndMZrUE pZbsrWSt4QiOKhcO0TtIjfAwsr31AN51qFiNNITEiZl3UjXfkGRCK81X0yM2N8zZ 7Y0h1ldPBsZ/olNWeRyaW1uB64nKM0buR7/nDxCV/NI05nndJ34bIgo/JIj4xy0v SiBj2+y86+oMJZt17yYENwOQdtX3hbyESGuVm9dCrO0t9/byVQxkUk0OMm65BM/l l2d+gmMQZTbHziqfLlgq9i3i9+B4C2hsb7iBpuo7SW/FPbM45POgi3lpiZycaZyu krQo1qwL4KSGXzGN9CabEuKDcJcXqLxqMDOyEDA3R5Kz06V9tNuM+Di/mr4vuZHK sVHUfHuWBO9ionLlGPdc3fH/CuMqic8SHjumiAm2menBZV6cSzRDxpm6H4CyLt7y tWmw7BNU7dfHFGd+Jw0Ld49sAuEybszEXq6qYv5uYBVfJNqDvOvEeVoQp0RN2jJA AG30hymcZgxn9n7gkIgkPQDgIGUjnzUR8B2mE2UFU1CYVHXYXAXU55CCI5oeTkbs d0Y/zCZf1A== =p1bd -----END PGP SIGNATURE----- Merge tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: "Pretty quiet round for this release. This contains: - Add support for zoned storage to ublk (Andreas, Ming) - Series improving performance for drivers that mark themselves as needing a blocking context for issue (Bart) - Cleanup the flush logic (Chengming) - sed opal keyring support (Greg) - Fixes and improvements to the integrity support (Jinyoung) - Add some exports for bcachefs that we can hopefully delete again in the future (Kent) - deadline throttling fix (Zhiguo) - Series allowing building the kernel without buffer_head support (Christoph) - Sanitize the bio page adding flow (Christoph) - Write back cache fixes (Christoph) - MD updates via Song: - Fix perf regression for raid0 large sequential writes (Jan) - Fix split bio iostat for raid0 (David) - Various raid1 fixes (Heinz, Xueshi) - raid6test build fixes (WANG) - Deprecate bitmap file support (Christoph) - Fix deadlock with md sync thread (Yu) - Refactor md io accounting (Yu) - Various non-urgent fixes (Li, Yu, Jack) - Various fixes and cleanups (Arnd, Azeem, Chengming, Damien, Li, Ming, Nitesh, Ruan, Tejun, Thomas, Xu)" * tag 'for-6.6/block-2023-08-28' of git://git.kernel.dk/linux: (113 commits) block: use strscpy() to instead of strncpy() block: sed-opal: keyring support for SED keys block: sed-opal: Implement IOC_OPAL_REVERT_LSP block: sed-opal: Implement IOC_OPAL_DISCOVERY blk-mq: prealloc tags when increase tagset nr_hw_queues blk-mq: delete redundant tagset map update when fallback blk-mq: fix tags leak when shrink nr_hw_queues ublk: zoned: support REQ_OP_ZONE_RESET_ALL md: raid0: account for split bio in iostat accounting md/raid0: Fix performance regression for large sequential writes md/raid0: Factor out helper for mapping and submitting a bio md raid1: allow writebehind to work on any leg device set WriteMostly md/raid1: hold the barrier until handle_read_error() finishes md/raid1: free the r1bio before waiting for blocked rdev md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init drivers/rnbd: restore sysfs interface to rnbd-client md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid() raid6: test: only check for Altivec if building on powerpc hosts raid6: test: make sure all intermediate and artifact files are .gitignored ...	2023-08-29 20:21:42 -07:00
Linus Torvalds	adfd671676	sysctl-6.6-rc1 Long ago we set out to remove the kitchen sink on kernel/sysctl.c arrays and placings sysctls to their own sybsystem or file to help avoid merge conflicts. Matthew Wilcox pointed out though that if we're going to do that we might as well also save space while at it and try to remove the extra last sysctl entry added at the end of each array, a sentintel, instead of bloating the kernel by adding a new sentinel with each array moved. Doing that was not so trivial, and has required slowing down the moves of kernel/sysctl.c arrays and measuring the impact on size by each new move. The complex part of the effort to help reduce the size of each sysctl is being done by the patient work of el señor Don Joel Granados. A lot of this is truly painful code refactoring and testing and then trying to measure the savings of each move and removing the sentinels. Although Joel already has code which does most of this work, experience with sysctl moves in the past shows is we need to be careful due to the slew of odd build failures that are possible due to the amount of random Kconfig options sysctls use. To that end Joel's work is split by first addressing the major housekeeping needed to remove the sentinels, which is part of this merge request. The rest of the work to actually remove the sentinels will be done later in future kernel releases. At first I was only going to send his first 7 patches of his patch series, posted 1 month ago, but in retrospect due to the testing the changes have received in linux-next and the minor changes they make this goes with the entire set of patches Joel had planned: just sysctl house keeping. There are networking changes but these are part of the house keeping too. The preliminary math is showing this will all help reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array where we are able to remove each sentinel in the future. That also means there is no more bloating the kernel with the extra ~64 bytes per array moved as no new sentinels are created. Most of this has been in linux-next for about a month, the last 7 patches took a minor refresh 2 week ago based on feedback. -----BEGIN PGP SIGNATURE----- iQJGBAABCgAwFiEENnNq2KuOejlQLZofziMdCjCSiKcFAmTuVnMSHG1jZ3JvZkBr ZXJuZWwub3JnAAoJEM4jHQowkoinIckP/imvRlfkO6L0IP7MmJBRPtwY01rsTAKO Q14dZ//bG4DVQeGl1FdzrF6hhuLgekU0qW1YDFIWiCXO7CbaxaNBPSUkeW6ReVoC R/VHNUPxSR1PWQy1OTJV2t4XKri2sB7ijmUsfsATtISwhei9bggTHEysShtP4tv+ U87DzhoqMnbYIsfMo49KCqOa1Qm7TmjC1a7WAp6Fph3GJuXAzZR5pXpsd0NtOZ9x Ud5RT22icnQpMl7K+yPsqY6XcS5JkgBe/WbSzMAUkYZvBZFBq9t2D+OW5h9TZMhw piJWQ9X0Rm7qI2D15mJfXwaOhhyDhWci391hzdJmS6DI0prf6Ma2NFdAWOt/zomI uiRujS4bGeBUaK5F4TX2WQ1+jdMtAZ+0FncFnzt4U8q7dzUc91uVCm6iHW3gcfAb N7OEg2ZL0gkkgCZHqKxN8wpNQiC2KwnNk+HLAbnL2a/oJYfBtdopQmlxWfrN2hpF xxROiENqk483BRdMXDq6DR/gyDZmZWCobXIglSzlqCOjCOcLbDziIJ7pJk83ok09 h/QnXTYHf9protBq9OIQesgh2pwNzBBLifK84KZLKcb7IbdIKjpQrW5STp04oNGf wcGJzEz8tXUe0UKyMM47AcHQGzIy6cdXNLjyF8a+m7rnZzr1ndnMqZyRStZzuQin AUg2VWHKPmW9 =sq2p -----END PGP SIGNATURE----- Merge tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux Pull sysctl updates from Luis Chamberlain: "Long ago we set out to remove the kitchen sink on kernel/sysctl.c arrays and placings sysctls to their own sybsystem or file to help avoid merge conflicts. Matthew Wilcox pointed out though that if we're going to do that we might as well also save space while at it and try to remove the extra last sysctl entry added at the end of each array, a sentintel, instead of bloating the kernel by adding a new sentinel with each array moved. Doing that was not so trivial, and has required slowing down the moves of kernel/sysctl.c arrays and measuring the impact on size by each new move. The complex part of the effort to help reduce the size of each sysctl is being done by the patient work of el señor Don Joel Granados. A lot of this is truly painful code refactoring and testing and then trying to measure the savings of each move and removing the sentinels. Although Joel already has code which does most of this work, experience with sysctl moves in the past shows is we need to be careful due to the slew of odd build failures that are possible due to the amount of random Kconfig options sysctls use. To that end Joel's work is split by first addressing the major housekeeping needed to remove the sentinels, which is part of this merge request. The rest of the work to actually remove the sentinels will be done later in future kernel releases. The preliminary math is showing this will all help reduce the overall build time size of the kernel and run time memory consumed by the kernel by about ~64 bytes per array where we are able to remove each sentinel in the future. That also means there is no more bloating the kernel with the extra ~64 bytes per array moved as no new sentinels are created" * tag 'sysctl-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux: sysctl: Use ctl_table_size as stopping criteria for list macro sysctl: SIZE_MAX->ARRAY_SIZE in register_net_sysctl vrf: Update to register_net_sysctl_sz networking: Update to register_net_sysctl_sz netfilter: Update to register_net_sysctl_sz ax.25: Update to register_net_sysctl_sz sysctl: Add size to register_net_sysctl function sysctl: Add size arg to __register_sysctl_init sysctl: Add size to register_sysctl sysctl: Add a size arg to __register_sysctl_table sysctl: Add size argument to init_header sysctl: Add ctl_table_size to ctl_table_header sysctl: Use ctl_table_header in list_for_each_table_entry sysctl: Prefer ctl_table_header in proc_sysctl	2023-08-29 17:39:15 -07:00
Linus Torvalds	d68b4b6f30	- An extensive rework of kexec and crash Kconfig from Eric DeVolder ("refactor Kconfig to consolidate KEXEC and CRASH options"). - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a couple of macros to args.h"). - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper commands"). - vsprintf inclusion rationalization from Andy Shevchenko ("lib/vsprintf: Rework header inclusions"). - Switch the handling of kdump from a udev scheme to in-kernel handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory hot un/plug"). - Many singleton patches to various parts of the tree -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO2GpAAKCRDdBJ7gKXxA juW3AQD1moHzlSN6x9I3tjm5TWWNYFoFL8af7wXDJspp/DWH/AD/TO0XlWWhhbYy QHy7lL0Syha38kKLMXTM+bN6YQHi9AU= =WJQa -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - An extensive rework of kexec and crash Kconfig from Eric DeVolder ("refactor Kconfig to consolidate KEXEC and CRASH options") - kernel.h slimming work from Andy Shevchenko ("kernel.h: Split out a couple of macros to args.h") - gdb feature work from Kuan-Ying Lee ("Add GDB memory helper commands") - vsprintf inclusion rationalization from Andy Shevchenko ("lib/vsprintf: Rework header inclusions") - Switch the handling of kdump from a udev scheme to in-kernel handling, by Eric DeVolder ("crash: Kernel handling of CPU and memory hot un/plug") - Many singleton patches to various parts of the tree * tag 'mm-nonmm-stable-2023-08-28-22-48' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (81 commits) document while_each_thread(), change first_tid() to use for_each_thread() drivers/char/mem.c: shrink character device's devlist[] array x86/crash: optimize CPU changes crash: change crash_prepare_elf64_headers() to for_each_possible_cpu() crash: hotplug support for kexec_load() x86/crash: add x86 crash hotplug support crash: memory and CPU hotplug sysfs attributes kexec: exclude elfcorehdr from the segment digest crash: add generic infrastructure for crash hotplug support crash: move a few code bits to setup support of crash hotplug kstrtox: consistently use _tolower() kill do_each_thread() nilfs2: fix WARNING in mark_buffer_dirty due to discarded buffer reuse scripts/bloat-o-meter: count weak symbol sizes treewide: drop CONFIG_EMBEDDED lockdep: fix static memory detection even more lib/vsprintf: declare no_hash_pointers in sprintf.h lib/vsprintf: split out sprintf() and friends kernel/fork: stop playing lockless games for exe_file replacement adfs: delete unused "union adfs_dirtail" definition ...	2023-08-29 14:53:51 -07:00
Chuck Lever	6372e2ee62	NFSD: da_addr_body field missing in some GETDEVICEINFO replies The XDR specification in RFC 8881 looks like this: struct device_addr4 { layouttype4 da_layout_type; opaque da_addr_body<>; }; struct GETDEVICEINFO4resok { device_addr4 gdir_device_addr; bitmap4 gdir_notification; }; union GETDEVICEINFO4res switch (nfsstat4 gdir_status) { case NFS4_OK: GETDEVICEINFO4resok gdir_resok4; case NFS4ERR_TOOSMALL: count4 gdir_mincount; default: void; }; Looking at nfsd4_encode_getdeviceinfo() .... When the client provides a zero gd_maxcount, then the Linux NFS server implementation encodes the da_layout_type field and then skips the da_addr_body field completely, proceeding directly to encode gdir_notification field. There does not appear to be an option in the specification to skip encoding da_addr_body. Moreover, Section 18.40.3 says: > If the client wants to just update or turn off notifications, it > MAY send a GETDEVICEINFO operation with gdia_maxcount set to zero. > In that event, if the device ID is valid, the reply's da_addr_body > field of the gdir_device_addr field will be of zero length. Since the layout drivers are responsible for encoding the da_addr_body field, put this fix inside the ->encode_getdeviceinfo methods. Fixes: `9cf514ccfa` ("nfsd: implement pNFS operations") Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Tom Haynes <loghyr@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Chuck Lever	78c542f916	SUNRPC: Add enum svc_auth_status In addition to the benefits of using an enum rather than a set of macros, we now have a named type that can improve static type checking of function return values. As part of this change, I removed a stale comment from svcauth.h; the return values from current implementations of the auth_ops::release method are all zero/negative errno, not the SVC_OK enum values as the old comment suggested. Suggested-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
NeilBrown	c743b4259c	SUNRPC: remove timeout arg from svc_recv() Most svc threads have no interest in a timeout. nfsd sets it to 1 hour, but this is a wart of no significance. lockd uses the timeout so that it can call nlmsvc_retry_blocked(). It also sometimes calls svc_wake_up() to ensure this is called. So change lockd to be consistent and always use svc_wake_up() to trigger nlmsvc_retry_blocked() - using a timer instead of a timeout to svc_recv(). And change svc_recv() to not take a timeout arg. This makes the sp_threads_timedout counter always zero. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
NeilBrown	7b719e2bf3	SUNRPC: change svc_recv() to return void. svc_recv() currently returns a 0 on success or one of two errors: - -EAGAIN means no message was successfully received - -EINTR means the thread has been told to stop Previously nfsd would stop as the result of a signal as well as following kthread_stop(). In that case the difference was useful: EINTR means stop unconditionally. EAGAIN means stop if kthread_should_stop(), continue otherwise. Now threads only exit when kthread_should_stop() so we don't need the distinction. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
NeilBrown	f78116d3bf	SUNRPC: call svc_process() from svc_recv(). All callers of svc_recv() go on to call svc_process() on success. Simplify callers by having svc_recv() do that for them. This loses one call to validate_process_creds() in nfsd. That was debugging code added 14 years ago. I don't think we need to keep it. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
NeilBrown	9f28a971ee	nfsd: separate nfsd_last_thread() from nfsd_put() Now that the last nfsd thread is stopped by an explicit act of calling svc_set_num_threads() with a count of zero, we only have a limited number of places that can happen, and don't need to call nfsd_last_thread() in nfsd_put() So separate that out and call it at the two places where the number of threads is set to zero. Move the clearing of ->nfsd_serv and the call to svc_xprt_destroy_all() into nfsd_last_thread(), as they are really part of the same action. nfsd_put() is now a thin wrapper around svc_put(), so make it a static inline. nfsd_put() cannot be called after nfsd_last_thread(), so in a couple of places we have to use svc_put() instead. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
NeilBrown	18e4cf9155	nfsd: Simplify code around svc_exit_thread() call in nfsd() Previously a thread could exit asynchronously (due to a signal) so some care was needed to hold nfsd_mutex over the last svc_put() call. Now a thread can only exit when svc_set_num_threads() is called, and this is always called under nfsd_mutex. So no care is needed. Not only is the mutex held when a thread exits now, but the svc refcount is elevated, so the svc_put() in svc_exit_thread() will never be a final put, so the mutex isn't even needed at this point in the code. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
NeilBrown	3903902401	nfsd: don't allow nfsd threads to be signalled. The original implementation of nfsd used signals to stop threads during shutdown. In Linux 2.3.46pre5 nfsd gained the ability to shutdown threads internally it if was asked to run "0" threads. After this user-space transitioned to using "rpc.nfsd 0" to stop nfsd and sending signals to threads was no longer an important part of the API. In commit `3ebdbe5203` ("SUNRPC: discard svo_setup and rename svc_set_num_threads_sync()") (v5.17-rc1~75^2~41) we finally removed the use of signals for stopping threads, using kthread_stop() instead. This patch makes the "obvious" next step and removes the ability to signal nfsd threads - or any svc threads. nfsd stops allowing signals and we don't check for their delivery any more. This will allow for some simplification in later patches. A change worth noting is in nfsd4_ssc_setup_dul(). There was previously a signal_pending() check which would only succeed when the thread was being shut down. It should really have tested kthread_should_stop() as well. Now it just does the latter, not the former. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
NeilBrown	8db14cad28	lockd: remove SIGKILL handling lockd allows SIGKILL and responds by dropping all locks and restarting the grace period. This functionality has been present since 2.1.32 when lockd was added to Linux. This functionality is undocumented and most likely added as a useful debug aid. When there is a need to drop locks, the better approach is to use /proc/fs/nfsd/unlock_*. This patch removes SIGKILL handling as part of preparation for removing all signal handling from sunrpc service threads. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Su Hui	de8d38cf44	fs: lockd: avoid possible wrong NULL parameter clang's static analysis warning: fs/lockd/mon.c: line 293, column 2: Null pointer passed as 2nd argument to memory copy function. Assuming 'hostname' is NULL and calling 'nsm_create_handle()', this will pass NULL as 2nd argument to memory copy function 'memcpy()'. So return NULL if 'hostname' is invalid. Fixes: `77a3ef33e2` ("NSM: More clean up of nsm_get_handle()") Signed-off-by: Su Hui <suhui@nfschina.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Zhu Wang	7afdc0c902	exportfs: remove kernel-doc warnings in exportfs Remove kernel-doc warning in exportfs: fs/exportfs/expfs.c:395: warning: Function parameter or member 'parent' not described in 'exportfs_encode_inode_fh' Signed-off-by: Zhu Wang <wangzhu9@huawei.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Jeff Layton	d424797032	nfsd: inherit required unset default acls from effective set A well-formed NFSv4 ACL will always contain OWNER@/GROUP@/EVERYONE@ ACEs, but there is no requirement for inheritable entries for those entities. POSIX ACLs must always have owner/group/other entries, even for a default ACL. nfsd builds the default ACL from inheritable ACEs, but the current code just leaves any unspecified ACEs zeroed out. The result is that adding a default user or group ACE to an inode can leave it with unwanted deny entries. For instance, a newly created directory with no acl will look something like this: # NFSv4 translation by server A::OWNER@:rwaDxtTcCy A::GROUP@:rxtcy A::EVERYONE@:rxtcy # POSIX ACL of underlying file user::rwx group::r-x other::r-x ...if I then add new v4 ACE: nfs4_setfacl -a A:fd:1000:rwx /mnt/local/test ...I end up with a result like this today: user::rwx user:1000:rwx group::r-x mask::rwx other::r-x default:user::--- default:user:1000:rwx default:group::--- default😷:rwx default:other::--- A::OWNER@:rwaDxtTcCy A::1000:rwaDxtcy A::GROUP@:rxtcy A::EVERYONE@:rxtcy D:fdi:OWNER@:rwaDx A:fdi:OWNER@:tTcCy A:fdi:1000:rwaDxtcy A:fdi:GROUP@:tcy A:fdi:EVERYONE@:tcy ...which is not at all expected. Adding a single inheritable allow ACE should not result in everyone else losing access. The setfacl command solves a silimar issue by copying owner/group/other entries from the effective ACL when none of them are set: "If a Default ACL entry is created, and the Default ACL contains no owner, owning group, or others entry, a copy of the ACL owner, owning group, or others entry is added to the Default ACL. Having nfsd do the same provides a more sane result (with no deny ACEs in the resulting set): user::rwx user:1000:rwx group::r-x mask::rwx other::r-x default:user::rwx default:user:1000:rwx default:group::r-x default😷:rwx default:other::r-x A::OWNER@:rwaDxtTcCy A::1000:rwaDxtcy A::GROUP@:rxtcy A::EVERYONE@:rxtcy A:fdi:OWNER@:rwaDxtTcCy A:fdi:1000:rwaDxtcy A:fdi:GROUP@:rxtcy A:fdi:EVERYONE@:rxtcy Reported-by: Ondrej Valousek <ondrej.valousek@diasemi.com> Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2136452 Suggested-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Alexander Aring	be2be5f7f4	lockd: nlm_blocked list race fixes This patch fixes races when lockd accesses the global nlm_blocked list. It was mostly safe to access the list because everything was accessed from the lockd kernel thread context but there exist cases like nlmsvc_grant_deferred() that could manipulate the nlm_blocked list and it can be called from any context. Signed-off-by: Alexander Aring <aahringo@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Jeff Layton	f2b7019d2e	nfsd: set missing after_change as before_change + 1 In the event that we can't fetch post_op_attr attributes, we still need to set a value for the after_change. The operation has already happened, so we're not able to return an error at that point, but we do want to ensure that the client knows that its cache should be invalidated. If we weren't able to fetch post-op attrs, then just set the after_change to before_change + 1. The atomic flag should already be clear in this case. Suggested-by: Neil Brown <neilb@suse.de> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Jeff Layton	976626073a	nfsd: remove unsafe BUG_ON from set_change_info At one time, nfsd would scrape inode information directly out of struct inode in order to populate the change_info4. At that time, the BUG_ON in set_change_info made some sense, since having it unset meant a coding error. More recently, it calls vfs_getattr to get this information, which can fail. If that fails, fh_pre_saved can end up not being set. While this situation is unfortunate, we don't need to crash the box. Move set_change_info to nfs4proc.c since all of the callers are there. Revise the condition for setting "atomic" to also check for fh_pre_saved. Drop the BUG_ON and just have it zero out both change_attr4s when this occurs. Reported-by: Boyang Xue <bxue@redhat.com> Closes: https://bugzilla.redhat.com/show_bug.cgi?id=2223560 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Jeff Layton	a332018a91	nfsd: handle failure to collect pre/post-op attrs more sanely Collecting pre_op_attrs can fail, in which case it's probably best to fail the whole operation. Change fh_fill_pre_attrs and fh_fill_both_attrs to return __be32, and have the callers check the return code and abort the operation if it's not nfs_ok. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Jeff Layton	5865bafa19	nfsd: add a MODULE_DESCRIPTION I got this today from modpost: WARNING: modpost: missing MODULE_DESCRIPTION() in fs/nfsd/nfsd.o Add a module description. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Chuck Lever	e7421ce714	NFSD: Rename struct svc_cacherep The svc_ prefix is identified with the SunRPC layer. Although the duplicate reply cache caches RPC replies, it is only for the NFS protocol. Rename the struct to better reflect its purpose. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Chuck Lever	cb18eca4b8	NFSD: Remove svc_rqst::rq_cacherep Over time I'd like to see NFS-specific fields moved out of struct svc_rqst, which is an RPC layer object. These fields are layering violations. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Chuck Lever	c135e1269f	NFSD: Refactor the duplicate reply cache shrinker Avoid holding the bucket lock while freeing cache entries. This change also caps the number of entries that are freed when the shrinker calls to reduce the shrinker's impact on the cache's effectiveness. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Chuck Lever	a9507f6af1	NFSD: Replace nfsd_prune_bucket() Enable nfsd_prune_bucket() to drop the bucket lock while calling kfree(). Use the same pattern that Jeff recently introduced in the NFSD filecache. A few percpu operations are moved outside the lock since they temporarily disable local IRQs which is expensive and does not need to be done while the lock is held. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Chuck Lever	ff0d169329	NFSD: Rename nfsd_reply_cache_alloc() For readability, rename to match the other helpers. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Chuck Lever	35308e7f0f	NFSD: Refactor nfsd_reply_cache_free_locked() To reduce contention on the bucket locks, we must avoid calling kfree() while each bucket lock is held. Start by refactoring nfsd_reply_cache_free_locked() into a helper that removes an entry from the bucket (and must therefore run under the lock) and a second helper that frees the entry (which does not need to hold the lock). For readability, rename the helpers nfsd_cacherep_<verb>. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Dai Ngo	1d3dd1d56c	NFSD: Enable write delegation support This patch grants write delegations for OPEN with NFS4_SHARE_ACCESS_WRITE if there is no conflict with other OPENs. Write delegation conflicts with another OPEN, REMOVE, RENAME and SETATTR are handled the same as read delegation using notify_change, try_break_deleg. The NFSv4.0 protocol does not enable a server to determine that a conflicting GETATTR originated from the client holding the delegation versus coming from some other client. With NFSv4.1 and later, the SEQUENCE operation that begins each COMPOUND contains a client ID, so delegation recall can be safely squelched in this case. With NFSv4.0, however, the server must recall or send a CB_GETATTR (per RFC 7530 Section 16.7.5) even when the GETATTR originates from the client holding that delegation. An NFSv4.0 client can trigger a pathological situation if it always sends a DELEGRETURN preceded by a conflicting GETATTR in the same COMPOUND. COMPOUND execution will always stop at the GETATTR and the DELEGRETURN will never get executed. The server eventually revokes the delegation, which can result in loss of open or lock state. Tracepoint added to track whether read or write delegation is granted. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Chuck Lever	50bce06f0e	NFSD: Report zero space limit for write delegations Replace the -1 (no limit) with a zero (no reserved space). This prevents certain non-determinant client behavior, such as silly-renaming a file when the only open reference is a write delegation. Such a rename can leave unexpected .nfs files in a directory that is otherwise supposed to be empty. Note that other server implementations that support write delegation also set this field to zero. Suggested-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Dai Ngo	fd19ca36fd	NFSD: handle GETATTR conflict with write delegation If the GETATTR request on a file that has write delegation in effect and the request attributes include the change info and size attribute then the write delegation is recalled. If the delegation is returned within 30ms then the GETATTR is serviced as normal otherwise the NFS4ERR_DELAY error is returned for the GETATTR. Add counter for write delegation recall due to conflict GETATTR. This is used to evaluate the need to implement CB_GETATTR to adoid recalling the delegation with conflit GETATTR. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Dai Ngo	d67cd907cf	locks: allow support for write delegation Remove the check for F_WRLCK in generic_add_lease to allow file_lock to be used for write delegation. First consumer is NFSD. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-29 17:45:22 -04:00
Linus Torvalds	b96a3e9142	- Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6 Nfio7afS1ncD+hPYT8947UnLxTgn+ww= =Afws -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...	2023-08-29 14:25:26 -07:00
Linus Torvalds	9d6b14cd1e	flexible-array transformations for 6.6-rc1 Hi Linus, Please, pull the following flexible-array transformations. These patches have been baking in linux-next for a while. Thanks -- Gustavo -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAmTtBEQACgkQRwW0y0cG 2zHgeBAAm1D5H+4THFOuYVsl93igowMbcPwraoo2xxnNtLqQR4y6p9SiWcjh2vk3 bmabPtID3jPZytjpV6QUYe6JlcOoWc2LsNTWqwnt2X2fhcovJfGMc/K3+7bRQzE8 uBqooxkiaZvPiUATeo0zdBSN9aYWXvEpYrBR5J9EtJWyS93LiYb/nn2DY+0d1yrY 89koxTV9/eOKJ7xi6UB0zWbyayor+JawX5RI2ysMFHO6UB2aVrNjqUCJP51dyywa 2QKtwkY/8vQIo8Y8c7ReApHR0t+s0jmxRi9ip+4rh/0VNl8ON5SoL+EcRB5FLCTn E/VqQcebjIwo0AfeEzUQIc68E1hZJLYhGGJupF1Wsb9IvmbdTTMhhfLJmWStz5+E GnNpXx9gTjCoM5+FqJlMkHgxT9JpHYSV5LtnZKmM7wiexHND+5T/ukAxRUhFjLY3 saMbXnJ85kHsNYLAxIDUfdRcXEDNBJl7XiVI2YRCI9fttD+GVjdFtATdNZFsFAge kHZCAFloFHwe0lTcksN5I/NVdDc6Jv7ABjrTKNezfrjfRn24zvR61V54/ebPp5Fs 2JA5kVJpgGnUdIM2uot2lYLCmLDjhGNfXE95AjK0SdJXrZtzYqzTcKJg6w1gz3ZJ VHWK4snsu48KxTx66f+v38ovKXLtlA8Rk8kYOHWXYgB0ZXOOAy4= =L9kF -----END PGP SIGNATURE----- Merge tag 'flex-array-transformations-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull flexible-array updates from Gustavo A. R. Silva. * tag 'flex-array-transformations-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: fs: omfs: Use flexible-array member in struct omfs_extent sparc: openpromio: Address -Warray-bounds warning reiserfs: Replace one-element array with flexible-array member	2023-08-29 12:48:12 -07:00
Linus Torvalds	468e28d4ac	v6.6-vfs.super.fixes -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZO2teQAKCRCRxhvAZXjc omN9AP9F3rtueiMv0kfdwRhXl4GITY+o5OpiEpLjHdPC4nEalwEAvt8h4nvNmTg6 B54+2wNX2vc3t5UVPuFlqSHtUxoKTgA= =DMxT -----END PGP SIGNATURE----- Merge tag 'v6.6-vfs.super.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock fixes from Christian Brauner: "Two follow-up fixes for the super work this cycle: - Move a misplaced lockep assertion before we potentially free the object containing the lock. - Ensure that filesystems which match superblocks in sget{_fc}() based on sb->s_fs_info are guaranteed to see a valid sb->s_fs_info as long as a superblock still appears on the filesystem type's superblock list. What we want as a proper solution for next cycle is to split sb->free_sb() out of sb->kill_sb() so that we can simply call kill_super_notify() after sb->kill_sb() but before sb->free_sb(). Currently, this is lumped together in sb->kill_sb()" * tag 'v6.6-vfs.super.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: super: ensure valid info super: move lockdep assert	2023-08-29 11:59:37 -07:00
Namjae Jeon	0e2378eaa2	ksmbd: add missing calling smb2_set_err_rsp() on error If some error happen on smb2_sess_setup(), Need to call smb2_set_err_rsp() to set error response. This patch add missing calling smb2_set_err_rsp() on error. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-29 12:30:20 -05:00
Namjae Jeon	0ba5439d9a	ksmbd: replace one-element array with flex-array member in struct smb2_ea_info UBSAN complains about out-of-bounds array indexes on 1-element arrays in struct smb2_ea_info. UBSAN: array-index-out-of-bounds in fs/smb/server/smb2pdu.c:4335:15 index 1 is out of range for type 'char [1]' CPU: 1 PID: 354 Comm: kworker/1:4 Not tainted 6.5.0-rc4 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020 Workqueue: ksmbd-io handle_ksmbd_work [ksmbd] Call Trace: <TASK> __dump_stack linux/lib/dump_stack.c:88 dump_stack_lvl+0x48/0x70 linux/lib/dump_stack.c:106 dump_stack+0x10/0x20 linux/lib/dump_stack.c:113 ubsan_epilogue linux/lib/ubsan.c:217 __ubsan_handle_out_of_bounds+0xc6/0x110 linux/lib/ubsan.c:348 smb2_get_ea linux/fs/smb/server/smb2pdu.c:4335 smb2_get_info_file linux/fs/smb/server/smb2pdu.c:4900 smb2_query_info+0x63ae/0x6b20 linux/fs/smb/server/smb2pdu.c:5275 __process_request linux/fs/smb/server/server.c:145 __handle_ksmbd_work linux/fs/smb/server/server.c:213 handle_ksmbd_work+0x348/0x10b0 linux/fs/smb/server/server.c:266 process_one_work+0x85a/0x1500 linux/kernel/workqueue.c:2597 worker_thread+0xf3/0x13a0 linux/kernel/workqueue.c:2748 kthread+0x2b7/0x390 linux/kernel/kthread.c:389 ret_from_fork+0x44/0x90 linux/arch/x86/kernel/process.c:145 ret_from_fork_asm+0x1b/0x30 linux/arch/x86/entry/entry_64.S:304 </TASK> Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-29 12:30:20 -05:00
Namjae Jeon	4b081ce0d8	ksmbd: fix slub overflow in ksmbd_decode_ntlmssp_auth_blob() If authblob->SessionKey.Length is bigger than session key size(CIFS_KEY_SIZE), slub overflow can happen in key exchange codes. cifs_arc4_crypt copy to session key array from SessionKey from client. Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21940 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-29 12:30:20 -05:00
Namjae Jeon	17d5b135bb	ksmbd: fix wrong DataOffset validation of create context If ->DataOffset of create context is 0, DataBuffer size is not correctly validated. This patch change wrong validation code and consider tag length in request. Cc: stable@vger.kernel.org Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21824 Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-29 12:30:20 -05:00
Yang Li	bf26f1b4e0	ksmbd: Fix one kernel-doc comment Fix one kernel-doc comment to silence the warning: fs/smb/server/smb2pdu.c:4160: warning: Excess function parameter 'infoclass_size' description in 'buffer_check_err' Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-29 12:30:20 -05:00
Namjae Jeon	e628bf939a	ksmbd: reduce descriptor size if remaining bytes is less than request size Create 3 kinds of files to reproduce this problem. dd if=/dev/urandom of=127k.bin bs=1024 count=127 dd if=/dev/urandom of=128k.bin bs=1024 count=128 dd if=/dev/urandom of=129k.bin bs=1024 count=129 When copying files from ksmbd share to windows or cifs.ko, The following error message happen from windows client. "The file '129k.bin' is too large for the destination filesystem." We can see the error logs from ksmbd debug prints [48394.611537] ksmbd: RDMA r/w request 0x0: token 0x669d, length 0x20000 [48394.612054] ksmbd: smb_direct: RDMA write, len 0x20000, needed credits 0x1 [48394.612572] ksmbd: filename 129k.bin, offset 131072, len 131072 [48394.614189] ksmbd: nbytes 1024, offset 132096 mincount 0 [48394.614585] ksmbd: Failed to process 8 [-22] And we can reproduce it with cifs.ko, e.g. dd if=129k.bin of=/dev/null bs=128KB count=2 This problem is that ksmbd rdma return error if remaining bytes is less than Length of Buffer Descriptor V1 Structure. smb_direct_rdma_xmit() ... if (desc_buf_len == 0 \|\| total_length > buf_len \|\| total_length > t->max_rdma_rw_size) return -EINVAL; This patch reduce descriptor size with remaining bytes and remove the check for total_length and buf_len. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-29 12:30:20 -05:00
Atte Heikkilä	65656f5242	ksmbd: fix `force create mode' and` force directory mode' `force create mode' and `force directory mode' should be bitwise ORed with the perms after `create mask' and `directory mask' have been applied, respectively. Signed-off-by: Atte Heikkilä <atteh.mailbox@gmail.com> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-29 12:30:20 -05:00
Namjae Jeon	041bba4414	ksmbd: fix wrong interim response on compound If smb2_lock or smb2_open request is compound, ksmbd could send wrong interim response to client. ksmbd allocate new interim buffer instead of using resonse buffer to support compound request. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-29 12:30:19 -05:00
Namjae Jeon	e2b76ab8b5	ksmbd: add support for read compound MacOS sends a compound request including read to the server (e.g. open-read-close). So far, ksmbd has not handled read as a compound request. For compatibility between ksmbd and an OS that supports SMB, This patch provides compound support for read requests. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-29 12:30:19 -05:00
Yang Yingliang	084ba46fc4	ksmbd: switch to use kmemdup_nul() helper Use kmemdup_nul() helper instead of open-coding to simplify the code. Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2023-08-29 12:30:19 -05:00
Alexei Filippov	0225e10972	jfs: validate max amount of blocks before allocation. The lack of checking bmp->db_max_freebud in extBalloc() can lead to shift out of bounds, so this patch prevents undefined behavior, because bmp->db_max_freebud == -1 only if there is no free space. Signed-off-by: Aleksei Filippov <halip0503@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Fixes: `1da177e4c3` ("Linux-2.6.12-rc2") Reported-and-tested-by: syzbot+5f088f29593e6b4c8db8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?id=01abadbd6ae6a08b1f1987aa61554c6b3ac19ff2	2023-08-29 12:25:47 -05:00
Colin Ian King	87098a0d9e	jfs: remove redundant initialization to pointer ip The pointer ip is being initialized with a value that is never read, it is being re-assigned later on. The assignment is redundant and can be removed. Cleans up clang scan warning: fs/jfs/namei.c:886:16: warning: Value stored to 'ip' during its initialization is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2023-08-29 12:20:50 -05:00
Bernd Schubert	f73016b63b	fuse: conditionally fill kstat in fuse_do_statx() The code path fuse_update_attributes fuse_update_get_attr fuse_do_statx has the risk to use a NULL pointer for struct kstat *stat, although current callers of fuse_update_attributes() only set request_mask to values that will trigger the call of fuse_do_getattr(), which already handles the NULL pointer. Future updates might miss that fuse_do_statx() does not handle it it is safer to add a condition already right now. Signed-off-by: Bernd Schubert <bschubert@ddn.com> Fixes: `d3045530bd` ("fuse: implement statx") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2023-08-29 14:58:48 +02:00
Christian Brauner	dc3216b141	super: ensure valid info For keyed filesystems that recycle superblocks based on s_fs_info or information contained therein s_fs_info must be kept as long as the superblock is on the filesystem type super list. This isn't guaranteed as s_fs_info will be freed latest in sb->kill_sb(). The fix is simply to perform notification and list removal in kill_anon_super(). Any filesystem needs to free s_fs_info after they call the kill_*() helpers. If they don't they risk use-after-free right now so fixing it here is guaranteed that s_fs_info remain valid. For block backed filesystems notifying in pass sb->kill_sb() in deactivate_locked_super() remains unproblematic and is required because multiple other block devices can be shut down after kill_block_super() has been called from a filesystem's sb->kill_sb() handler. For example, ext4 and xfs close additional devices. Block based filesystems don't depend on s_fs_info (btrfs does use s_fs_info but also uses kill_anon_super() and not kill_block_super().). Sorry for that braino. Goal should be to unify this behavior during this cycle obviously. But let's please do a simple bugfix now. Fixes: `2c18a63b76` ("super: wait until we passed kill super") Fixes: syzbot+5b64180f8d9e39d3f061@syzkaller.appspotmail.com Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Reported-by: syzbot+5b64180f8d9e39d3f061@syzkaller.appspotmail.com Message-Id: <20230828-vfs-super-fixes-v1-2-b37a4a04a88f@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-29 10:13:04 +02:00
Christian Brauner	345a5c4a0b	super: move lockdep assert Fix braino and move the lockdep assertion after put_super() otherwise we risk a use-after-free. Fixes: `2c18a63b76` ("super: wait until we passed kill super") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> Message-Id: <20230828-vfs-super-fixes-v1-1-b37a4a04a88f@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2023-08-29 10:13:04 +02:00
Linus Torvalds	5b07aaca18	pstore updates for v6.6-rc1 - Greatly simplify compression support (Ard Biesheuvel). - Avoid crashes for corrupted offsets when prz size is 0 (Enlin Mu). - Expand range of usable record sizes (Yuxiao Zhang). - Fix kernel-doc warning (Matthew Wilcox). -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmTs5TUWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJkn8D/9mdBm32Wfx/if84YejxHJpzHmV nKPRgib89vNZdL5ORP02ZTonJBZn4NC7KtJfBHSfdoW1U+5GCC/cHOpECUHQui9Q CN22VFm37JdmBZq2+YmPug5y7z94wbFkD79otCR9VlMt5uwbNIGxUaI10fK2M97n 3avg/RZzz6kI9Y6BChZfBDLKXXi6ytnIRQOa9ZqZyDylN1nTLi8vqrxf0P8Am0jE 1s2GumYj54NuuNTdqvlz0XhTyCM5pk5omTqlq1VW9Trr0fLa2CLvEBWxWo8G7odC Yav5p8e0jX0GjDFM3NHPgRcXTcY0vkWGnJLdZGNyEkxPq96GH09j5rhFOIo9+KPz Y3fhYWzZyNWjy7YujWupDyL6lozWObhOcjBRnFmW7gJHjoO2G0GT2ufW2fb9cD4q fTGPiX2Fum1Zl6b0CXF+j4wDaazsBxGGAGzTqj7yp2Je0rPJPotd69q8LT2bbVcP ZahXJsFNn/YmVKv9MhNZjOuxGZoR4Cgco114V+sU5aYZMcZ68fQNzTzMydkbbdch SMapAV9a99H1D8ldT9dhm+HlKZFzIrOtBDrDoIbF4qQB8OWhjEK6Ot3oBbXvLl7w 72i1niDVRj+v/hUSc/7XYfZkUG7NYJQqXbaJp20LWvEs5OALdWRC3T6vXnvh53Qd 9ErztYLmF6k3W/h/xA== =bu02 -----END PGP SIGNATURE----- Merge tag 'pstore-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull pstore updates from Kees Cook: - Greatly simplify compression support (Ard Biesheuvel) - Avoid crashes for corrupted offsets when prz size is 0 (Enlin Mu) - Expand range of usable record sizes (Yuxiao Zhang) - Fix kernel-doc warning (Matthew Wilcox) * tag 'pstore-v6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: pstore: Fix kernel-doc warning pstore: Support record sizes larger than kmalloc() limit pstore/ram: Check start of empty przs during init pstore: Replace crypto API compression with zlib_deflate library calls pstore: Remove worst-case compression size logic	2023-08-28 12:36:04 -07:00
Linus Torvalds	547635c6ac	for-6.6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTskOwACgkQxWXV+ddt WDsNJw/8CCi41Z7e3LdJsQd2iy3/+oJZUvIGuT5YvshYxTLCbV7AL+diBPnSQs4Q /KFMGL7RZBgJzwVoSQtXnESXXgX8VOVfN1zY//k5g6z7BscCEQd73H/M0B8ciZy/ aBygm9tJ7EtWbGZWNR8yad8YtOgl6xoClrPnJK/DCLwMGPy2o+fnKP3Y9FOKY5KM 1Sl0Y4FlJ9dTJpxIwYbx4xmuyHrh2OivjU/KnS9SzQlHu0nl6zsIAE45eKem2/EG 1figY5aFBYPpPYfopbLDalEBR3bQGiViZVJuNEop3AimdcMOXw9jBF3EZYUb5Tgn MleMDgmmjLGOE/txGhvTxKj9kci2aGX+fJn3jXbcIMksAA0OQFLPqzGvEQcrs6Ok HA0RsmAkS5fWNDCuuo4ZPXEyUPvluTQizkwyoulOfnK+UPJCWaRqbEBMTsvm6M6X wFT2czwLpaEU/W6loIZkISUhfbRqVoA3DfHy398QXNzRhSrg8fQJjma1f7mrHvTi CzU+OD5YSC2nXktVOnklyTr0XT+7HF69cumlDbr8TS8u1qu8n1keU/7M3MBB4xZk BZFJDz8pnsAqpwVA4T434E/w45MDnYlwBw5r+U8Xjyso8xlau+sYXKcim85vT2Q0 yx/L91P6tdekR1y97p4aDdxw/PgTzdkNGMnsTBMVzgtCj+5pMmE= =N7Yn -----END PGP SIGNATURE----- Merge tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "No new features, the bulk of the changes are fixes, refactoring and cleanups. The notable fix is the scrub performance restoration after rewrite in 6.4, though still only partial. Fixes: - scrub performance drop due to rewrite in 6.4 partially restored: - do IO grouping by blg_plug/blk_unplug again - avoid unnecessary tree searches when processing stripes, in extent and checksum trees - the drop is noticeable on fast PCIe devices, -66% and restored to -33% of the original - backports to 6.4 planned - handle more corner cases of transaction commit during orphan cleanup or delayed ref processing - use correct fsid/metadata_uuid when validating super block - copy directory permissions and time when creating a stub subvolume Core: - debugging feature integrity checker deprecated, to be removed in 6.7 - in zoned mode, zones are activated just before the write, making error handling easier, now the overcommit mechanism can be enabled again which improves performance by avoiding more frequent flushing - v0 extent handling completely removed, deprecated long time ago - error handling improvements - tests: - extent buffer bitmap tests - pinned extent splitting tests - cleanups and refactoring: - compression writeback - extent buffer bitmap - space flushing, ENOSPC handling" * tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits) btrfs: zoned: skip splitting and logical rewriting on pre-alloc write btrfs: tests: test invalid splitting when skipping pinned drop extent_map btrfs: tests: add a test for btrfs_add_extent_mapping btrfs: tests: add extent_map tests for dropping with odd layouts btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker() btrfs: scrub: don't go ordered workqueue for dev-replace btrfs: scrub: fix grouping of read IO btrfs: scrub: avoid unnecessary csum tree search preparing stripes btrfs: scrub: avoid unnecessary extent tree search preparing stripes btrfs: copy dir permission and time when creating a stub subvolume btrfs: remove pointless empty list check when reading delayed dir indexes btrfs: drop redundant check to use fs_devices::metadata_uuid btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super btrfs: use the correct superblock to compare fsid in btrfs_validate_super btrfs: simplify memcpy either of metadata_uuid or fsid btrfs: add a helper to read the superblock metadata_uuid btrfs: remove v0 extent handling btrfs: output extra debug info if we failed to find an inline backref btrfs: move the !zoned assert into run_delalloc_cow btrfs: consolidate the error handling in run_delalloc_nocow ...	2023-08-28 12:26:57 -07:00
Linus Torvalds	f678c890c6	affs-for-6.6-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTsk7MACgkQxWXV+ddt WDsKjRAAoKY1qtApAMWl7vX+fw4pyoexrZISvzsJpCmPDV0SlKZ8CqOfYGhyT845 /pYUB8oxCm8UgpbswCTDVzppFANVqJBRIQEQc9d6yej4bA62Z/4Z561xROE1BEsO UR2UIKw4gtpAmSlCoLYljt4Eg0izRt0MNmYvIk9ccQTIzVO8u2E8keNW76OSbAW1 J9AH0KgrxPgK9GPN/MlWg8THyuWgP3p/Qf4QS6JcDRCNFr4gmNwxP8Cj1kifYVT/ iMxghN+Z2iFPN8gHIeebjYpyoJJ4ZHRVL9mUZ0SLwcbt6ZfV5DnYn2yepkxxHfyL CaUrRNJHE3Cb4vyJgtvVoAHE78Bqyk4WHKirNnrHly5j8K9ba3PIEpL+XjDVUckA 597xlrBsaQhlvMHXn+sMOMAMs7EVkUfR/gzSoLirWWOlLJ9skl3H8vK/BunK1bo+ +YYwMJwPtv3HmCdMbb00Rm9D2vVSVYM4iydEjmW6ILWBgecQII5O14CmkSH9zoG7 sip4O5oDkxmnoq9TuKPjS6AEPi+6UsbCsv8Z8boXdj97w4+bJIZVGbDHXvdAIEkj 2NmrYgqCqcbB46cSIc2VF1Fx0WO8Jh+GTFYGAPGFrabLFDVWeuwtdJahbYnSrmuL whMVNSH4sooZflCWjwANhV8H5wlIatnn+xh+lX3TWpKiBeqIGpw= =wPHe -----END PGP SIGNATURE----- Merge tag 'affs-for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull affs updates from David Sterba: "Two minor updates for AFFS: - reimplement writepage() address space callback on top of migrate_folio() - fix a build warning, local parameters 'toupper' collide with the standard ctype.h name" * tag 'affs-for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: affs: rename local toupper() to fn() to avoid confusion affs: remove writepage implementation	2023-08-28 12:18:26 -07:00
Linus Torvalds	3bb156a556	fsverity updates for 6.6 Several cleanups for fs/verity/, including two commits that make the builtin signature support more cleanly separated from the base feature. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZOwXSRQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK+kzAP9UsTtZjjQLvEDF6OYlysFLQppDrmk0 zh9iH/Z7qT8BQwEA0azRHW5/FDLvkHZWsU0QCLjDpUPrNHc712aeM1pLpgc= =zOdL -----END PGP SIGNATURE----- Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux Pull fsverity updates from Eric Biggers: "Several cleanups for fs/verity/, including two commits that make the builtin signature support more cleanly separated from the base feature" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux: fsverity: skip PKCS#7 parser when keyring is empty fsverity: move sysctl registration out of signature.c fsverity: simplify handling of errors during initcall fsverity: explicitly check that there is no algorithm 0	2023-08-28 12:16:42 -07:00
Linus Torvalds	6016fc9162	New code for 6.6: * Make large writes to the page cache fill sparse parts of the cache with large folios, then use large memcpy calls for the large folio. * Track the per-block dirty state of each large folio so that a buffered write to a single byte on a large folio does not result in a (potentially) multi-megabyte writeback IO. * Allow some directio completions to be performed in the initiating task's context instead of punting through a workqueue. This will reduce latency for some io_uring requests. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZM0Z1AAKCRBKO3ySh0YR pp7BAQCzkKejCM0185tNIH/faHjzidSisNQkJ5HoB4Opq9U66AEA6IPuAdlPlM/J FPW1oPq33Yn7AV4wXjUNFfDLzVb/Fgg= =dFBU -----END PGP SIGNATURE----- Merge tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull iomap updates from Darrick Wong: "We've got some big changes for this release -- I'm very happy to be landing willy's work to enable large folios for the page cache for general read and write IOs when the fs can make contiguous space allocations, and Ritesh's work to track sub-folio dirty state to eliminate the write amplification problems inherent in using large folios. As a bonus, io_uring can now process write completions in the caller's context instead of bouncing through a workqueue, which should reduce io latency dramatically. IOWs, XFS should see a nice performance bump for both IO paths. Summary: - Make large writes to the page cache fill sparse parts of the cache with large folios, then use large memcpy calls for the large folio. - Track the per-block dirty state of each large folio so that a buffered write to a single byte on a large folio does not result in a (potentially) multi-megabyte writeback IO. - Allow some directio completions to be performed in the initiating task's context instead of punting through a workqueue. This will reduce latency for some io_uring requests" * tag 'iomap-6.6-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (26 commits) iomap: support IOCB_DIO_CALLER_COMP io_uring/rw: add write support for IOCB_DIO_CALLER_COMP fs: add IOCB flags related to passing back dio completions iomap: add IOMAP_DIO_INLINE_COMP iomap: only set iocb->private for polled bio iomap: treat a write through cache the same as FUA iomap: use an unsigned type for IOMAP_DIO_* defines iomap: cleanup up iomap_dio_bio_end_io() iomap: Add per-block dirty state tracking to improve performance iomap: Allocate ifs in ->write_begin() early iomap: Refactor iomap_write_delalloc_punch() function out iomap: Use iomap_punch_t typedef iomap: Fix possible overflow condition in iomap_write_delalloc_scan iomap: Add some uptodate state handling helpers for ifs state bitmap iomap: Drop ifs argument from iomap_set_range_uptodate() iomap: Rename iomap_page to iomap_folio_state and others iomap: Copy larger chunks from userspace iomap: Create large folios in the buffered write path filemap: Allow __filemap_get_folio to allocate large folios filemap: Add fgf_t typedef ...	2023-08-28 11:59:52 -07:00
Linus Torvalds	dd2c0198a8	Changes since last update: - Support xattr bloom filter to optimize negative xattr lookups; - Support DEFLATE compression algorithm as an alternative; - Fix a regression that ztailpacking pclusters don't release properly; - Avoid warning dedupe and fragments features anymore; - Some folio conversions and cleanups. -----BEGIN PGP SIGNATURE----- iIcEABYIAC8WIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCZOvhIBEceGlhbmdAa2Vy bmVsLm9yZwAKCRA5NzHcH7XmBFgqAP4/gcxH5vhgxMunxmgBkSxMFBQf/W7CfOiN QkGHjSKl8gEA78EBwAJ3vDJ1JgQRTb9/9UBrtW7n2hzj/eVS/LIyYQI= =o3Bx -----END PGP SIGNATURE----- Merge tag 'erofs-for-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs Pull erofs updates from Gao Xiang: "In this cycle, a xattr bloom filter feature is introduced to speed up negative xattr lookups, which was originally suggested by Alexander for Composefs use cases. Additionally, the DEFLATE algorithm is now supported, which can be used together with hardware accelerators for our cloud workloads. Each supported compression algorithm can be selected on a per-file basis for specific access patterns too. There are also some random fixes and cleanups as usual: - Support xattr bloom filter to optimize negative xattr lookups - Support DEFLATE compression algorithm as an alternative - Fix a regression that ztailpacking pclusters don't release properly - Avoid warning dedupe and fragments features anymore - Some folio conversions and cleanups" * tag 'erofs-for-6.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs: erofs: release ztailpacking pclusters properly erofs: don't warn dedupe and fragments features anymore erofs: adapt folios for z_erofs_read_folio() erofs: adapt folios for z_erofs_readahead() erofs: get rid of fe->backmost for cache decompression erofs: drop z_erofs_page_mark_eio() erofs: tidy up z_erofs_do_read_page() erofs: move preparation logic into z_erofs_pcluster_begin() erofs: avoid obsolete {collector,collection} terms erofs: simplify z_erofs_read_fragment() erofs: remove redundant erofs_fs_type declaration in super.c erofs: add necessary kmem_cache_create flags for erofs inode cache erofs: clean up redundant comment and adjust code alignment erofs: refine warning messages for zdata I/Os erofs: boost negative xattr lookup with bloom filter erofs: update on-disk format for xattr name filter erofs: DEFLATE compression support	2023-08-28 11:52:10 -07:00
Linus Torvalds	f20ae9cf5b	File locking fixes for v6.6 -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmTngQgACgkQAA5oQRlW ghXfYg//cbUlOoba3RLBmKTeN68BWaLTIgeibJ17ioOSjVnxxMqyt4tLH3FD1pnT 2QisXiR3qrJav+sbwClpC23tafCHDfTo4uGbdXlehgAu4OSJ1+8gF2k1SsdAd4Qe tyfP1I2LO29/Yfres2Oy86JQi99Ds4slxjj63z/pYBdo4LEDJc83m/9limgMv4ro Geop6f9lV3Z03rOZIixQinMtOQTXW0NJbiXMEGzFS1Y05xQ34kX+v33MBAvtnbY9 nhbWuJKYlgZZhH5CghZUAvKfq2a4suvpZbnzmEZLJ0DdRE0h/4ctIxTSK8tnrq5d 07cs34NOGWljlguJc7cYotrKv51lu6yGbSCHyNdAEdBotuIXJ4Q3krAUHBnuGzy3 fa/a7nozRO7RDzshGTO4dBRerTEyS7bWUEnIhRp6BHR2ny9jhK6zMa1ys3eAMG4F 5a61AH/3qS+i8gQQSF9HFRyD+D0qDVMg2YCE/lKveF5tNBeKqrfj/Wz3HnY2zB/L luQY0esX5idWTH8YYJlntxgmWEZBvNgmuWcbDVu2Jj/jXUGP6S3dnY0MZio2Sbdr zM7/HenAkvXTlbtJsLmdGGZSMnINuiJi4sT2aYn/8JG4HbHFlAU89ILhIGrF3ayP 669VpO6g/DoEpVRyOnzC3MCHE09SWoiHcXo4VCsz/270uXOpcVs= =f9CF -----END PGP SIGNATURE----- Merge tag 'filelock-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking updates from Jeff Layton: - new functionality for F_OFD_GETLK: requesting a type of F_UNLCK will find info about whatever lock happens to be first in the given range, regardless of type. - an OFD lock selftest - bugfix involving a UAF in a tracepoint - comment typo fix * tag 'filelock-v6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock fs/locks: Fix typo selftests: add OFD lock tests fs/locks: F_UNLCK extension for F_OFD_GETLK	2023-08-28 11:47:24 -07:00
Linus Torvalds	b4a04f92a4	v6.6-fs.proc.uapi -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXT2QAKCRCRxhvAZXjc olkFAQCT4nRkRTpBvbiv4DgvCIy+URqLNfHGxCxdAX1B09o3UwEAyepf1tz7aFpB wB67V265JFDMWtvQkSx4ORNpAjZ9Kg0= =Opqi -----END PGP SIGNATURE----- Merge tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull procfs fixes from Christian Brauner: "Mode changes to files under /proc/<pid>/ aren't supported ever since commit `6d76fa58b0` ("Don't allow chmod() on the /proc/<pid>/ files"). Due to an oversight in commit `1b3044e39a` ("procfs: fix pthread cross-thread naming if !PR_DUMPABLE") in switching from REG to NOD, mode changes on /proc/thread-self/comm were accidently allowed. Similar, mode changes for all files beneath /proc/<pid>/net/ are blocked but mode changes on /proc/<pid>/net itself were accidently allowed. Both issues come down to not using the generic proc_setattr() helper which blocks all mode changes. This is rectified with this pull request. This also removes a strange nolibc test that abused /proc/<pid>/net for testing mode changes. Using procfs for this test never made a lot of sense given procfs has special semantics for almost everything anway. Both changes are minor user-visible changes. It is however very unlikely that mode changes on proc/<pid>/net and /proc/thread-self/comm are something that userspace relies on" * tag 'v6.6-fs.proc.uapi' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: procfs: block chmod on /proc/thread-self/comm proc: use generic setattr() for /proc/$PID/net selftests/nolibc: drop test chmod_net	2023-08-28 11:43:19 -07:00
Linus Torvalds	2e0afa7e78	v6.6-vfs.autofs -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXUDgAKCRCRxhvAZXjc ogplAQCYXt+zcfs1GMhCUtPFzyyCwNsraMNzVwTdFbMz4R1JuQD9HL82VKyvMwmZ uo6uGVd9xN6cEy61Lpz9K8dn59uVAQE= =851o -----END PGP SIGNATURE----- Merge tag 'v6.6-vfs.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull autofs fixes from Christian Brauner: "This fixes a memory leak in autofs reported by syzkaller and a missing conversion from uninterruptible to interruptible wake up when autofs is in catatonic mode" * tag 'v6.6-vfs.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: autofs: use wake_up() instead of wake_up_interruptible(() autofs: fix memory leak of waitqueues in autofs_catatonic_mode	2023-08-28 11:39:14 -07:00
Linus Torvalds	475d4df827	v6.6-vfs.fchmodat2 -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXT7QAKCRCRxhvAZXjc ort3AP0VIK/oJk5skgjpinQrCfvtVz0XOtawuBtn0f1weIfb6AD9Hg1rqOKnQD5z dkvn3xaEr3gPOVzqU5SvFwVoCM0cMwA= =24Ha -----END PGP SIGNATURE----- Merge tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fchmodat2 system call from Christian Brauner: "This adds the fchmodat2() system call. It is a revised version of the fchmodat() system call, adding a missing flag argument. Support for both AT_SYMLINK_NOFOLLOW and AT_EMPTY_PATH are included. Adding this system call revision has been a longstanding request but so far has always fallen through the cracks. While the kernel implementation of fchmodat() does not have a flag argument the libc provided POSIX-compliant fchmodat(3) version does. Both glibc and musl have to implement a workaround in order to support AT_SYMLINK_NOFOLLOW (see [1] and [2]). The workaround is brittle because it relies not just on O_PATH and O_NOFOLLOW semantics and procfs magic links but also on our rather inconsistent symlink semantics. This gives userspace a proper fchmodat2() system call that libcs can use to properly implement fchmodat(3) and allows them to get rid of their hacks. In this case it will immediately benefit them as the current workaround is already defunct because of aformentioned inconsistencies. In addition to AT_SYMLINK_NOFOLLOW, give userspace the ability to use AT_EMPTY_PATH with fchmodat2(). This is already possible with fchownat() so there's no reason to not also support it for fchmodat2(). The implementation is simple and comes with selftests. Implementation of the system call and wiring up the system call are done as separate patches even though they could arguably be one patch. But in case there are merge conflicts from other system call additions it can be beneficial to have separate patches" Link: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35 [1] Link: https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28 [2] * tag 'v6.6-vfs.fchmodat2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: selftests: fchmodat2: remove duplicate unneeded defines fchmodat2: add support for AT_EMPTY_PATH selftests: Add fchmodat2 selftest arch: Register fchmodat2, usually as syscall 452 fs: Add fchmodat2() Non-functional cleanup of a "__user * filename"	2023-08-28 11:25:27 -07:00
Linus Torvalds	511fb5bafe	v6.6-vfs.super -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXpbgAKCRCRxhvAZXjc oi8PAQCtXelGZHmTcmevsO8p4Qz7hFpkonZ/TnxKf+RdnlNgPgD+NWi+LoRBpaAj xk4z8SqJaTTP4WXrG5JZ6o7EQkUL8gE= =2e9I -----END PGP SIGNATURE----- Merge tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull superblock updates from Christian Brauner: "This contains the super rework that was ready for this cycle. The first part changes the order of how we open block devices and allocate superblocks, contains various cleanups, simplifications, and a new mechanism to wait on superblock state changes. This unblocks work to ultimately limit the number of writers to a block device. Jan has already scheduled follow-up work that will be ready for v6.7 and allows us to restrict the number of writers to a given block device. That series builds on this work right here. The second part contains filesystem freezing updates. Overview: The generic superblock changes are rougly organized as follows (ignoring additional minor cleanups): (1) Removal of the bd_super member from struct block_device. This was a very odd back pointer to struct super_block with unclear rules. For all relevant places we have other means to get the same information so just get rid of this. (2) Simplify rules for superblock cleanup. Roughly, everything that is allocated during fs_context initialization and that's stored in fs_context->s_fs_info needs to be cleaned up by the fs_context->free() implementation before the superblock allocation function has been called successfully. After sget_fc() returned fs_context->s_fs_info has been transferred to sb->s_fs_info at which point sb->kill_sb() if fully responsible for cleanup. Adhering to these rules means that cleanup of sb->s_fs_info in fill_super() is to be avoided as it's brittle and inconsistent. Cleanup shouldn't be duplicated between sb->put_super() as sb->put_super() is only called if sb->s_root has been set aka when the filesystem has been successfully born (SB_BORN). That complexity should be avoided. This also means that block devices are to be closed in sb->kill_sb() instead of sb->put_super(). More details in the lower section. (3) Make it possible to lookup or create a superblock before opening block devices There's a subtle dependency on (2) as some filesystems did rely on fill_super() to be called in order to correctly clean up sb->s_fs_info. All these filesystems have been fixed. (4) Switch most filesystem to follow the same logic as the generic mount code now does as outlined in (3). (5) Use the superblock as the holder of the block device. We can now easily go back from block device to owning superblock. (6) Export and extend the generic fs_holder_ops and use them as holder ops everywhere and remove the filesystem specific holder ops. (7) Call from the block layer up into the filesystem layer when the block device is removed, allowing to shut down the filesystem without risk of deadlocks. (8) Get rid of get_super(). We can now easily go back from the block device to owning superblock and can call up from the block layer into the filesystem layer when the device is removed. So no need to wade through all registered superblock to find the owning superblock anymore" Link: https://lore.kernel.org/lkml/20230824-prall-intakt-95dbffdee4a0@brauner/ * tag 'v6.6-vfs.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (47 commits) super: use higher-level helper for {freeze,thaw} super: wait until we passed kill super super: wait for nascent superblocks super: make locking naming consistent super: use locking helpers fs: simplify invalidate_inodes fs: remove get_super block: call into the file system for ioctl BLKFLSBUF block: call into the file system for bdev_mark_dead block: consolidate __invalidate_device and fsync_bdev block: drop the "busy inodes on changed media" log message dasd: also call __invalidate_device when setting the device offline amiflop: don't call fsync_bdev in FDFMTBEG floppy: call disk_force_media_change when changing the format block: simplify the disk_force_media_change interface nbd: call blk_mark_disk_dead in nbd_clear_sock_ioctl xfs use fs_holder_ops for the log and RT devices xfs: drop s_umount over opening the log and RT devices ext4: use fs_holder_ops for the log device ext4: drop s_umount over opening the log device ...	2023-08-28 11:04:18 -07:00
Linus Torvalds	de16588a77	v6.6-vfs.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTxQAKCRCRxhvAZXjc okaVAP94WAlItvDRt/z2Wtzf0+RqPZeTXEdGTxua8+RxqCyYIQD+OO5nRfKQPHlV AqqGJMKItQMSMIYgB5ftqVhNWZfnHgM= =pSEW -----END PGP SIGNATURE----- Merge tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual miscellaneous features, cleanups, and fixes for vfs and individual filesystems. Features: - Block mode changes on symlinks and rectify our broken semantics - Report file modifications via fsnotify() for splice - Allow specifying an explicit timeout for the "rootwait" kernel command line option. This allows to timeout and reboot instead of always waiting indefinitely for the root device to show up - Use synchronous fput for the close system call Cleanups: - Get rid of open-coded lockdep workarounds for async io submitters and replace it all with a single consolidated helper - Simplify epoll allocation helper - Convert simple_write_begin and simple_write_end to use a folio - Convert page_cache_pipe_buf_confirm() to use a folio - Simplify __range_close to avoid pointless locking - Disable per-cpu buffer head cache for isolated cpus - Port ecryptfs to kmap_local_page() api - Remove redundant initialization of pointer buf in pipe code - Unexport the d_genocide() function which is only used within core vfs - Replace printk(KERN_ERR) and WARN_ON() with WARN() Fixes: - Fix various kernel-doc issues - Fix refcount underflow for eventfds when used as EFD_SEMAPHORE - Fix a mainly theoretical issue in devpts - Check the return value of __getblk() in reiserfs - Fix a racy assert in i_readcount_dec - Fix integer conversion issues in various functions - Fix LSM security context handling during automounts that prevented NFS superblock sharing" * tag 'v6.6-vfs.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits) cachefiles: use kiocb_{start,end}_write() helpers ovl: use kiocb_{start,end}_write() helpers aio: use kiocb_{start,end}_write() helpers io_uring: use kiocb_{start,end}_write() helpers fs: create kiocb_{start,end}_write() helpers fs: add kerneldoc to file_{start,end}_write() helpers io_uring: rename kiocb_end_write() local helper splice: Convert page_cache_pipe_buf_confirm() to use a folio libfs: Convert simple_write_begin and simple_write_end to use a folio fs/dcache: Replace printk and WARN_ON by WARN fs/pipe: remove redundant initialization of pointer buf fs: Fix kernel-doc warnings devpts: Fix kernel-doc warnings doc: idmappings: fix an error and rephrase a paragraph init: Add support for rootwait timeout parameter vfs: fix up the assert in i_readcount_dec fs: Fix one kernel-doc comment docs: filesystems: idmappings: clarify from where idmappings are taken fs/buffer.c: disable per-CPU buffer_head cache for isolated CPUs vfs, security: Fix automount superblock LSM init problem, preventing NFS sb sharing ...	2023-08-28 10:17:14 -07:00
Linus Torvalds	ecd7db2047	v6.6-vfs.tmpfs -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTkgAKCRCRxhvAZXjc ouZsAPwNBHB2aPKtzWURuKx5RX02vXTzHX+A/LpuDz5WBFe8zQD+NlaBa4j0MBtS rVYM+CjOXnjnsLc8W0euMnfYNvViKgQ= =L2+2 -----END PGP SIGNATURE----- Merge tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull libfs and tmpfs updates from Christian Brauner: "This cycle saw a lot of work for tmpfs that required changes to the vfs layer. Andrew, Hugh, and I decided to take tmpfs through vfs this cycle. Things will go back to mm next cycle. Features ======== - By far the biggest work is the quota support for tmpfs. New tmpfs quota infrastructure is added to support it and a new QFMT_SHMEM uapi option is exposed. This offers user and group quotas to tmpfs (project quotas will be added later). Similar to other filesystems tmpfs quota are not supported within user namespaces yet. - Add support for user xattrs. While tmpfs already supports security xattrs (security.) and POSIX ACLs for a long time it lacked support for user xattrs (user.). With this pull request tmpfs will be able to support a limited number of user xattrs. This is accompanied by a fix (see below) to limit persistent simple xattr allocations. - Add support for stable directory offsets. Currently tmpfs relies on the libfs provided cursor-based mechanism for readdir. This causes issues when a tmpfs filesystem is exported via NFS. NFS clients do not open directories. Instead, each server-side readdir operation opens the directory, reads it, and then closes it. Since the cursor state for that directory is associated with the opened file it is discarded after each readdir operation. Such directory offsets are not just cached by NFS clients but also various userspace libraries based on these clients. As it stands there is no way to invalidate the caches when directory offsets have changed and the whole application depends on unchanging directory offsets. At LSFMM we discussed how to solve this problem and decided to support stable directory offsets. libfs now allows filesystems like tmpfs to use an xarrary to map a directory offset to a dentry. This mechanism is currently only used by tmpfs but can be supported by others as well. Fixes ===== - Change persistent simple xattrs allocations in libfs from GFP_KERNEL to GPF_KERNEL_ACCOUNT so they're subject to memory cgroup limits. Since this is a change to libfs it affects both tmpfs and kernfs. - Correctly verify {g,u}id mount options. A new filesystem context is created via fsopen() which records the namespace that becomes the owning namespace of the superblock when fsconfig(FSCONFIG_CMD_CREATE) is called for filesystems that are mountable in namespaces. However, fsconfig() calls can occur in a namespace different from the namespace where fsopen() has been called. Currently, when fsconfig() is called to set {g,u}id mount options the requested {g,u}id is mapped into a k{g,u}id according to the namespace where fsconfig() was called from. The resulting k{g,u}id is not guaranteed to be resolvable in the namespace of the filesystem (the one that fsopen() was called in). This means it's possible for an unprivileged user to create files owned by any group in a tmpfs mount since it's possible to set the setid bits on the tmpfs directory. The contract for {g,u}id mount options and {g,u}id values in general set from userspace has always been that they are translated according to the caller's idmapping. In so far, tmpfs has been doing the correct thing. But since tmpfs is mountable in unprivileged contexts it is also necessary to verify that the resulting {k,g}uid is representable in the namespace of the superblock to avoid such bugs. The new mount api's cross-namespace delegation abilities are already widely used. Having talked to a bunch of userspace this is the most faithful solution with minimal regression risks" * tag 'v6.6-vfs.tmpfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrs mm: invalidation check mapping before folio_contains tmpfs: trivial support for direct IO tmpfs,xattr: enable limited user extended attributes tmpfs: track free_ispace instead of free_inodes xattr: simple_xattr_set() return old_xattr to be freed tmpfs: verify {g,u}id mount options correctly shmem: move spinlock into shmem_recalc_inode() to fix quota support libfs: Remove parent dentry locking in offset_iterate_dir() libfs: Add a lock class for the offset map's xa_lock shmem: stable directory offsets shmem: Refactor shmem_symlink() libfs: Add directory operations for stable offsets shmem: fix quota lock nesting in huge hole handling shmem: Add default quota limit mount options shmem: quota support shmem: prepare shmem quota infrastructure quota: Check presence of quota operation structures instead of ->quota_read and ->quota_write callbacks shmem: make shmem_get_inode() return ERR_PTR instead of NULL shmem: make shmem_inode_acct_block() return error	2023-08-28 09:55:25 -07:00
Linus Torvalds	615e95831e	v6.6-vfs.ctime -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXTKAAKCRCRxhvAZXjc oifJAQCzi/p+AdQu8LA/0XvR7fTwaq64ZDCibU4BISuLGT2kEgEAuGbuoFZa0rs2 XYD/s4+gi64p9Z01MmXm2XO1pu3GPg0= =eJz5 -----END PGP SIGNATURE----- Merge tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs timestamp updates from Christian Brauner: "This adds VFS support for multi-grain timestamps and converts tmpfs, xfs, ext4, and btrfs to use them. This carries acks from all relevant filesystems. The VFS always uses coarse-grained timestamps when updating the ctime and mtime after a change. This has the benefit of allowing filesystems to optimize away a lot of metadata updates, down to around 1 per jiffy, even when a file is under heavy writes. Unfortunately, this has always been an issue when we're exporting via NFSv3, which relies on timestamps to validate caches. A lot of changes can happen in a jiffy, so timestamps aren't sufficient to help the client decide to invalidate the cache. Even with NFSv4, a lot of exported filesystems don't properly support a change attribute and are subject to the same problems with timestamp granularity. Other applications have similar issues with timestamps (e.g., backup applications). If we were to always use fine-grained timestamps, that would improve the situation, but that becomes rather expensive, as the underlying filesystem would have to log a lot more metadata updates. This introduces fine-grained timestamps that are used when they are actively queried. This uses the 31st bit of the ctime tv_nsec field to indicate that something has queried the inode for the mtime or ctime. When this flag is set, on the next mtime or ctime update, the kernel will fetch a fine-grained timestamp instead of the usual coarse-grained one. As POSIX generally mandates that when the mtime changes, the ctime must also change the kernel always stores normalized ctime values, so only the first 30 bits of the tv_nsec field are ever used. Filesytems can opt into this behavior by setting the FS_MGTIME flag in the fstype. Filesystems that don't set this flag will continue to use coarse-grained timestamps. Various preparatory changes, fixes and cleanups are included: - Fixup all relevant places where POSIX requires updating ctime together with mtime. This is a wide-range of places and all maintainers provided necessary Acks. - Add new accessors for inode->i_ctime directly and change all callers to rely on them. Plain accesses to inode->i_ctime are now gone and it is accordingly rename to inode->__i_ctime and commented as requiring accessors. - Extend generic_fillattr() to pass in a request mask mirroring in a sense the statx() uapi. This allows callers to pass in a request mask to only get a subset of attributes filled in. - Rework timestamp updates so it's possible to drop the @now parameter the update_time() inode operation and associated helpers. - Add inode_update_timestamps() and convert all filesystems to it removing a bunch of open-coding" * tag 'v6.6-vfs.ctime' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (107 commits) btrfs: convert to multigrain timestamps ext4: switch to multigrain timestamps xfs: switch to multigrain timestamps tmpfs: add support for multigrain timestamps fs: add infrastructure for multigrain timestamps fs: drop the timespec64 argument from update_time xfs: have xfs_vn_update_time gets its own timestamp fat: make fat_update_time get its own timestamp fat: remove i_version handling from fat_update_time ubifs: have ubifs_update_time use inode_update_timestamps btrfs: have it use inode_update_timestamps fs: drop the timespec64 arg from generic_update_time fs: pass the request_mask to generic_fillattr fs: remove silly warning from current_time gfs2: fix timestamp handling on quota inodes fs: rename i_ctime field to __i_ctime selinux: convert to ctime accessor functions security: convert to ctime accessor functions apparmor: convert to ctime accessor functions sunrpc: convert to ctime accessor functions ...	2023-08-28 09:31:32 -07:00
Linus Torvalds	84ab1277ce	v6.6-vfs.fs_context -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZOXUHQAKCRCRxhvAZXjc opWuAQC5wYyKWMwpxc3GaGcHiC7nq0uyYCcVgzeebsw1eGzFvgD9FoYRphC2pqi1 p8qUexEK2aOZmPquFWmRDTRMcZ23YAk= =UKnx -----END PGP SIGNATURE----- Merge tag 'v6.6-vfs.fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull mount API updates from Christian Brauner: "This introduces FSCONFIG_CMD_CREATE_EXCL which allows userspace to implement something like $ mount -t ext4 --exclusive /dev/sda /B which fails if a superblock for the requested filesystem does already exist instead of silently reusing an existing superblock. Without it, in the sequence $ move-mount -f xfs -o source=/dev/sda4 /A $ move-mount -f xfs -o noacl,source=/dev/sda4 /B the initial mounter will create a superblock. The second mounter will reuse the existing superblock, creating a bind-mount (see [1] for the source of the move-mount binary). The problem is that reusing an existing superblock means all mount options other than read-only and read-write will be silently ignored even if they are incompatible requests. For example, the second mount has requested no POSIX ACL support but since the existing superblock is reused POSIX ACL support will remain enabled. Such silent superblock reuse can easily become a security issue. After adding support for FSCONFIG_CMD_CREATE_EXCL to mount(8) in util-linux this can be fixed: $ move-mount -f xfs --exclusive -o source=/dev/sda4 /A $ move-mount -f xfs --exclusive -o noacl,source=/dev/sda4 /B Device or resource busy \| move-mount.c: 300: do_fsconfig: i xfs: reusing existing filesystem not allowed This requires the new mount api. With the old mount api it would be necessary to plumb this through every legacy filesystem's file_system_type->mount() method. If they want this feature they are most welcome to switch to the new mount api" Link: https://github.com/brauner/move-mount-beneath [1] Link: https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner Link: https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner Link: https://lore.kernel.org/linux-fsdevel/20230725-einnahmen-warnschilder-17779aec0a97@brauner Link: https://lore.kernel.org/lkml/20230824-anzog-allheilmittel-e8c63e429a79@brauner/ * tag 'v6.6-vfs.fs_context' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: fs: add FSCONFIG_CMD_CREATE_EXCL fs: add vfs_cmd_reconfigure() fs: add vfs_cmd_create() super: remove get_tree_single_reconf()	2023-08-28 09:00:09 -07:00
Baokun Li	768d612f79	ext4: fix slab-use-after-free in ext4_es_insert_extent() Yikebaer reported an issue: ================================================================== BUG: KASAN: slab-use-after-free in ext4_es_insert_extent+0xc68/0xcb0 fs/ext4/extents_status.c:894 Read of size 4 at addr ffff888112ecc1a4 by task syz-executor/8438 CPU: 1 PID: 8438 Comm: syz-executor Not tainted 6.5.0-rc5 #1 Call Trace: [...] kasan_report+0xba/0xf0 mm/kasan/report.c:588 ext4_es_insert_extent+0xc68/0xcb0 fs/ext4/extents_status.c:894 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] Allocated by task 8438: [...] kmem_cache_zalloc include/linux/slab.h:693 [inline] __es_alloc_extent fs/ext4/extents_status.c:469 [inline] ext4_es_insert_extent+0x672/0xcb0 fs/ext4/extents_status.c:873 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] Freed by task 8438: [...] kmem_cache_free+0xec/0x490 mm/slub.c:3823 ext4_es_try_to_merge_right fs/ext4/extents_status.c:593 [inline] __es_insert_extent+0x9f4/0x1440 fs/ext4/extents_status.c:802 ext4_es_insert_extent+0x2ca/0xcb0 fs/ext4/extents_status.c:882 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] ================================================================== The flow of issue triggering is as follows: 1. remove es raw es es removed es1 \|-------------------\| -> \|----\|.......\|------\| 2. insert es es insert es1 merge with es es1 merge with es and free es1 \|----\|.......\|------\| -> \|------------\|------\| -> \|-------------------\| es merges with newes, then merges with es1, frees es1, then determines if es1->es_len is 0 and triggers a UAF. The code flow is as follows: ext4_es_insert_extent es1 = __es_alloc_extent(true); es2 = __es_alloc_extent(true); __es_remove_extent(inode, lblk, end, NULL, es1) __es_insert_extent(inode, &newes, es1) ---> insert es1 to es tree __es_insert_extent(inode, &newes, es2) ext4_es_try_to_merge_right ext4_es_free_extent(inode, es1) ---> es1 is freed if (es1 && !es1->es_len) // Trigger UAF by determining if es1 is used. We determine whether es1 or es2 is used immediately after calling __es_remove_extent() or __es_insert_extent() to avoid triggering a UAF if es1 or es2 is freed. Reported-by: Yikebaer Aizezi <yikebaer61@gmail.com> Closes: https://lore.kernel.org/lkml/CALcu4raD4h9coiyEBL4Bm0zjDwxC2CyPiTwsP3zFuhot6y9Beg@mail.gmail.com Fixes: `2a69c45008` ("ext4: using nofail preallocation in ext4_es_insert_extent()") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230815070808.3377171-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:13 -04:00
Eric Biggers	af494af385	libfs: remove redundant checks of s_encoding Now that neither ext4 nor f2fs allows inodes with the casefold flag to be instantiated when unsupported, it's unnecessary to repeatedly check for support later on during random filesystem operations. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230814182903.37267-4-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:13 -04:00
Eric Biggers	b814279395	ext4: remove redundant checks of s_encoding Now that ext4 does not allow inodes with the casefold flag to be instantiated when unsupported, it's unnecessary to repeatedly check for support later on during random filesystem operations. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230814182903.37267-3-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:13 -04:00
Eric Biggers	8216776ccf	ext4: reject casefold inode flag without casefold feature It is invalid for the casefold inode flag to be set without the casefold superblock feature flag also being set. e2fsck already considers this case to be invalid and handles it by offering to clear the casefold flag on the inode. __ext4_iget() also already considered this to be invalid, sort of, but it only got so far as logging an error message; it didn't actually reject the inode. Make it reject the inode so that other code doesn't have to handle this case. This matches what f2fs does. Note: we could check 's_encoding != NULL' instead of ext4_has_feature_casefold(). This would make the check robust against the casefold feature being enabled by userspace writing to the page cache of the mounted block device. However, it's unsolvable in general for filesystems to be robust against concurrent writes to the page cache of the mounted block device. Though this very particular scenario involving the casefold feature is solvable, we should not pretend that we can support this model, so let's just check the casefold feature. tune2fs already forbids enabling casefold on a mounted filesystem. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230814182903.37267-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:13 -04:00
Ruan Jinjie	0f6bc57971	ext4: use LIST_HEAD() to initialize the list_head in mballoc.c Use LIST_HEAD() to initialize the list_head instead of open-coding it. Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Link: https://lore.kernel.org/r/20230812071839.3481909-1-ruanjinjie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:13 -04:00
Liu Song	03de20bed2	ext4: do not mark inode dirty every time when appending using delalloc In the delalloc append write scenario, if inode's i_size is extended due to buffer write, there are delalloc writes pending in the range up to i_size, and no need to touch i_disksize since writeback will push i_disksize up to i_size eventually. Offers significant performance improvement in high-frequency append write scenarios. I conducted tests in my 32-core environment by launching 32 concurrent threads to append write to the same file. Each write operation had a length of 1024 bytes and was repeated 100000 times. Without using this patch, the test was completed in 7705 ms. However, with this patch, the test was completed in 5066 ms, resulting in a performance improvement of 34%. Moreover, in test scenarios of Kafka version 2.6.2, using packet size of 2K, with this patch resulted in a 10% performance improvement. Signed-off-by: Liu Song <liusong@linux.alibaba.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230810154333.84921-1-liusong@linux.alibaba.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:13 -04:00
Theodore Ts'o	bb15cea20f	ext4: rename s_error_work to s_sb_upd_work The most common use that s_error_work will get scheduled is now the periodic update of the superblock. So rename it to s_sb_upd_work. Also rename the function flush_stashed_error_work() to update_super_work(). Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:12 -04:00
Vitaliy Kuznetsov	ff0722de89	ext4: add periodic superblock update check This patch introduces a mechanism to periodically check and update the superblock within the ext4 file system. The main purpose of this patch is to keep the disk superblock up to date. The update will be performed if more than one hour has passed since the last update, and if more than 16MB of data have been written to disk. This check and update is performed within the ext4_journal_commit_callback function, ensuring that the superblock is written while the disk is active, rather than based on a timer that may trigger during disk idle periods. Discussion https://www.spinics.net/lists/linux-ext4/msg85865.html Signed-off-by: Vitaliy Kuznetsov <vk.en.mail@gmail.com> Link: https://lore.kernel.org/r/20230810143852.40228-1-vk.en.mail@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:12 -04:00
Brian Foster	194505b55d	ext4: drop dio overwrite only flag and associated warning The commit referenced below opened up concurrent unaligned dio under shared locking for pure overwrites. In doing so, it enabled use of the IOMAP_DIO_OVERWRITE_ONLY flag and added a warning on unexpected -EAGAIN returns as an extra precaution, since ext4 does not retry writes in such cases. The flag itself is advisory in this case since ext4 checks for unaligned I/Os and uses appropriate locking up front, rather than on a retry in response to -EAGAIN. As it turns out, the warning check is susceptible to false positives because there are scenarios where -EAGAIN can be expected from lower layers without necessarily having IOCB_NOWAIT set on the iocb. For example, one instance of the warning has been seen where io_uring sets IOCB_HIPRI, which in turn results in REQ_POLLED\|REQ_NOWAIT on the bio. This results in -EAGAIN if the block layer is unable to allocate a request, etc. [Note that there is an outstanding patch to untangle REQ_POLLED and REQ_NOWAIT such that the latter relies on IOCB_NOWAIT, which would also address this instance of the warning.] Another instance of the warning has been reproduced by syzbot. A dio write is interrupted down in __get_user_pages_locked() waiting on the mm lock and returns -EAGAIN up the stack. If the iomap dio iteration layer has made no progress on the write to this point, -EAGAIN returns up to the filesystem and triggers the warning. This use of the overwrite flag in ext4 is precautionary and half-baked. I.e., ext4 doesn't actually implement overwrite checking in the iomap callbacks when the flag is set, so the only extra verification it provides are i_size checks in the generic iomap dio layer. Combined with the tendency for false positives, the added verification is not worth the extra trouble. Remove the flag, associated warning, and update the comments to document when concurrent unaligned dio writes are allowed and why said flag is not used. Cc: stable@kernel.org Reported-by: syzbot+5050ad0fb47527b1808a@syzkaller.appspotmail.com Reported-by: Pengfei Xu <pengfei.xu@intel.com> Fixes: `310ee0902b` ("ext4: allow concurrent unaligned dio overwrites") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230810165559.946222-1-bfoster@redhat.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:12 -04:00
Wang Jianjian	68228da51c	ext4: add correct group descriptors and reserved GDT blocks to system zone When setup_system_zone, flex_bg is not initialized so it is always 1. Use a new helper function, ext4_num_base_meta_blocks() which does not depend on sbi->s_log_groups_per_flex being initialized. [ Squashed two patches in the Link URL's below together into a single commit, which is simpler to review/understand. Also fix checkpatch warnings. --TYT ] Cc: stable@kernel.org Signed-off-by: Wang Jianjian <wangjianjian0@foxmail.com> Link: https://lore.kernel.org/r/tencent_21AF0D446A9916ED5C51492CC6C9A0A77B05@qq.com Link: https://lore.kernel.org/r/tencent_D744D1450CC169AEA77FCF0A64719909ED05@qq.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:12 -04:00
Cai Xinchen	b6c7d6dc8a	ext4: remove unused function declaration These functions do not have its function implementation. So those function declaration is useless. Remove these Signed-off-by: Cai Xinchen <caixinchen1@huawei.com> Link: https://lore.kernel.org/r/20230802030025.173148-1-caixinchen1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:12 -04:00
Su Hui	a50bda1474	ext4: mballoc: avoid garbage value from err clang's static analysis warning: fs/ext4/mballoc.c line 4178, column 6, Branch condition evaluates to a garbage value. err is uninitialized and will be judged when 'len <= 0' or it first enters the loop while the condition "!ext4_sb_block_valid()" is true. Although this can't make problems now, it's better to correct it. Signed-off-by: Su Hui <suhui@nfschina.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Link: https://lore.kernel.org/r/20230725043310.1227621-1-suhui@nfschina.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:12 -04:00
Lu Hongfei	79ebf48c44	ext4: use sbi instead of EXT4_SB(sb) in ext4_mb_new_blocks_simple() Signed-off-by: Lu Hongfei <luhongfei@vivo.com> Link: https://lore.kernel.org/r/20230707115907.26637-1-luhongfei@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:12 -04:00
Lu Hongfei	89cadf6e22	ext4: change the type of blocksize in ext4_mb_init_cache() The return value type of i_blocksize() is 'unsigned int', so the type of blocksize has been modified from 'int' to 'unsigned int' to ensure data type consistency. Signed-off-by: Lu Hongfei <luhongfei@vivo.com> Link: https://lore.kernel.org/r/20230707105516.9156-1-luhongfei@vivo.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:12 -04:00
Zhihao Cheng	1524773425	ext4: fix unttached inode after power cut with orphan file feature enabled Running generic/475(filesystem consistent tests after power cut) could easily trigger unattached inode error while doing fsck: Unattached zero-length inode 39405. Clear? no Unattached inode 39405 Connect to /lost+found? no Above inconsistence is caused by following process: P1 P2 ext4_create inode = ext4_new_inode_start_handle // itable records nlink=1 ext4_add_nondir err = ext4_add_entry // ENOSPC ext4_append ext4_bread ext4_getblk ext4_map_blocks // returns ENOSPC drop_nlink(inode) // won't be updated into disk inode ext4_orphan_add(handle, inode) ext4_orphan_file_add ext4_journal_stop(handle) jbd2_journal_commit_transaction // commit success >> power cut << ext4_fill_super ext4_load_and_init_journal // itable records nlink=1 ext4_orphan_cleanup ext4_process_orphan if (inode->i_nlink) // true, inode won't be deleted Then, allocated inode will be reserved on disk and corresponds to no dentries, so e2fsck reports 'unattached inode' problem. The problem won't happen if orphan file feature is disabled, because ext4_orphan_add() will update disk inode in orphan list mode. There are several places not updating disk inode while putting inode into orphan area, such as ext4_add_nondir(), ext4_symlink() and whiteout in ext4_rename(). Fix it by updating inode into disk in all error branches of these places. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217605 Fixes: `02f310fcf4` ("ext4: Speedup ext4 orphan inode handling") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230628132011.650383-1-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:12 -04:00
Zhang Yi	2dfba3bb40	jbd2: correct the end of the journal recovery scan range We got a filesystem inconsistency issue below while running generic/475 I/O failure pressure test with fast_commit feature enabled. Symlink /p3/d3/d1c/d6c/dd6/dce/l101 (inode #132605) is invalid. If fast_commit feature is enabled, a special fast_commit journal area is appended to the end of the normal journal area. The journal->j_last point to the first unused block behind the normal journal area instead of the whole log area, and the journal->j_fc_last point to the first unused block behind the fast_commit journal area. While doing journal recovery, do_one_pass(PASS_SCAN) should first scan the normal journal area and turn around to the first block once it meet journal->j_last, but the wrap() macro misuse the journal->j_fc_last, so the recovering could not read the next magic block (commit block perhaps) and would end early mistakenly and missing tN and every transaction after it in the following example. Finally, it could lead to filesystem inconsistency. \| normal journal area \| fast commit area \| +-------------------------------------------------+------------------+ \| tN(rere) \| tN+1 \|~\| tN-x \|...\| tN-1 \| tN(front) \| .... \| +-------------------------------------------------+------------------+ / / / start journal->j_last journal->j_fc_last This patch fix it by use the correct ending journal->j_last. Fixes: `5b849b5f96` ("jbd2: fast commit recovery path") Cc: stable@kernel.org Reported-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/linux-ext4/20230613043120.GB1584772@mit.edu/ Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230626073322.3956567-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:27:12 -04:00
Zhang Yi	ee5c807137	ext4: ext4_get_{dev}_journal return proper error value ext4_get_journal() and ext4_get_dev_journal() return NULL if they failed to init journal, making them return proper error value instead, also rename them to ext4_open_{inode,dev}_journal(). [ Folded fix to ext4_calculate_overhead() to check for an ERR_PTR instead of NULL. ] Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230811063610.2980059-13-yi.zhang@huaweicloud.com Reported-by: syzbot+b3123e6d9842e526de39@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/20230826011029.2023140-1-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2023-08-27 11:13:39 -04:00
Linus Torvalds	6f0edbb833	18 hotfixes. 13 are cc:stable and the remainder pertain to post-6.4 issues or aren't considered suitable for a -stable backport. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZOjuGgAKCRDdBJ7gKXxA jkLlAQDY9sYxhQZp1PFLirUIPeOBjEyifVy6L6gCfk9j0snLggEA2iK+EtuJt2Dc SlMfoTq29zyU/YgfKKwZEVKtPJZOHQU= =oTcj -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "18 hotfixes. 13 are cc:stable and the remainder pertain to post-6.4 issues or aren't considered suitable for a -stable backport" * tag 'mm-hotfixes-stable-2023-08-25-11-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: shmem: fix smaps BUG sleeping while atomic selftests: cachestat: catch failing fsync test on tmpfs selftests: cachestat: test for cachestat availability maple_tree: disable mas_wr_append() when other readers are possible madvise:madvise_free_pte_range(): don't use mapcount() against large folio for sharing check madvise:madvise_free_huge_pmd(): don't use mapcount() against large folio for sharing check madvise:madvise_cold_or_pageout_pte_range(): don't use mapcount() against large folio for sharing check mm: multi-gen LRU: don't spin during memcg release mm: memory-failure: fix unexpected return value in soft_offline_page() radix tree: remove unused variable mm: add a call to flush_cache_vmap() in vmap_pfn() selftests/mm: FOLL_LONGTERM need to be updated to 0x100 nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers() mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via GUP-fast selftests: cgroup: fix test_kmem_basic less than error mm: enable page walking API to lock vmas during the walk smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd() mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULT	2023-08-25 11:44:43 -07:00
Daeho Jeong	3b71661214	f2fs: use finish zone command when closing a zone Use the finish zone command first when a zone should be closed. Signed-off-by: Daeho Jeong <daehojeong@google.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2023-08-25 10:30:37 -07:00
Alexander Aring	7c53e847ff	dlm: fix plock lookup when using multiple lockspaces All posix lock ops, for all lockspaces (gfs2 file systems) are sent to userspace (dlm_controld) through a single misc device. The dlm_controld daemon reads the ops from the misc device and sends them to other cluster nodes using separate, per-lockspace cluster api communication channels. The ops for a single lockspace are ordered at this level, so that the results are received in the same sequence that the requests were sent. When the results are sent back to the kernel via the misc device, they are again funneled through the single misc device for all lockspaces. When the dlm code in the kernel processes the results from the misc device, these results will be returned in the same sequence that the requests were sent, on a per-lockspace basis. A recent change in this request/reply matching code missed the "per-lockspace" check (fsid comparison) when matching request and reply, so replies could be incorrectly matched to requests from other lockspaces. Cc: stable@vger.kernel.org Reported-by: Barry Marson <bmarson@redhat.com> Fixes: `57e2c2f2d9` ("fs: dlm: fix mismatch of plock results from userspace") Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2023-08-25 10:31:39 -05:00
Steve French	09ee7a3bf8	[SMB3] send channel sequence number in SMB3 requests after reconnects The ChannelSequence field in the SMB3 header is supposed to be increased after reconnect to allow the server to distinguish requests from before and after the reconnect. We had always been setting it to zero. There are cases where incrementing ChannelSequence on requests after network reconnects can reduce the chance of data corruptions. See MS-SMB2 3.2.4.1 and 3.2.7.1 Signed-off-by: Steve French <stfrench@microsoft.com> Cc: stable@vger.kernel.org # 5.16+	2023-08-24 23:37:06 -05:00
Oleg Nesterov	dce8f8ed1d	document while_each_thread(), change first_tid() to use for_each_thread() Add the comment to explain that while_each_thread(g,t) is not rcu-safe unless g is stable (e.g. current). Even if g is a group leader and thus can't exit before t, t or another sub-thread can exec and remove g from the thread_group list. The only lockless user of while_each_thread() is first_tid() and it is fine in that it can't loop forever, yet for_each_thread() looks better and I am going to change while_each_thread/next_thread. Link: https://lkml.kernel.org/r/20230823170806.GA11724@redhat.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-24 16:25:15 -07:00
Matthew Wilcox (Oracle)	1d024e7a8d	mm: remove enum page_entry_size Remove the unnecessary encoding of page order into an enum and pass the page order directly. That lets us get rid of pe_order(). The switch constructs have to be changed to if/else constructs to prevent GCC from warning on builds with 3-level page tables where PMD_ORDER and PUD_ORDER have the same value. If you are looking at this commit because your driver stopped compiling, look at the previous commit as well and audit your driver to be sure it doesn't depend on mmap_lock being held in its ->huge_fault method. [willy@infradead.org: use "order %u" to match the (non dev_t) style] Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-24 16:20:30 -07:00
Matthew Wilcox (Oracle)	051ddcfeb1	mm: move PMD_ORDER to pgtable.h Patch series "Change calling convention for ->huge_fault", v2. There are two unrelated changes to the calling convention for ->huge_fault. I've bundled them together to help people notice the change. The first is to improve scalability of DAX page faults by allowing them to be handled under the VMA lock. The second is to remove enum page_entry_size since it's really unnecessary. The changelogs and documentation updates hopefully work to that end. This patch (of 3): Allow this to be used in generic code. Also add PUD_ORDER. Link: https://lkml.kernel.org/r/20230818202335.2739663-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230818202335.2739663-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-24 16:20:29 -07:00
Jann Horn	004a9a38e2	mm: userfaultfd: remove stale comment about core dump locking Since commit `7f3bfab52c` ("mm/gup: take mmap_lock in get_dump_page()"), which landed in v5.10, core dumping doesn't enter fault handling without holding the mmap_lock anymore. Remove the stale parts of the comments, but leave the behavior as-is - letting core dumping block on userfault handling would be a bad idea and could lead to deadlocks if the dumping process was handling its own userfaults. Link: https://lkml.kernel.org/r/20230815212216.264445-1-jannh@google.com Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-24 16:20:27 -07:00
Matthew Wilcox (Oracle)	f9bff0e318	minmax: add in_range() macro Patch series "New page table range API", v6. This patchset changes the API used by the MM to set up page table entries. The four APIs are: set_ptes(mm, addr, ptep, pte, nr) update_mmu_cache_range(vma, addr, ptep, nr) flush_dcache_folio(folio) flush_icache_pages(vma, page, nr) flush_dcache_folio() isn't technically new, but no architecture implemented it, so I've done that for them. The old APIs remain around but are mostly implemented by calling the new interfaces. The new APIs are based around setting up N page table entries at once. The N entries belong to the same PMD, the same folio and the same VMA, so ptep++ is a legitimate operation, and locking is taken care of for you. Some architectures can do a better job of it than just a loop, but I have hesitated to make too deep a change to architectures I don't understand well. One thing I have changed in every architecture is that PG_arch_1 is now a per-folio bit instead of a per-page bit when used for dcache clean/dirty tracking. This was something that would have to happen eventually, and it makes sense to do it now rather than iterate over every page involved in a cache flush and figure out if it needs to happen. The point of all this is better performance, and Fengwei Yin has measured improvement on x86. I suspect you'll see improvement on your architecture too. Try the new will-it-scale test mentioned here: https://lore.kernel.org/linux-mm/20230206140639.538867-5-fengwei.yin@intel.com/ You'll need to run it on an XFS filesystem and have CONFIG_TRANSPARENT_HUGEPAGE set. This patchset is the basis for much of the anonymous large folio work being done by Ryan, so it's received quite a lot of testing over the last few months. This patch (of 38): Determine if a value lies within a range more efficiently (subtraction + comparison vs two comparisons and an AND). It also has useful (under some circumstances) behaviour if the range exceeds the maximum value of the type. Convert all the conflicting definitions of in_range() within the kernel; some can use the generic definition while others need their own definition. Link: https://lkml.kernel.org/r/20230802151406.3735276-1-willy@infradead.org Link: https://lkml.kernel.org/r/20230802151406.3735276-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-24 16:20:18 -07:00
Suren Baghdasaryan	29a22b9e08	mm: handle userfaults under VMA lock Enable handle_userfault to operate under VMA lock by releasing VMA lock instead of mmap_lock and retrying. Note that FAULT_FLAG_RETRY_NOWAIT should never be used when handling faults under per-VMA lock protection because that would break the assumption that lock is dropped on retry. [surenb@google.com: fix a lockdep issue in vma_assert_write_locked] Link: https://lkml.kernel.org/r/20230712195652.969194-1-surenb@google.com Link: https://lkml.kernel.org/r/20230630211957.1341547-7-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hdanton@sina.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michel Lespinasse <michel@lespinasse.org> Cc: Minchan Kim <minchan@google.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-08-24 16:20:17 -07:00
Linus Torvalds	f8d6ff4490	nfsd-6.5 fixes: - Close race window when handling FREE_STATEID operations - Fix regression in /proc/fs/nfsd/v4_end_grace introduced in v6.5-rc -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmTntHEACgkQM2qzM29m f5cGgQ//ZvUW1Vvp+s86puw+EyEtwu15Ms19kTYNR+AKebpPz/c9K9iEF3nmZXEL bRn25fELtzXYi8rqavrv8fMj7dQhmkT3DE0WaJcTtCLD5N5bGDO3mQeoQ1fKGR1r rHITp0jC25Viur7kXhXU6qIcvu0VthK+feW/DMlkKsmSlQE5V4utxUGYZp8gfZGU 7cbYRpCqF2J1bJSPxH/lKpg5ZHztpZW6aPXG7frHcg04qsfqrMRS0HqG8KYaAKXh BObBqSYDo8agOa3u361pBZoVZHF2/7gFXlZKIZdp+6F5/B1IjoN+7eWnI7hFxiH7 zf5jLa9xlWrXr2vQTuPEJa9dCr756Ixzq7IJ7ZzIMOpVypixZ04jBLfnuhcnayu9 8k/0CFqQwmfvIcXgJEpTJ+OKm0kDqI3n7WE9gkeYBkRewEvJQXaFZ/vqTYi7bp9H eWlwQ4bHE5touERBMp0HmDdct/ZdUn8dS6MDcdGFXrVf5m+Jt6hZCXTnpU3Ah+zF d0uK4IEwJ2yC9FhBqOYZ6+XBr1JA+40vdnHOBvKdpAzQnIdwnNa4rzR0Eab6+m4i fmhI63s9slPBcBMroRC0mhftcdkd7LjBhhWbsDu8nemKmmHcOKzwTda78EayQYnm /zJUVr8BqzkgaJG1PUn9y0g4IOfgTiokDmBdLu6bTAanRtekhVY= =BF07 -----END PGP SIGNATURE----- Merge tag 'nfsd-6.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: "Two last-minute one-liners for v6.5-rc. One got lost in the shuffle, and the other was reported just this morning" - Close race window when handling FREE_STATEID operations - Fix regression in /proc/fs/nfsd/v4_end_grace introduced in v6.5-rc" * tag 'nfsd-6.5-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Fix a thinko introduced by recent trace point changes nfsd: Fix race to FREE_STATEID and cl_revoked	2023-08-24 14:30:47 -07:00
Olga Kornievskaia	51d674a5e4	NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server After receiving the location(s) of the DS server(s) in the GETDEVINCEINFO, create the request for the clientid to such server and indicate that the client is connecting to a DS. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-08-24 13:24:15 -04:00
Trond Myklebust	537935f72e	NFS/pNFS: Set the connect timeout for the pNFS flexfiles driver Ensure that the connect timeout for the pNFS flexfiles driver is of the same order as the I/O timeout, so that we can fail over quickly when trying to read from a data server that is down. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-08-24 13:24:15 -04:00
Trond Myklebust	88975a5596	NFS: Fix a potential data corruption We must ensure that the subrequests are joined back into the head before we can retransmit a request. If the head was not on the commit lists, because the server wrote it synchronously, we still need to add it back to the retransmission list. Add a call that mirrors the effect of nfs_cancel_remove_inode() for O_DIRECT. Fixes: `ed5d588fe4` ("NFS: Try to join page groups before an O_DIRECT retransmission") Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-08-24 13:24:15 -04:00
Kinglong Mee	14e7316a3c	nfs: fix redundant readdir request after get eof When a directory contains 17 files (except . and ..), nfs client sends a redundant readdir request after get eof. A simple reproduce, At NFS server, create a directory with 17 files under exported directory. # mkdir test # cd test # for i in {0..16} ; do touch $i; done At NFS client, no matter mounting through nfsv3 or nfsv4, does ls (or ll) at the created test directory. A tshark output likes following (for nfsv4), # tshark -i eth0 tcp port 2049 -Tfields -e ip.src -e ip.dst -e nfs -e nfs.cookie4 srcip dstip SEQUENCE, PUTFH, READDIR 0 dstip srcip SEQUENCE PUTFH READDIR 909539109313539306,2108391201987888856,2305312124304486544,2566335452463141496,2978225129081509984,4263037479923412583,4304697173036510679,4666703455469210097,4759208201298769007,4776701232145978803,5338408478512081262,5949498658935544804,5971526429894832903,6294060338267709855,6528840566229532529,8600463293536422524,9223372036854775807 srcip dstip srcip dstip SEQUENCE, PUTFH, READDIR 9223372036854775807 dstip srcip SEQUENCE PUTFH READDIR The READDIR with cookie 9223372036854775807(0x7FFFFFFFFFFFFFFF) is redundant. Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-08-24 13:24:15 -04:00
Dan Carpenter	08b45fcb2d	nfs/blocklayout: Use the passed in gfp flags This allocation should use the passed in GFP_ flags instead of GFP_KERNEL. One places where this matters is in filelayout_pg_init_write() which uses GFP_NOFS as the allocation flags. Fixes: `5c83746a0c` ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-08-24 13:24:15 -04:00
huzhi001@208suo.com	a841c9cb9b	filemap: Fix errors in file.c The following checkpatch errors are removed: ERROR: "foo * bar" should be "foo bar" "foo bar" should be "foo *bar" Signed-off-by: ZhiHu <huzhi001@208suo.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-08-24 13:24:15 -04:00
Fedor Pchelkin	96562c45af	NFSv4/pnfs: minor fix for cleanup path in nfs4_get_device_info It is an almost improbable error case but when page allocating loop in nfs4_get_device_info() fails then we should only free the already allocated pages, as __free_page() can't deal with NULL arguments. Found by Linux Verification Center (linuxtesting.org). Cc: stable@vger.kernel.org Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-08-24 13:24:15 -04:00
GUO Zihua	08be82ba0c	NFS: Move common includes outside ifdef module.h, clnt.h, addr.h and dns_resolve.h is always included whether CONFIG_NFS_USE_KERNEL_DNS is set or not and their order does not seems to matter. Move them outside the ifdef to simplify code and avoid checkincludes message. Signed-off-by: GUO Zihua <guozihua@huawei.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2023-08-24 13:24:15 -04:00
Chuck Lever	8073a98e95	NFSD: Fix a thinko introduced by recent trace point changes The fixed commit erroneously removed a call to nfsd_end_grace(), which makes calls to write_v4_end_grace() a no-op. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202308241229.68396422-oliver.sang@intel.com Fixes: `39d432fc76` ("NFSD: trace nfsctl operations") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>	2023-08-24 10:56:28 -04:00
Will Shiu	74f6f59126	locks: fix KASAN: use-after-free in trace_event_raw_event_filelock_lock As following backtrace, the struct file_lock request , in posix_lock_inode is free before ftrace function using. Replace the ftrace function ahead free flow could fix the use-after-free issue. [name:report&]=============================================== BUG:KASAN: use-after-free in trace_event_raw_event_filelock_lock+0x80/0x12c [name:report&]Read at addr f6ffff8025622620 by task NativeThread/16753 [name:report_hw_tags&]Pointer tag: [f6], memory tag: [fe] [name:report&] BT: Hardware name: MT6897 (DT) Call trace: dump_backtrace+0xf8/0x148 show_stack+0x18/0x24 dump_stack_lvl+0x60/0x7c print_report+0x2c8/0xa08 kasan_report+0xb0/0x120 __do_kernel_fault+0xc8/0x248 do_bad_area+0x30/0xdc do_tag_check_fault+0x1c/0x30 do_mem_abort+0x58/0xbc el1_abort+0x3c/0x5c el1h_64_sync_handler+0x54/0x90 el1h_64_sync+0x68/0x6c trace_event_raw_event_filelock_lock+0x80/0x12c posix_lock_inode+0xd0c/0xd60 do_lock_file_wait+0xb8/0x190 fcntl_setlk+0x2d8/0x440 ... [name:report&] [name:report&]Allocated by task 16752: ... slab_post_alloc_hook+0x74/0x340 kmem_cache_alloc+0x1b0/0x2f0 posix_lock_inode+0xb0/0xd60 ... [name:report&] [name:report&]Freed by task 16752: ... kmem_cache_free+0x274/0x5b0 locks_dispose_list+0x3c/0x148 posix_lock_inode+0xc40/0xd60 do_lock_file_wait+0xb8/0x190 fcntl_setlk+0x2d8/0x440 do_fcntl+0x150/0xc18 ... Signed-off-by: Will Shiu <Will.Shiu@mediatek.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>	2023-08-24 10:42:19 -04:00
Jakub Wilk	bd4c4680c0	fs/locks: Fix typo Signed-off-by: Jakub Wilk <jwilk@jwilk.net> Signed-off-by: Jeff Layton <jlayton@kernel.org>	2023-08-24 10:42:19 -04:00
Luís Henriques	d9ae977d2d	ceph: switch ceph_lookup/atomic_open() to use new fscrypt helper Instead of setting the no-key dentry, use the new fscrypt_prepare_lookup_partial() helper. We still need to mark the directory as incomplete if the directory was just unlocked. In ceph_atomic_open() this fixes a bug where a dentry is incorrectly set with DCACHE_NOKEY_NAME when 'dir' has been evicted but the key is still available (for example, where there's a drop_caches). Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-24 11:24:37 +02:00
Xiubo Li	295fc4aa7d	ceph: fix updating i_truncate_pagecache_size for fscrypt When fscrypt is enabled we will align the truncate size up to the CEPH_FSCRYPT_BLOCK_SIZE always, so if we truncate the size in the same block more than once, the latter ones will be skipped being invalidated from the page caches. This will force invalidating the page caches by using the smaller size than the real file size. At the same time add more debug log and fix the debug log for truncate code. Link: https://tracker.ceph.com/issues/58834 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-24 11:24:36 +02:00
Xiubo Li	1464de9f81	ceph: wait for OSD requests' callbacks to finish when unmounting The sync_filesystem() will flush all the dirty buffer and submit the osd reqs to the osdc and then is blocked to wait for all the reqs to finish. But the when the reqs' replies come, the reqs will be removed from osdc just before the req->r_callback()s are called. Which means the sync_filesystem() will be woke up by leaving the req->r_callback()s are still running. This will be buggy when the waiter require the req->r_callback()s to release some resources before continuing. So we need to make sure the req->r_callback()s are called before removing the reqs from the osdc. WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0 CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1 Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015 RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0 RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202 RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00 RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000 RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000 R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40 R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000 FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> generic_shutdown_super+0x47/0x120 kill_anon_super+0x14/0x30 ceph_kill_sb+0x36/0x90 [ceph] deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x67/0xb0 exit_to_user_mode_prepare+0x23d/0x240 syscall_exit_to_user_mode+0x25/0x60 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fd83dc39e9b We need to increase the blocker counter to make sure all the osd requests' callbacks have been finished just before calling the kill_anon_super() when unmounting. Link: https://tracker.ceph.com/issues/58126 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-24 11:24:36 +02:00
Xiubo Li	e3dfcab208	ceph: drop messages from MDS when unmounting When unmounting all the dirty buffers will be flushed and after the last osd request is finished the last reference of the i_count will be released. Then it will flush the dirty cap/snap to MDSs, and the unmounting won't wait the possible acks, which will ihold the inodes when updating the metadata locally but makes no sense any more, of this. This will make the evict_inodes() to skip these inodes. If encrypt is enabled the kernel generate a warning when removing the encrypt keys when the skipped inodes still hold the keyring: WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0 CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1 Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015 RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0 RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202 RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00 RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000 RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000 R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40 R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000 FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> generic_shutdown_super+0x47/0x120 kill_anon_super+0x14/0x30 ceph_kill_sb+0x36/0x90 [ceph] deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x67/0xb0 exit_to_user_mode_prepare+0x23d/0x240 syscall_exit_to_user_mode+0x25/0x60 do_syscall_64+0x40/0x80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fd83dc39e9b Later the kernel will crash when iput() the inodes and dereferencing the "sb->s_master_keys", which has been released by the generic_shutdown_super(). Link: https://tracker.ceph.com/issues/59162 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-24 11:24:36 +02:00
Luís Henriques	abd4fc7758	ceph: prevent snapshot creation in encrypted locked directories With snapshot names encryption we can not allow snapshots to be created in locked directories because the names wouldn't be encrypted. This patch forces the directory to be unlocked to allow a snapshot to be created. Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-24 11:24:36 +02:00
Luís Henriques	dd66df0053	ceph: add support for encrypted snapshot names Since filenames in encrypted directories are encrypted and shown as a base64-encoded string when the directory is locked, make snapshot names show a similar behaviour. When creating a snapshot, .snap directories for every subdirectory will show the snapshot name in the "long format": # mkdir .snap/my-snap # ls my-dir/.snap/ _my-snap_1099511627782 Encrypted snapshots will need to be able to handle these by encrypting/decrypting only the snapshot part of the string ('my-snap'). Also, since the MDS prevents snapshot names to be bigger than 240 characters it is necessary to adapt CEPH_NOHASH_NAME_MAX to accommodate this extra limitation. [ idryomov: drop const on !CONFIG_FS_ENCRYPTION branch too ] Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-24 11:24:36 +02:00
Luís Henriques	b422f11504	ceph: invalidate pages when doing direct/sync writes When doing a direct/sync write, we need to invalidate the page cache in the range being written to. If we don't do this, the cache will include invalid data as we just did a write that avoided the page cache. In the event that invalidation fails, just ignore the error. That likely just means that we raced with another task doing a buffered write, in which case we want to leave the page intact anyway. [ jlayton: minor comment update ] Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-24 11:24:36 +02:00
Jeff Layton	f0fe1e54cf	ceph: plumb in decryption during reads Force the use of sparse reads when the inode is encrypted, and add the appropriate code to decrypt the extent map after receiving. Note that the crypto block may be smaller than a page, but the reverse cannot be true. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-24 11:24:36 +02:00
Jeff Layton	d55207717d	ceph: add encryption support to writepage and writepages Allow writepage to issue encrypted writes. Extend out the requested size and offset to cover complete blocks, and then encrypt and write them to the OSDs. Add the appropriate machinery to write back dirty data with encryption. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-24 11:24:36 +02:00
Jeff Layton	33a5f1709a	ceph: add read/modify/write to ceph_sync_write When doing a synchronous write on an encrypted inode, we have no guarantee that the caller is writing crypto block-aligned data. When that happens, we must do a read/modify/write cycle. First, expand the range to cover complete blocks. If we had to change the original pos or length, issue a read to fill the first and/or last pages, and fetch the version of the object from the result. We then copy data into the pages as usual, encrypt the result and issue a write prefixed by an assertion that the version hasn't changed. If it has changed then we restart the whole thing again. If there is no object at that position in the file (-ENOENT), we prefix the write on an exclusive create of the object instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-24 11:24:36 +02:00
Jeff Layton	b294fa295f	ceph: align data in pages in ceph_sync_write Encrypted files will need to be dealt with in block-sized chunks and once we do that, the way that ceph_sync_write aligns the data in the bounce buffer won't be acceptable. Change it to align the data the same way it would be aligned in the pagecache. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-24 11:24:36 +02:00
Jeff Layton	8cff8f5374	ceph: don't use special DIO path for encrypted inodes Eventually I want to merge the synchronous and direct read codepaths, possibly via new netfs infrastructure. For now, the direct path is not crypto-enabled, so use the sync read/write paths instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Milind Changire <mchangir@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2023-08-24 11:24:36 +02:00

... 3 4 5 6 7 ...

83953 Commits