linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-24 05:02:12 +00:00

Author	SHA1	Message	Date
Linus Torvalds	bbe85027ce	Recalling the first round of new code for 5.10, in which we added: - New feature: Widen inode timestamps and quota grace expiration timestamps to support dates through the year 2486. - New feature: storing inode btree counts in the AGI to speed up certain mount time per-AG block reservation operatoins and add a little more metadata redundancy. For the second round of new code for 5.10: - Deprecate the V4 filesystem format, some disused mount options, and some legacy sysctl knobs now that we can support dates into the 25th century. Note that removal of V4 support will not happen until the early 2030s. - Fix some probles with inode realtime flag propagation. - Fix some buffer handling issues when growing a rt filesystem. - Fix a problem where a BMAP_REMAP unmap call would free rt extents even though the purpose of BMAP_REMAP is to avoid freeing the blocks. - Strengthen the dabtree online scrubber to check hash values on child dabtree blocks. - Actually log new intent items created as part of recovering log intent items. - Fix a bug where quotas weren't attached to an inode undergoing bmap intent item recovery. - Fix a buffer overrun problem with specially crafted log buffer headers. - Various cleanups to type usage and slightly inaccurate comments. - More cleanups to the xattr, log, and quota code. - Don't run the (slower) shared-rmap operations on attr fork mappings. - Fix a bug where we failed to check the LSN of finobt blocks during replay and could therefore overwrite newer data with older data. - Clean up the ugly nested transaction mess that log recovery uses to stage intent item recovery in the correct order by creating a proper data structure to capture recovered chains. - Use the capture structure to resume intent item chains with the same log space and block reservations as when they were captured. - Fix a UAF bug in bmap intent item recovery where we failed to maintain our reference to the incore inode if the bmap operation needed to relog itself to continue. - Rearrange the defer ops mechanism to finish newly created subtasks of a parent task before moving on to the next parent task. - Automatically relog intent items in deferred ops chains if doing so would help us avoid pinning the log tail. This will help fix some log scaling problems now and will facilitate atomic file updates later. - Fix a deadlock in the GETFSMAP implementation by using an internal memory buffer to reduce indirect calls and copies to userspace, thereby improving its performance by ~20%. - Fix various problems when calling growfs on a realtime volume would not fully update the filesystem metadata. - Fix broken Kconfig asking about deprecated XFS when XFS is disabled. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl+KRuUACgkQ+H93GTRK tOtqsBAAm+AZ92DRjOD7/TbU3vJALRKBBUCc6weEYUJKaZUkdYpx7Fn8Si3K8Nu0 Pxo8qLO8WtP3ECyd+CZgkQgZAhHrjRG+FnCOuNyj1yMguX9CDu4cK0dOh/M64+pM BvWPqLfd99mzr7HkQ0SuLIyDMeio3leU4lySAIVpADO3V7WF5ZgHCfEETpOh5Di1 oIzhYlxHyfK+32u4sXSWsPnogQZwjyn4CyQ+6humK0d089pVB1wbjHaTym7exjSa cFhMqS1XDbpMuoF4BXMcx31UTOb+8/S6TKCVsRl61j3XKGzbYKSrLmrSb/r6gFWn wyXJGmLok0I2UDnX1ZArIWstJHcgPlTelWrssG8wAnopLSJoU10f8o88m43d0krF fCUCac1rKPcisg7CS5njgUkOBknSLeBCeztl59N/8acnkaETPQr0tReDpB4wGGaW aGEWBrCbz1QZyfDBttNPQLcreROGukZ8R8MMRl4GiAQwZz5UrTUFeoK6thplHVvp ANhpYGdJy4jJ79wt4MNVYUF8U8IRWdn0ddsRx08pLWchC1PH8HH944qrUXAVPYZ+ MohSQqKtjvaKwZLP86SvCJFs20wEEUxzCSQbz4LTO5aBz3uDfo0LQSOYzPBU+OKp E33SNds13nsjeH8HBLtXH3lr3absywLcV2ZMaIGsQSLpE2p8AHA= =LuWg -----END PGP SIGNATURE----- Merge tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull more xfs updates from Darrick Wong: "The second large pile of new stuff for 5.10, with changes even more monumental than last week! We are formally announcing the deprecation of the V4 filesystem format in 2030. All users must upgrade to the V5 format, which contains design improvements that greatly strengthen metadata validation, supports reflink and online fsck, and is the intended vehicle for handling timestamps past 2038. We're also deprecating the old Irix behavioral tweaks in September 2025. Coming along for the ride are two design changes to the deferred metadata ops subsystem. One of the improvements is to retain correct logical ordering of tasks and subtasks, which is a more logical design for upper layers of XFS and will become necessary when we add atomic file range swaps and commits. The second improvement to deferred ops improves the scalability of the log by helping the log tail to move forward during long-running operations. This reduces log contention when there are a large number of threads trying to run transactions. In addition to that, this fixes numerous small bugs in log recovery; refactors logical intent log item recovery to remove the last remaining place in XFS where we could have nested transactions; fixes a couple of ways that intent log item recovery could fail in ways that wouldn't have happened in the regular commit paths; fixes a deadlock vector in the GETFSMAP implementation (which improves its performance by 20%); and fixes serious bugs in the realtime growfs, fallocate, and bitmap handling code. Summary: - Deprecate the V4 filesystem format, some disused mount options, and some legacy sysctl knobs now that we can support dates into the 25th century. Note that removal of V4 support will not happen until the early 2030s. - Fix some probles with inode realtime flag propagation. - Fix some buffer handling issues when growing a rt filesystem. - Fix a problem where a BMAP_REMAP unmap call would free rt extents even though the purpose of BMAP_REMAP is to avoid freeing the blocks. - Strengthen the dabtree online scrubber to check hash values on child dabtree blocks. - Actually log new intent items created as part of recovering log intent items. - Fix a bug where quotas weren't attached to an inode undergoing bmap intent item recovery. - Fix a buffer overrun problem with specially crafted log buffer headers. - Various cleanups to type usage and slightly inaccurate comments. - More cleanups to the xattr, log, and quota code. - Don't run the (slower) shared-rmap operations on attr fork mappings. - Fix a bug where we failed to check the LSN of finobt blocks during replay and could therefore overwrite newer data with older data. - Clean up the ugly nested transaction mess that log recovery uses to stage intent item recovery in the correct order by creating a proper data structure to capture recovered chains. - Use the capture structure to resume intent item chains with the same log space and block reservations as when they were captured. - Fix a UAF bug in bmap intent item recovery where we failed to maintain our reference to the incore inode if the bmap operation needed to relog itself to continue. - Rearrange the defer ops mechanism to finish newly created subtasks of a parent task before moving on to the next parent task. - Automatically relog intent items in deferred ops chains if doing so would help us avoid pinning the log tail. This will help fix some log scaling problems now and will facilitate atomic file updates later. - Fix a deadlock in the GETFSMAP implementation by using an internal memory buffer to reduce indirect calls and copies to userspace, thereby improving its performance by ~20%. - Fix various problems when calling growfs on a realtime volume would not fully update the filesystem metadata. - Fix broken Kconfig asking about deprecated XFS when XFS is disabled" * tag 'xfs-5.10-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits) xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n xfs: fix high key handling in the rt allocator's query_range function xfs: annotate grabbing the realtime bitmap/summary locks in growfs xfs: make xfs_growfs_rt update secondary superblocks xfs: fix realtime bitmap/summary file truncation when growing rt volume xfs: fix the indent in xfs_trans_mod_dquot xfs: do the ASSERT for the arguments O_{u,g,p}dqpp xfs: fix deadlock and streamline xfs_getfsmap performance xfs: limit entries returned when counting fsmap records xfs: only relog deferred intent items if free space in the log gets low xfs: expose the log push threshold xfs: periodically relog deferred intent items xfs: change the order in which child and parent defer ops are finished xfs: fix an incore inode UAF in xfs_bui_recover xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering xfs: clean up bmap intent item recovery checking xfs: xfs_defer_capture should absorb remaining transaction reservation xfs: xfs_defer_capture should absorb remaining block reservations xfs: proper replay of deferred ops queued during log recovery xfs: remove XFS_LI_RECOVERED ...	2020-10-19 14:38:46 -07:00
Darrick J. Wong	894645546b	xfs: fix Kconfig asking about XFS_SUPPORT_V4 when XFS_FS=n Pavel Machek complained that the question about supporting deprecated XFS v4 comes up even when XFS is disabled. This clearly makes no sense, so fix Kconfig. Reported-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2020-10-16 15:34:28 -07:00
Darrick J. Wong	d88850bd55	xfs: fix high key handling in the rt allocator's query_range function Fix some off-by-one errors in xfs_rtalloc_query_range. The highest key in the realtime bitmap is always one less than the number of rt extents, which means that the key clamp at the start of the function is wrong. The 4th argument to xfs_rtfind_forw is the highest rt extent that we want to probe, which means that passing 1 less than the high key is wrong. Finally, drop the rem variable that controls the loop because we can compare the iteration point (rtstart) against the high key directly. The sordid history of this function is that the original commit (fb3c3) incorrectly passed (high_rec->ar_startblock - 1) as the 'limit' parameter to xfs_rtfind_forw. This was wrong because the "high key" is supposed to be the largest key for which the caller wants result rows, not the key for the first row that could possibly be outside the range that the caller wants to see. A subsequent attempt (8ad56) to strengthen the parameter checking added incorrect clamping of the parameters to the number of rt blocks in the system (despite the bitmap functions all taking units of rt extents) to avoid querying ranges past the end of rt bitmap file but failed to fix the incorrect _rtfind_forw parameter. The original _rtfind_forw parameter error then survived the conversion of the startblock and blockcount fields to rt extents (a0e5c), and the most recent off-by-one fix (a3a37) thought it was patching a problem when the end of the rt volume is not in use, but none of these fixes actually solved the original problem that the author was confused about the "limit" argument to xfs_rtfind_forw. Sadly, all four of these patches were written by this author and even his own usage of this function and rt testing were inadequate to get this fixed quickly. Original-problem: `fb3c3de2f6` ("xfs: add a couple of queries to iterate free extents in the rtbitmap") Not-fixed-by: `8ad560d256` ("xfs: strengthen rtalloc query range checks") Not-fixed-by: `a0e5c435ba` ("xfs: fix xfs_rtalloc_rec units") Fixes: `a3a374bf18` ("xfs: fix off-by-one error in xfs_rtalloc_query_range") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>	2020-10-16 15:34:28 -07:00
Linus Torvalds	2fc61f25fb	New code for 5.10: - Clean up the buffer ioend calling path so that the retry strategy isn't quite so scattered everywhere. - Clean up m_sb_bp handling. - New feature: storing inode btree counts in the AGI to speed up certain mount time per-AG block reservation operatoins and add a little more metadata redundancy. - New feature: Widen inode timestamps and quota grace expiration timestamps to support dates through the year 2486. - Get rid of more of our custom buffer allocation API wrappers. - Use a proper VLA for shortform xattr structure namevals. - Force the log after reflinking or deduping into a file that is opened with O_SYNC or O_DSYNC. - Fix some math errors in the realtime allocator. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl9iOCEACgkQ+H93GTRK tOvn8Q//VYzuMUIxoc9pVfyd0L3ThROKEO2cwzbGEauXXIginmMqfITVCMWtLg9/ siDnXGFDetqF9g+1DjyzLucVGAleEd0QSCrWqTxNjHUeUu6AtNrv716jQ09BTZD+ SoD9oZhG0k9seGWrVkQRSfyVyARxw/OEGbnGe5xCR3VTXG/xMTRFOJMsbCM6trhL 1nHJDhsoWSDeSYi59GEm/nscZ3yuZOcnnkTMpQrdWX+Y2BHzlUqrXuSf1ZxWmfQm 2RPVxdRRt8Mt0n28oo7eGQ1tC7nYHDqVRlZcM8IuBGIu3kDrPhNVWOIkjzaaBYf9 goZG67RZ/Rm3GDtlaWtz0KRDpJUKOWV+SuSh3MtpqZSb91llAN2tVbiHKYHW72zG Bi+9RCadHKpuU1iyLGP+eaMPkVGV0H2co1DuENJYPy9wQTdRg2LD3qGJuNr7YwMR Rs9QBDntjFO0WbC9UiGkIxSryYdgUctLLv3/eT0LpwvoygabFjNQ69hguFRHhfP0 d1AxS8o2qzyf1v+0NVqM9FAPhiqSY1SBd+seawF5tlQL4OY2BMUgAwdtVqRJapyU BoLSih507nbw+R0FqayKpcUraU7OFrY5SZ21hOsHKt9AR3XW+ntU9ySZMM4EdAXx DunsgaAk6hvZy99y10O1f1Wo30jZKjgnEGHi/IgjG7FKD3iSIvw= =99iy -----END PGP SIGNATURE----- Merge tag 'xfs-5.10-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs updates from Darrick Wong: "The biggest changes are two new features for the ondisk metadata: one to record the sizes of the inode btrees in the AG to increase redundancy checks and to improve mount times; and a second new feature to support timestamps until the year 2486. We also fixed a problem where reflinking into a file that requires synchronous writes wouldn't actually flush the updates to disk; clean up a fair amount of cruft; and started fixing some bugs in the realtime volume code. Summary: - Clean up the buffer ioend calling path so that the retry strategy isn't quite so scattered everywhere. - Clean up m_sb_bp handling. - New feature: storing inode btree counts in the AGI to speed up certain mount time per-AG block reservation operatoins and add a little more metadata redundancy. - New feature: Widen inode timestamps and quota grace expiration timestamps to support dates through the year 2486. - Get rid of more of our custom buffer allocation API wrappers. - Use a proper VLA for shortform xattr structure namevals. - Force the log after reflinking or deduping into a file that is opened with O_SYNC or O_DSYNC. - Fix some math errors in the realtime allocator" * tag 'xfs-5.10-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (42 commits) xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt extent size xfs: make sure the rt allocator doesn't run off the end xfs: Remove unneeded semicolon xfs: force the log after remapping a synchronous-writes file xfs: Convert xfs_attr_sf macros to inline functions xfs: Use variable-size array for nameval in xfs_attr_sf_entry xfs: Remove typedef xfs_attr_shortform_t xfs: remove typedef xfs_attr_sf_entry_t xfs: Remove kmem_zalloc_large() xfs: enable big timestamps xfs: trace timestamp limits xfs: widen ondisk quota expiration timestamps to handle y2038+ xfs: widen ondisk inode timestamps to deal with y2038+ xfs: redefine xfs_ictimestamp_t xfs: redefine xfs_timestamp_t xfs: move xfs_log_dinode_to_disk to the log recovery code xfs: refactor quota timestamp coding xfs: refactor default quota grace period setting code xfs: refactor quota expiration timer modification xfs: explicitly define inode timestamp range ...	2020-10-14 14:06:06 -07:00
Darrick J. Wong	ace74e797a	xfs: annotate grabbing the realtime bitmap/summary locks in growfs Use XFS_ILOCK_RT{BITMAP,SUM} to annotate grabbing the rt bitmap and summary locks when we grow the realtime volume, just like we do most everywhere else. This shuts up lockdep warnings about grabbing the ILOCK class of locks recursively: ============================================ WARNING: possible recursive locking detected 5.9.0-rc4-djw #rc4 Tainted: G O -------------------------------------------- xfs_growfs/4841 is trying to acquire lock: ffff888035acc230 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs] but task is already holding lock: ffff888035acedb0 (&xfs_nondir_ilock_class){++++}-{3:3}, at: xfs_ilock+0xac/0x1a0 [xfs] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&xfs_nondir_ilock_class); lock(&xfs_nondir_ilock_class); * DEADLOCK * May be due to missing lock nesting notation Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>	2020-10-13 08:41:31 -07:00
Darrick J. Wong	7249c95a3f	xfs: make xfs_growfs_rt update secondary superblocks When we call growfs on the data device, we update the secondary superblocks to reflect the updated filesystem geometry. We need to do this for growfs on the realtime volume too, because a future xfs_repair run could try to fix the filesystem using a backup superblock. This was observed by the online superblock scrubbers while running xfs/233. One can also trigger this by growing an rt volume, cycling the mount, and creating new rt files. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>	2020-10-13 08:41:31 -07:00
Darrick J. Wong	f4c32e87de	xfs: fix realtime bitmap/summary file truncation when growing rt volume The realtime bitmap and summary files are regular files that are hidden away from the directory tree. Since they're regular files, inode inactivation will try to purge what it thinks are speculative preallocations beyond the incore size of the file. Unfortunately, xfs_growfs_rt forgets to update the incore size when it resizes the inodes, with the result that inactivating the rt inodes at unmount time will cause their contents to be truncated. Fix this by updating the incore size when we change the ondisk size as part of updating the superblock. Note that we don't do this when we're allocating blocks to the rt inodes because we actually want those blocks to get purged if the growfs fails. This fixes corruption complaints from the online rtsummary checker when running xfs/233. Since that test requires rmap, one can also trigger this by growing an rt volume, cycling the mount, and creating rt files. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>	2020-10-13 08:41:31 -07:00
Kaixu Xia	e5b23740db	xfs: fix the indent in xfs_trans_mod_dquot The formatting is strange in xfs_trans_mod_dquot, so do a reindent. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-10-07 08:40:29 -07:00
Kaixu Xia	97611f9366	xfs: do the ASSERT for the arguments O_{u,g,p}dqpp If we pass in XFS_QMOPT_{U,G,P}QUOTA flags and different uid/gid/prid than them currently associated with the inode, the arguments O_{u,g,p}dqpp shouldn't be NULL, so add the ASSERT for them. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-10-07 08:40:29 -07:00
Darrick J. Wong	8ffa90e114	xfs: fix deadlock and streamline xfs_getfsmap performance Refactor xfs_getfsmap to improve its performance: instead of indirectly calling a function that copies one record to userspace at a time, create a shadow buffer in the kernel and copy the whole array once at the end. On the author's computer, this reduces the runtime on his /home by ~20%. This also eliminates a deadlock when running GETFSMAP against the realtime device. The current code locks the rtbitmap to create fsmappings and copies them into userspace, having not released the rtbitmap lock. If the userspace buffer is an mmap of a sparse file that itself resides on the realtime device, the write page fault will recurse into the fs for allocation, which will deadlock on the rtbitmap lock. Fixes: `4c934c7dd6` ("xfs: report realtime space information via the rtbitmap") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>	2020-10-07 08:40:29 -07:00
Darrick J. Wong	acd1ac3aa2	xfs: limit entries returned when counting fsmap records If userspace asked fsmap to count the number of entries, we cannot return more than UINT_MAX entries because fmh_entries is u32. Therefore, stop counting if we hit this limit or else we will waste time to return truncated results. Fixes: `e89c041338` ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>	2020-10-07 08:40:29 -07:00
Darrick J. Wong	74f4d6a1e0	xfs: only relog deferred intent items if free space in the log gets low Now that we have the ability to ask the log how far the tail needs to be pushed to maintain its free space targets, augment the decision to relog an intent item so that we only do it if the log has hit the 75% full threshold. There's no point in relogging an intent into the same checkpoint, and there's no need to relog if there's plenty of free space in the log. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-10-07 08:40:29 -07:00
Darrick J. Wong	ed1575daf7	xfs: expose the log push threshold Separate the computation of the log push threshold and the push logic in xlog_grant_push_ail. This enables higher level code to determine (for example) that it is holding on to a logged intent item and the log is so busy that it is more than 75% full. In that case, it would be desirable to move the log item towards the head to release the tail, which we will cover in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-10-07 08:40:29 -07:00
Darrick J. Wong	4e919af782	xfs: periodically relog deferred intent items There's a subtle design flaw in the deferred log item code that can lead to pinning the log tail. Taking up the defer ops chain examples from the previous commit, we can get trapped in sequences like this: Caller hands us a transaction t0 with D0-D3 attached. The defer ops chain will look like the following if the transaction rolls succeed: t1: D0(t0), D1(t0), D2(t0), D3(t0) t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0) t3: d5(t1), D1(t0), D2(t0), D3(t0) ... t9: d9(t7), D3(t0) t10: D3(t0) t11: d10(t10), d11(t10) t12: d11(t10) In transaction 9, we finish d9 and try to roll to t10 while holding onto an intent item for D3 that we logged in t0. The previous commit changed the order in which we place new defer ops in the defer ops processing chain to reduce the maximum chain length. Now make xfs_defer_finish_noroll capable of relogging the entire chain periodically so that we can always move the log tail forward. Most chains will never get relogged, except for operations that generate very long chains (large extents containing many blocks with different sharing levels) or are on filesystems with small logs and a lot of ongoing metadata updates. Callers are now required to ensure that the transaction reservation is large enough to handle logging done items and new intent items for the maximum possible chain length. Most callers are careful to keep the chain lengths low, so the overhead should be minimal. The decision to relog an intent item is made based on whether the intent was logged in a previous checkpoint, since there's no point in relogging an intent into the same checkpoint. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-10-07 08:40:28 -07:00
Darrick J. Wong	27dada070d	xfs: change the order in which child and parent defer ops are finished The defer ops code has been finishing items in the wrong order -- if a top level defer op creates items A and B, and finishing item A creates more defer ops A1 and A2, we'll put the new items on the end of the chain and process them in the order A B A1 A2. This is kind of weird, since it's convenient for programmers to be able to think of A and B as an ordered sequence where all the sub-tasks for A must finish before we move on to B, e.g. A A1 A2 D. Right now, our log intent items are not so complex that this matters, but this will become important for the atomic extent swapping patchset. In order to maintain correct reference counting of extents, we have to unmap and remap extents in that order, and we want to complete that work before moving on to the next range that the user wants to swap. This patch fixes defer ops to satsify that requirement. The primary symptom of the incorrect order was noticed in an early performance analysis of the atomic extent swap code. An astonishingly large number of deferred work items accumulated when userspace requested an atomic update of two very fragmented files. The cause of this was traced to the same ordering bug in the inner loop of xfs_defer_finish_noroll. If the ->finish_item method of a deferred operation queues new deferred operations, those new deferred ops are appended to the tail of the pending work list. To illustrate, say that a caller creates a transaction t0 with four deferred operations D0-D3. The first thing defer ops does is roll the transaction to t1, leaving us with: t1: D0(t0), D1(t0), D2(t0), D3(t0) Let's say that finishing each of D0-D3 will create two new deferred ops. After finish D0 and roll, we'll have the following chain: t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1) d4 and d5 were logged to t1. Notice that while we're about to start work on D1, we haven't actually completed all the work implied by D0 being finished. So far we've been careful (or lucky) to structure the dfops callers such that D1 doesn't depend on d4 or d5 being finished, but this is a potential logic bomb. There's a second problem lurking. Let's see what happens as we finish D1-D3: t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2) t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3) t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) Let's say that d4-d11 are simple work items that don't queue any other operations, which means that we can complete each d4 and roll to t6: t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) ... t11: d10(t4), d11(t4) t12: d11(t4) <done> When we try to roll to transaction #12, we're holding defer op d11, which we logged way back in t4. This means that the tail of the log is pinned at t4. If the log is very small or there are a lot of other threads updating metadata, this means that we might have wrapped the log and cannot get roll to t11 because there isn't enough space left before we'd run into t4. Let's shift back to the original failure. I mentioned before that I discovered this flaw while developing the atomic file update code. In that scenario, we have a defer op (D0) that finds a range of file blocks to remap, creates a handful of new defer ops to do that, and then asks to be continued with however much work remains. So, D0 is the original swapext deferred op. The first thing defer ops does is rolls to t1: t1: D0(t0) We try to finish D0, logging d1 and d2 in the process, but can't get all the work done. We log a done item and a new intent item for the work that D0 still has to do, and roll to t2: t2: D0'(t1), d1(t1), d2(t1) We roll and try to finish D0', but still can't get all the work done, so we log a done item and a new intent item for it, requeue D0 a second time, and roll to t3: t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2) If it takes 48 more rolls to complete D0, then we'll finally dispense with D0 in t50: t50: D<fifty primes>(t49), d1(t1), ..., d102(t50) We then try to roll again to get a chain like this: t51: d1(t1), d2(t1), ..., d101(t50), d102(t50) ... t152: d102(t50) <done> Notice that in rolling to transaction #51, we're holding on to a log intent item for d1 that was logged in transaction #1. This means that the tail of the log is pinned at t1. If the log is very small or there are a lot of other threads updating metadata, this means that we might have wrapped the log and cannot roll to t51 because there isn't enough space left before we'd run into t1. This is of course problem #2 again. But notice the third problem with this scenario: we have 102 defer ops tied to this transaction! Each of these items are backed by pinned kernel memory, which means that we risk OOM if the chains get too long. Yikes. Problem #1 is a subtle logic bomb that could hit someone in the future; problem #2 applies (rarely) to the current upstream, and problem #3 applies to work under development. This is not how incremental deferred operations were supposed to work. The dfops design of logging in the same transaction an intent-done item and a new intent item for the work remaining was to make it so that we only have to juggle enough deferred work items to finish that one small piece of work. Deferred log item recovery will find that first unfinished work item and restart it, no matter how many other intent items might follow it in the log. Therefore, it's ok to put the new intents at the start of the dfops chain. For the first example, the chains look like this: t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0) t3: d5(t1), D1(t0), D2(t0), D3(t0) ... t9: d9(t7), D3(t0) t10: D3(t0) t11: d10(t10), d11(t10) t12: d11(t10) For the second example, the chains look like this: t1: D0(t0) t2: d1(t1), d2(t1), D0'(t1) t3: d2(t1), D0'(t1) t4: D0'(t1) t5: d1(t4), d2(t4), D0''(t4) ... t148: D0<50 primes>(t147) t149: d101(t148), d102(t148) t150: d102(t148) <done> This actually sucks more for pinning the log tail (we try to roll to t10 while holding an intent item that was logged in t1) but we've solved problem #1. We've also reduced the maximum chain length from: sum(all the new items) + nr_original_items to: max(new items that each original item creates) + nr_original_items This solves problem #3 by sharply reducing the number of defer ops that can be attached to a transaction at any given time. The change makes the problem of log tail pinning worse, but is improvement we need to solve problem #2. Actually solving #2, however, is left to the next patch. Note that a subsequent analysis of some hard-to-trigger reflink and COW livelocks on extremely fragmented filesystems (or systems running a lot of IO threads) showed the same symptoms -- uncomfortably large numbers of incore deferred work items and occasional stalls in the transaction grant code while waiting for log reservations. I think this patch and the next one will also solve these problems. As originally written, the code used list_splice_tail_init instead of list_splice_init, so change that, and leave a short comment explaining our actions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-10-07 08:40:28 -07:00
Darrick J. Wong	ff4ab5e02a	xfs: fix an incore inode UAF in xfs_bui_recover In xfs_bui_item_recover, there exists a use-after-free bug with regards to the inode that is involved in the bmap replay operation. If the mapping operation does not complete, we call xfs_bmap_unmap_extent to create a deferred op to finish the unmapping work, and we retain a pointer to the incore inode. Unfortunately, the very next thing we do is commit the transaction and drop the inode. If reclaim tears down the inode before we try to finish the defer ops, we dereference garbage and blow up. Therefore, create a way to join inodes to the defer ops freezer so that we can maintain the xfs_inode reference until we're done with the inode. Note: This imposes the requirement that there be enough memory to keep every incore inode in memory throughout recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-10-07 08:40:28 -07:00
Darrick J. Wong	64a3f3315b	xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock ordering In most places in XFS, we have a specific order in which we gather resources: grab the inode, allocate a transaction, then lock the inode. xfs_bui_item_recover doesn't do it in that order, so fix it to be more consistent. This also makes the error bailout code a bit less weird. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-10-07 08:40:28 -07:00
Darrick J. Wong	919522e89f	xfs: clean up bmap intent item recovery checking The bmap intent item checking code in xfs_bui_item_recover is spread all over the function. We should check the recovered log item at the top before we allocate any resources or do anything else, so do that. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-10-07 08:40:28 -07:00
Darrick J. Wong	929b92f640	xfs: xfs_defer_capture should absorb remaining transaction reservation When xfs_defer_capture extracts the deferred ops and transaction state from a transaction, it should record the transaction reservation type from the old transaction so that when we continue the dfops chain, we still use the same reservation parameters. Doing this means that the log item recovery functions get to determine the transaction reservation instead of abusing tr_itruncate in yet another part of xfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-10-07 08:40:28 -07:00
Darrick J. Wong	4f9a60c480	xfs: xfs_defer_capture should absorb remaining block reservations When xfs_defer_capture extracts the deferred ops and transaction state from a transaction, it should record the remaining block reservations so that when we continue the dfops chain, we can reserve the same number of blocks to use. We capture the reservations for both data and realtime volumes. This adds the requirement that every log intent item recovery function must be careful to reserve enough blocks to handle both itself and all defer ops that it can queue. On the other hand, this enables us to do away with the handwaving block estimation nonsense that was going on in xlog_finish_defer_ops. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-10-07 08:40:28 -07:00
Darrick J. Wong	e6fff81e48	xfs: proper replay of deferred ops queued during log recovery When we replay unfinished intent items that have been recovered from the log, it's possible that the replay will cause the creation of more deferred work items. As outlined in commit `509955823c` ("xfs: log recovery should replay deferred ops in order"), later work items have an implicit ordering dependency on earlier work items. Therefore, recovery must replay the items (both recovered and created) in the same order that they would have been during normal operation. For log recovery, we enforce this ordering by using an empty transaction to collect deferred ops that get created in the process of recovering a log intent item to prevent them from being committed before the rest of the recovered intent items. After we finish committing all the recovered log items, we allocate a transaction with an enormous block reservation, splice our huge list of created deferred ops into that transaction, and commit it, thereby finishing all those ops. This is /really/ hokey -- it's the one place in XFS where we allow nested transactions; the splicing of the defer ops list is is inelegant and has to be done twice per recovery function; and the broken way we handle inode pointers and block reservations cause subtle use-after-free and allocator problems that will be fixed by this patch and the two patches after it. Therefore, replace the hokey empty transaction with a structure designed to capture each chain of deferred ops that are created as part of recovering a single unfinished log intent. Finally, refactor the loop that replays those chains to do so using one transaction per chain. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-10-07 08:40:28 -07:00
Darrick J. Wong	901219bb25	xfs: remove XFS_LI_RECOVERED The ->iop_recover method of a log intent item removes the recovered intent item from the AIL by logging an intent done item and committing the transaction, so it's superfluous to have this flag check. Nothing else uses it, so get rid of the flag entirely. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-10-07 08:40:27 -07:00
Darrick J. Wong	b80b29d602	xfs: remove xfs_defer_reset Remove this one-line helper since the assert is trivially true in one call site and the rest obscures a bitmask operation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-10-07 08:40:27 -07:00
Dave Chinner	671459676a	xfs: fix finobt btree block recovery ordering Nathan popped up on #xfs and pointed out that we fail to handle finobt btree blocks in xlog_recover_get_buf_lsn(). This means they always fall through the entire magic number matching code to "recover immediately". Whilst most of the time this is the correct behaviour, occasionally it will be incorrect and could potentially overwrite more recent metadata because we don't check the LSN in the on disk metadata at all. This bug has been present since the finobt was first introduced, and is a potential cause of the occasional xfs_iget_check_free_state() failures we see that indicate that the inode btree state does not match the on disk inode state. Fixes: `aafc3c2465` ("xfs: support the XFS_BTNUM_FINOBT free inode btree type") Reported-by: Nathan Scott <nathans@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-09-30 07:28:52 -07:00
Pavel Reichl	3442de9cc3	xfs: remove deprecated sysctl options These optionr were for Irix compatibility, probably for clustered XFS clients in a heterogenous cluster which contained both Irix & Linux machines, so that behavior would be consistent. That doesn't exist anymore and it's no longer needed. Signed-off-by: Pavel Reichl <preichl@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: actually state when the sysctls go away] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-09-25 11:34:08 -07:00
Pavel Reichl	c23c393eaa	xfs: remove deprecated mount options ikeep/noikeep was a workaround for old DMAPI code which is no longer relevant. attr2/noattr2 - is for controlling upgrade behaviour from fixed attribute fork sizes in the inode (attr1) and dynamic attribute fork sizes (attr2). mkfs has defaulted to setting attr2 since 2007, hence just about every XFS filesystem out there in production right now uses attr2. Signed-off-by: Pavel Reichl <preichl@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix minor typos] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-09-25 11:34:08 -07:00
Kaixu Xia	c9c626b354	xfs: directly call xfs_generic_create() for ->create() and ->mkdir() The current create and mkdir handlers both call the xfs_vn_mknod() which is a wrapper routine around xfs_generic_create() function. Actually the create and mkdir handlers can directly call xfs_generic_create() function and reduce the call chain. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-09-25 11:34:08 -07:00
Darrick J. Wong	d7884e6e90	xfs: avoid shared rmap operations for attr fork extents During code review, I noticed that the rmap code uses the (slower) shared mappings rmap functions for any extent of a reflinked file, even if those extents are for the attr fork, which doesn't support sharing. We can speed up rmap a tiny bit by optimizing out this case. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-09-25 11:34:08 -07:00
Gao Xiang	b38e07401e	xfs: drop the obsolete comment on filestream locking Since commit `1c1c6ebcf5` ("xfs: Replace per-ag array with a radix tree"), there is no m_peraglock anymore, so it's hard to understand the described situation since per-ag is no longer an array and no need to reallocate, call xfs_filestream_flush() in growfs. In addition, the race condition for shrink feature is quite confusing to me currently as well. Get rid of it instead. Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-09-25 11:34:08 -07:00
Kaixu Xia	61ef523051	xfs: code cleanup in xfs_attr_leaf_entsize_{remote,local} Cleanup the typedef usage, the unnecessary parentheses, the unnecessary backslash and use the open-coded round_up call in xfs_attr_leaf_entsize_{remote,local}. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-09-25 11:34:08 -07:00
Kaixu Xia	d6b8fc6c7a	xfs: do the assert for all the log done items in xfs_trans_cancel We should do the assert for all the log intent-done items if they appear here. This patch detect intent-done items by the fact that their item ops don't have iop_unpin and iop_push methods and also move the helper xlog_item_is_intent to xfs_trans.h. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-09-25 11:34:07 -07:00
Kaixu Xia	74af4c1770	xfs: remove the unused parameter id from xfs_qm_dqattach_one Since we never use the second parameter id, so remove it from xfs_qm_dqattach_one() function. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-09-25 11:34:07 -07:00
Kaixu Xia	3feb4ffbf6	xfs: remove the redundant crc feature check in xfs_attr3_rmt_verify We already check whether the crc feature is enabled before calling xfs_attr3_rmt_verify(), so remove the redundant feature check in that function. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-09-25 11:34:07 -07:00
Kaixu Xia	a647d109e0	xfs: fix some comments Fix the comments to help people understand the code. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> [darrick: fix the indenting problems too] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-09-25 11:34:07 -07:00
Kaixu Xia	5aff6750d5	xfs: remove the unnecessary xfs_dqid_t type cast Since the type prid_t and xfs_dqid_t both are uint32_t, seems the type cast is unnecessary, so remove it. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-09-25 11:34:07 -07:00
Kaixu Xia	9c0fce4c16	xfs: use the existing type definition for di_projid We have already defined the project ID type prid_t, so maybe should use it here. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-09-25 11:34:07 -07:00
Kaixu Xia	c63290e300	xfs: remove the unused SYNCHRONIZE macro There are no callers of the SYNCHRONIZE() macro, so remove it. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-09-25 11:34:07 -07:00
Gao Xiang	0c771b99d6	xfs: clean up calculation of LR header blocks Let's use DIV_ROUND_UP() to calculate log record header blocks as what did in xlog_get_iclog_buffer_size() and wrap up a common helper for log recovery. Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-09-23 09:24:17 -07:00
Gao Xiang	f692d09e9c	xfs: avoid LR buffer overrun due to crafted h_len Currently, crafted h_len has been blocked for the log header of the tail block in commit `a70f9fe52d` ("xfs: detect and handle invalid iclog size set by mkfs"). However, each log record could still have crafted h_len and cause log record buffer overrun. So let's check h_len vs buffer size for each log record as well. Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>	2020-09-23 08:58:52 -07:00
Darrick J. Wong	384ff09ba2	xfs: don't release log intent items when recovery fails Nowadays, log recovery will call ->release on the recovered intent items if recovery fails. Therefore, it's redundant to release them from inside the ->recover functions when they're about to return an error. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-09-23 08:58:52 -07:00
Darrick J. Wong	2dbf872c04	xfs: attach inode to dquot in xfs_bui_item_recover In the bmap intent item recovery code, we must be careful to attach the inode to its dquots (if quotas are enabled) so that a change in the shape of the bmap btree doesn't cause the quota counters to be incorrect. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-09-23 08:58:52 -07:00
Darrick J. Wong	93293bcbde	xfs: log new intent items created as part of finishing recovered intent items During a code inspection, I found a serious bug in the log intent item recovery code when an intent item cannot complete all the work and decides to requeue itself to get that done. When this happens, the item recovery creates a new incore deferred op representing the remaining work and attaches it to the transaction that it allocated. At the end of _item_recover, it moves the entire chain of deferred ops to the dummy parent_tp that xlog_recover_process_intents passed to it, but fail to log a new intent item for the remaining work before committing the transaction for the single unit of work. xlog_finish_defer_ops logs those new intent items once recovery has finished dealing with the intent items that it recovered, but this isn't sufficient. If the log is forced to disk after a recovered log item decides to requeue itself and the system goes down before we call xlog_finish_defer_ops, the second log recovery will never see the new intent item and therefore has no idea that there was more work to do. It will finish recovery leaving the filesystem in a corrupted state. The same logic applies to /any/ deferred ops added during intent item recovery, not just the one handling the remaining work. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-09-23 08:58:51 -07:00
Darrick J. Wong	e581c9397a	xfs: check dabtree node hash values when loading child blocks When xchk_da_btree_block is loading a non-root dabtree block, we know that the parent block had to have a (hashval, address) pointer to the block that we just loaded. Check that the hashval in the parent matches the block we just loaded. This was found by fuzzing nbtree[3].hashval = ones in xfs/394. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-09-23 08:58:51 -07:00
Darrick J. Wong	8df0fa39bd	xfs: don't free rt blocks when we're doing a REMAP bunmapi call When callers pass XFS_BMAPI_REMAP into xfs_bunmapi, they want the extent to be unmapped from the given file fork without the extent being freed. We do this for non-rt files, but we forgot to do this for realtime files. So far this isn't a big deal since nobody makes a bunmapi call to a rt file with the REMAP flag set, but don't leave a logic bomb. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2020-09-23 08:58:51 -07:00
Chandan Babu R	c54e14d155	xfs: Set xfs_buf's b_ops member when zeroing bitmap/summary files In xfs_growfs_rt(), we enlarge bitmap and summary files by allocating new blocks for both files. For each of the new blocks allocated, we allocate an xfs_buf, zero the payload, log the contents and commit the transaction. Hence these buffers will eventually find themselves appended to list at xfs_ail->ail_buf_list. Later, xfs_growfs_rt() loops across all of the new blocks belonging to the bitmap inode to set the bitmap values to 1. In doing so, it allocates a new transaction and invokes the following sequence of functions, - xfs_rtfree_range() - xfs_rtmodify_range() - xfs_rtbuf_get() We pass '&xfs_rtbuf_ops' as the ops pointer to xfs_trans_read_buf(). - xfs_trans_read_buf() We find the xfs_buf of interest in per-ag hash table, invoke xfs_buf_reverify() which ends up assigning '&xfs_rtbuf_ops' to xfs_buf->b_ops. On the other hand, if xfs_growfs_rt_alloc() had allocated a few blocks for the bitmap inode and returned with an error, all the xfs_bufs corresponding to the new bitmap blocks that have been allocated would continue to be on xfs_ail->ail_buf_list list without ever having a non-NULL value assigned to their b_ops members. An AIL flush operation would then trigger the following warning message to be printed on the console, XFS (loop0): _xfs_buf_ioapply: no buf ops on daddr 0x58 len 8 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ CPU: 3 PID: 449 Comm: xfsaild/loop0 Not tainted 5.8.0-rc4-chandan-00038-g4d8c2b9de9ab-dirty #37 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0x57/0x70 _xfs_buf_ioapply+0x37c/0x3b0 ? xfs_rw_bdev+0x1e0/0x1e0 ? xfs_buf_delwri_submit_buffers+0xd4/0x210 __xfs_buf_submit+0x6d/0x1f0 xfs_buf_delwri_submit_buffers+0xd4/0x210 xfsaild+0x2c8/0x9e0 ? __switch_to_asm+0x42/0x70 ? xfs_trans_ail_cursor_first+0x80/0x80 kthread+0xfe/0x140 ? kthread_park+0x90/0x90 ret_from_fork+0x22/0x30 This message indicates that the xfs_buf had its b_ops member set to NULL. This commit fixes the issue by assigning "&xfs_rtbuf_ops" to b_ops member of each of the xfs_bufs logged by xfs_growfs_rt_alloc(). Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-09-23 08:58:51 -07:00
Chandan Babu R	72cc95132a	xfs: Set xfs_buf type flag when growing summary/bitmap files The following sequence of commands, mkfs.xfs -f -m reflink=0 -r rtdev=/dev/loop1,size=10M /dev/loop0 mount -o rtdev=/dev/loop1 /dev/loop0 /mnt xfs_growfs /mnt ... causes the following call trace to be printed on the console, XFS: Assertion failed: (bip->bli_flags & XFS_BLI_STALE) \|\| (xfs_blft_from_flags(&bip->__bli_format) > XFS_BLFT_UNKNOWN_BUF && xfs_blft_from_flags(&bip->__bli_format) < XFS_BLFT_MAX_BUF), file: fs/xfs/xfs_buf_item.c, line: 331 Call Trace: xfs_buf_item_format+0x632/0x680 ? kmem_alloc_large+0x29/0x90 ? kmem_alloc+0x70/0x120 ? xfs_log_commit_cil+0x132/0x940 xfs_log_commit_cil+0x26f/0x940 ? xfs_buf_item_init+0x1ad/0x240 ? xfs_growfs_rt_alloc+0x1fc/0x280 __xfs_trans_commit+0xac/0x370 xfs_growfs_rt_alloc+0x1fc/0x280 xfs_growfs_rt+0x1a0/0x5e0 xfs_file_ioctl+0x3fd/0xc70 ? selinux_file_ioctl+0x174/0x220 ksys_ioctl+0x87/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x3e/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This occurs because the buffer being formatted has the value of XFS_BLFT_UNKNOWN_BUF assigned to the 'type' subfield of bip->bli_formats->blf_flags. This commit fixes the issue by assigning one of XFS_BLFT_RTSUMMARY_BUF and XFS_BLFT_RTBITMAP_BUF to the 'type' subfield of bip->bli_formats->blf_flags before committing the corresponding transaction. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-09-21 09:54:29 -07:00
Brian Foster	6dd379c7fa	xfs: drop extra transaction roll from inode extent truncate The inode extent truncate path unmaps extents from the inode block mapping, finishes deferred ops to free the associated extents and then explicitly rolls the transaction before processing the next extent. The latter extent roll is spurious as xfs_defer_finish() always returns a clean transaction and automatically relogs inodes attached to the transaction (with lock_flags == 0). This can unnecessarily increase the number of log ticket regrants that occur during a long running truncate operation. Remove the explicit transaction roll. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-09-21 09:54:29 -07:00
Matthew Wilcox (Oracle)	24addd848a	fs: Introduce i_blocks_per_page This helper is useful for both THPs and for supporting block size larger than page size. Convert all users that I could find (we have a few different ways of writing this idiom, and I may have missed some). Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>	2020-09-21 08:59:26 -07:00
Darrick J. Wong	b96cb835e3	xfs: deprecate the V4 format The V4 filesystem format contains known weaknesses in the on-disk format that make metadata verification diffiult. In addition, the format does not support dates past 2038 and will not be upgraded to do so. We should start the process of retiring the old format to close off attack surfaces and to encourage users to migrate onto V5. Therefore, make XFS V4 support a configurable option. For the first period it will be default Y in case some distributors want to withdraw support early; for the second period it will be default N so that anyone who wishes to continue support can do so; and after that, support will be removed from the kernel. Dates for these events have been added to the upstream kernel. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>	2020-09-15 20:52:43 -07:00
Darrick J. Wong	d4f2c14cc9	xfs: don't propagate RTINHERIT -> REALTIME when there is no rtdev While running generic/042 with -drtinherit=1 set in MKFS_OPTIONS, I observed that the kernel will gladly set the realtime flag on any file created on the loopback filesystem even though that filesystem doesn't actually have a realtime device attached. This leads to verifier failures and doesn't make any sense, so be smarter about this. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-09-15 20:52:43 -07:00

1 2 3 4 5 ...

6563 Commits