Pankaj Raghav reported that when filesystem block size is larger
than page size, the xattr code can use kmalloc() for high order
allocations. This triggers a useless warning in the allocator as it
is a __GFP_NOFAIL allocation here:
static inline
struct page *rmqueue(struct zone *preferred_zone,
struct zone *zone, unsigned int order,
gfp_t gfp_flags, unsigned int alloc_flags,
int migratetype)
{
struct page *page;
/*
* We most definitely don't want callers attempting to
* allocate greater than order-1 page units with __GFP_NOFAIL.
*/
>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
...
Fix this by changing all these call sites to use kvmalloc(), which
will strip the NOFAIL from the kmalloc attempt and if that fails
will do a __GFP_NOFAIL vmalloc().
This is not an issue that productions systems will see as
filesystems with block size > page size cannot be mounted by the
kernel; Pankaj is developing this functionality right now.
Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
Fixes: f078d4ea82 ("xfs: convert kmem_alloc() to kmalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Link: https://lore.kernel.org/r/20240822135018.1931258-8-kernel@pankajraghav.com
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Daniel Gomez <da.gomez@samsung.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The pag in xfs_ag_resv_rmapbt_alloc() is already held when the struct
xfs_btree_cur is initialized in xfs_rmapbt_init_cursor(), so there is no
need to get pag again.
On the other hand, in xfs_rmapbt_free_block(), the similar function
xfs_ag_resv_rmapbt_free() was removed in commit 92a005448f ("xfs: get
rid of unnecessary xfs_perag_{get,put} pairs"), xfs_ag_resv_rmapbt_alloc()
was left because scrub used it, but now scrub has removed it. Therefore,
we could get rid of xfs_ag_resv_rmapbt_alloc() just like the rmap free
block, make the code cleaner.
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Currently the AIL attempts to keep 25% of the "log space" free,
where the current used space is tracked by the reserve grant head.
That is, it tracks both physical space used plus the amount reserved
by transactions in progress.
When we start tail pushing, we are trying to make space for new
reservations by writing back older metadata and the log is generally
physically full of dirty metadata, and reservations for modifications
in flight take up whatever space the AIL can physically free up.
Hence we don't really need to take into account the reservation
space that has been used - we just need to keep the log tail moving
as fast as we can to free up space for more reservations to be made.
We know exactly how much physical space the journal is consuming in
the AIL (i.e. max LSN - min LSN) so we can base push thresholds
directly on this state rather than have to look at grant head
reservations to determine how much to physically push out of the
log.
This also allows code that needs to know if log items in the current
transaction need to be pushed or re-logged to simply sample the
current target - they don't need to calculate the current target
themselves. This avoids the need for any locking when doing such
checks.
Further, moving to a physical target means we don't need "push all
until empty semantics" like were introduced in the previous patch.
We can now test and clear the "push all" as a one-shot command to
set the target to the current head of the AIL. This allows the
xfsaild to maximise the use of log space right up to the point where
conditions indicate that the xfsaild is not keeping up with load and
it needs to work harder, and as soon as those constraints go away
(i.e. external code no longer needs everything pushed) the xfsaild
will return to maintaining the normal 25% free space thresholds.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
We have a mechanism that checks the amount of log space remaining
available every time we make a transaction reservation. If the
amount of space is below a threshold (25% free) we push on the AIL
to tell it to do more work. To do this, we end up calculating the
LSN that the AIL needs to push to on every reservation and updating
the push target for the AIL with that new target LSN.
This is silly and expensive. The AIL is perfectly capable of
calculating the push target itself, and it will always be running
when the AIL contains objects.
What the target does is determine if the AIL needs to do
any work before it goes back to sleep. If we haven't run out of
reservation space or memory (or some other push all trigger), it
will simply go back to sleep for a while if there is more than 25%
of the journal space free without doing anything.
If there are items in the AIL at a lower LSN than the target, it
will try to push up to the target or to the point of getting stuck
before going back to sleep and trying again soon after.`
Hence we can modify the AIL to calculate it's own 25% push target
before it starts a push using the same reserve grant head based
calculation as is currently used, and remove all the places where we
ask the AIL to push to a new 25% free target. We can also drop the
minimum free space size of 256BBs from the calculation because the
25% of a minimum sized log is *always going to be larger than
256BBs.
This does still require a manual push in certain circumstances.
These circumstances arise when the AIL is not full, but the
reservation grants consume the entire of the free space in the log.
In this case, we still need to push on the AIL to free up space, so
when we hit this condition (i.e. reservation going to sleep to wait
on log space) we do a single push to tell the AIL it should empty
itself. This will keep the AIL moving as new reservations come in
and want more space, rather than keep queuing them and having to
push the AIL repeatedly.
The reason for using the "push all" when grant space runs out is
that we can run out of grant space when there is more than 25% of
the log free. Small logs are notorious for this, and we have a hack
in the log callback code (xlog_state_set_callback()) where we push
the AIL because the *head* moved) to ensure that we kick the AIL
when we consume space in it because that can push us over the "less
than 25% available" available that starts tail pushing back up
again.
Hence when we run out of grant space and are going to sleep, we have
to consider that the grant space may be consuming almost all the log
space and there is almost nothing in the AIL. In this situation, the
AIL pins the tail and moving the tail forwards is the only way the
grant space will come available, so we have to force the AIL to push
everything to guarantee grant space will eventually be returned.
Hence triggering a "push all" just before sleeping removes all the
nasty corner cases we have in other parts of the code that work
around the "we didn't ask the AIL to push enough to free grant
space" condition that leads to log space hangs...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
A concurrent file creation and little writing could unexpectedly return
-ENOSPC error since there is a race window that the allocator could get
the wrong agf->agf_longest.
Write file process steps:
1) Find the entry that best meets the conditions, then calculate the start
address and length of the remaining part of the entry after allocation.
2) Delete this entry and update the -current- agf->agf_longest.
3) Insert the remaining unused parts of this entry based on the
calculations in 1), and update the agf->agf_longest again if necessary.
Create file process steps:
1) Check whether there are free inodes in the inode chunk.
2) If there is no free inode, check whether there has space for creating
inode chunks, perform the no-lock judgment first.
3) If the judgment succeeds, the judgment is performed again with agf lock
held. Otherwire, an error is returned directly.
If the write process is in step 2) but not go to 3) yet, the create file
process goes to 2) at this time, it may be mistaken for no space,
resulting in the file system still has space but the file creation fails.
We have sent two different commits to the community in order to fix this
problem[1][2]. Unfortunately, both solutions have flaws. In [2], I
discussed with Dave and Darrick, realized that a better solution to this
problem requires the "last cnt record tracking" to be ripped out of the
generic btree code. And surprisingly, Dave directly provided his fix code.
This patch includes appropriate modifications based on his tmp-code to
address this issue.
The entire fix can be roughly divided into two parts:
1) Delete the code related to lastrec-update in the generic btree code.
2) Place the process of updating longest freespace with cntbt separately
to the end of the cntbt modifications. Move the cursor to the rightmost
firstly, and update the longest free extent based on the record.
Note that we can not update the longest with xfs_alloc_get_rec() after
find the longest record, as xfs_verify_agbno() may not pass because
pag->block_count is updated on the outside. Therefore, use
xfs_btree_get_rec() as a replacement.
[1] https://lore.kernel.org/all/20240419061848.1032366-2-yebin10@huawei.com
[2] https://lore.kernel.org/all/20240604071121.3981686-1-wozizhi@huawei.com
Reported by: Ye Bin <yebin10@huawei.com>
Signed-off-by: Zizhi Wo <wozizhi@huawei.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Move the code that adds the incore xfs_refcount_update_item deferred
work data to a transaction live with the CUI log item code. This means
that the refcount code no longer has to know about the inner workings of
the CUI log items.
As a consequence, we can get rid of the _{get,put}_group helpers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Only update rcur when we know the final *pcur value.
Inspired-by: Christoph Hellwig <hch@lst.de>
[djwong: don't leave the caller with a dangling ref]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
In xfs_refcount_finish_one we know the cursor is non-zero when calling
xfs_refcount_finish_one_cleanup and we pass a 0 error variable. This
means xfs_refcount_finish_one_cleanup is just doing a
xfs_btree_del_cursor.
Open code that and move xfs_refcount_finish_one_cleanup to
fs/xfs/xfs_refcount_item.c.
Inspired-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Pass the incore refcount intent structure to the tracepoints instead of
open-coding the argument passing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Prepare the rest of refcount btree tracepoints for use with realtime
reflink by making them take the btree cursor object as a parameter.
This will save us a lot of trouble later on.
Remove the xfs_refcount_recover_extent tracepoint since it's already
covered by other refcount tracepoints.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
The only user of the "ag" tracepoint event classes is the refcount
btree, so rename them to make that obvious and make them take the btree
cursor to simplify the arguments. This will save us a lot of trouble
later on.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Convert all the refcount tracepoints to use the btree error tracepoint
class.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the code that adds the incore xfs_rmap_update_item deferred work
data to a transaction to live with the RUI log item code. This means
that the rmap code no longer has to know about the inner workings of the
RUI log items.
As a consequence, we can get rid of the _{get,put}_group helpers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Only update rcur when we know the final *pcur value.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[djwong: don't leave the caller with a dangling ref]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
In xfs_rmap_finish_one we known the cursor is non-zero when calling
xfs_rmap_finish_one_cleanup and we pass a 0 error variable. This means
xfs_rmap_finish_one_cleanup is just doing a xfs_btree_del_cursor.
Open code that and move xfs_rmap_finish_one_cleanup to
fs/xfs/xfs_rmap_item.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
[djwong: minor porting changes]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Pass the incore rmap structure to the tracepoints instead of open-coding
the argument passing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Prepare the rmap btree tracepoints for use with realtime rmap btrees by
making them take the btree cursor object as a parameter. This will save
us a lot of trouble later on.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new tracepoint class for btree-related errors, then convert all
the rmap tracepoints to use it. Also fix the one tracepoint that was
abusing the old class by making it a separate tracepoint.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the code that adds the incore xfs_extent_free_item deferred work
data to a transaction to live with the EFI log item code. This means
that the allocator code no longer has to know about the inner workings
of the EFI log items.
As a consequence, we can get rid of the _{get,put}_group helpers.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
xfs_free_extent_later can handle the extra AGFL special casing with
very little extra logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The bno/len verification is already done by the calls to
xfs_verify_rtbext / xfs_verify_fsbext, and reporting a corruption error
seem like the better handling than tripping an assert anyway.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Convert the boolean to skip discard on free into a proper flags field so
that we can add more flags in the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Pass the incore EFI structure to the tracepoints instead of open-coding
the argument passing. This cleans up the call sites a bit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Currently, the XFS_SB_CRC_OFF macro uses the incore superblock struct
(xfs_sb) to compute the address of sb_crc within the ondisk superblock
struct (xfs_dsb). This is a landmine if we ever change the layout of
the incore superblock (as we're about to do), so redefine the macro
to use xfs_dsb to compute the layout of xfs_dsb.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the directory entry update hook code to xfs_dir2 so that it is
mostly consolidated with the higher level directory functions. Retain
the exports so that online fsck can still send notifications through the
hooks.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new libxfs function to rename two directory entries. The
upcoming metadata directory feature will need this to replace a metadata
inode directory entry.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new libxfs function to exchange two directory entries.
The upcoming metadata directory feature will need this to replace a
metadata inode directory entry.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new libxfs function to remove a (name, inode) entry from a
directory. The upcoming metadata directory feature will need this to
create a metadata directory tree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a libxfs helper function that marks an inode free on disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new libxfs function to link an existing inode into a directory.
The upcoming metadata directory feature will need this to create a
metadata directory tree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Create a new libxfs function to link a newly created inode into a
directory. The upcoming metadata directory feature will need this to
create a metadata directory tree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
INIT_XATTRS is overloaded here -- it's set during the creat process when
we think that we're immediately going to set some ACL xattrs to save
time. However, it's also used by the parent pointers code to enable the
attr fork in preparation to receive ppptr xattrs. This results in
xfs_has_parent() branches scattered around the codebase to turn on
INIT_XATTRS.
Linkable files are created far more commonly than unlinkable temporary
files or directory tree roots, so we should centralize this logic in
xfs_inode_init. For the three callers that don't want parent pointers
(online repiar tempfiles, unlinkable tempfiles, rootdir creation) we
provide an UNLINKABLE flag to skip attr fork initialization.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move all the code that initializes a new inode's attributes from the
icreate_args structure and the parent directory into libxfs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
There are two parts to initializing a newly allocated inode: setting up
the incore structures, and initializing the new inode core based on the
parent inode and the current user's environment. The initialization
code is not specific to the kernel, so we would like to share that with
userspace by hoisting it to libxfs. Therefore, split xfs_icreate into
separate functions to prepare for the next few patches.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Enable xfs_trans_ichgtime to change the inode access time so that we can
use this function to set inode times when allocating inodes instead of
open-coding it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Callers that want to create an inode currently pass all possible file
attribute values for the new inode into xfs_init_new_inode as ten
separate parameters. This causes two code maintenance issues: first, we
have large multi-line call sites which programmers must read carefully
to make sure they did not accidentally invert a value. Second, all
three file id parameters must be passed separately to the quota
functions; any discrepancy results in quota count errors.
Clean this up by creating a new icreate_args structure to hold all this
information, some helpers to initialize them properly, and make the
callers pass this structure through to the creation function, whose name
we shorten to xfs_icreate. This eliminates the issues, enables us to
keep the inode init code in sync with userspace via libxfs, and is
needed for future metadata directory tree management.
(A subsequent cleanup will also fix the quota alloc calls.)
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Hoist the inode flag conversion functions into libxfs so that we can
keep them in sync. Do this by creating a new xfs_inode_util.c file in
libxfs.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Move the extent size helpers to xfs_bmap.c in libxfs since they're used
there already.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Following warning is reported, so remove these duplicated header
including:
./fs/xfs/libxfs/xfs_trans_resv.c: xfs_da_format.h is included more than once.
./fs/xfs/scrub/quota_repair.c: xfs_format.h is included more than once.
./fs/xfs/xfs_handle.c: xfs_da_btree.h is included more than once.
./fs/xfs/xfs_qm_bhv.c: xfs_mount.h is included more than once.
./fs/xfs/xfs_trace.c: xfs_bmap.h is included more than once.
This is just a clean code, no logic changed.
Signed-off-by: Wenchao Hao <haowenchao22@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
This adds sanity checks for xfs_dir2_data_unused and xfs_dir2_data_entry
to make sure don't stray beyond valid memory region. Before patching, the
loop simply checks that the start offset of the dup and dep is within the
range. So in a crafted image, if last entry is xfs_dir2_data_unused, we
can change dup->length to dup->length-1 and leave 1 byte of space. In the
next traversal, this space will be considered as dup or dep. We may
encounter an out of bound read when accessing the fixed members.
In the patch, we make sure that the remaining bytes large enough to hold
an unused entry before accessing xfs_dir2_data_unused and
xfs_dir2_data_unused is XFS_DIR2_DATA_ALIGN byte aligned. We also make
sure that the remaining bytes large enough to hold a dirent with a
single-byte name before accessing xfs_dir2_data_entry.
Signed-off-by: lei lu <llfamsec@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
Currently AGFL blocks can be filled from the following three sources:
- allocbt free blocks, as in xfs_allocbt_free_block();
- rmapbt free blocks, as in xfs_rmapbt_free_block();
- refilled from freespace btrees, as in xfs_alloc_fix_freelist().
Originally, allocbt free blocks would be marked as stale only when they
put back in the general free space pool as Dave mentioned on IRC, "we
don't stale AGF metadata btree blocks when they are returned to the
AGFL .. but once they get put back in the general free space pool, we
have to make sure the buffers are marked stale as the next user of
those blocks might be user data...."
However, after commit ca250b1b3d ("xfs: invalidate allocbt blocks
moved to the free list") and commit edfd9dd549 ("xfs: move buffer
invalidation to xfs_btree_free_block"), even allocbt / bmapbt free
blocks will be invalidated immediately since they may fail to pass
V5 format validation on writeback even writeback to free space would be
safe.
IOWs, IMHO currently there is actually no difference of free blocks
between AGFL freespace pool and the general free space pool. So let's
avoid extra redundant AGFL buffer invalidation, since otherwise we're
currently facing unnecessary xfs_log_force() due to xfs_trans_binval()
again on buffers already marked as stale before as below:
[ 333.507469] Call Trace:
[ 333.507862] xfs_buf_find+0x371/0x6a0 <- xfs_buf_lock
[ 333.508451] xfs_buf_get_map+0x3f/0x230
[ 333.509062] xfs_trans_get_buf_map+0x11a/0x280
[ 333.509751] xfs_free_agfl_block+0xa1/0xd0
[ 333.510403] xfs_agfl_free_finish_item+0x16e/0x1d0
[ 333.511157] xfs_defer_finish_noroll+0x1ef/0x5c0
[ 333.511871] xfs_defer_finish+0xc/0xa0
[ 333.512471] xfs_itruncate_extents_flags+0x18a/0x5e0
[ 333.513253] xfs_inactive_truncate+0xb8/0x130
[ 333.513930] xfs_inactive+0x223/0x270
xfs_log_force() will take tens of milliseconds with AGF buffer locked.
It becomes an unnecessary long latency especially on our PMEM devices
with FSDAX enabled and fsops like xfs_reflink_find_shared() at the same
time are stuck due to the same AGF lock. Removing the double
invalidation on the AGFL blocks does not make this issue go away, but
this patch fixes for our workloads in reality and it should also work
by the code analysis.
Note that I'm not sure I need to remove another redundant one in
xfs_alloc_ag_vextent_small() since it's unrelated to our workloads.
Also fstests are passed with this patch.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
The kernel reads userspace's buffer but does not write it back.
Therefore this is really an _IOW ioctl. Change this before 6.10 final
releases.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
For a very very long time, inode inactivation has set the inode size to
zero before unmapping the extents associated with the data fork.
Unfortunately, commit 3c6f46eacd changed the inode verifier to
prohibit zero-length symlinks and directories. If an inode happens to
get logged in this state and the system crashes before freeing the
inode, log recovery will also fail on the broken inode.
Therefore, allow zero-size symlinks and directories as long as the link
count is zero; nobody will be able to open these files by handle so
there isn't any risk of data exposure.
Fixes: 3c6f46eacd ("xfs: sanity check directory inode di_size")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
xfs/205 produces the following failure when always_cow is enabled:
--- a/tests/xfs/205.out 2024-02-28 16:20:24.437887970 -0800
+++ b/tests/xfs/205.out.bad 2024-06-03 21:13:40.584000000 -0700
@@ -1,4 +1,5 @@
QA output created by 205
*** one file
+ !!! disk full (expected)
*** one file, a few bytes at a time
*** done
This is the result of overly aggressive attempts to align cow fork
delalloc reservations to the CoW extent size hint. Looking at the trace
data, we're trying to append a single fsblock to the "fred" file.
Trying to create a speculative post-eof reservation fails because
there's not enough space.
We then set @prealloc_blocks to zero and try again, but the cowextsz
alignment code triggers, which expands our request for a 1-fsblock
reservation into a 39-block reservation. There's not enough space for
that, so the whole write fails with ENOSPC even though there's
sufficient space in the filesystem to allocate the single block that we
need to land the write.
There are two things wrong here -- first, we shouldn't be attempting
speculative preallocations beyond what was requested when we're low on
space. Second, if we've already computed a posteof preallocation, we
shouldn't bother trying to align that to the cowextsize hint.
Fix both of these problems by adding a flag that only enables the
expansion of the delalloc reservation to the cowextsize if we're doing a
non-extending write, and only if we're not doing an ENOSPC retry. This
requires us to move the ENOSPC retry logic to xfs_bmapi_reserve_delalloc.
I probably should have caught this six years ago when 6ca30729c2 was
being reviewed, but oh well. Update the comments to reflect what the
code does now.
Fixes: 6ca30729c2 ("xfs: bmap code cleanup")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
A user with a completely full filesystem experienced an unexpected
shutdown when the filesystem tried to write the superblock during
runtime.
kernel shows the following dmesg:
[ 8.176281] XFS (dm-4): Metadata corruption detected at xfs_sb_write_verify+0x60/0x120 [xfs], xfs_sb block 0x0
[ 8.177417] XFS (dm-4): Unmount and run xfs_repair
[ 8.178016] XFS (dm-4): First 128 bytes of corrupted metadata buffer:
[ 8.178703] 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 01 90 00 00 XFSB............
[ 8.179487] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 8.180312] 00000020: cf 12 dc 89 ca 26 45 29 92 e6 e3 8d 3b b8 a2 c3 .....&E)....;...
[ 8.181150] 00000030: 00 00 00 00 01 00 00 06 00 00 00 00 00 00 00 80 ................
[ 8.182003] 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................
[ 8.182004] 00000050: 00 00 00 01 00 64 00 00 00 00 00 04 00 00 00 00 .....d..........
[ 8.182004] 00000060: 00 00 64 00 b4 a5 02 00 02 00 00 08 00 00 00 00 ..d.............
[ 8.182005] 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 17 00 00 19 ................
[ 8.182008] XFS (dm-4): Corruption of in-memory data detected. Shutting down filesystem
[ 8.182010] XFS (dm-4): Please unmount the filesystem and rectify the problem(s)
When xfs_log_sb writes super block to disk, b_fdblocks is fetched from
m_fdblocks without any lock. As m_fdblocks can experience a positive ->
negative -> positive changing when the FS reaches fullness (see
xfs_mod_fdblocks). So there is a chance that sb_fdblocks is negative, and
because sb_fdblocks is type of unsigned long long, it reads super big.
And sb_fdblocks being bigger than sb_dblocks is a problem during log
recovery, xfs_validate_sb_write() complains.
Fix:
As sb_fdblocks will be re-calculated during mount when lazysbcount is
enabled, We just need to make xfs_validate_sb_write() happy -- make sure
sb_fdblocks is not nenative. This patch also takes care of other percpu
counters in xfs_log_sb.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
An async dio write to a sparse file can generate a lot of extents
and when we unlink this file (using rm), the kernel can be busy in umapping
and freeing those extents as part of transaction processing.
Similarly xfs reflink remapping path can also iterate over a million
extent entries in xfs_reflink_remap_blocks().
Since we can busy loop in these two functions, so let's add cond_resched()
to avoid softlockup messages like these.
watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435]
CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S L 6.9.0-rc5-0-default #1
Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker
NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290
LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290
Call Trace:
xfs_alloc_get_rec+0x54/0x1b0 (unreliable)
xfs_alloc_compute_aligned+0x5c/0x144
xfs_alloc_ag_vextent_size+0x238/0x8d4
xfs_alloc_fix_freelist+0x540/0x694
xfs_free_extent_fix_freelist+0x84/0xe0
__xfs_free_extent+0x74/0x1ec
xfs_extent_free_finish_item+0xcc/0x214
xfs_defer_finish_one+0x194/0x388
xfs_defer_finish_noroll+0x1b4/0x5c8
xfs_defer_finish+0x2c/0xc4
xfs_bunmapi_range+0xa4/0x100
xfs_itruncate_extents_flags+0x1b8/0x2f4
xfs_inactive_truncate+0xe0/0x124
xfs_inactive+0x30c/0x3e0
xfs_inodegc_worker+0x140/0x234
process_scheduled_works+0x240/0x57c
worker_thread+0x198/0x468
kthread+0x138/0x140
start_kernel_thread+0x14/0x18
run fstests generic/175 at 2024-02-02 04:40:21
[ C17] watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679]
watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679]
CPU: 17 PID: 7679 Comm: xfs_io Kdump: loaded Tainted: G X 6.4.0
NIP [c008000005e3ec94] xfs_rmapbt_diff_two_keys+0x54/0xe0 [xfs]
LR [c008000005e08798] xfs_btree_get_leaf_keys+0x110/0x1e0 [xfs]
Call Trace:
0xc000000014107c00 (unreliable)
__xfs_btree_updkeys+0x8c/0x2c0 [xfs]
xfs_btree_update_keys+0x150/0x170 [xfs]
xfs_btree_lshift+0x534/0x660 [xfs]
xfs_btree_make_block_unfull+0x19c/0x240 [xfs]
xfs_btree_insrec+0x4e4/0x630 [xfs]
xfs_btree_insert+0x104/0x2d0 [xfs]
xfs_rmap_insert+0xc4/0x260 [xfs]
xfs_rmap_map_shared+0x228/0x630 [xfs]
xfs_rmap_finish_one+0x2d4/0x350 [xfs]
xfs_rmap_update_finish_item+0x44/0xc0 [xfs]
xfs_defer_finish_noroll+0x2e4/0x740 [xfs]
__xfs_trans_commit+0x1f4/0x400 [xfs]
xfs_reflink_remap_extent+0x2d8/0x650 [xfs]
xfs_reflink_remap_blocks+0x154/0x320 [xfs]
xfs_file_remap_range+0x138/0x3a0 [xfs]
do_clone_file_range+0x11c/0x2f0
vfs_clone_file_range+0x60/0x1c0
ioctl_file_clone+0x78/0x140
sys_ioctl+0x934/0x1270
system_call_exception+0x158/0x320
system_call_vectored_common+0x15c/0x2ec
Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Tested-by: Disha Goel<disgoel@linux.ibm.com>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
An internal user complained about log recovery failing on a symlink
("Bad dinode after recovery") with the following (excerpted) format:
core.magic = 0x494e
core.mode = 0120777
core.version = 3
core.format = 2 (extents)
core.nlinkv2 = 1
core.nextents = 1
core.size = 297
core.nblocks = 1
core.naextents = 0
core.forkoff = 0
core.aformat = 2 (extents)
u3.bmx[0] = [startoff,startblock,blockcount,extentflag]
0:[0,12,1,0]
This is a symbolic link with a 297-byte target stored in a disk block,
which is to say this is a symlink with a remote target. The forkoff is
0, which is to say that there's 512 - 176 == 336 bytes in the inode core
to store the data fork.
Eventually, testing of generic/388 failed with the same inode corruption
message during inode recovery. In writing a debugging patch to call
xfs_dinode_verify on dirty inode log items when we're committing
transactions, I observed that xfs/298 can reproduce the problem quite
quickly.
xfs/298 creates a symbolic link, adds some extended attributes, then
deletes them all. The test failure occurs when the final removexattr
also deletes the attr fork because that does not convert the remote
symlink back into a shortform symlink. That is how we trip this test.
The only reason why xfs/298 only triggers with the debug patch added is
that it deletes the symlink, so the final iflush shows the inode as
free.
I wrote a quick fstest to emulate the behavior of xfs/298, except that
it leaves the symlinks on the filesystem after inducing the "corrupt"
state. Kernels going back at least as far as 4.18 have written out
symlink inodes in this manner and prior to 1eb70f54c4 they did not
object to reading them back in.
Because we've been writing out inodes this way for quite some time, the
only way to fix this is to relax the check for symbolic links.
Directories don't have this problem because di_size is bumped to
blocksize during the sf->data conversion.
Fixes: 1eb70f54c4 ("xfs: validate inode fork size against fork format")
Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>