Commit Graph

7285 Commits

Author SHA1 Message Date
Darrick J. Wong
996b2329b2 xfs: constify the name argument to various directory functions
Various directory functions do not modify their @name parameter,
so mark it const to make that clear.  This will enable us to mark
the global xfs_name_dotdot variable as const to prevent mischief.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-03-14 10:23:17 -07:00
Darrick J. Wong
41667260bc xfs: reserve quota for target dir expansion when renaming files
XFS does not reserve quota for directory expansion when renaming
children into a directory.  This means that we don't reject the
expansion with EDQUOT when we're at or near a hard limit, which means
that unprivileged userspace can use rename() to exceed quota.

Rename operations don't always expand the target directory, and we allow
a rename to proceed with no space reservation if we don't need to add a
block to the target directory to handle the addition.  Moreover, the
unlink operation on the source directory generally does not expand the
directory (you'd have to free a block and then cause a btree split) and
it's probably of little consequence to leave the corner case that
renaming a file out of a directory can increase its size.

As with link and unlink, there is a further bug in that we do not
trigger the blockgc workers to try to clear space when we're out of
quota.

Because rename is its own special tricky animal, we'll patch xfs_rename
directly to reserve quota to the rename transaction.  We'll leave
cleaning up the rest of xfs_rename for the metadata directory tree
patchset.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-03-14 10:23:17 -07:00
Darrick J. Wong
871b9316e7 xfs: reserve quota for dir expansion when linking/unlinking files
XFS does not reserve quota for directory expansion when linking or
unlinking children from a directory.  This means that we don't reject
the expansion with EDQUOT when we're at or near a hard limit, which
means that unprivileged userspace can use link()/unlink() to exceed
quota.

The fix for this is nuanced -- link operations don't always expand the
directory, and we allow a link to proceed with no space reservation if
we don't need to add a block to the directory to handle the addition.
Unlink operations generally do not expand the directory (you'd have to
free a block and then cause a btree split) and we can defer the
directory block freeing if there is no space reservation.

Moreover, there is a further bug in that we do not trigger the blockgc
workers to try to clear space when we're out of quota.

To fix both cases, create a new xfs_trans_alloc_dir function that
allocates the transaction, locks and joins the inodes, and reserves
quota for the directory.  If there isn't sufficient space or quota,
we'll switch the caller to reservationless mode.  This should prevent
quota usage overruns with the least restriction in functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-03-14 10:23:17 -07:00
Darrick J. Wong
dd3b015dd8 xfs: refactor user/group quota chown in xfs_setattr_nonsize
Combine if tests to reduce the indentation levels of the quota chown
calls in xfs_setattr_nonsize.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
2022-03-14 10:23:17 -07:00
Darrick J. Wong
e014f37db1 xfs: use setattr_copy to set vfs inode attributes
Filipe Manana pointed out that XFS' behavior w.r.t. setuid/setgid
revocation isn't consistent with btrfs[1] or ext4.  Those two
filesystems use the VFS function setattr_copy to convey certain
attributes from struct iattr into the VFS inode structure.

Andrey Zhadchenko reported[2] that XFS uses the wrong user namespace to
decide if it should clear setgid and setuid on a file attribute update.
This is a second symptom of the problem that Filipe noticed.

XFS, on the other hand, open-codes setattr_copy in xfs_setattr_mode,
xfs_setattr_nonsize, and xfs_setattr_time.  Regrettably, setattr_copy is
/not/ a simple copy function; it contains additional logic to clear the
setgid bit when setting the mode, and XFS' version no longer matches.

The VFS implements its own setuid/setgid stripping logic, which
establishes consistent behavior.  It's a tad unfortunate that it's
scattered across notify_change, should_remove_suid, and setattr_copy but
XFS should really follow the Linux VFS.  Adapt XFS to use the VFS
functions and get rid of the old functions.

[1] https://lore.kernel.org/fstests/CAL3q7H47iNQ=Wmk83WcGB-KBJVOEtR9+qGczzCeXJ9Y2KCV25Q@mail.gmail.com/
[2] https://lore.kernel.org/linux-xfs/20220221182218.748084-1-andrey.zhadchenko@virtuozzo.com/

Fixes: 7fa294c899 ("userns: Allow chown and setgid preservation")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
2022-03-14 10:23:16 -07:00
Darrick J. Wong
eba0549bc7 xfs: don't generate selinux audit messages for capability testing
There are a few places where we test the current process' capability set
to decide if we're going to be more or less generous with resource
acquisition for a system call.  If the process doesn't have the
capability, we can continue the call, albeit in a degraded mode.

These are /not/ the actual security decisions, so it's not proper to use
capable(), which (in certain selinux setups) causes audit messages to
get logged.  Switch them to has_capability_noaudit.

Fixes: 7317a03df7 ("xfs: refactor inode ownership change transaction/inode/quota allocation idiom")
Fixes: ea9a46e1c4 ("xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2022-03-09 10:32:06 -08:00
Gao Xiang
1a39ae415c xfs: add missing cmap->br_state = XFS_EXT_NORM update
COW extents are already converted into written real extents after
xfs_reflink_convert_cow_locked(), therefore cmap->br_state should
reflect it.

Otherwise, there is another necessary unwritten convertion
triggered in xfs_dio_write_end_io() for direct I/O cases.

Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-09 10:32:06 -08:00
Linus Torvalds
3bd9dd8138 Bug fixes for 5.17-rc4:
- Only call sync_filesystem when we're remounting the filesystem
    readonly readonly, and actually check its return value.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmIEnhUACgkQ+H93GTRK
 tOsn7g//e3R/lqpkx6jJtg6SqiC1KiI9euwD0wBdIvrCWSJZ6IdjOSvfRRG13vN0
 S1spU0joiLLlVhzLIQdysgZkRub57P0mRmq3zVpHYxMOWKBvPH1OmZtdu83HOiAv
 /zjy3tNFc/1ZaqHudAZv3+4780qMZtQTmL7DbgLnvFspCBf4PdBlT0d7Wbf982w8
 dMWF7Pla8DhLVFbMsGdyXlnGROz+pw3jofVwY9P6f+PaY37mo+lZu65GrTlNecnc
 QfTyX45VAWFO/XZtXm7pXCr8211eK2SnrOFZXZH9u3qxSD5vo1NWf9KPKVkYxc8/
 7icz+Yp5t61HQg3o0z7cNAQZp7CQl0BWz6gp2YXMWHS3ZJMnd6H4zTDBdV2MSA5/
 alT4kcwncRVcmHtFET7JAsnQkWNeREBqhqCRoAf8hW8uxpjkXw6sPop7+hbZtoJw
 VAp1TxbEMbPGTZb76Kw4nZt1eZ3SyJOl6ByzsJMxekEFiMYVh4yxO+a3Q6KNOkMM
 O62JpzdE1EeFgV7qmoZ8QzCZuD7z7KC99iv5QtyacFITCqv5y0h/RLGCsOwJ0EMc
 fJGKN7uQOZrBIJYInx53S7fCYGGMm0+HUUXMUatBe4RK3dADyqapLzQb0tCGamAf
 NQra6NotwfNq8SN+Sn17PJ1KifSRKfw6l7Q+6pt9LA2eVbr2jV4=
 =6ODm
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "Nothing exciting, just more fixes for not returning sync_filesystem
  error values (and eliding it when it's not necessary).

  Summary:

   - Only call sync_filesystem when we're remounting the filesystem
     readonly readonly, and actually check its return value"

* tag 'xfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: only bother with sync_filesystem during readonly remount
2022-02-26 09:53:19 -08:00
Gustavo A. R. Silva
5224f79096 treewide: Replace zero-length arrays with flexible-array members
There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use “flexible array members”[1] for these
cases. The older style of one-element or zero-length arrays should
no longer be used[2].

This code was transformed with the help of Coccinelle:
(next-20220214$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file script.cocci --include-headers --dir . > output.patch)

@@
identifier S, member, array;
type T1, T2;
@@

struct S {
  ...
  T1 member;
  T2 array[
- 0
  ];
};

UAPI and wireless changes were intentionally excluded from this patch
and will be sent out separately.

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/78
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2022-02-17 07:00:39 -06:00
Darrick J. Wong
b97cca3ba9 xfs: only bother with sync_filesystem during readonly remount
In commit 02b9984d64, we pushed a sync_filesystem() call from the VFS
into xfs_fs_remount.  The only time that we ever need to push dirty file
data or metadata to disk for a remount is if we're remounting the
filesystem read only, so this really could be moved to xfs_remount_ro.

Once we've moved the call site, actually check the return value from
sync_filesystem.

Fixes: 02b9984d64 ("fs: push sync_filesystem() down to the file system's remount_fs()")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-02-09 21:07:24 -08:00
Linus Torvalds
fbc04bf01a Fixes for 5.17-rc3:
- Fix fallocate so that it drops all file privileges when files are
    modified instead of open-coding that incompletely.
  - Fix fallocate to flush the log if the caller wanted synchronous file
    updates.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmH7/7wACgkQ+H93GTRK
 tOut/BAAl2aLNDdBjjp9rBbNdnZtXSQ3JKBo/6FrGX+7M2rr5/Ppskzfm2FE6HQj
 TS4CNAg1Gid/gbK956iOU+E3LvGR1lUA3yhtm0OZ7yffQ+1zGA2RUCTy39SZEUTI
 PTpzF6gH/oZAKrfv2r9yLEFJdPyBoo8BBBHBRshw5TiDqMG1pJ3toTySehLXHsRf
 Bf4pXGyOy0+q9zkIprUQg5l3cEEdoOW5mLwv/1soiLx+mz6Go/MHFaAO/U3m2KBq
 mwNUZcXoRSihyd7xKEqLQfYuy7abrtICsupoMGvCzFI1ox1ZzPN5/aNAI1LBXz8+
 tK+gRaJg7stGVuI28zIEw4mmcLWG+h46fyuunhY6EIcoXZzpcXL9gxQzle6ypE8e
 ScRtRnjUd4CsDyRNzDez082K1T6M6pD3QXMWPn29/WnjKElbEKwqDRjri4HOqwyq
 A3buwhvuzK2ZS8f2k1/YLrNqZLCZwd9LRsz085GI7HUp7cardiTFERNzt4Nt935k
 T6lDFNKXVW9MZvSXGWNGmwHAQRvmkW+i7pyhE1LCsTwF0Q6YNwbhZAl9SJK5S5Zk
 TJEam4d6xY86Ppu7kXcYWtv8gJxMJyiGbywI8xZDz1VrkwNbfB0ZK/8A9KEMGc1r
 fuUxTZF12BY5Sm5YaxlmrDnkkvn3K9CHQ3DI7BiP+IldwDdhuOU=
 =dqTg
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "I was auditing operations in XFS that clear file privileges, and
  realized that XFS' fallocate implementation drops suid/sgid but
  doesn't clear file capabilities the same way that file writes and
  reflink do.

  There are VFS helpers that do it correctly, so refactor XFS to use
  them. I also noticed that we weren't flushing the log at the correct
  point in the fallocate operation, so that's fixed too.

  Summary:

   - Fix fallocate so that it drops all file privileges when files are
     modified instead of open-coding that incompletely.

   - Fix fallocate to flush the log if the caller wanted synchronous
     file updates"

* tag 'xfs-5.17-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: ensure log flush at the end of a synchronous fallocate call
  xfs: move xfs_update_prealloc_flags() to xfs_pnfs.c
  xfs: set prealloc flag in xfs_alloc_file_space()
  xfs: fallocate() should call file_modified()
  xfs: remove XFS_PREALLOC_SYNC
  xfs: reject crazy array sizes being fed to XFS_IOC_GETBMAP*
2022-02-05 09:21:55 -08:00
Linus Torvalds
ea7b3e6d42 Fixes for 5.17-rc3:
- Fix a bug where callers of ->sync_fs (e.g. sync_filesystem and
    syncfs(2)) ignore the return value.
  - Fix a bug where callers of sync_filesystem (e.g. fs freeze) ignore
    the return value.
  - Fix a bug in XFS where xfs_fs_sync_fs never passed back error
    returns.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmH2xBcACgkQ+H93GTRK
 tOvcFQ//Wo3mMeyte2IkA4Km6tKxB5TMk3oEAYRcp5H2w4tmAnqrgukFRHR3kee8
 b7/qBqAeI66FsneT74Rvb5gR+X2gPgiwKg88ZKmCnufJruL2YNL5D0u7e8sTaqO6
 DLIf8EXw2l9KZOsMjPy2wBbflvufhM1LciXR25YwslYXUySaIDqNJ6R58Xyx7xni
 WNNxkM4rHry2AoLBGnyLti88xguLCSGdS5vEyiNj/dKHZn47/J8xXPppFyDyzGA2
 2oVfGCIgUZGL/Hy99iM8kTIN50GS0Eq1tiM6qUnad2U8KUWj1g2l9yqL6pCC3iuT
 PLn4iB7IDV6zs+FOLdgEYLp7iw/2BxUzocBTaZQlZm0VMJ8UYQEvOeecnOqnly+9
 tt12ZLk3iFVSTUejw3PofqrHc2nTUJr8ojrzKrwIc0Pmur6YlIlO99RHUxy+Li4o
 IVh+D5cqOiwCul5cdGk9120dBACEICI+Q70vjYpnwrgjozHg1tML8d1VtPI9+fs3
 fSzjkLlIc+vgLYgsxvmZg8kSEd2hDb3lq2aSirx42aaEvX94QDly/SPNz78FdehZ
 JWrSewp8rBAW8+yvfBKKcpyhCyULLCGIQjKOziuwLpjjEHMOd6CpJ5w/2c/fKxzy
 B775+6lSb+P/FG3ikW7PKYNa5wWfgp0/Maz79XiuAcUJI5FK9R8=
 =BdrI
 -----END PGP SIGNATURE-----

Merge tag 'vfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull vfs fixes from Darrick Wong:
 "I was auditing the sync_fs code paths recently and noticed that most
  callers of ->sync_fs ignore its return value (and many implementations
  never return nonzero even if the fs is broken!), which means that
  internal fs errors and corruption are not passed up to userspace
  callers of syncfs(2) or FIFREEZE. Hence fixing the common code and
  XFS, and I'll start working on the ext4/btrfs folks if this is merged.

  Summary:

   - Fix a bug where callers of ->sync_fs (e.g. sync_filesystem and
     syncfs(2)) ignore the return value.

   - Fix a bug where callers of sync_filesystem (e.g. fs freeze) ignore
     the return value.

   - Fix a bug in XFS where xfs_fs_sync_fs never passed back error
     returns"

* tag 'vfs-5.17-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: return errors in xfs_fs_sync_fs
  quota: make dquot_quota_sync return errors from ->sync_fs
  vfs: make sync_filesystem return errors from ->sync_fs
  vfs: make freeze_super abort when sync_filesystem returns error
2022-02-05 09:13:51 -08:00
Christoph Hellwig
49add4966d block: pass a block_device and opf to bio_init
Pass the block_device that we plan to use this bio for and the
operation to bio_init to optimize the assignment.  A NULL block_device
can be passed, both for the passthrough case on a raw request_queue and
to temporarily avoid refactoring some nasty code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-19-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:49:59 -07:00
Christoph Hellwig
07888c665b block: pass a block_device and opf to bio_alloc
Pass the block_device and operation that we plan to use this bio for to
bio_alloc to optimize the assignment.  NULL/0 can be passed, both for the
passthrough case on a raw request_queue and to temporarily avoid
refactoring some nasty code.

Also move the gfp_mask argument after the nr_vecs argument for a much
more logical calling convention matching what most of the kernel does.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20220124091107.642561-18-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-02-02 07:49:59 -07:00
Dave Chinner
cea267c235 xfs: ensure log flush at the end of a synchronous fallocate call
Since we've started treating fallocate more like a file write, we
should flush the log to disk if the user has asked for synchronous
writes either by setting it via fcntl flags, or inode flags, or with
the sync mount option.  We've already got a helper for this, so use
it.

[The original patch by Darrick was massaged by Dave to fit this patchset]

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2022-02-01 14:14:48 -08:00
Dave Chinner
b39a04636f xfs: move xfs_update_prealloc_flags() to xfs_pnfs.c
The operations that xfs_update_prealloc_flags() perform are now
unique to xfs_fs_map_blocks(), so move xfs_update_prealloc_flags()
to be a static function in xfs_pnfs.c and cut out all the
other functionality that is doesn't use anymore.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-02-01 14:14:48 -08:00
Dave Chinner
0b02c8c0d7 xfs: set prealloc flag in xfs_alloc_file_space()
Now that we only call xfs_update_prealloc_flags() from
xfs_file_fallocate() in the case where we need to set the
preallocation flag, do this in xfs_alloc_file_space() where we
already have the inode joined into a transaction and get
rid of the call to xfs_update_prealloc_flags() from the fallocate
code.

This also means that we now correctly avoid setting the
XFS_DIFLAG_PREALLOC flag when xfs_is_always_cow_inode() is true, as
these inodes will never have preallocated extents.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-02-01 14:14:48 -08:00
Dave Chinner
fbe7e52003 xfs: fallocate() should call file_modified()
In XFS, we always update the inode change and modification time when
any fallocate() operation succeeds.  Furthermore, as various
fallocate modes can change the file contents (extending EOF,
punching holes, zeroing things, shifting extents), we should drop
file privileges like suid just like we do for a regular write().
There's already a VFS helper that figures all this out for us, so
use that.

The net effect of this is that we no longer drop suid/sgid if the
caller is root, but we also now drop file capabilities.

We also move the xfs_update_prealloc_flags() function so that it now
is only called by the scope that needs to set the the prealloc flag.

Based on a patch from Darrick Wong.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-02-01 14:14:48 -08:00
Dave Chinner
472c6e46f5 xfs: remove XFS_PREALLOC_SYNC
Callers can acheive the same thing by calling xfs_log_force_inode()
after making their modifications. There is no need for
xfs_update_prealloc_flags() to do this.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-02-01 14:14:48 -08:00
Darrick J. Wong
29d650f7e3 xfs: reject crazy array sizes being fed to XFS_IOC_GETBMAP*
Syzbot tripped over the following complaint from the kernel:

WARNING: CPU: 2 PID: 15402 at mm/util.c:597 kvmalloc_node+0x11e/0x125 mm/util.c:597

While trying to run XFS_IOC_GETBMAP against the following structure:

struct getbmap fubar = {
	.bmv_count	= 0x22dae649,
};

Obviously, this is a crazy huge value since the next thing that the
ioctl would do is allocate 37GB of memory.  This is enough to make
kvmalloc mad, but isn't large enough to trip the validation functions.
In other words, I'm fussing with checks that were **already sufficient**
because that's easier than dealing with 644 internal bug reports.  Yes,
that's right, six hundred and forty-four.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com>
2022-01-31 13:20:58 -08:00
Darrick J. Wong
2d86293c70 xfs: return errors in xfs_fs_sync_fs
Now that the VFS will do something with the return values from
->sync_fs, make ours pass on error codes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <brauner@kernel.org>
2022-01-30 08:59:47 -08:00
Dave Chinner
ebb7fb1557 xfs, iomap: limit individual ioend chain lengths in writeback
Trond Myklebust reported soft lockups in XFS IO completion such as
this:

 watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
 CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
 Workqueue: xfs-conv/md127 xfs_end_io [xfs]
 RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
 Call Trace:
  wake_up_page_bit+0x8a/0x110
  iomap_finish_ioend+0xd7/0x1c0
  iomap_finish_ioends+0x7f/0xb0
  xfs_end_ioend+0x6b/0x100 [xfs]
  xfs_end_io+0xb9/0xe0 [xfs]
  process_one_work+0x1a7/0x360
  worker_thread+0x1fa/0x390
  kthread+0x116/0x130
  ret_from_fork+0x35/0x40

Ioends are processed as an atomic completion unit when all the
chained bios in the ioend have completed their IO. Logically
contiguous ioends can also be merged and completed as a single,
larger unit.  Both of these things can be problematic as both the
bio chains per ioend and the size of the merged ioends processed as
a single completion are both unbound.

If we have a large sequential dirty region in the page cache,
write_cache_pages() will keep feeding us sequential pages and we
will keep mapping them into ioends and bios until we get a dirty
page at a non-sequential file offset. These large sequential runs
can will result in bio and ioend chaining to optimise the io
patterns. The pages iunder writeback are pinned within these chains
until the submission chaining is broken, allowing the entire chain
to be completed. This can result in huge chains being processed
in IO completion context.

We get deep bio chaining if we have large contiguous physical
extents. We will keep adding pages to the current bio until it is
full, then we'll chain a new bio to keep adding pages for writeback.
Hence we can build bio chains that map millions of pages and tens of
gigabytes of RAM if the page cache contains big enough contiguous
dirty file regions. This long bio chain pins those pages until the
final bio in the chain completes and the ioend can iterate all the
chained bios and complete them.

OTOH, if we have a physically fragmented file, we end up submitting
one ioend per physical fragment that each have a small bio or bio
chain attached to them. We do not chain these at IO submission time,
but instead we chain them at completion time based on file
offset via iomap_ioend_try_merge(). Hence we can end up with unbound
ioend chains being built via completion merging.

XFS can then do COW remapping or unwritten extent conversion on that
merged chain, which involves walking an extent fragment at a time
and running a transaction to modify the physical extent information.
IOWs, we merge all the discontiguous ioends together into a
contiguous file range, only to then process them individually as
discontiguous extents.

This extent manipulation is computationally expensive and can run in
a tight loop, so merging logically contiguous but physically
discontigous ioends gains us nothing except for hiding the fact the
fact we broke the ioends up into individual physical extents at
submission and then need to loop over those individual physical
extents at completion.

Hence we need to have mechanisms to limit ioend sizes and
to break up completion processing of large merged ioend chains:

1. bio chains per ioend need to be bound in length. Pure overwrites
go straight to iomap_finish_ioend() in softirq context with the
exact bio chain attached to the ioend by submission. Hence the only
way to prevent long holdoffs here is to bound ioend submission
sizes because we can't reschedule in softirq context.

2. iomap_finish_ioends() has to handle unbound merged ioend chains
correctly. This relies on any one call to iomap_finish_ioend() being
bound in runtime so that cond_resched() can be issued regularly as
the long ioend chain is processed. i.e. this relies on mechanism #1
to limit individual ioend sizes to work correctly.

3. filesystems have to loop over the merged ioends to process
physical extent manipulations. This means they can loop internally,
and so we break merging at physical extent boundaries so the
filesystem can easily insert reschedule points between individual
extent manipulations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reported-and-tested-by: Trond Myklebust <trondmy@hammerspace.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-26 09:19:20 -08:00
Linus Torvalds
1cb69c8044 New code for 5.17:
- Minor cleanup of ioctl32 cruft
 - Clean up open coded inodegc workqueue function calls
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHpnxUACgkQ+H93GTRK
 tOtpig/9GzzlPSpbjGMU3BW+dfqvYMLV9YcB3bkZLXPJIb/D2UPXzOQd2SvM9ScJ
 XwM857EjrmV7XifEezbBvla33N9ToNv8/IVdmctcc2as4c6VzDXmkiyze7wD+2D/
 i9UJEOjEtaTvqioOMQJpfaATI7y/9N45UFfcjkbmc4lTP0xOvd1Owx9+ujrCYZ9e
 UccsC7ovUzHxxuK1uFhRMI0up1f5urdFA0wFKvEGsfhIzBthU6BDtYUjuzhC/vLV
 DxqPlQgkRqQfzB9loFQOiNNSC2kVF+SKPXZS7k/HiQIALDOKtFNnSanXG5oPB5+d
 DnuSCvJ+RfejmEoDtMydQu0KXN5fq9g7FodH3tAMPPNn8N1l+qJuC//+6VHZjvIB
 GhfHyWkpg/mzSAAqU0sqeGSvzG1byawYNUzueIIcEWLP1qg4O3QkXW7VJ+yuGb70
 gsO7CQjnW20CmJqeftNsBaBor114UjUJLBQsXR8uowKHq/L1MhNh7Ir6lTfQdti+
 S099EVjrrFV0ZVRou1ZyQ5ZWsdGNtz0p0hEEVEPG2ERHOKfbVWLwbL38s8JY7C56
 hBVwQRwZScdIb/SYTWxPXbBJNR/BXUyxrCikk66isQT0kwFCHXE67YaXbFklxybs
 uzKzVYEBPtYhvDeFTa53GEotGw/LDtZD07E58yowlJUqmJtMLM0=
 =EIqW
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "One of the patches removes some dead code from xfs_ioctl32.h and the
  other fixes broken workqueue flushing in the inode garbage collector.

   - Minor cleanup of ioctl32 cruft

   - Clean up open coded inodegc workqueue function calls"

* tag 'xfs-5.17-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: flush inodegc workqueue tasks before cancel
  xfs: remove unused xfs_ioctl32.h declarations
2022-01-22 11:04:27 +02:00
Linus Torvalds
31d949782e Withdraw the XFS_IOC_ALLOCSP* and XFS_IOC_FREESP* ioctl definitions.
Remove the header definitions for these ioctls.  The just-removed
 implementation has allowed callers to read stale disk contents for more
 than **21 years** and nobody noticed or complained, which implies a lack
 of users aside from exploit programs.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHlp28ACgkQ+H93GTRK
 tOt77Q/+KP7XtwaloGbkBfMzk9JhVh4wISagQ8BxZ5Jb8cHg9ekWJz585he9siNY
 mfcuaLh5onWiBRsfqo4RXg03X4E3c/U+Q3oqe405O3TZkK2LOZMjkPE1ijDdTej6
 V9WhCTCRGuJA01sKgFuuwFJxJRjIROE29FPgRQP9EvLtEITIxLbHMFZRYKpYh143
 EhNwzQQwwPg/L4/m1qmWfC3L+Z7qTm6EQhOnpyzxlKxyX7qXIjBHi6WvErcDeJ3F
 pS7v1fwcZxctrh7PrKwhXSrkbMmd5J3p5qI/MrCtGEKWNXk+rv6AC0n4gcXvpJ/v
 wL0OTyik9pwA8V0XPQcuWvQXmrm8vR2XvMok6gXkHB1jCfzYAwJsrHQbop4pyCFe
 U3HU46x0g7UFXY7jUjztD8YNIT1+B+ducetCCGAhI97HiQrSsqSvgvPFZNle7Cef
 Oheab4iIs1zUblNrVzyGCQmK42ankypxPbfrrtvLi7SFrLRAGXeWeqDf1RXJnt5b
 xrOqCe1hgXR4RJrkTPWiiQindLlhDuywfa+Q1Y5fYZatsTtgceE/HIOg80x4pPgR
 4Ip7hW9lsjoDckpu0bC0bvYiqhrYM1eztpUToYdy7FeOkQKkPHO9xm/m1tHbqzmi
 bF3hkBo6bLByXiY/ZXzrQGrErJ6OTdNVpsR1vYjoaycrQt6wznI=
 =hq3i
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull more xfs irix ioctl housecleaning from Darrick Wong:
 "Withdraw the XFS_IOC_ALLOCSP* and XFS_IOC_FREESP* ioctl definitions.

  This is the third and final of a series of small pull requests that
  perform some long overdue housecleaning of XFS ioctls. This time,
  we're withdrawing all variants of the ALLOCSP and FREESP ioctls from
  XFS' userspace API. This might be a little premature since we've only
  just removed the functionality, but as I pointed out in the last pull
  request, nobody (including fstests) noticed that it was broken for 20
  years.

  In response to the patch, we received a single comment from someone
  who stated that they 'augment' the ioctl for their own purposes, but
  otherwise acquiesced to the withdrawal. I still want to try to clobber
  these old ioctl definitions in 5.17.

  So remove the header definitions for these ioctls. The just-removed
  implementation has allowed callers to read stale disk contents for
  more than **21 years** and nobody noticed or complained, which implies
  a lack of users aside from exploit programs"

* tag 'xfs-5.17-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: remove the XFS_IOC_{ALLOC,FREE}SP* definitions
2022-01-21 08:51:48 +02:00
Linus Torvalds
d701a8ccac Remove the XFS_IOC_ALLOCSP* and XFS_IOC_FREESP* ioctl families.
Linux has always used fallocate as the space management system call,
 whereas these Irix legacy ioctls only ever worked on XFS, and have been
 the cause of recent stale data disclosure vulnerabilities.  As
 equivalent functionality is available elsewhere, remove the code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHlplQACgkQ+H93GTRK
 tOtAcRAAg11WggF9ycNLwnczUs4NmTV1cwhz8+eTuwr2yul3gl/mrO3MyjMmkrnm
 1rXjwg28GKtps04Ugh+8TTL+QkDn6Uteco27OZbmUf00a0MoC7JG4VkEQVtXjcaK
 zvevfutTH7Vnl49m+YBrLtonrTqmND46quKoPKsv0a5nlbXHSNMouUkayWXDSyOl
 8tRcNWLy76L+XCxEU21cD1NBw3Vr0mCiId4xTcbNFw3TUVAGoZgghzC2d/gHFiwN
 1PM7G51TKUNm3dybH0mt/jLF/fLsVxFnznnlW4bb/XzMuU4geqd0r1AQuIdbwZa9
 uB+PkFWwN5frTEFELYTamAa4LlAe2oQ0hmSGLfC/zEtPcOv4h6qHNgRsN9wfG+H9
 oYUeRY+2zHcD7jYJsaZZt5WCIDVncOlJMclRdpbpujkJzJX9ZjAi++PTgDxdMjFa
 egwDAvOdgijgtz8erN0gglJrqJzQQp6ByNtht5rZjHz7LkrWYtt57TOoS986pW7X
 /MwBLjT/4Xig/XaFVrmMohF3VPrG/eH/DpTnHotzQzZRYQWbKZwCgin6+kKC8cV8
 Y+eE1jKZunL4Ms/GmrxencNzsDSJtkKyR5LkHCqgH8YUPJM3vYDcleZY+UgEKq0a
 z0fw3MZvxM2jsUIk7+J8uQ8esSqUb5hNXkUJsUraUtG3Z6ZeaOg=
 =2QZ3
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs irix ioctl housecleaning from Darrick Wong:
 "Remove the XFS_IOC_ALLOCSP* and XFS_IOC_FREESP* ioctl families.

  This is the second of a series of small pull requests that perform
  some long overdue housecleaning of XFS ioctls. This time, we're
  vacating the implementation of all variants of the ALLOCSP and FREESP
  ioctls, which are holdovers from EFS in Irix, circa 1993. Roughly
  equivalent functionality have been available for both ioctls since
  2.6.25 (April 2008):

   - XFS_IOC_FREESP ftruncates a file.

   - XFS_IOC_ALLOCSP is the equivalent of fallocate.

  As noted in the fix patch for CVE 2021-4155, the ALLOCSP ioctl has
  been serving up stale disk blocks since 2000, and in 21 years
  **nobody** noticed. On those grounds I think it's safe to vacate the
  implementation.

  Note that we lose the ability to preallocate and truncate relative to
  the current file position, but as nobody's ever implemented that for
  the VFS, I conclude that it's not in high demand.

  Linux has always used fallocate as the space management system call,
  whereas these Irix legacy ioctls only ever worked on XFS, and have
  been the cause of recent stale data disclosure vulnerabilities. As
  equivalent functionality is available elsewhere, remove the code"

* tag 'xfs-5.17-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: kill the XFS_IOC_{ALLOC,FREE}SP* ioctls
2022-01-21 08:47:25 +02:00
Linus Torvalds
12a8fb20f1 Withdraw the ioctl definition for the FSSETDM ioctl.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHlpbEACgkQ+H93GTRK
 tOs8MBAAjOyUswNqVMYgMB6eww7oT8j54jGdTmo+d+N2hIoX4OAV6GdCI6te/7DE
 yz6TtzbQ9fs9lDYsdRsMjfwr18R8N1m0D23cpFSJ6qq6e2NJi0mqVwmC8mPRNAJH
 68yL0p6O4Tk5jj7kae+zsv50nsSANXclfoHjsZQ0DGOuGagmPkNVqT82KIC4n3dn
 aw66xhuaiLLJE+4boLe4NexRVBbyOuHQ2uB7xUnKwc9tHvjAf8EFCIhUV1wp0qZf
 CfA2wg8+Jzwrqz/gVRKUZOjz7LeIY6E2qCBrA+DATv2dcv7QhmvGHaQ9OkrvIE72
 CbvI92IhvOcKFzpfMrRYGhOh7KE6SkxLGqsAXgnjPoFQkCDudgCaExBO96RMMd6u
 cX43mXWZbUl++Sh2GhPD/xkiskLRZFjiHJbKBX/5nwjU2BzTHQY/7Jy07fIkR4c4
 IrkKgiXfSJT4j/KeAMkBpZ7THMjRMSUgwliSWHL0QWUz5Bou8WRnHUl8CMsu9vDJ
 fYeekXDQYuAX+UrcsDlbA0UukigOLSIZiQTAEgSbIkd/+Zb6U6e0IF7pTTZJ9uFs
 bndLFYqZtEAySDrMCBM+W8VYmR48EDxfN8xsdS1kbZIqEdNhmkEMj9tMf0rs+FRi
 lo1vMi08O7VcuyiyNrKs0e1d1Gkd2jwmwIskSQweslP5BbfPzRE=
 =d+fT
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs ioctl housecleaning from Darrick Wong:
 "This is the first of a series of small pull requests that perform some
  long overdue housecleaning of XFS ioctls. This first pull request
  removes the FSSETDM ioctl, which was used to set DMAPI event
  attributes on XFS files. The DMAPI support has never been merged
  upstream and the implementation of FSSETDM itself was removed two
  years ago, so let's withdraw it completely.

   - Withdraw the ioctl definition for the FSSETDM ioctl"

* tag 'xfs-5.17-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: remove the XFS_IOC_FSSETDM definitions
2022-01-21 08:44:07 +02:00
Brian Foster
6191cf3ad5 xfs: flush inodegc workqueue tasks before cancel
The xfs_inodegc_stop() helper performs a high level flush of pending
work on the percpu queues and then runs a cancel_work_sync() on each
of the percpu work tasks to ensure all work has completed before
returning.  While cancel_work_sync() waits for wq tasks to complete,
it does not guarantee work tasks have started. This means that the
_stop() helper can queue and instantly cancel a wq task without
having completed the associated work. This can be observed by
tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
test:

	xfs_destroy_inode: ... ino 0x83 ...
	xfs_inode_set_need_inactive: ... ino 0x83 ...
	xfs_inodegc_stop: ...
	...
	xfs_inodegc_start: ...
	xfs_inodegc_worker: ...
	xfs_inode_inactivating: ... ino 0x83 ...

The first few lines show that the inode is removed and need inactive
state set, but the inactivation work has not completed before the
inodegc mechanism stops. The inactivation doesn't actually occur
until the fs is unfrozen and the gc mechanism starts back up. Note
that this test requires fsfreeze to reproduce because xfs_freeze
indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().

When this occurs, the workqueue try_to_grab_pending() logic first
tries to steal the pending bit, which does not succeed because the
bit has been set by queue_work_on(). Subsequently, it checks for
association of a pool workqueue from the work item under the pool
lock. This association is set at the point a work item is queued and
cleared when dequeued for processing. If the association exists, the
work item is removed from the queue and cancel_work_sync() returns
true. If the pwq association is cleared, the remove attempt assumes
the task is busy and retries (eventually returning false to the
caller after waiting for the work task to complete).

To avoid this race, we can flush each work item explicitly before
cancel. However, since the _queue_all() already schedules each
underlying work item, the workqueue level helpers are sufficient to
achieve the same ordering effect. E.g., the inodegc enabled flag
prevents scheduling any further work in the _stop() case. Use the
drain_workqueue() helper in this particular case to make the intent
a bit more self explanatory.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-19 14:58:26 -08:00
Darrick J. Wong
a8e422af69 xfs: remove unused xfs_ioctl32.h declarations
Remove these unused ia32 compat declarations; all the bits involved have
either been withdrawn or hoisted to the VFS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2022-01-18 10:18:36 -08:00
Darrick J. Wong
b3bb9413e7 xfs: remove the XFS_IOC_{ALLOC,FREE}SP* definitions
Now that we've made these ioctls defunct, move them from xfs_fs.h to
xfs_ioctl.c, which effectively removes them from the publicly supported
ioctl interfaces for XFS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2022-01-17 09:17:11 -08:00
Darrick J. Wong
4d1b97f9ce xfs: kill the XFS_IOC_{ALLOC,FREE}SP* ioctls
According to the glibc compat header for Irix 4, these ioctls originated
in April 1991 as a (somewhat clunky) way to preallocate space at the end
of a file on an EFS filesystem.  XFS, which was released in Irix 5.3 in
December 1993, picked up these ioctls to maintain compatibility and they
were ported to Linux in the early 2000s.

Recently it was pointed out to me they still lurk in the kernel, even
though the Linux fallocate syscall supplanted the functionality a long
time ago.  fstests doesn't seem to include any real functional or stress
tests for these ioctls, which means that the code quality is ... very
questionable.  Most notably, it was a stale disk block exposure vector
for 21 years and nobody noticed or complained.  As mature programmers
say, "If you're not testing it, it's broken."

Given all that, let's withdraw these ioctls from the XFS userspace API.
Normally we'd set a long deprecation process, but I estimate that there
aren't any real users, so let's trigger a warning in dmesg and return
-ENOTTY.

See: CVE-2021-4155

Augments: 983d8e60f5 ("xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocate")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-17 09:16:41 -08:00
Darrick J. Wong
9dec0368b9 xfs: remove the XFS_IOC_FSSETDM definitions
Remove the definitions for these ioctls, since the functionality (and,
weirdly, the 32-bit compat ioctl definitions) were removed from the
kernel in November 2019.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-17 09:16:40 -08:00
Linus Torvalds
f56caedaf9 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
 "146 patches.

  Subsystems affected by this patch series: kthread, ia64, scripts,
  ntfs, squashfs, ocfs2, vfs, and mm (slab-generic, slab, kmemleak,
  dax, kasan, debug, pagecache, gup, shmem, frontswap, memremap,
  memcg, selftests, pagemap, dma, vmalloc, memory-failure, hugetlb,
  userfaultfd, vmscan, mempolicy, oom-kill, hugetlbfs, migration, thp,
  ksm, page-poison, percpu, rmap, zswap, zram, cleanups, hmm, and
  damon)"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (146 commits)
  mm/damon: hide kernel pointer from tracepoint event
  mm/damon/vaddr: hide kernel pointer from damon_va_three_regions() failure log
  mm/damon/vaddr: use pr_debug() for damon_va_three_regions() failure logging
  mm/damon/dbgfs: remove an unnecessary variable
  mm/damon: move the implementation of damon_insert_region to damon.h
  mm/damon: add access checking for hugetlb pages
  Docs/admin-guide/mm/damon/usage: update for schemes statistics
  mm/damon/dbgfs: support all DAMOS stats
  Docs/admin-guide/mm/damon/reclaim: document statistics parameters
  mm/damon/reclaim: provide reclamation statistics
  mm/damon/schemes: account how many times quota limit has exceeded
  mm/damon/schemes: account scheme actions that successfully applied
  mm/damon: remove a mistakenly added comment for a future feature
  Docs/admin-guide/mm/damon/usage: update for kdamond_pid and (mk|rm)_contexts
  Docs/admin-guide/mm/damon/usage: mention tracepoint at the beginning
  Docs/admin-guide/mm/damon/usage: remove redundant information
  Docs/admin-guide/mm/damon/usage: update for scheme quotas and watermarks
  mm/damon: convert macro functions to static inline functions
  mm/damon: modify damon_rand() macro to static inline function
  mm/damon: move damon_rand() definition into damon.h
  ...
2022-01-15 20:37:06 +02:00
NeilBrown
4034247a0d mm: introduce memalloc_retry_wait()
Various places in the kernel - largely in filesystems - respond to a
memory allocation failure by looping around and re-trying.  Some of
these cannot conveniently use __GFP_NOFAIL, for reasons such as:

 - a GFP_ATOMIC allocation, which __GFP_NOFAIL doesn't work on
 - a need to check for the process being signalled between failures
 - the possibility that other recovery actions could be performed
 - the allocation is quite deep in support code, and passing down an
   extra flag to say if __GFP_NOFAIL is wanted would be clumsy.

Many of these currently use congestion_wait() which (in almost all
cases) simply waits the given timeout - congestion isn't tracked for
most devices.

It isn't clear what the best delay is for loops, but it is clear that
the various filesystems shouldn't be responsible for choosing a timeout.

This patch introduces memalloc_retry_wait() with takes on that
responsibility.  Code that wants to retry a memory allocation can call
this function passing the GFP flags that were used.  It will wait
however is appropriate.

For now, it only considers __GFP_NORETRY and whatever
gfpflags_allow_blocking() tests.  If blocking is allowed without
__GFP_NORETRY, then alloc_page either made some reclaim progress, or
waited for a while, before failing.  So there is no need for much
further waiting.  memalloc_retry_wait() will wait until the current
jiffie ends.  If this condition is not met, then alloc_page() won't have
waited much if at all.  In that case memalloc_retry_wait() waits about
200ms.  This is the delay that most current loops uses.

linux/sched/mm.h needs to be included in some files now,
but linux/backing-dev.h does not.

Link: https://lkml.kernel.org/r/163754371968.13692.1277530886009912421@noble.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Chao Yu <chao@kernel.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-01-15 16:30:29 +02:00
Linus Torvalds
a33f5c380c New code for 5.17:
- Fix a minor locking inconsistency in readdir
  - Fix incorrect fs feature bit validation for secondary superblocks
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHfFFYACgkQ+H93GTRK
 tOvrQA//RnDMit4zLoPo3WeZTDjpv1CQkOXFgQSba+Owgz1RvGn5f29R4ICKW6Uy
 NfnPcq5AeNhysOAi0uZPl4EC6140QgMrD4aJjKY1JAc/BkeBVEN05NOUFfCuBXU/
 RMzL8nGD5zsN7fPIWVGS60iitou8IvoX8ky4dKx7XcsbFzbBMtnJIsUpfqVutY9u
 i+zub6sNRkstr4uBRk+1S8uAqHUAW+21YwfKqB6pgpCoO5BHu2e4eN01ohWvF8Ru
 0ujJ2j4YlfjYmtQypFk3rNgQoI0oXY6mYWZPKr7fYvhDpoKEodUvHRLIJEHqoD+y
 fX6Ey5XxpHmDxSJnWRC7Vznl56VEmMndUcWEq2ZqROp/r/zZp7StXyHLO7DR9nEs
 mp+55a4tcKhHa/KnjbqexAaN4a1NTpryMqjsPHP7VTNu5Dq8CK4kHtrcEVrKCZFq
 ExRFMfoUDxap6iJaxoKAz5CtZyJeuZO8bLCa7jq/2F91EWp+2aclxEU6VY5r6B2X
 dFlIY8XnZEfJxE3xnhH/aDs6IKWH4YmvgtwxNb+RIupyMTfJzpcysjV9NRER8fAv
 9rLfWNz+nx0efNyXle+h+vrzT/zXgyi/0PSjAQS9/xErTPRFAmXFefLl/0oMQSHC
 XFIVivcR0lxKpZhB5CIOfVj1Vu8VM52JvlXjHPU3ObqEbFDxtf4=
 =RYJ4
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:
 "These are the last few obvious fixes that I found while stress testing
  online fsck for XFS prior to initiating a design review of the whole
  giant machinery.

   - Fix a minor locking inconsistency in readdir

   - Fix incorrect fs feature bit validation for secondary superblocks"

* tag 'xfs-5.17-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix online fsck handling of v5 feature bits on secondary supers
  xfs: take the ILOCK when readdir inspects directory mapping data
2022-01-15 07:47:40 +02:00
Linus Torvalds
3acbdbf42e dax + libnvdimm for v5.17
- Simplify the dax_operations API
   - Eliminate bdev_dax_pgoff() in favor of the filesystem maintaining
     and applying a partition offset to all its DAX iomap operations.
   - Remove wrappers and device-mapper stacked callbacks for
     ->copy_from_iter() and ->copy_to_iter() in favor of moving
     block_device relative offset responsibility to the
     dax_direct_access() caller.
   - Remove the need for an @bdev in filesystem-DAX infrastructure
   - Remove unused uio helpers copy_from_iter_flushcache() and
     copy_mc_to_iter() as only the non-check_copy_size() versions are
     used for DAX.
 - Prepare XFS for the pending (next merge window) DAX+reflink support
 - Remove deprecated DEV_DAX_PMEM_COMPAT support
 - Cleanup a straggling misuse of the GUID api
 
 Tags offered after the branch was cut:
 Reviewed-by: Mike Snitzer <snitzer@redhat.com>
 Link: https://lore.kernel.org/r/Ydb/3P+8nvjCjYfO@redhat.com
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYd3dTAAKCRDfioYZHlFs
 Z//UAP9zetoTE+O7zJG7CXja4jSopSadbdbh6QKSXaqfKBPvQQD+N4US3wA2bGv8
 f/qCY62j2Hj3hUTGHs9RvTyw3JsSYAA=
 =QvDs
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull dax and libnvdimm updates from Dan Williams:
 "The bulk of this is a rework of the dax_operations API after
  discovering the obstacles it posed to the work-in-progress DAX+reflink
  support for XFS and other copy-on-write filesystem mechanics.

  Primarily the need to plumb a block_device through the API to handle
  partition offsets was a sticking point and Christoph untangled that
  dependency in addition to other cleanups to make landing the
  DAX+reflink support easier.

  The DAX_PMEM_COMPAT option has been around for 4 years and not only
  are distributions shipping userspace that understand the current
  configuration API, but some are not even bothering to turn this option
  on anymore, so it seems a good time to remove it per the deprecation
  schedule. Recall that this was added after the device-dax subsystem
  moved from /sys/class/dax to /sys/bus/dax for its sysfs organization.
  All recent functionality depends on /sys/bus/dax.

  Some other miscellaneous cleanups and reflink prep patches are
  included as well.

  Summary:

   - Simplify the dax_operations API:

      - Eliminate bdev_dax_pgoff() in favor of the filesystem
        maintaining and applying a partition offset to all its DAX iomap
        operations.

      - Remove wrappers and device-mapper stacked callbacks for
        ->copy_from_iter() and ->copy_to_iter() in favor of moving
        block_device relative offset responsibility to the
        dax_direct_access() caller.

      - Remove the need for an @bdev in filesystem-DAX infrastructure

      - Remove unused uio helpers copy_from_iter_flushcache() and
        copy_mc_to_iter() as only the non-check_copy_size() versions are
        used for DAX.

   - Prepare XFS for the pending (next merge window) DAX+reflink support

   - Remove deprecated DEV_DAX_PMEM_COMPAT support

   - Cleanup a straggling misuse of the GUID api"

* tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (38 commits)
  iomap: Fix error handling in iomap_zero_iter()
  ACPI: NFIT: Import GUID before use
  dax: remove the copy_from_iter and copy_to_iter methods
  dax: remove the DAXDEV_F_SYNC flag
  dax: simplify dax_synchronous and set_dax_synchronous
  uio: remove copy_from_iter_flushcache() and copy_mc_to_iter()
  iomap: turn the byte variable in iomap_zero_iter into a ssize_t
  memremap: remove support for external pgmap refcounts
  fsdax: don't require CONFIG_BLOCK
  iomap: build the block based code conditionally
  dax: fix up some of the block device related ifdefs
  fsdax: shift partition offset handling into the file systems
  dax: return the partition offset from fs_dax_get_by_bdev
  iomap: add a IOMAP_DAX flag
  xfs: pass the mapping flags to xfs_bmbt_to_iomap
  xfs: use xfs_direct_write_iomap_ops for DAX zeroing
  xfs: move dax device handling into xfs_{alloc,free}_buftarg
  ext4: cleanup the dax handling in ext4_fill_super
  ext2: cleanup the dax handling in ext2_fill_super
  fsdax: decouple zeroing from the iomap buffered I/O code
  ...
2022-01-12 15:46:11 -08:00
Linus Torvalds
f079ab01b5 Convert xfs/iomap to use folios
This should be all that is needed for XFS to use large folios.
 There is no code in this pull request to create large folios, but
 no additional changes should be needed to XFS or iomap once they
 are created.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmHcpaUACgkQDpNsjXcp
 gj4MUAf+ItcKfgFo1QCMT+6Y0mohVqPme/vdyOCNv6yOOfZZqN5ZQc+2hmxXrRz9
 XPOPwZKL0TttlHSYEJmrm8mqwN8UXl0kqMu4kQqOXMziiD9qpVlaLXOZ7iLdkQxu
 z/xe1iACcGfJUaQCsaMP6BZqp6iETA4qP72dBE4jc6PC4H3OI0pN/900gEbAcLxD
 Yn0a5NhrdS/EySU2aHLB6OcwhqnSiHBVjUbFiuXxuvOYyzLaERIh00Kx3jLdj4DR
 82K4TF8h2IZpALfIDSt0JG+gHLCc+EfF7Yd/xkeEv0md3ncyi+jWvFCFPNJbyFjm
 cYoDTSunfbxwszA2n01R4JM8/KkGwA==
 =IeFX
 -----END PGP SIGNATURE-----

Merge tag 'iomap-5.17' of git://git.infradead.org/users/willy/linux

Pull iomap updates from Matthew Wilcox:
 "Convert xfs/iomap to use folios.

  This should be all that is needed for XFS to use large folios. There
  is no code in this pull request to create large folios, but no
  additional changes should be needed to XFS or iomap once they are
  created.

  Usually this would have come from Darrick, and we had intended that it
  would come that route. Between the holidays and various things which
  Darrick needed to work on, he asked if I could send things directly.

  There weren't any other iomap patches pending for this release, which
  probably also played a role"

* tag 'iomap-5.17' of git://git.infradead.org/users/willy/linux: (26 commits)
  iomap: Inline __iomap_zero_iter into its caller
  xfs: Support large folios
  iomap: Support large folios in invalidatepage
  iomap: Convert iomap_migrate_page() to use folios
  iomap: Convert iomap_add_to_ioend() to take a folio
  iomap: Simplify iomap_do_writepage()
  iomap: Simplify iomap_writepage_map()
  iomap,xfs: Convert ->discard_page to ->discard_folio
  iomap: Convert iomap_write_end_inline to take a folio
  iomap: Convert iomap_write_begin() and iomap_write_end() to folios
  iomap: Convert __iomap_zero_iter to use a folio
  iomap: Allow iomap_write_begin() to be called with the full length
  iomap: Convert iomap_page_mkwrite to use a folio
  iomap: Convert readahead and readpage to use a folio
  iomap: Convert iomap_read_inline_data to take a folio
  iomap: Use folio offsets instead of page offsets
  iomap: Convert bio completions to use folios
  iomap: Pass the iomap_page into iomap_set_range_uptodate
  iomap: Add iomap_invalidate_folio
  iomap: Convert iomap_releasepage to use a folio
  ...
2022-01-12 12:51:41 -08:00
Darrick J. Wong
4a9bca8680 xfs: fix online fsck handling of v5 feature bits on secondary supers
While I was auditing the code in xfs_repair that adds feature bits to
existing V5 filesystems, I decided to have a look at how online fsck
handles feature bits, and I found a few problems:

1) ATTR2 is added to the primary super when an xattr is set to a file,
but that isn't consistently propagated to secondary supers.  This isn't
a corruption, merely a discrepancy that repair will fix if it ever has
to restore the primary from a secondary.  Hence, if we find a mismatch
on a secondary, this is a preen condition, not a corruption.

2) There are more compat and ro_compat features now than there used to
be, but we mask off the newer features from testing.  This means we
ignore inconsistencies in the INOBTCOUNT and BIGTIME features, which is
wrong.  Get rid of the masking and compare directly.

3) NEEDSREPAIR, when set on a secondary, is ignored by everyone.  Hence
a mismatch here should also be flagged for preening, and online repair
should clear the flag.  Right now we ignore it due to (2).

4) log_incompat features are ephemeral, since we can clear the feature
bit as soon as the log no longer contains live records for a particular
log feature.  As such, the only copy we care about is the one in the
primary super.  If we find any bits set in the secondary super, we
should flag that for preening, and clear the bits if the user elects to
repair it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-12 09:45:21 -08:00
Darrick J. Wong
65552b02a1 xfs: take the ILOCK when readdir inspects directory mapping data
I was poking around in the directory code while diagnosing online fsck
bugs, and noticed that xfs_readdir doesn't actually take the directory
ILOCK when it calls xfs_dir2_isblock.  xfs_dir_open most probably loaded
the data fork mappings and the VFS took i_rwsem (aka IOLOCK_SHARED) so
we're protected against writer threads, but we really need to follow the
locking model like we do in other places.

To avoid unnecessarily cycling the ILOCK for fairly small directories,
change the block/leaf _getdents functions to consume the ILOCK hold that
the parent readdir function took to decide on a _getdents implementation.

It is ok to cycle the ILOCK in readdir because the VFS takes the IOLOCK
in the appropriate mode during lookups and writes, and we don't want to
be holding the ILOCK when we copy directory entries to userspace in case
there's a page fault.  We really only need it to protect against data
fork lookups, like we do for other files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-11 15:11:04 -08:00
Linus Torvalds
11fc88c2e4 New code for 5.17:
- Fix log recovery with da btree buffers when metauuid is in use.
  - Fix type coercion problems in xattr buffer size validation.
  - Fix a bug in online scrub dir leaf bestcount checking.
  - Only run COW recovery when recovering the log.
  - Fix symlink target buffer UAF problems and symlink locking problems
    by not exposing xfs innards to the VFS.
  - Fix incorrect quotaoff lock usage.
  - Don't let transactions cancel cleanly if they have deferred work
    items attached.
  - Fix a UAF when we're deciding if we need to relog an intent item.
  - Reduce kvmalloc overhead for log shadow buffers.
  - Clean up sysfs attr group usage.
  - Fix a bug where scrub's bmap/rmap checking could race with a quota
    file block allocation due to insufficient locking.
  - Teach scrub to complain about invalid project ids.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmHXOL0ACgkQ+H93GTRK
 tOulDQ/+PwvFyROQePQo+5Z7fDb92VAKX6UxFYzWZGXj3zhRmTX7fvNSniPxKqHO
 K0GMnZ9rJxvuo/MRuSEPILl+2rKh59Efo9IPcs2TR+xO+vI/EKnch2/gPLBN1y7Y
 A4peQuK6CvUjYTyMazR2PcK8y76ZW8iXyuQMxnpmYldZBCZ+v7myVoVlPCRrZ68W
 EoEQ/cDf4fmsycSG4Eh9WTddbqd9bfl1+jXEz/pzlbhJSvQOaqynR+21ahxhEQ/f
 CzjSM4kXQ9AWTfHOPnEXs+Hf02NfcKRMuguzO69gJZhooEXJlgeYAZXUKqXpdmbC
 2Tc+x/rJnQv9PtzEw/WoWMem4rhwjBX7cVi2E/SJHB46MLI5rjhQilIdWzecWd8q
 nOrjIZlReI0vxYSqe1ifejxiCsV6gv3XdyLNacjxf7ISkU4XF0caO8f+3/ocIScV
 E7k+XN13pArUiRZXOt8UL571Hoa6nhtHjyAUg6i62AFI2R9XYUEjEXKMwX7ovRVO
 8eGdeQqioHkmOkF8HGyyurB9+itccuW8dtl7YOqEQJMGXeyk7rsvpcPahKKKoiDr
 8w3bpZlNiOfZEnRcNjO9B/Egu6LLMudti2ptwlUvfvWsfa8VTO8v0h5pqVhwuwrk
 lwQThWJ5THe/MptWybdZ3kSjhMvzqi4GO75T30Hj6XhecC3dFfo=
 =RYEb
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.17-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "The big new feature here is that the mount code now only bothers to
  try to free stale COW staging extents if the fs unmounted uncleanly.
  This should reduce mount times, particularly on filesystems supporting
  reflink and containing a large number of allocation groups.

  Everything else this cycle are bugfixes, as the iomap folios
  conversion should be plenty enough excitement for anyone. That and I
  ran out of brain bandwidth after Thanksgiving last year.

  Summary:

   - Fix log recovery with da btree buffers when metauuid is in use.

   - Fix type coercion problems in xattr buffer size validation.

   - Fix a bug in online scrub dir leaf bestcount checking.

   - Only run COW recovery when recovering the log.

   - Fix symlink target buffer UAF problems and symlink locking problems
     by not exposing xfs innards to the VFS.

   - Fix incorrect quotaoff lock usage.

   - Don't let transactions cancel cleanly if they have deferred work
     items attached.

   - Fix a UAF when we're deciding if we need to relog an intent item.

   - Reduce kvmalloc overhead for log shadow buffers.

   - Clean up sysfs attr group usage.

   - Fix a bug where scrub's bmap/rmap checking could race with a quota
     file block allocation due to insufficient locking.

   - Teach scrub to complain about invalid project ids"

* tag 'xfs-5.17-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: warn about inodes with project id of -1
  xfs: hold quota inode ILOCK_EXCL until the end of dqalloc
  xfs: Remove redundant assignment of mp
  xfs: reduce kvmalloc overhead for CIL shadow buffers
  xfs: sysfs: use default_groups in kobj_type
  xfs: prevent UAF in xfs_log_item_in_current_chkpt
  xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list()
  xfs: Fix comments mentioning xfs_ialloc
  xfs: check sb_meta_uuid for dabuf buffer recovery
  xfs: fix a bug in the online fsck directory leaf1 bestcount check
  xfs: only run COW extent recovery when there are no live extents
  xfs: don't expose internal symlink metadata buffers to the vfs
  xfs: fix quotaoff mutex usage now that we don't support disabling it
  xfs: shut down filesystem if we xfs_trans_cancel with deferred work items
2022-01-11 15:01:50 -08:00
Linus Torvalds
5dfbfe71e3 fs.idmapped.v5.17
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYdRCkgAKCRCRxhvAZXjc
 olrvAQCdp8LWkT8TauJSl8wmUm3mZhNy+5+fXuCUSwe3PyUtTQEAq4fxm41JpG8u
 WCZTrrxVhaXwgUY3aWzzeQnLCZjtEQw=
 =woqV
 -----END PGP SIGNATURE-----

Merge tag 'fs.idmapped.v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull fs idmapping updates from Christian Brauner:
 "This contains the work to enable the idmapping infrastructure to
  support idmapped mounts of filesystems mounted with an idmapping.

  In addition this contains various cleanups that avoid repeated
  open-coding of the same functionality and simplify the code in quite a
  few places.

  We also finish the renaming of the mapping helpers we started a few
  kernel releases back and move them to a dedicated header to not
  continue polluting the fs header needlessly with low-level idmapping
  helpers. With this series the fs header only contains idmapping
  helpers that interact with fs objects.

  Currently we only support idmapped mounts for filesystems mounted
  without an idmapping themselves. This was a conscious decision
  mentioned in multiple places (cf. [1]).

  As explained at length in [3] it is perfectly fine to extend support
  for idmapped mounts to filesystem's mounted with an idmapping should
  the need arise. The need has been there for some time now (cf. [2]).

  Before we can port any filesystem that is mountable with an idmapping
  to support idmapped mounts in the coming cycles, we need to first
  extend the mapping helpers to account for the filesystem's idmapping.
  This again, is explained at length in our documentation at [3] and
  also in the individual commit messages so here's an overview.

  Currently, the low-level mapping helpers implement the remapping
  algorithms described in [3] in a simplified manner as we could rely on
  the fact that all filesystems supporting idmapped mounts are mounted
  without an idmapping.

  In contrast, filesystems mounted with an idmapping are very likely to
  not use an identity mapping and will instead use a non-identity
  mapping. So the translation step from or into the filesystem's
  idmapping in the remapping algorithm cannot be skipped for such
  filesystems.

  Non-idmapped filesystems and filesystems not supporting idmapped
  mounts are unaffected by this change as the remapping algorithms can
  take the same shortcut as before. If the low-level helpers detect that
  they are dealing with an idmapped mount but the underlying filesystem
  is mounted without an idmapping we can rely on the previous shortcut
  and can continue to skip the translation step from or into the
  filesystem's idmapping. And of course, if the low-level helpers detect
  that they are not dealing with an idmapped mount they can simply
  return the relevant id unchanged; no remapping needs to be performed
  at all.

  These checks guarantee that only the minimal amount of work is
  performed. As before, if idmapped mounts aren't used the low-level
  helpers are idempotent and no work is performed at all"

Link: 2ca4dcc490 ("fs/mount_setattr: tighten permission checks") [1]
Link: https://github.com/containers/podman/issues/10374 [2]
Link: Documentations/filesystems/idmappings.rst [3]
Link: a65e58e791 ("fs: document and rename fsid helpers") [4]

* tag 'fs.idmapped.v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  fs: support mapped mounts of mapped filesystems
  fs: add i_user_ns() helper
  fs: port higher-level mapping helpers
  fs: remove unused low-level mapping helpers
  fs: use low-level mapping helpers
  docs: update mapping documentation
  fs: account for filesystem mappings
  fs: tweak fsuidgid_has_mapping()
  fs: move mapping helpers
  fs: add is_idmapped_mnt() helper
2022-01-11 14:26:55 -08:00
Darrick J. Wong
7e937bb3cb xfs: warn about inodes with project id of -1
Inodes aren't supposed to have a project id of -1U (aka 4294967295) but
the kernel hasn't always validated FSSETXATTR correctly.  Flag this as
something for the sysadmin to check out.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06 10:43:30 -08:00
Darrick J. Wong
eae44cb341 xfs: hold quota inode ILOCK_EXCL until the end of dqalloc
Online fsck depends on callers holding ILOCK_EXCL from the time they
decide to update a block mapping until after they've updated the reverse
mapping records to guarantee the stability of both mapping records.
Unfortunately, the quota code drops ILOCK_EXCL at the first transaction
roll in the dquot allocation process, which breaks that assertion.  This
leads to sporadic failures in the online rmap repair code if the repair
code grabs the AGF after bmapi_write maps a new block into the quota
file's data fork but before it can finish the deferred rmap update.

Fix this by rewriting the function to hold the ILOCK until after the
transaction commit like all other bmap updates do, and get rid of the
dqread wrapper that does nothing but complicate the codebase.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06 10:43:30 -08:00
Jiapeng Chong
f4901a182d xfs: Remove redundant assignment of mp
mp is being initialized to log->l_mp but this is never read
as record is overwritten later on. Remove the redundant
assignment.

Cleans up the following clang-analyzer warning:

fs/xfs/xfs_log_recover.c:3543:20: warning: Value stored to 'mp' during
its initialization is never read [clang-analyzer-deadcode.DeadStores].

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06 10:43:30 -08:00
Dave Chinner
8dc9384b7d xfs: reduce kvmalloc overhead for CIL shadow buffers
Oh, let me count the ways that the kvmalloc API sucks dog eggs.

The problem is when we are logging lots of large objects, we hit
kvmalloc really damn hard with costly order allocations, and
behaviour utterly sucks:

     - 49.73% xlog_cil_commit
	 - 31.62% kvmalloc_node
	    - 29.96% __kmalloc_node
	       - 29.38% kmalloc_large_node
		  - 29.33% __alloc_pages
		     - 24.33% __alloc_pages_slowpath.constprop.0
			- 18.35% __alloc_pages_direct_compact
			   - 17.39% try_to_compact_pages
			      - compact_zone_order
				 - 15.26% compact_zone
				      5.29% __pageblock_pfn_to_page
				      3.71% PageHuge
				    - 1.44% isolate_migratepages_block
					 0.71% set_pfnblock_flags_mask
				   1.11% get_pfnblock_flags_mask
			   - 0.81% get_page_from_freelist
			      - 0.59% _raw_spin_lock_irqsave
				 - do_raw_spin_lock
				      __pv_queued_spin_lock_slowpath
			- 3.24% try_to_free_pages
			   - 3.14% shrink_node
			      - 2.94% shrink_slab.constprop.0
				 - 0.89% super_cache_count
				    - 0.66% xfs_fs_nr_cached_objects
				       - 0.65% xfs_reclaim_inodes_count
					    0.55% xfs_perag_get_tag
				   0.58% kfree_rcu_shrink_count
			- 2.09% get_page_from_freelist
			   - 1.03% _raw_spin_lock_irqsave
			      - do_raw_spin_lock
				   __pv_queued_spin_lock_slowpath
		     - 4.88% get_page_from_freelist
			- 3.66% _raw_spin_lock_irqsave
			   - do_raw_spin_lock
				__pv_queued_spin_lock_slowpath
	    - 1.63% __vmalloc_node
	       - __vmalloc_node_range
		  - 1.10% __alloc_pages_bulk
		     - 0.93% __alloc_pages
			- 0.92% get_page_from_freelist
			   - 0.89% rmqueue_bulk
			      - 0.69% _raw_spin_lock
				 - do_raw_spin_lock
				      __pv_queued_spin_lock_slowpath
	   13.73% memcpy_erms
	 - 2.22% kvfree

On this workload, that's almost a dozen CPUs all trying to compact
and reclaim memory inside kvmalloc_node at the same time. Yet it is
regularly falling back to vmalloc despite all that compaction, page
and shrinker reclaim that direct reclaim is doing. Copying all the
metadata is taking far less CPU time than allocating the storage!

Direct reclaim should be considered extremely harmful.

This is a high frequency, high throughput, CPU usage and latency
sensitive allocation. We've got memory there, and we're using
kvmalloc to allow memory allocation to avoid doing lots of work to
try to do contiguous allocations.

Except it still does *lots of costly work* that is unnecessary.

Worse: the only way to avoid the slowpath page allocation trying to
do compaction on costly allocations is to turn off direct reclaim
(i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags).

Unfortunately, the stupid kvmalloc API then says "oh, this isn't a
GFP_KERNEL allocation context, so you only get kmalloc!". This
cuts off the vmalloc fallback, and this leads to almost instant OOM
problems which ends up in filesystems deadlocks, shutdowns and/or
kernel crashes.

I want some basic kvmalloc behaviour:

- kmalloc for a contiguous range with fail fast semantics - no
  compaction direct reclaim if the allocation enters the slow path.
- run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails

The really, really stupid part about this is these kvmalloc() calls
are run under memalloc_nofs task context, so all the allocations are
always reduced to GFP_NOFS regardless of the fact that kvmalloc
requires GFP_KERNEL to be passed in. IOWs, we're already telling
kvmalloc to behave differently to the gfp flags we pass in, but it
still won't allow vmalloc to be run with anything other than
GFP_KERNEL.

So, this patch open codes the kvmalloc() in the commit path to have
the above described behaviour. The result is we more than halve the
CPU time spend doing kvmalloc() in this path and transaction commits
with 64kB objects in them more than doubles. i.e. we get ~5x
reduction in CPU usage per costly-sized kvmalloc() invocation and
the profile looks like this:

  - 37.60% xlog_cil_commit
	16.01% memcpy_erms
      - 8.45% __kmalloc
	 - 8.04% kmalloc_order_trace
	    - 8.03% kmalloc_order
	       - 7.93% alloc_pages
		  - 7.90% __alloc_pages
		     - 4.05% __alloc_pages_slowpath.constprop.0
			- 2.18% get_page_from_freelist
			- 1.77% wake_all_kswapds
....
				    - __wake_up_common_lock
				       - 0.94% _raw_spin_lock_irqsave
		     - 3.72% get_page_from_freelist
			- 2.43% _raw_spin_lock_irqsave
      - 5.72% vmalloc
	 - 5.72% __vmalloc_node_range
	    - 4.81% __get_vm_area_node.constprop.0
	       - 3.26% alloc_vmap_area
		  - 2.52% _raw_spin_lock
	       - 1.46% _raw_spin_lock
	      0.56% __alloc_pages_bulk
      - 4.66% kvfree
	 - 3.25% vfree
	    - __vfree
	       - 3.23% __vunmap
		  - 1.95% remove_vm_area
		     - 1.06% free_vmap_area_noflush
			- 0.82% _raw_spin_lock
		     - 0.68% _raw_spin_lock
		  - 0.92% _raw_spin_lock
	 - 1.40% kfree
	    - 1.36% __free_pages
	       - 1.35% __free_pages_ok
		  - 1.02% _raw_spin_lock_irqsave

It's worth noting that over 50% of the CPU time spent allocating
these shadow buffers is now spent on spinlocks. So the shadow buffer
allocation overhead is greatly reduced by getting rid of direct
reclaim from kmalloc, and could probably be made even less costly if
vmalloc() didn't use global spinlocks to protect it's structures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06 10:43:30 -08:00
Greg Kroah-Hartman
219aac5d46 xfs: sysfs: use default_groups in kobj_type
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field.  Move the xfs sysfs code to use default_groups field which has
been the preferred way since aa30f47cf6 ("kobject: Add support for
default attribute groups to kobj_type") so that we can soon get rid of
the obsolete default_attrs field.

Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06 10:43:30 -08:00
Darrick J. Wong
983d8e60f5 xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocate
The old ALLOCSP/FREESP ioctls in XFS can be used to preallocate space at
the end of files, just like fallocate and RESVSP.  Make the behavior
consistent with the other ioctls.

Reported-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2021-12-22 14:19:18 -08:00
Darrick J. Wong
f8d92a66e8 xfs: prevent UAF in xfs_log_item_in_current_chkpt
While I was running with KASAN and lockdep enabled, I stumbled upon an
KASAN report about a UAF to a freed CIL checkpoint.  Looking at the
comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me
that the original patch to xfs_defer_finish_noroll should have done
something to lock the CIL to prevent it from switching the CIL contexts
while the predicate runs.

For upper level code that needs to know if a given log item is new
enough not to need relogging, add a new wrapper that takes the CIL
context lock long enough to sample the current CIL context.  This is
kind of racy in that the CIL can switch the contexts immediately after
sampling, but that's ok because the consequence is that the defer ops
code is a little slow to relog items.

 ==================================================================
 BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs]
 Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999

 CPU: 1 PID: 527999 Comm: fsstress Tainted: G      D      5.16.0-rc4-xfsx #rc4
 Call Trace:
  <TASK>
  dump_stack_lvl+0x45/0x59
  print_address_description.constprop.0+0x1f/0x140
  kasan_report.cold+0x83/0xdf
  xfs_log_item_in_current_chkpt+0x139/0x160
  xfs_defer_finish_noroll+0x3bb/0x1e30
  __xfs_trans_commit+0x6c8/0xcf0
  xfs_reflink_remap_extent+0x66f/0x10e0
  xfs_reflink_remap_blocks+0x2dd/0xa90
  xfs_file_remap_range+0x27b/0xc30
  vfs_dedupe_file_range_one+0x368/0x420
  vfs_dedupe_file_range+0x37c/0x5d0
  do_vfs_ioctl+0x308/0x1260
  __x64_sys_ioctl+0xa1/0x170
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f2c71a2950b
 Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48
 RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
 RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b
 RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004
 RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003
 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a
 R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20
  </TASK>

 Allocated by task 464064:
  kasan_save_stack+0x1e/0x50
  __kasan_kmalloc+0x81/0xa0
  kmem_alloc+0xcd/0x2c0 [xfs]
  xlog_cil_ctx_alloc+0x17/0x1e0 [xfs]
  xlog_cil_push_work+0x141/0x13d0 [xfs]
  process_one_work+0x7f6/0x1380
  worker_thread+0x59d/0x1040
  kthread+0x3b0/0x490
  ret_from_fork+0x1f/0x30

 Freed by task 51:
  kasan_save_stack+0x1e/0x50
  kasan_set_track+0x21/0x30
  kasan_set_free_info+0x20/0x30
  __kasan_slab_free+0xed/0x130
  slab_free_freelist_hook+0x7f/0x160
  kfree+0xde/0x340
  xlog_cil_committed+0xbfd/0xfe0 [xfs]
  xlog_cil_process_committed+0x103/0x1c0 [xfs]
  xlog_state_do_callback+0x45d/0xbd0 [xfs]
  xlog_ioend_work+0x116/0x1c0 [xfs]
  process_one_work+0x7f6/0x1380
  worker_thread+0x59d/0x1040
  kthread+0x3b0/0x490
  ret_from_fork+0x1f/0x30

 Last potentially related work creation:
  kasan_save_stack+0x1e/0x50
  __kasan_record_aux_stack+0xb7/0xc0
  insert_work+0x48/0x2e0
  __queue_work+0x4e7/0xda0
  queue_work_on+0x69/0x80
  xlog_cil_push_now.isra.0+0x16b/0x210 [xfs]
  xlog_cil_force_seq+0x1b7/0x850 [xfs]
  xfs_log_force_seq+0x1c7/0x670 [xfs]
  xfs_file_fsync+0x7c1/0xa60 [xfs]
  __x64_sys_fsync+0x52/0x80
  do_syscall_64+0x35/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae

 The buggy address belongs to the object at ffff88804ea5f600
  which belongs to the cache kmalloc-256 of size 256
 The buggy address is located 8 bytes inside of
  256-byte region [ffff88804ea5f600, ffff88804ea5f700)
 The buggy address belongs to the page:
 page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e
 head:ffffea00013a9780 order:1 compound_mapcount:0
 flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff)
 raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40
 raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                       ^
  ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ==================================================================

Fixes: 4e919af782 ("xfs: periodically relog deferred intent items")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-22 14:17:55 -08:00
Dan Carpenter
6ed6356b07 xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list()
The "bufsize" comes from the root user.  If "bufsize" is negative then,
because of type promotion, neither of the validation checks at the start
of the function are able to catch it:

	if (bufsize < sizeof(struct xfs_attrlist) ||
	    bufsize > XFS_XATTR_LIST_MAX)
		return -EINVAL;

This means "bufsize" will trigger (WARN_ON_ONCE(size > INT_MAX)) in
kvmalloc_node().  Fix this by changing the type from int to size_t.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21 09:49:41 -08:00
Yang Xu
132c460e49 xfs: Fix comments mentioning xfs_ialloc
Since kernel commit 1abcf26101 ("xfs: move on-disk inode allocation out of xfs_ialloc()"),
xfs_ialloc has been renamed to xfs_init_new_inode. So update this in comments.

Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21 09:49:41 -08:00
Dave Chinner
09654ed8a1 xfs: check sb_meta_uuid for dabuf buffer recovery
Got a report that a repeated crash test of a container host would
eventually fail with a log recovery error preventing the system from
mounting the root filesystem. It manifested as a directory leaf node
corruption on writeback like so:

 XFS (loop0): Mounting V5 Filesystem
 XFS (loop0): Starting recovery (logdev: internal)
 XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158
 XFS (loop0): Unmount and run xfs_repair
 XFS (loop0): First 128 bytes of corrupted metadata buffer:
 00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b  ........=.......
 00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc  .......X...)....
 00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23  ..x..~J}.S...G.#
 00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00  .........C......
 00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a  ................
 00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50  .5y....0.......P
 00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4  .@.......A......
 00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c  .b.......P!A....
 XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514).  Shutting down.
 XFS (loop0): Please unmount the filesystem and rectify the problem(s)
 XFS (loop0): log mount/recovery failed: error -117
 XFS (loop0): log mount failed

Tracing indicated that we were recovering changes from a transaction
at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57.
That is, log recovery was overwriting a buffer with newer changes on
disk than was in the transaction. Tracing indicated that we were
hitting the "recovery immediately" case in
xfs_buf_log_recovery_lsn(), and hence it was ignoring the LSN in the
buffer.

The code was extracting the LSN correctly, then ignoring it because
the UUID in the buffer did not match the superblock UUID. The
problem arises because the UUID check uses the wrong UUID - it
should be checking the sb_meta_uuid, not sb_uuid. This filesystem
has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the
correct matching sb_meta_uuid in it, it's just the code checked it
against the wrong superblock uuid.

The is no corruption in the filesystem, and failing to recover the
buffer due to a write verifier failure means the recovery bug did
not propagate the corruption to disk. Hence there is no corruption
before or after this bug has manifested, the impact is limited
simply to an unmountable filesystem....

This was missed back in 2015 during an audit of incorrect sb_uuid
usage that resulted in commit fcfbe2c4ef ("xfs: log recovery needs
to validate against sb_meta_uuid") that fixed the magic32 buffers to
validate against sb_meta_uuid instead of sb_uuid. It missed the
magicda buffers....

Fixes: ce748eaa65 ("xfs: create new metadata UUID field and incompat flag")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21 09:49:41 -08:00
Darrick J. Wong
e5d1802c70 xfs: fix a bug in the online fsck directory leaf1 bestcount check
When xfs_scrub encounters a directory with a leaf1 block, it tries to
validate that the leaf1 block's bestcount (aka the best free count of
each directory data block) is the correct size.  Previously, this author
believed that comparing bestcount to the directory isize (since
directory data blocks are under isize, and leaf/bestfree blocks are
above it) was sufficient.

Unfortunately during testing of online repair, it was discovered that it
is possible to create a directory with a hole between the last directory
block and isize.  The directory code seems to handle this situation just
fine and xfs_repair doesn't complain, which effectively makes this quirk
part of the disk format.

Fix the check to work properly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21 09:49:41 -08:00
Darrick J. Wong
7993f1a431 xfs: only run COW extent recovery when there are no live extents
As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose.  Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:

mount <filesystem>				# (0)
fsstress <run only readonly ops> &		# (1)
while true; do
	fsstress <run all ops>
	mount -o remount,ro			# (2)
	fsstress <run only readonly ops>
	mount -o remount,rw			# (3)
done

When (2) happens, notice that (1) is still running.  xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).

When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.

The results are catastrophic -- file (B) and the refcount btree are now
corrupt.  In the first patch, we fixed the race condition in (2) so that
(A) will always flush the COW fork.  In this second patch, we move the
_recover_cow call to the initial mount call in (0) for safety.

As mentioned previously, xfs_reflink_recover_cow walks the refcount
btree looking for COW staging extents, and frees them.  This was
intended to be run at mount time (when we know there are no live inodes)
to clean up any leftover staging events that may have been left behind
during an unclean shutdown.  As a time "optimization" for readonly
mounts, we deferred this to the ro->rw transition, not realizing that
any failure to clean all COW forks during a rw->ro transition would
result in catastrophic corruption.

Therefore, remove this optimization and only run the recovery routine
when we're guaranteed not to have any COW staging extents anywhere,
which means we always run this at mount time.  While we're at it, move
the callsite to xfs_log_mount_finish because any refcount btree
expansion (however unlikely given that we're removing records from the
right side of the index) must be fed by a per-AG reservation, which
doesn't exist in its current location.

Fixes: 174edb0e46 ("xfs: store in-progress CoW allocations in the refcount btree")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21 09:49:41 -08:00
Darrick J. Wong
7b7820b83f xfs: don't expose internal symlink metadata buffers to the vfs
Ian Kent reported that for inline symlinks, it's possible for
vfs_readlink to hang on to the target buffer returned by
_vn_get_link_inline long after it's been freed by xfs inode reclaim.
This is a layering violation -- we should never expose XFS internals to
the VFS.

When the symlink has a remote target, we allocate a separate buffer,
copy the internal information, and let the VFS manage the new buffer's
lifetime.  Let's adapt the inline code paths to do this too.  It's
less efficient, but fixes the layering violation and avoids the need to
adapt the if_data lifetime to rcu rules.  Clearly I don't care about
readlink benchmarks.

As a side note, this fixes the minor locking violation where we can
access the inode data fork without taking any locks; proper locking (and
eliminating the possibility of having to switch inode_operations on a
live inode) is essential to online repair coordinating repairs
correctly.

Reported-by: Ian Kent <raven@themaw.net>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21 09:49:41 -08:00
Darrick J. Wong
59d7fab2df xfs: fix quotaoff mutex usage now that we don't support disabling it
Prior to commit 40b52225e5 ("xfs: remove support for disabling quota
accounting on a mounted file system"), we used the quotaoff mutex to
protect dquot operations against quotaoff trying to pull down dquots as
part of disabling quota.

Now that we only support turning off quota enforcement, the quotaoff
mutex only protects changes in m_qflags/sb_qflags.  We don't need it to
protect dquots, which means we can remove it from setqlimits and the
dquot scrub code.  While we're at it, fix the function that forces
quotacheck, since it should have been taking the quotaoff mutex.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21 09:49:41 -08:00
Darrick J. Wong
47a6df7cd3 xfs: shut down filesystem if we xfs_trans_cancel with deferred work items
While debugging some very strange rmap corruption reports in connection
with the online directory repair code.  I root-caused the error to the
following incorrect sequence:

<start repair transaction>
<expand directory, causing a deferred rmap to be queued>
<roll transaction>
<cancel transaction>

Obviously, we should have committed the transaction instead of
cancelling it.  Thinking more broadly, however, xfs_trans_cancel should
have warned us that we were throwing away work item that we already
committed to performing.  This is not correct, and we need to shut down
the filesystem.

Change xfs_trans_cancel to complain in the loudest manner if we're
cancelling any transaction with deferred work items attached.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21 09:49:41 -08:00
Matthew Wilcox (Oracle)
6795801366 xfs: Support large folios
Now that iomap has been converted, XFS is large folio safe.
Indicate to the VFS that it can now create large folios for XFS.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-12-18 00:06:08 -05:00
Matthew Wilcox (Oracle)
6e478521df iomap,xfs: Convert ->discard_page to ->discard_folio
XFS has the only implementation of ->discard_page today, so convert it
to use folios in the same patch as converting the API.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-12-18 00:06:07 -05:00
Darrick J. Wong
089558bc7b xfs: remove all COW fork extents when remounting readonly
As part of multiple customer escalations due to file data corruption
after copy on write operations, I wrote some fstests that use fsstress
to hammer on COW to shake things loose.  Regrettably, I caught some
filesystem shutdowns due to incorrect rmap operations with the following
loop:

mount <filesystem>				# (0)
fsstress <run only readonly ops> &		# (1)
while true; do
	fsstress <run all ops>
	mount -o remount,ro			# (2)
	fsstress <run only readonly ops>
	mount -o remount,rw			# (3)
done

When (2) happens, notice that (1) is still running.  xfs_remount_ro will
call xfs_blockgc_stop to walk the inode cache to free all the COW
extents, but the blockgc mechanism races with (1)'s reader threads to
take IOLOCKs and loses, which means that it doesn't clean them all out.
Call such a file (A).

When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
walks the ondisk refcount btree and frees any COW extent that it finds.
This function does not check the inode cache, which means that incore
COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
one of those former COW extents are allocated and mapped into another
file (B) and someone triggers a COW to the stale reservation in (A), A's
dirty data will be written into (B) and once that's done, those blocks
will be transferred to (A)'s data fork without bumping the refcount.

The results are catastrophic -- file (B) and the refcount btree are now
corrupt.  Solve this race by forcing the xfs_blockgc_free_space to run
synchronously, which causes xfs_icwalk to return to inodes that were
skipped because the blockgc code couldn't take the IOLOCK.  This is safe
to do here because the VFS has already prohibited new writer threads.

Fixes: 10ddf64e42 ("xfs: remove leftover CoW reservations when remounting ro")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-12-07 10:17:29 -08:00
Christian Brauner
209188ce75
fs: port higher-level mapping helpers
Enable the mapped_fs{g,u}id() helpers to support filesystems mounted
with an idmapping. Apart from core mapping helpers that use
mapped_fs{g,u}id() to initialize struct inode's i_{g,u}id fields xfs is
the only place that uses these low-level helpers directly.

The patch only extends the helpers to be able to take the filesystem
idmapping into account. Since we don't actually yet pass the
filesystem's idmapping in no functional changes happen. This will happen
in a final patch.

Link: https://lore.kernel.org/r/20211123114227.3124056-9-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-9-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-9-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-12-05 10:28:57 +01:00
Christoph Hellwig
de20511477 fsdax: shift partition offset handling into the file systems
Remove the last user of ->bdev in dax.c by requiring the file system to
pass in an address that already includes the DAX offset.  As part of the
only set ->bdev or ->daxdev when actually required in the ->iomap_begin
methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> [erofs]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-27-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:54 -08:00
Christoph Hellwig
cd913c76f4 dax: return the partition offset from fs_dax_get_by_bdev
Prepare for the removal of the block_device from the DAX I/O path by
returning the partition offset from fs_dax_get_by_bdev so that the file
systems have it at hand for use during I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-26-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:54 -08:00
Christoph Hellwig
952da06375 iomap: add a IOMAP_DAX flag
Add a flag so that the file system can easily detect DAX operations
based just on the iomap operation requested instead of looking at
inode state using IS_DAX.  This will be needed to apply the to be
added partition offset only for operations that actually use DAX,
but not things like fiemap that are based on the block device.
In the long run it should also allow turning the bdev, dax_dev
and inline_data into a union.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-25-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:53 -08:00
Christoph Hellwig
740fd671e0 xfs: pass the mapping flags to xfs_bmbt_to_iomap
To prepare for looking at the IOMAP_DAX flag in xfs_bmbt_to_iomap pass in
the input mapping flags to xfs_bmbt_to_iomap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-24-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:53 -08:00
Christoph Hellwig
a50f6ab3fd xfs: use xfs_direct_write_iomap_ops for DAX zeroing
While the buffered write iomap ops do work due to the fact that zeroing
never allocates blocks, the DAX zeroing should use the direct ops just
like actual DAX I/O.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-23-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:53 -08:00
Christoph Hellwig
5b5abbefec xfs: move dax device handling into xfs_{alloc,free}_buftarg
Hide the DAX device lookup from the xfs_super.c code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Link: https://lore.kernel.org/r/20211129102203.2243509-22-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:53 -08:00
Christoph Hellwig
c6f4046865 fsdax: decouple zeroing from the iomap buffered I/O code
Unshare the DAX and iomap buffered I/O page zeroing code.  This code
previously did a IS_DAX check deep inside the iomap code, which in
fact was the only DAX check in the code.  Instead move these checks
into the callers.  Most callers already have DAX special casing anyway
and XFS will need it for reflink support as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-19-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:53 -08:00
Shiyang Ruan
f1ba5fafba xfs: add xfs_zero_range and xfs_truncate_page helpers
Add helpers to prepare for using different DAX operations.

Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
[hch: split from a larger patch + slight cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-16-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:52 -08:00
Christoph Hellwig
7b0800d00d dax: remove dax_capable
Just open code the block size and dax_dev == NULL checks in the callers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> [erofs]
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-9-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:51 -08:00
Christoph Hellwig
679a99495b xfs: factor out a xfs_setup_dax_always helper
Factor out another DAX setup helper to simplify future changes.  Also
move the experimental warning after the checks to not clutter the log
too much if the setup failed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20211129102203.2243509-8-hch@lst.de
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-04 08:58:51 -08:00
Christian Brauner
a793d79ea3
fs: move mapping helpers
The low-level mapping helpers were so far crammed into fs.h. They are
out of place there. The fs.h header should just contain the higher-level
mapping helpers that interact directly with vfs objects such as struct
super_block or struct inode and not the bare mapping helpers. Similarly,
only vfs and specific fs code shall interact with low-level mapping
helpers. And so they won't be made accessible automatically through
regular {g,u}id helpers.

Link: https://lore.kernel.org/r/20211123114227.3124056-3-brauner@kernel.org (v1)
Link: https://lore.kernel.org/r/20211130121032.3753852-3-brauner@kernel.org (v2)
Link: https://lore.kernel.org/r/20211203111707.3901969-3-brauner@kernel.org
Cc: Seth Forshee <sforshee@digitalocean.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
CC: linux-fsdevel@vger.kernel.org
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Reviewed-by: Seth Forshee <sforshee@digitalocean.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-12-03 18:50:17 +01:00
Eric Sandeen
e445976537 xfs: remove incorrect ASSERT in xfs_rename
This ASSERT in xfs_rename is a) incorrect, because
(RENAME_WHITEOUT|RENAME_NOREPLACE) is a valid combination, and
b) unnecessary, because actual invalid flag combinations are already
handled at the vfs level in do_renameat2() before we get called.
So, remove it.

Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-01 17:27:48 -08:00
Christoph Hellwig
1090427bf1 xfs: remove xfs_inew_wait
With the remove of xfs_dqrele_all_inodes, xfs_inew_wait and all the
infrastructure used to wake the XFS_INEW bit waitqueue is unused.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 777eb1fa85 ("xfs: remove xfs_dqrele_all_inodes")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-24 10:06:02 -08:00
Yang Xu
a1de97fe29 xfs: Fix the free logic of state in xfs_attr_node_hasname
When testing xfstests xfs/126 on lastest upstream kernel, it will hang on some machine.
Adding a getxattr operation after xattr corrupted, I can reproduce it 100%.

The deadlock as below:
[983.923403] task:setfattr        state:D stack:    0 pid:17639 ppid: 14687 flags:0x00000080
[  983.923405] Call Trace:
[  983.923410]  __schedule+0x2c4/0x700
[  983.923412]  schedule+0x37/0xa0
[  983.923414]  schedule_timeout+0x274/0x300
[  983.923416]  __down+0x9b/0xf0
[  983.923451]  ? xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
[  983.923453]  down+0x3b/0x50
[  983.923471]  xfs_buf_lock+0x33/0xf0 [xfs]
[  983.923490]  xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs]
[  983.923508]  xfs_buf_get_map+0x4c/0x320 [xfs]
[  983.923525]  xfs_buf_read_map+0x53/0x310 [xfs]
[  983.923541]  ? xfs_da_read_buf+0xcf/0x120 [xfs]
[  983.923560]  xfs_trans_read_buf_map+0x1cf/0x360 [xfs]
[  983.923575]  ? xfs_da_read_buf+0xcf/0x120 [xfs]
[  983.923590]  xfs_da_read_buf+0xcf/0x120 [xfs]
[  983.923606]  xfs_da3_node_read+0x1f/0x40 [xfs]
[  983.923621]  xfs_da3_node_lookup_int+0x69/0x4a0 [xfs]
[  983.923624]  ? kmem_cache_alloc+0x12e/0x270
[  983.923637]  xfs_attr_node_hasname+0x6e/0xa0 [xfs]
[  983.923651]  xfs_has_attr+0x6e/0xd0 [xfs]
[  983.923664]  xfs_attr_set+0x273/0x320 [xfs]
[  983.923683]  xfs_xattr_set+0x87/0xd0 [xfs]
[  983.923686]  __vfs_removexattr+0x4d/0x60
[  983.923688]  __vfs_removexattr_locked+0xac/0x130
[  983.923689]  vfs_removexattr+0x4e/0xf0
[  983.923690]  removexattr+0x4d/0x80
[  983.923693]  ? __check_object_size+0xa8/0x16b
[  983.923695]  ? strncpy_from_user+0x47/0x1a0
[  983.923696]  ? getname_flags+0x6a/0x1e0
[  983.923697]  ? _cond_resched+0x15/0x30
[  983.923699]  ? __sb_start_write+0x1e/0x70
[  983.923700]  ? mnt_want_write+0x28/0x50
[  983.923701]  path_removexattr+0x9b/0xb0
[  983.923702]  __x64_sys_removexattr+0x17/0x20
[  983.923704]  do_syscall_64+0x5b/0x1a0
[  983.923705]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  983.923707] RIP: 0033:0x7f080f10ee1b

When getxattr calls xfs_attr_node_get function, xfs_da3_node_lookup_int fails with EFSCORRUPTED in
xfs_attr_node_hasname because we have use blocktrash to random it in xfs/126. So it
free state in internal and xfs_attr_node_get doesn't do xfs_buf_trans release job.

Then subsequent removexattr will hang because of it.

This bug was introduced by kernel commit 07120f1abd ("xfs: Add xfs_has_attr and subroutines").
It adds xfs_attr_node_hasname helper and said caller will be responsible for freeing the state
in this case. But xfs_attr_node_hasname will free state itself instead of caller if
xfs_da3_node_lookup_int fails.

Fix this bug by moving the step of free state into caller.

Also, use "goto error/out" instead of returning error directly in xfs_attr_node_addname_find_attr and
xfs_attr_node_removename_setup function because we should free state ourselves.

Fixes: 07120f1abd ("xfs: Add xfs_has_attr and subroutines")
Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-24 10:06:02 -08:00
Linus Torvalds
ce49bfc8d0 Minor tweaks for 5.16:
* Clean up open-coded swap() calls.
  * A little bit of #ifdef golf to complete the reunification of the
    kernel and userspace libxfs source code.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmGNT1IACgkQ+H93GTRK
 tOucKA//Qk2NX3QBm/8pCrFE5V+eqooPANhZmzeviJCN/6++jcNOy0f+YK6JXVRC
 U2WdotHFH5fF6lsDkzNtMPHZ8JMZmOfEiPx5CGFiWT5iUW7FbLkROHm7GFtbwMoH
 qm3Lt7PbdSzJqTuOTvaGCw1xWkjDXMLdsdFM7mx3JO5zT9a/fCqjjmyR2Kl0qcSP
 RzfruVe20wUka2BeaXfZzSasgfLswratkU4xsiNiwA37yQaldzhrg8fg6uP3OSYi
 dkWFXi6WdWwQzARnjWNPwigUwA3xVaYgV+I6+ME0DYsUBvywZzUg3pkowhRAHyA9
 kv86L5Zt5K7kQcVqyd+lIvIuAcGrOZ9hA18PIXnwahLBqmjcqAJoF9XhTTZDMD4J
 LfujGMrf7DSDcf0vH8G9wlQQthsPGUOoFia5rr8MhdVVNee/b1Qvwsh7kmyg0DOK
 9WuNQxGPd7s+X+kwdmGrK7E6fqyPwEfC43l8wtCiBIyGz6QcorwD7kH9DGzv5xGF
 NX7WQeKvcaoXn1XVfonb0YgdVOnbyqK4AiY3Po1Ood3IxGyiLGCgDnusvYu+C9/r
 T0rRMbljkX1lUKqfzGkg2egOKPR+8RFgFKrKNSXUkDxl8TFLRd3ZObowPPlohq1I
 9lIIirip5UFYRv+7srDU1oZPWkvwkpJmaMFgagD3w+OWdo6zwao=
 =wsFu
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs cleanups from Darrick Wong:
 "The most 'exciting' aspect of this branch is that the xfsprogs
  maintainer and I have worked through the last of the code
  discrepancies between kernel and userspace libxfs such that there are
  no code differences between the two except for #includes.

  IOWs, diff suffices to demonstrate that the userspace tools behave the
  same as the kernel, and kernel-only bits are clearly marked in the
  /kernel/ source code instead of just the userspace source.

  Summary:

   - Clean up open-coded swap() calls.

   - A little bit of #ifdef golf to complete the reunification of the
     kernel and userspace libxfs source code"

* tag 'xfs-5.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: sync xfs_btree_split macros with userspace libxfs
  xfs: #ifdef out perag code for userspace
  xfs: use swap() to make dabtree code cleaner
2021-11-14 12:18:22 -08:00
Darrick J. Wong
4a6b35b3b3 xfs: sync xfs_btree_split macros with userspace libxfs
Sync this one last bit of discrepancy between kernel and userspace
libxfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2021-11-11 09:13:39 -08:00
Eric Sandeen
29f11fce21 xfs: #ifdef out perag code for userspace
The xfs_perag structure and initialization is unused in userspace,
so #ifdef it out with __KERNEL__ to facilitate the xfsprogs sync
and build.

Signed-off-by: Eric Sandeen <esandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-10 09:37:38 -08:00
Yang Guang
5b068aadf6 xfs: use swap() to make dabtree code cleaner
Use the macro 'swap()' defined in 'include/linux/minmax.h' to avoid
opencoding it.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Yang Guang <yang.guang5@zte.com.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-08 11:23:25 -08:00
Linus Torvalds
bba7d68227 New code for 5.16:
* Bug fixes and cleanups for kernel memory allocation usage, this time
    without touching the mm code.
  * Refactor the log recovery mechanism that preserves held resources
    across a transaction roll so that it uses the exact same mechanism
    that we use for that during regular runtime.
  * Fix bugs and tighten checking around btree heights.
  * Remove more old typedefs.
  * Fix perag reference leaks when racing with growfs.
  * Remove unused fields from xfs_btree_cur.
  * Allocate various scrub structures on the heap to reduce stack usage.
  * Pack xfs_btree_cur fields and rearrange to support arbitrary heights.
  * Compute maximum possible heights for each btree height, and use that
    to set up slab caches for each btree type.
  * Finally remove kmem_zone_t, since these have always been struct
    kmem_cache on Linux.
  * Compact the structures used to coordinate work intent items.
  * Set up slab caches for each work intent item type.
  * Rename the "bmap_add_free" function to "free_extent_later", which
    more accurately describes what it does.
  * Fix corruption warning on unmount when a CoW preallocation covers a
    data fork delalloc reservation but then the CoW fails.
  * Add some more minor code improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmF9d8IACgkQ+H93GTRK
 tOug0w/+N/RqL2Yo3R67ByI6+Los7JCWixTZJJjDzwvnYAAmT/z04JWD0fSO0iQX
 2UlxsONrWxLswX4qYdzJNhfDfxMI+aso5D9IHNIchOa95JLq4FSnJ7Elw+AaYJW3
 EvSm3LLFwwZwreNk6EXCOfRnPK2EXTFkq4iCPMu8bW+wXlxth2fo2AiZzNzl4y6N
 tBc8lYhtMiu7n223wM+qqAnh6onB4iSu9HtyMcuIAL3FvQU4FYlHy+IurtwnJX+p
 fHhPUG6YaIxvRFDMMEfEd2a8CSRtBoOmAgDNbKKq+l/2NrXRW3ysrohRAKtUrCfC
 15RFKvoZu9a2JxOfXxdr78zFGykuMFesEOaK86c58C35Rp5fTZTmB0kntLrUvIsK
 6gdSslUM0Dsw8v6l6tbIHBMjg7mn4fatK8IM74U7omL+mBHnV2HEBNtaOqG4Fjd+
 YSG3g/F7AkAYDUbvYKLiedZC+spbt28thg25ED6ZWGA1qJ1I5OZ8G9O80v3i0Jcf
 8G16nXeBtxumvz/Ga01S/c7jdCrXm46+qzubf3nET2di+5o+THDorSeg3nySfOz5
 L7uPnm3CV5jYFQN/kVg2an4mO5uLipHzbZSYN1l44lTPt8IUDDEu4DIbpjKgdFGT
 jZE0uRSscsLsGmrui1Bg65KO2cG1rz18Gp+RVeJoNZn7WRW+tjM=
 =zRzv
 -----END PGP SIGNATURE-----

Merge tag 'xfs-5.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs updates from Darrick Wong:
 "This cycle we've worked on fixing bugs and improving XFS' memory
  footprint.

  The most notable fixes include: fixing a corruption warning (and free
  space accounting skew) if copy on write fails; fixing slab cache
  misuse if SLOB is enabled, which apparently was broken for years
  without anybody noticing; and fixing a potential race with online
  shrinkfs.

  Otherwise, the bulk of the changes here involve setting up separate
  slab caches for frequently used items such as btree cursors and log
  intent items, and compacting the structures to reduce memory usage of
  those items substantially. This also sets us up to support larger
  btrees in future kernels. We also switch parts of online fsck to
  allocate scrub context information from the heap instead of using
  stack space.

  Summary:

   - Bug fixes and cleanups for kernel memory allocation usage, this
     time without touching the mm code.

   - Refactor the log recovery mechanism that preserves held resources
     across a transaction roll so that it uses the exact same mechanism
     that we use for that during regular runtime.

   - Fix bugs and tighten checking around btree heights.

   - Remove more old typedefs.

   - Fix perag reference leaks when racing with growfs.

   - Remove unused fields from xfs_btree_cur.

   - Allocate various scrub structures on the heap to reduce stack
     usage.

   - Pack xfs_btree_cur fields and rearrange to support arbitrary
     heights.

   - Compute maximum possible heights for each btree height, and use
     that to set up slab caches for each btree type.

   - Finally remove kmem_zone_t, since these have always been struct
     kmem_cache on Linux.

   - Compact the structures used to coordinate work intent items.

   - Set up slab caches for each work intent item type.

   - Rename the "bmap_add_free" function to "free_extent_later", which
     more accurately describes what it does.

   - Fix corruption warning on unmount when a CoW preallocation covers a
     data fork delalloc reservation but then the CoW fails.

   - Add some more minor code improvements"

* tag 'xfs-5.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (45 commits)
  xfs: use swap() to make code cleaner
  xfs: Remove duplicated include in xfs_super
  xfs: punch out data fork delalloc blocks on COW writeback failure
  xfs: remove unused parameter from refcount code
  xfs: reduce the size of struct xfs_extent_free_item
  xfs: rename xfs_bmap_add_free to xfs_free_extent_later
  xfs: create slab caches for frequently-used deferred items
  xfs: compact deferred intent item structures
  xfs: rename _zone variables to _cache
  xfs: remove kmem_zone typedef
  xfs: use separate btree cursor cache for each btree type
  xfs: compute absolute maximum nlevels for each btree type
  xfs: kill XFS_BTREE_MAXLEVELS
  xfs: compute the maximum height of the rmap btree when reflink enabled
  xfs: clean up xfs_btree_{calc_size,compute_maxlevels}
  xfs: compute maximum AG btree height for critical reservation calculation
  xfs: rename m_ag_maxlevels to m_allocbt_maxlevels
  xfs: dynamically allocate cursors based on maxlevels
  xfs: encode the max btree height in the cursor
  xfs: refactor btree cursor allocation function
  ...
2021-11-02 12:42:56 -07:00
Linus Torvalds
c03098d4b9 gfs2: Fix mmap + page fault deadlocks
Functions gfs2_file_read_iter and gfs2_file_write_iter are both
 accessing the user buffer to write to or read from while holding the
 inode glock.  In the most basic scenario, that buffer will not be
 resident and it will be mapped to the same file.  Accessing the buffer
 will trigger a page fault, and gfs2 will deadlock trying to take the
 same inode glock again while trying to handle that fault.
 
 Fix that and similar, more complex scenarios by disabling page faults
 while accessing user buffers.  To make this work, introduce a small
 amount of new infrastructure and fix some bugs that didn't trigger so
 far, with page faults enabled.
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmGBPisUHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTpE6A/7BezUnGuNJxJrR8pC+vcLYA7xAgUU
 6STQ6IN7w5UHRlSkNzZxZ2XPxW4uVQ4SxSEeaLqBsHZihepjcLNFZ/8MhQ6UPSD0
 8noHOi7CoIcp6IuWQtCpxRM/xjjm2SlMt2XbVJZaiJcdzCV9gB6TU9EkBRq7Zm/X
 9WFBbv1xZF0skn9ISCJvNtiiI+VyWKgMDUKxJUiTQjmJcklyyqHcVGmQi9BjqPz4
 4s3F+WH6CoGbDKlmNk/6Y9wZ/2+sbvGswVscUxPwJVPoZWsR1xBBUdAeAmEMD1P4
 BgE/Y1J8JXyVPYtyvZKq70XUhKdQkxB7RfX87YasOk9mY4Kjd5rIIGEykh+o2vC9
 kDhCHvf2Mnw5I6Rum3B7UXyB1vemY+fECIHsXhgBnS+ztabRtcAdpCuWoqb43ymw
 yEX1KwXyU4FpRYbrRvdZT42Fmh6ty8TW+N4swg8S2TrffirvgAi5yrcHZ4mPupYv
 lyzvsCW7Wv8hPXn/twNObX+okRgJnsxcCdBXARdCnRXfA8tH23xmu88u8RA1Vdxh
 nzTvv6Dx2EowwojuDWMx29Mw3fA2IqIfbOV+4FaRU7NZ2ZKtknL8yGl27qQUsMoJ
 vYsHTmagasjQr+NDJ3vQRLCw+JQ6B1hENpdkmixFD9moo7X1ZFW3HBi/UL973Bv6
 5CmgeXto8FRUFjI=
 =WeNd
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher:
 "Functions gfs2_file_read_iter and gfs2_file_write_iter are both
  accessing the user buffer to write to or read from while holding the
  inode glock.

  In the most basic deadlock scenario, that buffer will not be resident
  and it will be mapped to the same file. Accessing the buffer will
  trigger a page fault, and gfs2 will deadlock trying to take the same
  inode glock again while trying to handle that fault.

  Fix that and similar, more complex scenarios by disabling page faults
  while accessing user buffers. To make this work, introduce a small
  amount of new infrastructure and fix some bugs that didn't trigger so
  far, with page faults enabled"

* tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix mmap + page fault deadlocks for direct I/O
  iov_iter: Introduce nofault flag to disable page faults
  gup: Introduce FOLL_NOFAULT flag to disable page faults
  iomap: Add done_before argument to iomap_dio_rw
  iomap: Support partial direct I/O on user copy failures
  iomap: Fix iomap_dio_rw return value for user copies
  gfs2: Fix mmap + page fault deadlocks for buffered I/O
  gfs2: Eliminate ip->i_gh
  gfs2: Move the inode glock locking to gfs2_file_buffered_write
  gfs2: Introduce flag for glock holder auto-demotion
  gfs2: Clean up function may_grant
  gfs2: Add wrapper for iomap_file_buffered_write
  iov_iter: Introduce fault_in_iov_iter_writeable
  iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
  gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
  powerpc/kvm: Fix kvm_use_magic_page
  iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
2021-11-02 12:25:03 -07:00
Linus Torvalds
bf953917be Various hardening fixes and cleanups for 5.16-rc1
Hi Linus,
 
 Please, pull the following hardening fixes and cleanups that I've
 been collecting during the last development cycle. All of them have
 been baking in linux-next.
 
 Fix -Wcast-function-type error:
 
 - firewire: Remove function callback casts (Oscar Carter)
 
 Fix application of sizeof operator:
 
 - firmware/psci: fix application of sizeof to pointer (jing yangyang)
 
 Replace open coded instances with size_t saturating arithmetic helpers:
 
 - assoc_array: Avoid open coded arithmetic in allocator arguments (Len Baker)
 - writeback: prefer struct_size over open coded arithmetic (Len Baker)
 - aio: Prefer struct_size over open coded arithmetic (Len Baker)
 - dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic (Len Baker)
 
 Flexible array transformation:
 
 - KVM: PPC: Replace zero-length array with flexible array member (Len Baker)
 
 Use 2-factor argument multiplication form:
 
 - nouveau/svm: Use kvcalloc() instead of kvzalloc() (Gustavo A. R. Silva)
 - xfs: Use kvcalloc() instead of kvzalloc() (Gustavo A. R. Silva)
 
 Thanks
 --
 Gustavo
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAmGAWzsACgkQRwW0y0cG
 2zF55A/+PTBZKg0XLQkPZ7HFipobeZpfvM0dU4JutwN6Kts1RmMRftPn6ootY18v
 4tWR4jXcnblvEr7UTgYAl6QQdytFXZKOK+JKMWV8LXLqyNGF6sS2PmA6zk/iQoa5
 1q0IKUaaqLIXwmm3xoz+/uNHsb+kfjYOHZpHA6HhYZQFDyShW7+hhIeS1NauJo2X
 op3IWasMumrawPkCJZ0ZJJQLELtZNGt4gHnOjB1MAYhOTAokowgeeDNtyfoJ9j1L
 iL8kimphVLI35H/GERBozmqdqRGIIZLlQF4P66VfNNEXSDoKOemAKDSFrfmYoVwE
 kdh6fqeKPV/aRImrCtNthfpiEjqEpm8afQGMC5H5uPnZontUX9tcU1Qagg0vwYx0
 fLZ8mMuNQK5AZfugK+1+2ShfBYUlhvWRhQdtjC9nIAoO80NqouWB7QD0zIHC2WV7
 durdlhzxik70ISnXqKmTR6bQNcXB6kFLPR30RpcA3E6+AgwlkP0FmaD3e+sDttJ0
 vtxDMHqMMNNzOWlLW2eqEdKMEfoU0gLyRt5iM7EN6R8HUXwup5f9bu7V4LuCnR6y
 FAX4tEa8b5wg01zNfyWClCccU6tetSeXjdrhdIk7szQVsOsYXc4zxDrp6xvqsAh2
 B7GbGk5qeUzM/O7QWNIl+5s/NhUjEzQ3QiQebRDdjVyINU2OKsI=
 =Jk0U
 -----END PGP SIGNATURE-----

Merge tag 'kspp-misc-fixes-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux

Pull hardening fixes and cleanups from Gustavo A. R. Silva:
 "Various hardening fixes and cleanups that I've been collecting during
  the last development cycle:

  Fix -Wcast-function-type error:

   - firewire: Remove function callback casts (Oscar Carter)

  Fix application of sizeof operator:

   - firmware/psci: fix application of sizeof to pointer (jing yangyang)

  Replace open coded instances with size_t saturating arithmetic
  helpers:

   - assoc_array: Avoid open coded arithmetic in allocator arguments
     (Len Baker)

   - writeback: prefer struct_size over open coded arithmetic (Len
     Baker)

   - aio: Prefer struct_size over open coded arithmetic (Len Baker)

   - dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic
     (Len Baker)

  Flexible array transformation:

   - KVM: PPC: Replace zero-length array with flexible array member (Len
     Baker)

  Use 2-factor argument multiplication form:

   - nouveau/svm: Use kvcalloc() instead of kvzalloc() (Gustavo A. R.
     Silva)

   - xfs: Use kvcalloc() instead of kvzalloc() (Gustavo A. R. Silva)"

* tag 'kspp-misc-fixes-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux:
  firewire: Remove function callback casts
  nouveau/svm: Use kvcalloc() instead of kvzalloc()
  firmware/psci: fix application of sizeof to pointer
  dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic
  KVM: PPC: Replace zero-length array with flexible array member
  aio: Prefer struct_size over open coded arithmetic
  writeback: prefer struct_size over open coded arithmetic
  xfs: Use kvcalloc() instead of kvzalloc()
  assoc_array: Avoid open coded arithmetic in allocator arguments
2021-11-01 17:29:10 -07:00
Changcheng Deng
2a09b57507 xfs: use swap() to make code cleaner
Use swap() in order to make code cleaner. Issue found by coccinelle.

Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-30 09:28:55 -07:00
Wan Jiabing
0b9007ec7b xfs: Remove duplicated include in xfs_super
Fix following checkincludes.pl warning:
./fs/xfs/xfs_super.c: xfs_btree.h is included more than once.

The include is in line 15. Remove the duplicated here.

Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-30 09:28:49 -07:00
Andreas Gruenbacher
4fdccaa0d1 iomap: Add done_before argument to iomap_dio_rw
Add a done_before argument to iomap_dio_rw that indicates how much of
the request has already been transferred.  When the request succeeds, we
report that done_before additional bytes were tranferred.  This is
useful for finishing a request asynchronously when part of the request
has already been completed synchronously.

We'll use that to allow iomap_dio_rw to be used with page faults
disabled: when a page fault occurs while submitting a request, we
synchronously complete the part of the request that has already been
submitted.  The caller can then take care of the page fault and call
iomap_dio_rw again for the rest of the request, passing in the number of
bytes already tranferred.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-10-24 15:26:05 +02:00
Brian Foster
5ca5916b6b xfs: punch out data fork delalloc blocks on COW writeback failure
If writeback I/O to a COW extent fails, the COW fork blocks are
punched out and the data fork blocks left alone. It is possible for
COW fork blocks to overlap non-shared data fork blocks (due to
cowextsz hint prealloc), however, and writeback unconditionally maps
to the COW fork whenever blocks exist at the corresponding offset of
the page undergoing writeback. This means it's quite possible for a
COW fork extent to overlap delalloc data fork blocks, writeback to
convert and map to the COW fork blocks, writeback to fail, and
finally for ioend completion to cancel the COW fork blocks and leave
stale data fork delalloc blocks around in the inode. The blocks are
effectively stale because writeback failure also discards dirty page
state.

If this occurs, it is likely to trigger assert failures, free space
accounting corruption and failures in unrelated file operations. For
example, a subsequent reflink attempt of the affected file to a new
target file will trip over the stale delalloc in the source file and
fail. Several of these issues are occasionally reproduced by
generic/648, but are reproducible on demand with the right sequence
of operations and timely I/O error injection.

To fix this problem, update the ioend failure path to also punch out
underlying data fork delalloc blocks on I/O error. This is analogous
to the writeback submission failure path in xfs_discard_page() where
we might fail to map data fork delalloc blocks and consistent with
the successful COW writeback completion path, which is responsible
for unmapping from the data fork and remapping in COW fork blocks.

Fixes: 787eb48550 ("xfs: fix and streamline error handling in xfs_end_io")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-22 16:04:36 -07:00
Darrick J. Wong
c04c51c524 xfs: remove unused parameter from refcount code
The owner info parameter is always NULL, so get rid of the parameter.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:36 -07:00
Darrick J. Wong
b3b5ff412a xfs: reduce the size of struct xfs_extent_free_item
We only use EFIs to free metadata blocks -- not regular data/attr fork
extents.  Remove all the fields that we never use, for a net reduction
of 16 bytes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:36 -07:00
Darrick J. Wong
c201d9ca53 xfs: rename xfs_bmap_add_free to xfs_free_extent_later
xfs_bmap_add_free isn't a block mapping function; it schedules deferred
freeing operations for a later point in a compound transaction chain.
While it's primarily used by bunmapi, its use has expanded beyond that.
Move it to xfs_alloc.c and rename the function since it's now general
freeing functionality.  Bring the slab cache bits in line with the
way we handle the other intent items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:36 -07:00
Darrick J. Wong
f3c799c22c xfs: create slab caches for frequently-used deferred items
Create slab caches for the high-level structures that coordinate
deferred intent items, since they're used fairly heavily.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:36 -07:00
Darrick J. Wong
9e253954ac xfs: compact deferred intent item structures
Rearrange these structs to reduce the amount of unused padding bytes.
This saves eight bytes for each of the three structs changed here, which
means they're now all (rmap/bmap are 64 bytes, refc is 32 bytes) even
powers of two.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:36 -07:00
Darrick J. Wong
182696fb02 xfs: rename _zone variables to _cache
Now that we've gotten rid of the kmem_zone_t typedef, rename the
variables to _cache since that's what they are.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:04:20 -07:00
Darrick J. Wong
e7720afad0 xfs: remove kmem_zone typedef
Remove these typedefs by referencing kmem_cache directly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22 16:00:31 -07:00
Gustavo A. R. Silva
c2e4e3b756 xfs: Use kvcalloc() instead of kvzalloc()
Use 2-factor argument multiplication form kvcalloc() instead of
kvzalloc().

Link: https://github.com/KSPP/linux/issues/162
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-10-20 18:14:12 -05:00
Darrick J. Wong
9fa47bdcd3 xfs: use separate btree cursor cache for each btree type
Now that we have the infrastructure to track the max possible height of
each btree type, we can create a separate slab cache for cursors of each
type of btree.  For smaller indices like the free space btrees, this
means that we can pack more cursors into a slab page, improving slab
utilization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:16 -07:00
Darrick J. Wong
0ed5f7356d xfs: compute absolute maximum nlevels for each btree type
Add code for all five btree types so that we can compute the absolute
maximum possible btree height for each btree type.  This is a setup for
the next patch, which makes every btree type have its own cursor cache.

The functions are exported so that we can have xfs_db report the
absolute maximum btree heights for each btree type, rather than making
everyone run their own ad-hoc computations.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:16 -07:00
Darrick J. Wong
bc8883eb77 xfs: kill XFS_BTREE_MAXLEVELS
Nobody uses this symbol anymore, so kill it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:16 -07:00
Darrick J. Wong
9ec691205e xfs: compute the maximum height of the rmap btree when reflink enabled
Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
enough to handle the maximally tall rmap btree when all blocks are in
use and maximally shared, let's compute the maximum height assuming the
rmapbt consumes as many blocks as possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:16 -07:00
Darrick J. Wong
1b236ad7ba xfs: clean up xfs_btree_{calc_size,compute_maxlevels}
During review of the next patch, Dave remarked that he found these two
btree geometry calculation functions lacking in documentation and that
they performed more work than was really necessary.

These functions take the same parameters and have nearly the same logic;
the only real difference is in the return values.  Reword the function
comment to make it clearer what each function does, and move them to be
adjacent to reinforce their relation.

Clean up both of them to stop opencoding the howmany functions, stop
using the uint typedefs, and make them both support computations for
more than 2^32 leaf records, since we're going to need all of the above
for files with large data forks and large rmap btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:16 -07:00
Darrick J. Wong
b74e15d720 xfs: compute maximum AG btree height for critical reservation calculation
Compute the actual maximum AG btree height for deciding if a per-AG
block reservation is critically low.  This only affects the sanity check
condition, since we /generally/ will trigger on the 10% threshold.  This
is a long-winded way of saying that we're removing one more usage of
XFS_BTREE_MAXLEVELS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:15 -07:00
Darrick J. Wong
7cb3efb4cf xfs: rename m_ag_maxlevels to m_allocbt_maxlevels
Years ago when XFS was thought to be much more simple, we introduced
m_ag_maxlevels to specify the maximum btree height of per-AG btrees for
a given filesystem mount.  Then we observed that inode btrees don't
actually have the same height and split that off; and now we have rmap
and refcount btrees with much different geometries and separate
maxlevels variables.

The 'ag' part of the name doesn't make much sense anymore, so rename
this to m_alloc_maxlevels to reinforce that this is the maximum height
of the *free space* btrees.  This sets us up for the next patch, which
will add a variable to track the maximum height of all AG btrees.

(Also take the opportunity to improve adjacent comments and fix minor
style problems.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:15 -07:00
Darrick J. Wong
c940a0c54a xfs: dynamically allocate cursors based on maxlevels
To support future btree code, we need to be able to size btree cursors
dynamically for very large btrees.  Switch the maxlevels computation to
use the precomputed values in the superblock, and create cursors that
can handle a certain height.  For now, we retain the btree cursor cache
that can handle up to 9-level btrees, though a subsequent patch
introduces separate caches for each btree type, where each cache's
objects will be exactly tall enough to handle the specific btree type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19 11:45:15 -07:00