xfs: atomic file content exchanges [v30.3 03/16]

This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange
 ranges of bytes between two files atomically.
 
 This new functionality enables data storage programs to stage and commit
 file updates such that reader programs will see either the old contents
 or the new contents in their entirety, with no chance of torn writes.  A
 successful call completion guarantees that the new contents will be seen
 even if the system fails.
 
 The ability to exchange file fork mappings between files in this manner
 is critical to supporting online filesystem repair, which is built upon
 the strategy of constructing a clean copy of a damaged structure and
 committing the new structure into the metadata file atomically.  The
 ioctls exist to facilitate testing of the new functionality and to
 enable future application program designs.
 
 User programs will be able to update files atomically by opening an
 O_TMPFILE, reflinking the source file to it, making whatever updates
 they want to make, and exchange the relevant ranges of the temp file
 with the original file.  If the updates are aligned with the file block
 size, a new (since v2) flag provides for exchanging only the written
 areas.  Note that application software must quiesce writes to the file
 while it stages an atomic update.  This will be addressed by a
 subsequent series.
 
 This mechanism solves the clunkiness of two existing atomic file update
 mechanisms: for O_TRUNC + rewrite, this eliminates the brief period
 where other programs can see an empty file.  For create tempfile +
 rename, the need to copy file attributes and extended attributes for
 each file update is eliminated.
 
 However, this method introduces its own awkwardness -- any program
 initiating an exchange now needs to have a way to signal to other
 programs that the file contents have changed.  For file access mediated
 via read and write, fanotify or inotify are probably sufficient.  For
 mmaped files, that may not be fast enough.
 
 Here is the proposed manual page:
 
 IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
 
 NAME
        ioctl_xfs_exchange_range  -  exchange  the contents of parts of
        two files
 
 SYNOPSIS
        #include <sys/ioctl.h>
        #include <xfs/xfs_fs.h>
 
        int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct  xfs_ex‐
        change_range *arg);
 
 DESCRIPTION
        Given  a  range  of bytes in a first file file1_fd and a second
        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
        changes the contents of the two ranges.
 
        Exchanges  are  atomic  with  regards to concurrent file opera‐
        tions.  Implementations must guarantee that readers see  either
        the old contents or the new contents in their entirety, even if
        the system fails.
 
        The system call parameters are conveyed in  structures  of  the
        following form:
 
            struct xfs_exchange_range {
                __s32    file1_fd;
                __u32    pad;
                __u64    file1_offset;
                __u64    file2_offset;
                __u64    length;
                __u64    flags;
            };
 
        The field pad must be zero.
 
        The  fields file1_fd, file1_offset, and length define the first
        range of bytes to be exchanged.
 
        The fields file2_fd, file2_offset, and length define the second
        range of bytes to be exchanged.
 
        Both  files must be from the same filesystem mount.  If the two
        file descriptors represent the same file, the byte ranges  must
        not  overlap.   Most  disk-based  filesystems  require that the
        starts of both ranges must be aligned to the file  block  size.
        If  this  is  the  case, the ends of the ranges must also be so
        aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set.
 
        The field flags control the behavior of the exchange operation.
 
            XFS_EXCHANGE_RANGE_TO_EOF
                   Ignore the length parameter.  All bytes in  file1_fd
                   from  file1_offset to EOF are moved to file2_fd, and
                   file2's size is set to  (file2_offset+(file1_length-
                   file1_offset)).   Meanwhile, all bytes in file2 from
                   file2_offset to EOF are moved to file1  and  file1's
                   size    is   set   to   (file1_offset+(file2_length-
                   file2_offset)).
 
            XFS_EXCHANGE_RANGE_DSYNC
                   Ensure that all modified in-core data in  both  file
                   ranges  and  all  metadata updates pertaining to the
                   exchange operation are flushed to persistent storage
                   before  the  call  returns.  Opening either file de‐
                   scriptor with O_SYNC or O_DSYNC will have  the  same
                   effect.
 
            XFS_EXCHANGE_RANGE_FILE1_WRITTEN
                   Only  exchange sub-ranges of file1_fd that are known
                   to contain data  written  by  application  software.
                   Each  sub-range  may  be  expanded (both upwards and
                   downwards) to align with the file  allocation  unit.
                   For files on the data device, this is one filesystem
                   block.  For files on the realtime  device,  this  is
                   the realtime extent size.  This facility can be used
                   to implement fast atomic  scatter-gather  writes  of
                   any  complexity for software-defined storage targets
                   if all writes are aligned  to  the  file  allocation
                   unit.
 
            XFS_EXCHANGE_RANGE_DRY_RUN
                   Check  the parameters and the feasibility of the op‐
                   eration, but do not change anything.
 
 RETURN VALUE
        On error, -1 is returned, and errno is set to indicate the  er‐
        ror.
 
 ERRORS
        Error  codes can be one of, but are not limited to, the follow‐
        ing:
 
        EBADF  file1_fd is not open for reading and writing or is  open
               for  append-only  writes;  or  file2_fd  is not open for
               reading and writing or is open for append-only writes.
 
        EINVAL The parameters are not correct for  these  files.   This
               error  can  also appear if either file descriptor repre‐
               sents a device, FIFO, or socket.  Disk filesystems  gen‐
               erally  require  the  offset  and length arguments to be
               aligned to the fundamental block sizes of both files.
 
        EIO    An I/O error occurred.
 
        EISDIR One of the files is a directory.
 
        ENOMEM The kernel was unable to allocate sufficient  memory  to
               perform the operation.
 
        ENOSPC There  is  not  enough  free space in the filesystem ex‐
               change the contents safely.
 
        EOPNOTSUPP
               The filesystem does not support exchanging bytes between
               the two files.
 
        EPERM  file1_fd or file2_fd are immutable.
 
        ETXTBSY
               One of the files is a swap file.
 
        EUCLEAN
               The filesystem is corrupt.
 
        EXDEV  file1_fd  and  file2_fd  are  not  on  the  same mounted
               filesystem.
 
 CONFORMING TO
        This API is XFS-specific.
 
 USE CASES
        Several use cases are imagined for this system  call.   In  all
        cases, application software must coordinate updates to the file
        because the exchange is performed unconditionally.
 
        The first is a data storage program that wants to  commit  non-
        contiguous  updates  to a file atomically and coordinates write
        access to that file.  This can be done by creating a  temporary
        file, calling FICLONE(2) to share the contents, and staging the
        updates into the temporary file.  The FULL_FILES flag is recom‐
        mended  for this purpose.  The temporary file can be deleted or
        punched out afterwards.
 
        An example program might look like this:
 
            int fd = open("/some/file", O_RDWR);
            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
 
            ioctl(temp_fd, FICLONE, fd);
 
            /* append 1MB of records */
            lseek(temp_fd, 0, SEEK_END);
            write(temp_fd, data1, 1000000);
 
            /* update record index */
            pwrite(temp_fd, data1, 600, 98765);
            pwrite(temp_fd, data2, 320, 54321);
            pwrite(temp_fd, data2, 15, 0);
 
            /* commit the entire update */
            struct xfs_exchange_range args = {
                .file1_fd = temp_fd,
                .flags = XFS_EXCHANGE_RANGE_TO_EOF,
            };
 
            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
 
        The second is a software-defined  storage  host  (e.g.  a  disk
        jukebox)  which  implements an atomic scatter-gather write com‐
        mand.  Provided the exported disk's logical block size  matches
        the file's allocation unit size, this can be done by creating a
        temporary file and writing the data at the appropriate offsets.
        It  is  recommended that the temporary file be truncated to the
        size of the regular file before any writes are  staged  to  the
        temporary  file  to avoid issues with zeroing during EOF exten‐
        sion.  Use this call with the FILE1_WRITTEN  flag  to  exchange
        only  the  file  allocation  units involved in the emulated de‐
        vice's write command.  The temporary file should  be  truncated
        or  punched out completely before being reused to stage another
        write.
 
        An example program might look like this:
 
            int fd = open("/some/file", O_RDWR);
            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
            struct stat sb;
            int blksz;
 
            fstat(fd, &sb);
            blksz = sb.st_blksize;
 
            /* land scatter gather writes between 100fsb and 500fsb */
            pwrite(temp_fd, data1, blksz * 2, blksz * 100);
            pwrite(temp_fd, data2, blksz * 20, blksz * 480);
            pwrite(temp_fd, data3, blksz * 7, blksz * 257);
 
            /* commit the entire update */
            struct xfs_exchange_range args = {
                .file1_fd = temp_fd,
                .file1_offset = blksz * 100,
                .file2_offset = blksz * 100,
                .length       = blksz * 400,
                .flags        = XFS_EXCHANGE_RANGE_FILE1_WRITTEN |
                                XFS_EXCHANGE_RANGE_FILE1_DSYNC,
            };
 
            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
 
 NOTES
        Some filesystems may limit the amount of data or the number  of
        extents that can be exchanged in a single call.
 
 SEE ALSO
        ioctl(2)
 
 XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
 
 The reference implementation in XFS creates a new log incompat feature
 and log intent items to track high level progress of swapping ranges of
 two files and finish interrupted work if the system goes down.  Sample
 code can be found in the corresponding changes to xfs_io to exercise the
 use case mentioned above.
 
 Note that this function is /not/ the O_DIRECT atomic untorn file writes
 concept that has also been floating around for years.  It is also not
 the RWF_ATOMIC patchset that has been shared.  This RFC is constructed
 entirely in software, which means that there are no limitations other
 than the general filesystem limits.
 
 As a side note, the original motivation behind the kernel functionality
 is online repair of file-based metadata.  The atomic file content
 exchange is implemented as an atomic exchange of file fork mappings,
 which means that we can implement online reconstruction of extended
 attributes and directories by building a new one in another inode and
 exchanging the contents.
 
 Subsequent patchsets adapt the online filesystem repair code to use
 atomic file exchanges.  This enables repair functions to construct a
 clean copy of a directory, xattr information, symbolic links, realtime
 bitmaps, and realtime summary information in a temporary inode.  If this
 completes successfully, the new contents can be committed atomically
 into the inode being repaired.  This is essential to avoid making
 corruption problems worse if the system goes down in the middle of
 running repair.
 
 For userspace, this series also includes the userspace pieces needed to
 test the new functionality, and a sample implementation of atomic file
 updates.
 
 This has been running on the djcloud for months with no problems.  Enjoy!
 
 Signed-off-by: Darrick J. Wong <djwong@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZh23UgAKCRBKO3ySh0YR
 pmYQAQCGwoAev/oRzIJrZmbpzNaU9w7XEPF+tW3vJSX6tlxG+wD8DIi4kTAplu/9
 i860EFqZp5MuwHyGVDCac0owigtt6wk=
 =Lsls
 -----END PGP SIGNATURE-----

Merge tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.10-mergeA

xfs: atomic file content exchanges

This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange
ranges of bytes between two files atomically.

This new functionality enables data storage programs to stage and commit
file updates such that reader programs will see either the old contents
or the new contents in their entirety, with no chance of torn writes.  A
successful call completion guarantees that the new contents will be seen
even if the system fails.

The ability to exchange file fork mappings between files in this manner
is critical to supporting online filesystem repair, which is built upon
the strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically.  The
ioctls exist to facilitate testing of the new functionality and to
enable future application program designs.

User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file.  If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas.  Note that application software must quiesce writes to the file
while it stages an atomic update.  This will be addressed by a
subsequent series.

This mechanism solves the clunkiness of two existing atomic file update
mechanisms: for O_TRUNC + rewrite, this eliminates the brief period
where other programs can see an empty file.  For create tempfile +
rename, the need to copy file attributes and extended attributes for
each file update is eliminated.

However, this method introduces its own awkwardness -- any program
initiating an exchange now needs to have a way to signal to other
programs that the file contents have changed.  For file access mediated
via read and write, fanotify or inotify are probably sufficient.  For
mmaped files, that may not be fast enough.

Here is the proposed manual page:

IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)

NAME
       ioctl_xfs_exchange_range  -  exchange  the contents of parts of
       two files

SYNOPSIS
       #include <sys/ioctl.h>
       #include <xfs/xfs_fs.h>

       int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct  xfs_ex‐
       change_range *arg);

DESCRIPTION
       Given  a  range  of bytes in a first file file1_fd and a second
       range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
       changes the contents of the two ranges.

       Exchanges  are  atomic  with  regards to concurrent file opera‐
       tions.  Implementations must guarantee that readers see  either
       the old contents or the new contents in their entirety, even if
       the system fails.

       The system call parameters are conveyed in  structures  of  the
       following form:

           struct xfs_exchange_range {
               __s32    file1_fd;
               __u32    pad;
               __u64    file1_offset;
               __u64    file2_offset;
               __u64    length;
               __u64    flags;
           };

       The field pad must be zero.

       The  fields file1_fd, file1_offset, and length define the first
       range of bytes to be exchanged.

       The fields file2_fd, file2_offset, and length define the second
       range of bytes to be exchanged.

       Both  files must be from the same filesystem mount.  If the two
       file descriptors represent the same file, the byte ranges  must
       not  overlap.   Most  disk-based  filesystems  require that the
       starts of both ranges must be aligned to the file  block  size.
       If  this  is  the  case, the ends of the ranges must also be so
       aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set.

       The field flags control the behavior of the exchange operation.

           XFS_EXCHANGE_RANGE_TO_EOF
                  Ignore the length parameter.  All bytes in  file1_fd
                  from  file1_offset to EOF are moved to file2_fd, and
                  file2's size is set to  (file2_offset+(file1_length-
                  file1_offset)).   Meanwhile, all bytes in file2 from
                  file2_offset to EOF are moved to file1  and  file1's
                  size    is   set   to   (file1_offset+(file2_length-
                  file2_offset)).

           XFS_EXCHANGE_RANGE_DSYNC
                  Ensure that all modified in-core data in  both  file
                  ranges  and  all  metadata updates pertaining to the
                  exchange operation are flushed to persistent storage
                  before  the  call  returns.  Opening either file de‐
                  scriptor with O_SYNC or O_DSYNC will have  the  same
                  effect.

           XFS_EXCHANGE_RANGE_FILE1_WRITTEN
                  Only  exchange sub-ranges of file1_fd that are known
                  to contain data  written  by  application  software.
                  Each  sub-range  may  be  expanded (both upwards and
                  downwards) to align with the file  allocation  unit.
                  For files on the data device, this is one filesystem
                  block.  For files on the realtime  device,  this  is
                  the realtime extent size.  This facility can be used
                  to implement fast atomic  scatter-gather  writes  of
                  any  complexity for software-defined storage targets
                  if all writes are aligned  to  the  file  allocation
                  unit.

           XFS_EXCHANGE_RANGE_DRY_RUN
                  Check  the parameters and the feasibility of the op‐
                  eration, but do not change anything.

RETURN VALUE
       On error, -1 is returned, and errno is set to indicate the  er‐
       ror.

ERRORS
       Error  codes can be one of, but are not limited to, the follow‐
       ing:

       EBADF  file1_fd is not open for reading and writing or is  open
              for  append-only  writes;  or  file2_fd  is not open for
              reading and writing or is open for append-only writes.

       EINVAL The parameters are not correct for  these  files.   This
              error  can  also appear if either file descriptor repre‐
              sents a device, FIFO, or socket.  Disk filesystems  gen‐
              erally  require  the  offset  and length arguments to be
              aligned to the fundamental block sizes of both files.

       EIO    An I/O error occurred.

       EISDIR One of the files is a directory.

       ENOMEM The kernel was unable to allocate sufficient  memory  to
              perform the operation.

       ENOSPC There  is  not  enough  free space in the filesystem ex‐
              change the contents safely.

       EOPNOTSUPP
              The filesystem does not support exchanging bytes between
              the two files.

       EPERM  file1_fd or file2_fd are immutable.

       ETXTBSY
              One of the files is a swap file.

       EUCLEAN
              The filesystem is corrupt.

       EXDEV  file1_fd  and  file2_fd  are  not  on  the  same mounted
              filesystem.

CONFORMING TO
       This API is XFS-specific.

USE CASES
       Several use cases are imagined for this system  call.   In  all
       cases, application software must coordinate updates to the file
       because the exchange is performed unconditionally.

       The first is a data storage program that wants to  commit  non-
       contiguous  updates  to a file atomically and coordinates write
       access to that file.  This can be done by creating a  temporary
       file, calling FICLONE(2) to share the contents, and staging the
       updates into the temporary file.  The FULL_FILES flag is recom‐
       mended  for this purpose.  The temporary file can be deleted or
       punched out afterwards.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);

           ioctl(temp_fd, FICLONE, fd);

           /* append 1MB of records */
           lseek(temp_fd, 0, SEEK_END);
           write(temp_fd, data1, 1000000);

           /* update record index */
           pwrite(temp_fd, data1, 600, 98765);
           pwrite(temp_fd, data2, 320, 54321);
           pwrite(temp_fd, data2, 15, 0);

           /* commit the entire update */
           struct xfs_exchange_range args = {
               .file1_fd = temp_fd,
               .flags = XFS_EXCHANGE_RANGE_TO_EOF,
           };

           ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

       The second is a software-defined  storage  host  (e.g.  a  disk
       jukebox)  which  implements an atomic scatter-gather write com‐
       mand.  Provided the exported disk's logical block size  matches
       the file's allocation unit size, this can be done by creating a
       temporary file and writing the data at the appropriate offsets.
       It  is  recommended that the temporary file be truncated to the
       size of the regular file before any writes are  staged  to  the
       temporary  file  to avoid issues with zeroing during EOF exten‐
       sion.  Use this call with the FILE1_WRITTEN  flag  to  exchange
       only  the  file  allocation  units involved in the emulated de‐
       vice's write command.  The temporary file should  be  truncated
       or  punched out completely before being reused to stage another
       write.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct stat sb;
           int blksz;

           fstat(fd, &sb);
           blksz = sb.st_blksize;

           /* land scatter gather writes between 100fsb and 500fsb */
           pwrite(temp_fd, data1, blksz * 2, blksz * 100);
           pwrite(temp_fd, data2, blksz * 20, blksz * 480);
           pwrite(temp_fd, data3, blksz * 7, blksz * 257);

           /* commit the entire update */
           struct xfs_exchange_range args = {
               .file1_fd = temp_fd,
               .file1_offset = blksz * 100,
               .file2_offset = blksz * 100,
               .length       = blksz * 400,
               .flags        = XFS_EXCHANGE_RANGE_FILE1_WRITTEN |
                               XFS_EXCHANGE_RANGE_FILE1_DSYNC,
           };

           ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

NOTES
       Some filesystems may limit the amount of data or the number  of
       extents that can be exchanged in a single call.

SEE ALSO
       ioctl(2)

XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)

The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down.  Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.

Note that this function is /not/ the O_DIRECT atomic untorn file writes
concept that has also been floating around for years.  It is also not
the RWF_ATOMIC patchset that has been shared.  This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.

As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata.  The atomic file content
exchange is implemented as an atomic exchange of file fork mappings,
which means that we can implement online reconstruction of extended
attributes and directories by building a new one in another inode and
exchanging the contents.

Subsequent patchsets adapt the online filesystem repair code to use
atomic file exchanges.  This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode.  If this
completes successfully, the new contents can be committed atomically
into the inode being repaired.  This is essential to avoid making
corruption problems worse if the system goes down in the middle of
running repair.

For userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>

* tag 'atomic-file-updates-6.10_2024-04-15' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux:
  xfs: enable logged file mapping exchange feature
  docs: update swapext -> exchmaps language
  xfs: capture inode generation numbers in the ondisk exchmaps log item
  xfs: support non-power-of-two rtextsize with exchange-range
  xfs: make file range exchange support realtime files
  xfs: condense symbolic links after a mapping exchange operation
  xfs: condense directories after a mapping exchange operation
  xfs: condense extended attributes after a mapping exchange operation
  xfs: add error injection to test file mapping exchange recovery
  xfs: bind together the front and back ends of the file range exchange code
  xfs: create deferred log items for file mapping exchanges
  xfs: introduce a file mapping exchange log intent item
  xfs: create a incompat flag for atomic file mapping exchanges
  xfs: introduce new file range exchange ioctl
  vfs: export remap and write check helpers
This commit is contained in:
Chandan Babu R 2024-04-16 11:25:09 +05:30
commit 22d5a8e52d
30 changed files with 3615 additions and 187 deletions

View File

@ -2167,7 +2167,7 @@ The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
function frees them all because compaction is not needed.
The details of repairing directories and extended attributes will be discussed
in a subsequent section about atomic extent swapping.
in a subsequent section about atomic file content exchanges.
However, it should be noted that these repair functions only use blob storage
to cache a small number of entries before adding them to a temporary ondisk
file, which is why compaction is not required.
@ -2802,7 +2802,8 @@ follows this format:
Repairs for file-based metadata such as extended attributes, directories,
symbolic links, quota files and realtime bitmaps are performed by building a
new structure attached to a temporary file and swapping the forks.
new structure attached to a temporary file and exchanging all mappings in the
file forks.
Afterward, the mappings in the old file fork are the candidate blocks for
disposal.
@ -3851,8 +3852,8 @@ Because file forks can consume as much space as the entire filesystem, repairs
cannot be staged in memory, even when a paging scheme is available.
Therefore, online repair of file-based metadata createas a temporary file in
the XFS filesystem, writes a new structure at the correct offsets into the
temporary file, and atomically swaps the fork mappings (and hence the fork
contents) to commit the repair.
temporary file, and atomically exchanges all file fork mappings (and hence the
fork contents) to commit the repair.
Once the repair is complete, the old fork can be reaped as necessary; if the
system goes down during the reap, the iunlink code will delete the blocks
during log recovery.
@ -3862,10 +3863,11 @@ consistent to use a temporary file safely!
This dependency is the reason why online repair can only use pageable kernel
memory to stage ondisk space usage information.
Swapping metadata extents with a temporary file requires the owner field of the
block headers to match the file being repaired and not the temporary file. The
directory, extended attribute, and symbolic link functions were all modified to
allow callers to specify owner numbers explicitly.
Exchanging metadata file mappings with a temporary file requires the owner
field of the block headers to match the file being repaired and not the
temporary file.
The directory, extended attribute, and symbolic link functions were all
modified to allow callers to specify owner numbers explicitly.
There is a downside to the reaping process -- if the system crashes during the
reap phase and the fork extents are crosslinked, the iunlink processing will
@ -3974,8 +3976,8 @@ The proposed patches are in the
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
series.
Atomic Extent Swapping
----------------------
Logged File Content Exchanges
-----------------------------
Once repair builds a temporary file with a new data structure written into
it, it must commit the new changes into the existing file.
@ -4010,17 +4012,21 @@ e. Old blocks in the file may be cross-linked with another structure and must
These problems are overcome by creating a new deferred operation and a new type
of log intent item to track the progress of an operation to exchange two file
ranges.
The new deferred operation type chains together the same transactions used by
the reverse-mapping extent swap code.
The new exchange operation type chains together the same transactions used by
the reverse-mapping extent swap code, but records intermedia progress in the
log so that operations can be restarted after a crash.
This new functionality is called the file contents exchange (xfs_exchrange)
code.
The underlying implementation exchanges file fork mappings (xfs_exchmaps).
The new log item records the progress of the exchange to ensure that once an
exchange begins, it will always run to completion, even there are
interruptions.
The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag
in the superblock protects these new log item records from being replayed on
old kernels.
The proposed patchset is the
`atomic extent swap
`file contents exchange
<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
series.
@ -4061,72 +4067,73 @@ series.
| The feature bit will not be cleared from the superblock until the log |
| becomes clean. |
| |
| Log-assisted extended attribute updates and atomic extent swaps both use |
| log incompat features and provide convenience wrappers around the |
| Log-assisted extended attribute updates and file content exchanges bothe |
| use log incompat features and provide convenience wrappers around the |
| functionality. |
+--------------------------------------------------------------------------+
Mechanics of an Atomic Extent Swap
``````````````````````````````````
Mechanics of a Logged File Content Exchange
```````````````````````````````````````````
Swapping entire file forks is a complex task.
Exchanging contents between file forks is a complex task.
The goal is to exchange all file fork mappings between two file fork offset
ranges.
There are likely to be many extent mappings in each fork, and the edges of
the mappings aren't necessarily aligned.
Furthermore, there may be other updates that need to happen after the swap,
Furthermore, there may be other updates that need to happen after the exchange,
such as exchanging file sizes, inode flags, or conversion of fork data to local
format.
This is roughly the format of the new deferred extent swap work item:
This is roughly the format of the new deferred exchange-mapping work item:
.. code-block:: c
struct xfs_swapext_intent {
struct xfs_exchmaps_intent {
/* Inodes participating in the operation. */
struct xfs_inode *sxi_ip1;
struct xfs_inode *sxi_ip2;
struct xfs_inode *xmi_ip1;
struct xfs_inode *xmi_ip2;
/* File offset range information. */
xfs_fileoff_t sxi_startoff1;
xfs_fileoff_t sxi_startoff2;
xfs_filblks_t sxi_blockcount;
xfs_fileoff_t xmi_startoff1;
xfs_fileoff_t xmi_startoff2;
xfs_filblks_t xmi_blockcount;
/* Set these file sizes after the operation, unless negative. */
xfs_fsize_t sxi_isize1;
xfs_fsize_t sxi_isize2;
xfs_fsize_t xmi_isize1;
xfs_fsize_t xmi_isize2;
/* XFS_SWAP_EXT_* log operation flags */
uint64_t sxi_flags;
/* XFS_EXCHMAPS_* log operation flags */
uint64_t xmi_flags;
};
The new log intent item contains enough information to track two logical fork
offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
blockcount)``.
Each step of a swap operation exchanges the largest file range mapping possible
from one file to the other.
After each step in the swap operation, the two startoff fields are incremented
and the blockcount field is decremented to reflect the progress made.
The flags field captures behavioral parameters such as swapping the attr fork
instead of the data fork and other work to be done after the extent swap.
The two isize fields are used to swap the file size at the end of the operation
if the file data fork is the target of the swap operation.
Each step of an exchange operation exchanges the largest file range mapping
possible from one file to the other.
After each step in the exchange operation, the two startoff fields are
incremented and the blockcount field is decremented to reflect the progress
made.
The flags field captures behavioral parameters such as exchanging attr fork
mappings instead of the data fork and other work to be done after the exchange.
The two isize fields are used to exchange the file sizes at the end of the
operation if the file data fork is the target of the operation.
When the extent swap is initiated, the sequence of operations is as follows:
When the exchange is initiated, the sequence of operations is as follows:
1. Create a deferred work item for the extent swap.
At the start, it should contain the entirety of the file ranges to be
swapped.
1. Create a deferred work item for the file mapping exchange.
At the start, it should contain the entirety of the file block ranges to be
exchanged.
2. Call ``xfs_defer_finish`` to process the exchange.
This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
This is encapsulated in ``xrep_tempexch_contents`` for scrub operations.
This will log an extent swap intent item to the transaction for the deferred
extent swap work item.
mapping exchange work item.
3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero,
a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
``sxi_startoff2``, respectively, and compute the longest extent that can
be swapped in a single step.
a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
``xmi_startoff2``, respectively, and compute the longest extent that can
be exchanged in a single step.
This is the minimum of the two ``br_blockcount`` s in the mappings.
Keep advancing through the file forks until at least one of the mappings
contains written blocks.
@ -4148,20 +4155,20 @@ When the extent swap is initiated, the sequence of operations is as follows:
g. Extend the ondisk size of either file if necessary.
h. Log an extent swap done log item for the extent swap intent log item
that was read at the start of step 3.
h. Log a mapping exchange done log item for th mapping exchange intent log
item that was read at the start of step 3.
i. Compute the amount of file range that has just been covered.
This quantity is ``(map1.br_startoff + map1.br_blockcount -
sxi_startoff1)``, because step 3a could have skipped holes.
xmi_startoff1)``, because step 3a could have skipped holes.
j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
by the number of blocks computed in the previous step, and decrease
``sxi_blockcount`` by the same quantity.
``xmi_blockcount`` by the same quantity.
This advances the cursor.
k. Log a new extent swap intent log item reflecting the advanced state of
the work item.
k. Log a new mapping exchange intent log item reflecting the advanced state
of the work item.
l. Return the proper error code (EAGAIN) to the deferred operation manager
to inform it that there is more work to be done.
@ -4172,22 +4179,23 @@ When the extent swap is initiated, the sequence of operations is as follows:
This will be discussed in more detail in subsequent sections.
If the filesystem goes down in the middle of an operation, log recovery will
find the most recent unfinished extent swap log intent item and restart from
there.
This is how extent swapping guarantees that an outside observer will either see
the old broken structure or the new one, and never a mismash of both.
find the most recent unfinished maping exchange log intent item and restart
from there.
This is how atomic file mapping exchanges guarantees that an outside observer
will either see the old broken structure or the new one, and never a mismash of
both.
Preparation for Extent Swapping
```````````````````````````````
Preparation for File Content Exchanges
``````````````````````````````````````
There are a few things that need to be taken care of before initiating an
atomic extent swap operation.
atomic file mapping exchange operation.
First, regular files require the page cache to be flushed to disk before the
operation begins, and directio writes to be quiesced.
Like any filesystem operation, extent swapping must determine the maximum
amount of disk space and quota that can be consumed on behalf of both files in
the operation, and reserve that quantity of resources to avoid an unrecoverable
out of space failure once it starts dirtying metadata.
Like any filesystem operation, file mapping exchanges must determine the
maximum amount of disk space and quota that can be consumed on behalf of both
files in the operation, and reserve that quantity of resources to avoid an
unrecoverable out of space failure once it starts dirtying metadata.
The preparation step scans the ranges of both files to estimate:
- Data device blocks needed to handle the repeated updates to the fork
@ -4201,56 +4209,59 @@ The preparation step scans the ranges of both files to estimate:
to different extents on the realtime volume, which could happen if the
operation fails to run to completion.
The need for precise estimation increases the run time of the swap operation,
but it is very important to maintain correct accounting.
The filesystem must not run completely out of free space, nor can the extent
swap ever add more extent mappings to a fork than it can support.
The need for precise estimation increases the run time of the exchange
operation, but it is very important to maintain correct accounting.
The filesystem must not run completely out of free space, nor can the mapping
exchange ever add more extent mappings to a fork than it can support.
Regular users are required to abide the quota limits, though metadata repairs
may exceed quota to resolve inconsistent metadata elsewhere.
Special Features for Swapping Metadata File Extents
```````````````````````````````````````````````````
Special Features for Exchanging Metadata File Contents
``````````````````````````````````````````````````````
Extended attributes, symbolic links, and directories can set the fork format to
"local" and treat the fork as a literal area for data storage.
Metadata repairs must take extra steps to support these cases:
- If both forks are in local format and the fork areas are large enough, the
swap is performed by copying the incore fork contents, logging both forks,
and committing.
The atomic extent swap mechanism is not necessary, since this can be done
with a single transaction.
exchange is performed by copying the incore fork contents, logging both
forks, and committing.
The atomic file mapping exchange mechanism is not necessary, since this can
be done with a single transaction.
- If both forks map blocks, then the regular atomic extent swap is used.
- If both forks map blocks, then the regular atomic file mapping exchange is
used.
- Otherwise, only one fork is in local format.
The contents of the local format fork are converted to a block to perform the
swap.
exchange.
The conversion to block format must be done in the same transaction that
logs the initial extent swap intent log item.
The regular atomic extent swap is used to exchange the mappings.
Special flags are set on the swap operation so that the transaction can be
rolled one more time to convert the second file's fork back to local format
so that the second file will be ready to go as soon as the ILOCK is dropped.
logs the initial mapping exchange intent log item.
The regular atomic mapping exchange is used to exchange the metadata file
mappings.
Special flags are set on the exchange operation so that the transaction can
be rolled one more time to convert the second file's fork back to local
format so that the second file will be ready to go as soon as the ILOCK is
dropped.
Extended attributes and directories stamp the owning inode into every block,
but the buffer verifiers do not actually check the inode number!
Although there is no verification, it is still important to maintain
referential integrity, so prior to performing the extent swap, online repair
builds every block in the new data structure with the owner field of the file
being repaired.
referential integrity, so prior to performing the mapping exchange, online
repair builds every block in the new data structure with the owner field of the
file being repaired.
After a successful swap operation, the repair operation must reap the old fork
blocks by processing each fork mapping through the standard :ref:`file extent
reaping <reaping>` mechanism that is done post-repair.
After a successful exchange operation, the repair operation must reap the old
fork blocks by processing each fork mapping through the standard :ref:`file
extent reaping <reaping>` mechanism that is done post-repair.
If the filesystem should go down during the reap part of the repair, the
iunlink processing at the end of recovery will free both the temporary file and
whatever blocks were not reaped.
However, this iunlink processing omits the cross-link detection of online
repair, and is not completely foolproof.
Swapping Temporary File Extents
```````````````````````````````
Exchanging Temporary File Contents
``````````````````````````````````
To repair a metadata file, online repair proceeds as follows:
@ -4260,14 +4271,14 @@ To repair a metadata file, online repair proceeds as follows:
file.
The same fork must be written to as is being repaired.
3. Commit the scrub transaction, since the swap estimation step must be
completed before transaction reservations are made.
3. Commit the scrub transaction, since the exchange resource estimation step
must be completed before transaction reservations are made.
4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with
the appropriate resource reservations, locks, and fill out a ``struct
xfs_swapext_req`` with the details of the swap operation.
xfs_exchmaps_req`` with the details of the exchange operation.
5. Call ``xrep_tempswap_contents`` to swap the contents.
5. Call ``xrep_tempexch_contents`` to exchange the contents.
6. Commit the transaction to complete the repair.
@ -4309,7 +4320,7 @@ To check the summary file against the bitmap:
3. Compare the contents of the xfile against the ondisk file.
To repair the summary file, write the xfile contents into the temporary file
and use atomic extent swap to commit the new contents.
and use atomic mapping exchange to commit the new contents.
The temporary file is then reaped.
The proposed patchset is the
@ -4352,8 +4363,8 @@ Salvaging extended attributes is done as follows:
memory or there are no more attr fork blocks to examine, unlock the file and
add the staged extended attributes to the temporary file.
3. Use atomic extent swapping to exchange the new and old extended attribute
structures.
3. Use atomic file mapping exchange to exchange the new and old extended
attribute structures.
The old attribute blocks are now attached to the temporary file.
4. Reap the temporary file.
@ -4410,7 +4421,8 @@ salvaging directories is straightforward:
directory and add the staged dirents into the temporary directory.
Truncate the staging files.
4. Use atomic extent swapping to exchange the new and old directory structures.
4. Use atomic file mapping exchange to exchange the new and old directory
structures.
The old directory blocks are now attached to the temporary file.
5. Reap the temporary file.
@ -4542,7 +4554,7 @@ a :ref:`directory entry live update hook <liveupdate>` as follows:
Instead, we stash updates in the xfarray and rely on the scanner thread
to apply the stashed updates to the temporary directory.
5. When the scan is complete, atomically swap the contents of the temporary
5. When the scan is complete, atomically exchange the contents of the temporary
directory and the directory being repaired.
The temporary directory now contains the damaged directory structure.
@ -4629,8 +4641,8 @@ directory reconstruction:
5. Copy all non-parent pointer extended attributes to the temporary file.
6. When the scan is complete, atomically swap the attribute fork of the
temporary file and the file being repaired.
6. When the scan is complete, atomically exchange the mappings of the attribute
forks of the temporary file and the file being repaired.
The temporary file now contains the damaged extended attribute structure.
7. Reap the temporary file.
@ -5105,18 +5117,18 @@ make it easier for code readers to understand what has been built, for whom it
has been built, and why.
Please feel free to contact the XFS mailing list with questions.
FIEXCHANGE_RANGE
----------------
XFS_IOC_EXCHANGE_RANGE
----------------------
As discussed earlier, a second frontend to the atomic extent swap mechanism is
a new ioctl call that userspace programs can use to commit updates to files
atomically.
As discussed earlier, a second frontend to the atomic file mapping exchange
mechanism is a new ioctl call that userspace programs can use to commit updates
to files atomically.
This frontend has been out for review for several years now, though the
necessary refinements to online repair and lack of customer demand mean that
the proposal has not been pushed very hard.
Extent Swapping with Regular User Files
```````````````````````````````````````
File Content Exchanges with Regular User Files
``````````````````````````````````````````````
As mentioned earlier, XFS has long had the ability to swap extents between
files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
@ -5131,12 +5143,12 @@ the consistency of the fork mappings with the reverse mapping index was to
develop an iterative mechanism that used deferred bmap and rmap operations to
swap mappings one at a time.
This mechanism is identical to steps 2-3 from the procedure above except for
the new tracking items, because the atomic extent swap mechanism is an
iteration of an existing mechanism and not something totally novel.
the new tracking items, because the atomic file mapping exchange mechanism is
an iteration of an existing mechanism and not something totally novel.
For the narrow case of file defragmentation, the file contents must be
identical, so the recovery guarantees are not much of a gain.
Atomic extent swapping is much more flexible than the existing swapext
Atomic file content exchanges are much more flexible than the existing swapext
implementations because it can guarantee that the caller never sees a mix of
old and new contents even after a crash, and it can operate on two arbitrary
file fork ranges.
@ -5147,11 +5159,11 @@ The extra flexibility enables several new use cases:
Next, it opens a temporary file and calls the file clone operation to reflink
the first file's contents into the temporary file.
Writes to the original file should instead be written to the temporary file.
Finally, the process calls the atomic extent swap system call
(``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
of the updates to the original file, or none of them.
Finally, the process calls the atomic file mapping exchange system call
(``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby
committing all of the updates to the original file, or none of them.
.. _swapext_if_unchanged:
.. _exchrange_if_unchanged:
- **Transactional file updates**: The same mechanism as above, but the caller
only wants the commit to occur if the original file's contents have not
@ -5160,16 +5172,17 @@ The extra flexibility enables several new use cases:
change timestamps of the original file before reflinking its data to the
temporary file.
When the program is ready to commit the changes, it passes the timestamps
into the kernel as arguments to the atomic extent swap system call.
into the kernel as arguments to the atomic file mapping exchange system call.
The kernel only commits the changes if the provided timestamps match the
original file.
A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.
- **Emulation of atomic block device writes**: Export a block device with a
logical sector size matching the filesystem block size to force all writes
to be aligned to the filesystem block size.
Stage all writes to a temporary file, and when that is complete, call the
atomic extent swap system call with a flag to indicate that holes in the
temporary file should be ignored.
atomic file mapping exchange system call with a flag to indicate that holes
in the temporary file should be ignored.
This emulates an atomic device write in software, and can support arbitrary
scattered writes.
@ -5251,8 +5264,8 @@ of the file to try to share the physical space with a dummy file.
Cloning the extent means that the original owners cannot overwrite the
contents; any changes will be written somewhere else via copy-on-write.
Clearspace makes its own copy of the frozen extent in an area that is not being
cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
<swapext_if_unchanged>` feature) to change the target file's data extent
cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges
<exchrange_if_unchanged>` feature) to change the target file's data extent
mapping away from the area being cleared.
When all other mappings have been moved, clearspace reflinks the space into the
space collector file so that it becomes unavailable.

View File

@ -1667,6 +1667,7 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
return 0;
}
EXPORT_SYMBOL_GPL(generic_write_check_limits);
/* Like generic_write_checks(), but takes size of write instead of iter. */
int generic_write_checks_count(struct kiocb *iocb, loff_t *count)

View File

@ -99,8 +99,7 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in,
return 0;
}
static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
bool write)
int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write)
{
int mask = write ? MAY_WRITE : MAY_READ;
loff_t tmp;
@ -118,6 +117,7 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
return fsnotify_file_area_perm(file, mask, &pos, len);
}
EXPORT_SYMBOL_GPL(remap_verify_area);
/*
* Ensure that we don't remap a partial EOF block in the middle of something

View File

@ -34,6 +34,7 @@ xfs-y += $(addprefix libxfs/, \
xfs_dir2_node.o \
xfs_dir2_sf.o \
xfs_dquot_buf.o \
xfs_exchmaps.o \
xfs_ialloc.o \
xfs_ialloc_btree.o \
xfs_iext_tree.o \
@ -67,6 +68,7 @@ xfs-y += xfs_aops.o \
xfs_dir2_readdir.o \
xfs_discard.o \
xfs_error.o \
xfs_exchrange.o \
xfs_export.o \
xfs_extent_busy.o \
xfs_file.o \
@ -101,6 +103,7 @@ xfs-y += xfs_log.o \
xfs_buf_item.o \
xfs_buf_item_recover.o \
xfs_dquot_item_recover.o \
xfs_exchmaps_item.o \
xfs_extfree_item.o \
xfs_attr_item.o \
xfs_icreate_item.o \

View File

@ -27,6 +27,7 @@
#include "xfs_da_btree.h"
#include "xfs_attr.h"
#include "xfs_trans_priv.h"
#include "xfs_exchmaps.h"
static struct kmem_cache *xfs_defer_pending_cache;
@ -1176,6 +1177,10 @@ xfs_defer_init_item_caches(void)
error = xfs_attr_intent_init_cache();
if (error)
goto err;
error = xfs_exchmaps_intent_init_cache();
if (error)
goto err;
return 0;
err:
xfs_defer_destroy_item_caches();
@ -1186,6 +1191,7 @@ err:
void
xfs_defer_destroy_item_caches(void)
{
xfs_exchmaps_intent_destroy_cache();
xfs_attr_intent_destroy_cache();
xfs_extfree_intent_destroy_cache();
xfs_bmap_intent_destroy_cache();

View File

@ -72,7 +72,7 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
extern const struct xfs_defer_op_type xfs_attr_defer_type;
extern const struct xfs_defer_op_type xfs_exchmaps_defer_type;
/*
* Deferred operation item relogging limits.

View File

@ -63,7 +63,8 @@
#define XFS_ERRTAG_ATTR_LEAF_TO_NODE 41
#define XFS_ERRTAG_WB_DELAY_MS 42
#define XFS_ERRTAG_WRITE_DELAY_MS 43
#define XFS_ERRTAG_MAX 44
#define XFS_ERRTAG_EXCHMAPS_FINISH_ONE 44
#define XFS_ERRTAG_MAX 45
/*
* Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@ -111,5 +112,6 @@
#define XFS_RANDOM_ATTR_LEAF_TO_NODE 1
#define XFS_RANDOM_WB_DELAY_MS 3000
#define XFS_RANDOM_WRITE_DELAY_MS 3000
#define XFS_RANDOM_EXCHMAPS_FINISH_ONE 1
#endif /* __XFS_ERRORTAG_H_ */

1237
fs/xfs/libxfs/xfs_exchmaps.c Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,123 @@
/* SPDX-License-Identifier: GPL-2.0-or-later */
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#ifndef __XFS_EXCHMAPS_H__
#define __XFS_EXCHMAPS_H__
/* In-core deferred operation info about a file mapping exchange request. */
struct xfs_exchmaps_intent {
/* List of other incore deferred work. */
struct list_head xmi_list;
/* Inodes participating in the operation. */
struct xfs_inode *xmi_ip1;
struct xfs_inode *xmi_ip2;
/* File offset range information. */
xfs_fileoff_t xmi_startoff1;
xfs_fileoff_t xmi_startoff2;
xfs_filblks_t xmi_blockcount;
/* Set these file sizes after the operation, unless negative. */
xfs_fsize_t xmi_isize1;
xfs_fsize_t xmi_isize2;
uint64_t xmi_flags; /* XFS_EXCHMAPS_* flags */
};
/* Try to convert inode2 from block to short format at the end, if possible. */
#define __XFS_EXCHMAPS_INO2_SHORTFORM (1ULL << 63)
#define XFS_EXCHMAPS_INTERNAL_FLAGS (__XFS_EXCHMAPS_INO2_SHORTFORM)
/* flags that can be passed to xfs_exchmaps_{estimate,mappings} */
#define XFS_EXCHMAPS_PARAMS (XFS_EXCHMAPS_ATTR_FORK | \
XFS_EXCHMAPS_SET_SIZES | \
XFS_EXCHMAPS_INO1_WRITTEN)
static inline int
xfs_exchmaps_whichfork(const struct xfs_exchmaps_intent *xmi)
{
if (xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK)
return XFS_ATTR_FORK;
return XFS_DATA_FORK;
}
/* Parameters for a mapping exchange request. */
struct xfs_exchmaps_req {
/* Inodes participating in the operation. */
struct xfs_inode *ip1;
struct xfs_inode *ip2;
/* File offset range information. */
xfs_fileoff_t startoff1;
xfs_fileoff_t startoff2;
xfs_filblks_t blockcount;
/* XFS_EXCHMAPS_* operation flags */
uint64_t flags;
/*
* Fields below this line are filled out by xfs_exchmaps_estimate;
* callers should initialize this part of the struct to zero.
*/
/*
* Data device blocks to be moved out of ip1, and free space needed to
* handle the bmbt changes.
*/
xfs_filblks_t ip1_bcount;
/*
* Data device blocks to be moved out of ip2, and free space needed to
* handle the bmbt changes.
*/
xfs_filblks_t ip2_bcount;
/* rt blocks to be moved out of ip1. */
xfs_filblks_t ip1_rtbcount;
/* rt blocks to be moved out of ip2. */
xfs_filblks_t ip2_rtbcount;
/* Free space needed to handle the bmbt changes */
unsigned long long resblks;
/* Number of exchanges needed to complete the operation */
unsigned long long nr_exchanges;
};
static inline int
xfs_exchmaps_reqfork(const struct xfs_exchmaps_req *req)
{
if (req->flags & XFS_EXCHMAPS_ATTR_FORK)
return XFS_ATTR_FORK;
return XFS_DATA_FORK;
}
int xfs_exchmaps_estimate(struct xfs_exchmaps_req *req);
extern struct kmem_cache *xfs_exchmaps_intent_cache;
int __init xfs_exchmaps_intent_init_cache(void);
void xfs_exchmaps_intent_destroy_cache(void);
struct xfs_exchmaps_intent *xfs_exchmaps_init_intent(
const struct xfs_exchmaps_req *req);
void xfs_exchmaps_ensure_reflink(struct xfs_trans *tp,
const struct xfs_exchmaps_intent *xmi);
void xfs_exchmaps_upgrade_extent_counts(struct xfs_trans *tp,
const struct xfs_exchmaps_intent *xmi);
int xfs_exchmaps_finish_one(struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi);
int xfs_exchmaps_check_forks(struct xfs_mount *mp,
const struct xfs_exchmaps_req *req);
void xfs_exchange_mappings(struct xfs_trans *tp,
const struct xfs_exchmaps_req *req);
#endif /* __XFS_EXCHMAPS_H__ */

View File

@ -367,19 +367,21 @@ xfs_sb_has_ro_compat_feature(
return (sbp->sb_features_ro_compat & feature) != 0;
}
#define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */
#define XFS_SB_FEAT_INCOMPAT_SPINODES (1 << 1) /* sparse inode chunks */
#define XFS_SB_FEAT_INCOMPAT_META_UUID (1 << 2) /* metadata UUID */
#define XFS_SB_FEAT_INCOMPAT_BIGTIME (1 << 3) /* large timestamps */
#define XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR (1 << 4) /* needs xfs_repair */
#define XFS_SB_FEAT_INCOMPAT_NREXT64 (1 << 5) /* large extent counters */
#define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */
#define XFS_SB_FEAT_INCOMPAT_SPINODES (1 << 1) /* sparse inode chunks */
#define XFS_SB_FEAT_INCOMPAT_META_UUID (1 << 2) /* metadata UUID */
#define XFS_SB_FEAT_INCOMPAT_BIGTIME (1 << 3) /* large timestamps */
#define XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR (1 << 4) /* needs xfs_repair */
#define XFS_SB_FEAT_INCOMPAT_NREXT64 (1 << 5) /* large extent counters */
#define XFS_SB_FEAT_INCOMPAT_EXCHRANGE (1 << 6) /* exchangerange supported */
#define XFS_SB_FEAT_INCOMPAT_ALL \
(XFS_SB_FEAT_INCOMPAT_FTYPE| \
XFS_SB_FEAT_INCOMPAT_SPINODES| \
XFS_SB_FEAT_INCOMPAT_META_UUID| \
XFS_SB_FEAT_INCOMPAT_BIGTIME| \
XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR| \
XFS_SB_FEAT_INCOMPAT_NREXT64)
(XFS_SB_FEAT_INCOMPAT_FTYPE | \
XFS_SB_FEAT_INCOMPAT_SPINODES | \
XFS_SB_FEAT_INCOMPAT_META_UUID | \
XFS_SB_FEAT_INCOMPAT_BIGTIME | \
XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR | \
XFS_SB_FEAT_INCOMPAT_NREXT64 | \
XFS_SB_FEAT_INCOMPAT_EXCHRANGE)
#define XFS_SB_FEAT_INCOMPAT_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_ALL
static inline bool

View File

@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks {
#define XFS_FSOP_GEOM_FLAGS_BIGTIME (1 << 21) /* 64-bit nsec timestamps */
#define XFS_FSOP_GEOM_FLAGS_INOBTCNT (1 << 22) /* inobt btree counter */
#define XFS_FSOP_GEOM_FLAGS_NREXT64 (1 << 23) /* large extent counters */
#define XFS_FSOP_GEOM_FLAGS_EXCHANGE_RANGE (1 << 24) /* exchange range */
/*
* Minimum and maximum sizes need for growth checks.
@ -772,6 +773,46 @@ struct xfs_scrub_metadata {
# define XFS_XATTR_LIST_MAX 65536
#endif
/*
* Exchange part of file1 with part of the file that this ioctl that is being
* called against (which we'll call file2). Filesystems must be able to
* restart and complete the operation even after the system goes down.
*/
struct xfs_exchange_range {
__s32 file1_fd;
__u32 pad; /* must be zeroes */
__u64 file1_offset; /* file1 offset, bytes */
__u64 file2_offset; /* file2 offset, bytes */
__u64 length; /* bytes to exchange */
__u64 flags; /* see XFS_EXCHANGE_RANGE_* below */
};
/*
* Exchange file data all the way to the ends of both files, and then exchange
* the file sizes. This flag can be used to replace a file's contents with a
* different amount of data. length will be ignored.
*/
#define XFS_EXCHANGE_RANGE_TO_EOF (1ULL << 0)
/* Flush all changes in file data and file metadata to disk before returning. */
#define XFS_EXCHANGE_RANGE_DSYNC (1ULL << 1)
/* Dry run; do all the parameter verification but do not change anything. */
#define XFS_EXCHANGE_RANGE_DRY_RUN (1ULL << 2)
/*
* Exchange only the parts of the two files where the file allocation units
* mapped to file1's range have been written to. This can accelerate
* scatter-gather atomic writes with a temp file if all writes are aligned to
* the file allocation unit.
*/
#define XFS_EXCHANGE_RANGE_FILE1_WRITTEN (1ULL << 3)
#define XFS_EXCHANGE_RANGE_ALL_FLAGS (XFS_EXCHANGE_RANGE_TO_EOF | \
XFS_EXCHANGE_RANGE_DSYNC | \
XFS_EXCHANGE_RANGE_DRY_RUN | \
XFS_EXCHANGE_RANGE_FILE1_WRITTEN)
/*
* ioctl commands that are used by Linux filesystems
@ -843,6 +884,7 @@ struct xfs_scrub_metadata {
#define XFS_IOC_FSGEOMETRY _IOR ('X', 126, struct xfs_fsop_geom)
#define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req)
#define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req)
#define XFS_IOC_EXCHANGE_RANGE _IOWR('X', 129, struct xfs_exchange_range)
/* XFS_IOC_GETFSUUID ---------- deprecated 140 */

View File

@ -117,8 +117,9 @@ struct xfs_unmount_log_format {
#define XLOG_REG_TYPE_ATTRD_FORMAT 28
#define XLOG_REG_TYPE_ATTR_NAME 29
#define XLOG_REG_TYPE_ATTR_VALUE 30
#define XLOG_REG_TYPE_MAX 30
#define XLOG_REG_TYPE_XMI_FORMAT 31
#define XLOG_REG_TYPE_XMD_FORMAT 32
#define XLOG_REG_TYPE_MAX 32
/*
* Flags to log operation header
@ -243,6 +244,8 @@ typedef struct xfs_trans_header {
#define XFS_LI_BUD 0x1245
#define XFS_LI_ATTRI 0x1246 /* attr set/remove intent*/
#define XFS_LI_ATTRD 0x1247 /* attr set/remove done */
#define XFS_LI_XMI 0x1248 /* mapping exchange intent */
#define XFS_LI_XMD 0x1249 /* mapping exchange done */
#define XFS_LI_TYPE_DESC \
{ XFS_LI_EFI, "XFS_LI_EFI" }, \
@ -260,7 +263,9 @@ typedef struct xfs_trans_header {
{ XFS_LI_BUI, "XFS_LI_BUI" }, \
{ XFS_LI_BUD, "XFS_LI_BUD" }, \
{ XFS_LI_ATTRI, "XFS_LI_ATTRI" }, \
{ XFS_LI_ATTRD, "XFS_LI_ATTRD" }
{ XFS_LI_ATTRD, "XFS_LI_ATTRD" }, \
{ XFS_LI_XMI, "XFS_LI_XMI" }, \
{ XFS_LI_XMD, "XFS_LI_XMD" }
/*
* Inode Log Item Format definitions.
@ -878,6 +883,61 @@ struct xfs_bud_log_format {
uint64_t bud_bui_id; /* id of corresponding bui */
};
/*
* XMI/XMD (file mapping exchange) log format definitions
*/
/* This is the structure used to lay out an mapping exchange log item. */
struct xfs_xmi_log_format {
uint16_t xmi_type; /* xmi log item type */
uint16_t xmi_size; /* size of this item */
uint32_t __pad; /* must be zero */
uint64_t xmi_id; /* xmi identifier */
uint64_t xmi_inode1; /* inumber of first file */
uint64_t xmi_inode2; /* inumber of second file */
uint32_t xmi_igen1; /* generation of first file */
uint32_t xmi_igen2; /* generation of second file */
uint64_t xmi_startoff1; /* block offset into file1 */
uint64_t xmi_startoff2; /* block offset into file2 */
uint64_t xmi_blockcount; /* number of blocks */
uint64_t xmi_flags; /* XFS_EXCHMAPS_* */
uint64_t xmi_isize1; /* intended file1 size */
uint64_t xmi_isize2; /* intended file2 size */
};
/* Exchange mappings between extended attribute forks instead of data forks. */
#define XFS_EXCHMAPS_ATTR_FORK (1ULL << 0)
/* Set the file sizes when finished. */
#define XFS_EXCHMAPS_SET_SIZES (1ULL << 1)
/*
* Exchange the mappings of the two files only if the file allocation units
* mapped to file1's range have been written.
*/
#define XFS_EXCHMAPS_INO1_WRITTEN (1ULL << 2)
/* Clear the reflink flag from inode1 after the operation. */
#define XFS_EXCHMAPS_CLEAR_INO1_REFLINK (1ULL << 3)
/* Clear the reflink flag from inode2 after the operation. */
#define XFS_EXCHMAPS_CLEAR_INO2_REFLINK (1ULL << 4)
#define XFS_EXCHMAPS_LOGGED_FLAGS (XFS_EXCHMAPS_ATTR_FORK | \
XFS_EXCHMAPS_SET_SIZES | \
XFS_EXCHMAPS_INO1_WRITTEN | \
XFS_EXCHMAPS_CLEAR_INO1_REFLINK | \
XFS_EXCHMAPS_CLEAR_INO2_REFLINK)
/* This is the structure used to lay out an mapping exchange done log item. */
struct xfs_xmd_log_format {
uint16_t xmd_type; /* xmd log item type */
uint16_t xmd_size; /* size of this item */
uint32_t __pad;
uint64_t xmd_xmi_id; /* id of corresponding xmi */
};
/*
* Dquot Log format definitions.
*

View File

@ -75,6 +75,8 @@ extern const struct xlog_recover_item_ops xlog_cui_item_ops;
extern const struct xlog_recover_item_ops xlog_cud_item_ops;
extern const struct xlog_recover_item_ops xlog_attri_item_ops;
extern const struct xlog_recover_item_ops xlog_attrd_item_ops;
extern const struct xlog_recover_item_ops xlog_xmi_item_ops;
extern const struct xlog_recover_item_ops xlog_xmd_item_ops;
/*
* Macros, structures, prototypes for internal log manager use.
@ -121,6 +123,8 @@ bool xlog_is_buffer_cancelled(struct xlog *log, xfs_daddr_t blkno, uint len);
int xlog_recover_iget(struct xfs_mount *mp, xfs_ino_t ino,
struct xfs_inode **ipp);
int xlog_recover_iget_handle(struct xfs_mount *mp, xfs_ino_t ino, uint32_t gen,
struct xfs_inode **ipp);
void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type,
uint64_t intent_id);
int xlog_alloc_buf_cancel_table(struct xlog *log);

View File

@ -26,6 +26,7 @@
#include "xfs_health.h"
#include "xfs_ag.h"
#include "xfs_rtbitmap.h"
#include "xfs_exchrange.h"
/*
* Physical superblock buffer manipulations. Shared with libxfs in userspace.
@ -175,6 +176,8 @@ xfs_sb_version_to_features(
features |= XFS_FEAT_NEEDSREPAIR;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_NREXT64)
features |= XFS_FEAT_NREXT64;
if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_EXCHRANGE)
features |= XFS_FEAT_EXCHANGE_RANGE;
return features;
}
@ -1259,6 +1262,8 @@ xfs_fs_geometry(
}
if (xfs_has_large_extent_counts(mp))
geo->flags |= XFS_FSOP_GEOM_FLAGS_NREXT64;
if (xfs_has_exchange_range(mp))
geo->flags |= XFS_FSOP_GEOM_FLAGS_EXCHANGE_RANGE;
geo->rtsectsize = sbp->sb_blocksize;
geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);

View File

@ -380,3 +380,50 @@ xfs_symlink_write_target(
ASSERT(pathlen == 0);
return 0;
}
/* Remove all the blocks from a symlink and invalidate buffers. */
int
xfs_symlink_remote_truncate(
struct xfs_trans *tp,
struct xfs_inode *ip)
{
struct xfs_bmbt_irec mval[XFS_SYMLINK_MAPS];
struct xfs_mount *mp = tp->t_mountp;
struct xfs_buf *bp;
int nmaps = XFS_SYMLINK_MAPS;
int done = 0;
int i;
int error;
/* Read mappings and invalidate buffers. */
error = xfs_bmapi_read(ip, 0, XFS_MAX_FILEOFF, mval, &nmaps, 0);
if (error)
return error;
for (i = 0; i < nmaps; i++) {
if (!xfs_bmap_is_real_extent(&mval[i]))
break;
error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
&bp);
if (error)
return error;
xfs_trans_binval(tp, bp);
}
/* Unmap the remote blocks. */
error = xfs_bunmapi(tp, ip, 0, XFS_MAX_FILEOFF, 0, nmaps, &done);
if (error)
return error;
if (!done) {
ASSERT(done);
xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
return -EFSCORRUPTED;
}
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
return 0;
}

View File

@ -22,5 +22,6 @@ int xfs_symlink_remote_read(struct xfs_inode *ip, char *link);
int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
const char *target_path, int pathlen, xfs_fsblock_t fs_blocks,
uint resblks);
int xfs_symlink_remote_truncate(struct xfs_trans *tp, struct xfs_inode *ip);
#endif /* __XFS_SYMLINK_REMOTE_H */

View File

@ -10,6 +10,10 @@
* Components of space reservations.
*/
/* Worst case number of bmaps that can be held in a block. */
#define XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp) \
(((mp)->m_bmap_dmxr[0]) - ((mp)->m_bmap_dmnr[0]))
/* Worst case number of rmaps that can be held in a block. */
#define XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) \
(((mp)->m_rmap_mxr[0]) - ((mp)->m_rmap_mnr[0]))

View File

@ -62,6 +62,7 @@ static unsigned int xfs_errortag_random_default[] = {
XFS_RANDOM_ATTR_LEAF_TO_NODE,
XFS_RANDOM_WB_DELAY_MS,
XFS_RANDOM_WRITE_DELAY_MS,
XFS_RANDOM_EXCHMAPS_FINISH_ONE,
};
struct xfs_errortag_attr {
@ -179,6 +180,7 @@ XFS_ERRORTAG_ATTR_RW(da_leaf_split, XFS_ERRTAG_DA_LEAF_SPLIT);
XFS_ERRORTAG_ATTR_RW(attr_leaf_to_node, XFS_ERRTAG_ATTR_LEAF_TO_NODE);
XFS_ERRORTAG_ATTR_RW(wb_delay_ms, XFS_ERRTAG_WB_DELAY_MS);
XFS_ERRORTAG_ATTR_RW(write_delay_ms, XFS_ERRTAG_WRITE_DELAY_MS);
XFS_ERRORTAG_ATTR_RW(exchmaps_finish_one, XFS_ERRTAG_EXCHMAPS_FINISH_ONE);
static struct attribute *xfs_errortag_attrs[] = {
XFS_ERRORTAG_ATTR_LIST(noerror),
@ -224,6 +226,7 @@ static struct attribute *xfs_errortag_attrs[] = {
XFS_ERRORTAG_ATTR_LIST(attr_leaf_to_node),
XFS_ERRORTAG_ATTR_LIST(wb_delay_ms),
XFS_ERRORTAG_ATTR_LIST(write_delay_ms),
XFS_ERRORTAG_ATTR_LIST(exchmaps_finish_one),
NULL,
};
ATTRIBUTE_GROUPS(xfs_errortag);

614
fs/xfs/xfs_exchmaps_item.c Normal file
View File

@ -0,0 +1,614 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#include "xfs.h"
#include "xfs_fs.h"
#include "xfs_format.h"
#include "xfs_log_format.h"
#include "xfs_trans_resv.h"
#include "xfs_bit.h"
#include "xfs_shared.h"
#include "xfs_mount.h"
#include "xfs_defer.h"
#include "xfs_inode.h"
#include "xfs_trans.h"
#include "xfs_trans_priv.h"
#include "xfs_exchmaps_item.h"
#include "xfs_exchmaps.h"
#include "xfs_log.h"
#include "xfs_bmap.h"
#include "xfs_icache.h"
#include "xfs_bmap_btree.h"
#include "xfs_trans_space.h"
#include "xfs_error.h"
#include "xfs_log_priv.h"
#include "xfs_log_recover.h"
#include "xfs_exchrange.h"
#include "xfs_trace.h"
struct kmem_cache *xfs_xmi_cache;
struct kmem_cache *xfs_xmd_cache;
static const struct xfs_item_ops xfs_xmi_item_ops;
static inline struct xfs_xmi_log_item *XMI_ITEM(struct xfs_log_item *lip)
{
return container_of(lip, struct xfs_xmi_log_item, xmi_item);
}
STATIC void
xfs_xmi_item_free(
struct xfs_xmi_log_item *xmi_lip)
{
kvfree(xmi_lip->xmi_item.li_lv_shadow);
kmem_cache_free(xfs_xmi_cache, xmi_lip);
}
/*
* Freeing the XMI requires that we remove it from the AIL if it has already
* been placed there. However, the XMI may not yet have been placed in the AIL
* when called by xfs_xmi_release() from XMD processing due to the ordering of
* committed vs unpin operations in bulk insert operations. Hence the reference
* count to ensure only the last caller frees the XMI.
*/
STATIC void
xfs_xmi_release(
struct xfs_xmi_log_item *xmi_lip)
{
ASSERT(atomic_read(&xmi_lip->xmi_refcount) > 0);
if (atomic_dec_and_test(&xmi_lip->xmi_refcount)) {
xfs_trans_ail_delete(&xmi_lip->xmi_item, 0);
xfs_xmi_item_free(xmi_lip);
}
}
STATIC void
xfs_xmi_item_size(
struct xfs_log_item *lip,
int *nvecs,
int *nbytes)
{
*nvecs += 1;
*nbytes += sizeof(struct xfs_xmi_log_format);
}
/*
* This is called to fill in the vector of log iovecs for the given xmi log
* item. We use only 1 iovec, and we point that at the xmi_log_format structure
* embedded in the xmi item.
*/
STATIC void
xfs_xmi_item_format(
struct xfs_log_item *lip,
struct xfs_log_vec *lv)
{
struct xfs_xmi_log_item *xmi_lip = XMI_ITEM(lip);
struct xfs_log_iovec *vecp = NULL;
xmi_lip->xmi_format.xmi_type = XFS_LI_XMI;
xmi_lip->xmi_format.xmi_size = 1;
xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_XMI_FORMAT,
&xmi_lip->xmi_format,
sizeof(struct xfs_xmi_log_format));
}
/*
* The unpin operation is the last place an XMI is manipulated in the log. It
* is either inserted in the AIL or aborted in the event of a log I/O error. In
* either case, the XMI transaction has been successfully committed to make it
* this far. Therefore, we expect whoever committed the XMI to either construct
* and commit the XMD or drop the XMD's reference in the event of error. Simply
* drop the log's XMI reference now that the log is done with it.
*/
STATIC void
xfs_xmi_item_unpin(
struct xfs_log_item *lip,
int remove)
{
struct xfs_xmi_log_item *xmi_lip = XMI_ITEM(lip);
xfs_xmi_release(xmi_lip);
}
/*
* The XMI has been either committed or aborted if the transaction has been
* cancelled. If the transaction was cancelled, an XMD isn't going to be
* constructed and thus we free the XMI here directly.
*/
STATIC void
xfs_xmi_item_release(
struct xfs_log_item *lip)
{
xfs_xmi_release(XMI_ITEM(lip));
}
/* Allocate and initialize an xmi item. */
STATIC struct xfs_xmi_log_item *
xfs_xmi_init(
struct xfs_mount *mp)
{
struct xfs_xmi_log_item *xmi_lip;
xmi_lip = kmem_cache_zalloc(xfs_xmi_cache, GFP_KERNEL | __GFP_NOFAIL);
xfs_log_item_init(mp, &xmi_lip->xmi_item, XFS_LI_XMI, &xfs_xmi_item_ops);
xmi_lip->xmi_format.xmi_id = (uintptr_t)(void *)xmi_lip;
atomic_set(&xmi_lip->xmi_refcount, 2);
return xmi_lip;
}
static inline struct xfs_xmd_log_item *XMD_ITEM(struct xfs_log_item *lip)
{
return container_of(lip, struct xfs_xmd_log_item, xmd_item);
}
STATIC void
xfs_xmd_item_size(
struct xfs_log_item *lip,
int *nvecs,
int *nbytes)
{
*nvecs += 1;
*nbytes += sizeof(struct xfs_xmd_log_format);
}
/*
* This is called to fill in the vector of log iovecs for the given xmd log
* item. We use only 1 iovec, and we point that at the xmd_log_format structure
* embedded in the xmd item.
*/
STATIC void
xfs_xmd_item_format(
struct xfs_log_item *lip,
struct xfs_log_vec *lv)
{
struct xfs_xmd_log_item *xmd_lip = XMD_ITEM(lip);
struct xfs_log_iovec *vecp = NULL;
xmd_lip->xmd_format.xmd_type = XFS_LI_XMD;
xmd_lip->xmd_format.xmd_size = 1;
xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_XMD_FORMAT, &xmd_lip->xmd_format,
sizeof(struct xfs_xmd_log_format));
}
/*
* The XMD is either committed or aborted if the transaction is cancelled. If
* the transaction is cancelled, drop our reference to the XMI and free the
* XMD.
*/
STATIC void
xfs_xmd_item_release(
struct xfs_log_item *lip)
{
struct xfs_xmd_log_item *xmd_lip = XMD_ITEM(lip);
xfs_xmi_release(xmd_lip->xmd_intent_log_item);
kvfree(xmd_lip->xmd_item.li_lv_shadow);
kmem_cache_free(xfs_xmd_cache, xmd_lip);
}
static struct xfs_log_item *
xfs_xmd_item_intent(
struct xfs_log_item *lip)
{
return &XMD_ITEM(lip)->xmd_intent_log_item->xmi_item;
}
static const struct xfs_item_ops xfs_xmd_item_ops = {
.flags = XFS_ITEM_RELEASE_WHEN_COMMITTED |
XFS_ITEM_INTENT_DONE,
.iop_size = xfs_xmd_item_size,
.iop_format = xfs_xmd_item_format,
.iop_release = xfs_xmd_item_release,
.iop_intent = xfs_xmd_item_intent,
};
/* Log file mapping exchange information in the intent item. */
STATIC struct xfs_log_item *
xfs_exchmaps_create_intent(
struct xfs_trans *tp,
struct list_head *items,
unsigned int count,
bool sort)
{
struct xfs_xmi_log_item *xmi_lip;
struct xfs_exchmaps_intent *xmi;
struct xfs_xmi_log_format *xlf;
ASSERT(count == 1);
xmi = list_first_entry_or_null(items, struct xfs_exchmaps_intent,
xmi_list);
xmi_lip = xfs_xmi_init(tp->t_mountp);
xlf = &xmi_lip->xmi_format;
xlf->xmi_inode1 = xmi->xmi_ip1->i_ino;
xlf->xmi_igen1 = VFS_I(xmi->xmi_ip1)->i_generation;
xlf->xmi_inode2 = xmi->xmi_ip2->i_ino;
xlf->xmi_igen2 = VFS_I(xmi->xmi_ip2)->i_generation;
xlf->xmi_startoff1 = xmi->xmi_startoff1;
xlf->xmi_startoff2 = xmi->xmi_startoff2;
xlf->xmi_blockcount = xmi->xmi_blockcount;
xlf->xmi_isize1 = xmi->xmi_isize1;
xlf->xmi_isize2 = xmi->xmi_isize2;
xlf->xmi_flags = xmi->xmi_flags & XFS_EXCHMAPS_LOGGED_FLAGS;
return &xmi_lip->xmi_item;
}
STATIC struct xfs_log_item *
xfs_exchmaps_create_done(
struct xfs_trans *tp,
struct xfs_log_item *intent,
unsigned int count)
{
struct xfs_xmi_log_item *xmi_lip = XMI_ITEM(intent);
struct xfs_xmd_log_item *xmd_lip;
xmd_lip = kmem_cache_zalloc(xfs_xmd_cache, GFP_KERNEL | __GFP_NOFAIL);
xfs_log_item_init(tp->t_mountp, &xmd_lip->xmd_item, XFS_LI_XMD,
&xfs_xmd_item_ops);
xmd_lip->xmd_intent_log_item = xmi_lip;
xmd_lip->xmd_format.xmd_xmi_id = xmi_lip->xmi_format.xmi_id;
return &xmd_lip->xmd_item;
}
/* Add this deferred XMI to the transaction. */
void
xfs_exchmaps_defer_add(
struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi)
{
trace_xfs_exchmaps_defer(tp->t_mountp, xmi);
xfs_defer_add(tp, &xmi->xmi_list, &xfs_exchmaps_defer_type);
}
static inline struct xfs_exchmaps_intent *xmi_entry(const struct list_head *e)
{
return list_entry(e, struct xfs_exchmaps_intent, xmi_list);
}
/* Cancel a deferred file mapping exchange. */
STATIC void
xfs_exchmaps_cancel_item(
struct list_head *item)
{
struct xfs_exchmaps_intent *xmi = xmi_entry(item);
kmem_cache_free(xfs_exchmaps_intent_cache, xmi);
}
/* Process a deferred file mapping exchange. */
STATIC int
xfs_exchmaps_finish_item(
struct xfs_trans *tp,
struct xfs_log_item *done,
struct list_head *item,
struct xfs_btree_cur **state)
{
struct xfs_exchmaps_intent *xmi = xmi_entry(item);
int error;
/*
* Exchange one more mappings between two files. If there's still more
* work to do, we want to requeue ourselves after all other pending
* deferred operations have finished. This includes all of the dfops
* that we queued directly as well as any new ones created in the
* process of finishing the others. Doing so prevents us from queuing
* a large number of XMI log items in kernel memory, which in turn
* prevents us from pinning the tail of the log (while logging those
* new XMI items) until the first XMI items can be processed.
*/
error = xfs_exchmaps_finish_one(tp, xmi);
if (error != -EAGAIN)
xfs_exchmaps_cancel_item(item);
return error;
}
/* Abort all pending XMIs. */
STATIC void
xfs_exchmaps_abort_intent(
struct xfs_log_item *intent)
{
xfs_xmi_release(XMI_ITEM(intent));
}
/* Is this recovered XMI ok? */
static inline bool
xfs_xmi_validate(
struct xfs_mount *mp,
struct xfs_xmi_log_item *xmi_lip)
{
struct xfs_xmi_log_format *xlf = &xmi_lip->xmi_format;
if (!xfs_has_exchange_range(mp))
return false;
if (xmi_lip->xmi_format.__pad != 0)
return false;
if (xlf->xmi_flags & ~XFS_EXCHMAPS_LOGGED_FLAGS)
return false;
if (!xfs_verify_ino(mp, xlf->xmi_inode1) ||
!xfs_verify_ino(mp, xlf->xmi_inode2))
return false;
if (!xfs_verify_fileext(mp, xlf->xmi_startoff1, xlf->xmi_blockcount))
return false;
return xfs_verify_fileext(mp, xlf->xmi_startoff2, xlf->xmi_blockcount);
}
/*
* Use the recovered log state to create a new request, estimate resource
* requirements, and create a new incore intent state.
*/
STATIC struct xfs_exchmaps_intent *
xfs_xmi_item_recover_intent(
struct xfs_mount *mp,
struct xfs_defer_pending *dfp,
const struct xfs_xmi_log_format *xlf,
struct xfs_exchmaps_req *req,
struct xfs_inode **ipp1,
struct xfs_inode **ipp2)
{
struct xfs_inode *ip1, *ip2;
struct xfs_exchmaps_intent *xmi;
int error;
/*
* Grab both inodes and set IRECOVERY to prevent trimming of post-eof
* mappings and freeing of unlinked inodes until we're totally done
* processing files. The ondisk format of this new log item contains
* file handle information, which is why recovery for other items do
* not check the inode generation number.
*/
error = xlog_recover_iget_handle(mp, xlf->xmi_inode1, xlf->xmi_igen1,
&ip1);
if (error) {
XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, xlf,
sizeof(*xlf));
return ERR_PTR(error);
}
error = xlog_recover_iget_handle(mp, xlf->xmi_inode2, xlf->xmi_igen2,
&ip2);
if (error) {
XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, xlf,
sizeof(*xlf));
goto err_rele1;
}
req->ip1 = ip1;
req->ip2 = ip2;
req->startoff1 = xlf->xmi_startoff1;
req->startoff2 = xlf->xmi_startoff2;
req->blockcount = xlf->xmi_blockcount;
req->flags = xlf->xmi_flags & XFS_EXCHMAPS_PARAMS;
xfs_exchrange_ilock(NULL, ip1, ip2);
error = xfs_exchmaps_estimate(req);
xfs_exchrange_iunlock(ip1, ip2);
if (error)
goto err_rele2;
*ipp1 = ip1;
*ipp2 = ip2;
xmi = xfs_exchmaps_init_intent(req);
xfs_defer_add_item(dfp, &xmi->xmi_list);
return xmi;
err_rele2:
xfs_irele(ip2);
err_rele1:
xfs_irele(ip1);
req->ip2 = req->ip1 = NULL;
return ERR_PTR(error);
}
/* Process a file mapping exchange item that was recovered from the log. */
STATIC int
xfs_exchmaps_recover_work(
struct xfs_defer_pending *dfp,
struct list_head *capture_list)
{
struct xfs_exchmaps_req req = { .flags = 0 };
struct xfs_trans_res resv;
struct xfs_exchmaps_intent *xmi;
struct xfs_log_item *lip = dfp->dfp_intent;
struct xfs_xmi_log_item *xmi_lip = XMI_ITEM(lip);
struct xfs_mount *mp = lip->li_log->l_mp;
struct xfs_trans *tp;
struct xfs_inode *ip1, *ip2;
int error = 0;
if (!xfs_xmi_validate(mp, xmi_lip)) {
XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
&xmi_lip->xmi_format,
sizeof(xmi_lip->xmi_format));
return -EFSCORRUPTED;
}
xmi = xfs_xmi_item_recover_intent(mp, dfp, &xmi_lip->xmi_format, &req,
&ip1, &ip2);
if (IS_ERR(xmi))
return PTR_ERR(xmi);
trace_xfs_exchmaps_recover(mp, xmi);
resv = xlog_recover_resv(&M_RES(mp)->tr_write);
error = xfs_trans_alloc(mp, &resv, req.resblks, 0, 0, &tp);
if (error)
goto err_rele;
xfs_exchrange_ilock(tp, ip1, ip2);
xfs_exchmaps_ensure_reflink(tp, xmi);
xfs_exchmaps_upgrade_extent_counts(tp, xmi);
error = xlog_recover_finish_intent(tp, dfp);
if (error == -EFSCORRUPTED)
XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
&xmi_lip->xmi_format,
sizeof(xmi_lip->xmi_format));
if (error)
goto err_cancel;
/*
* Commit transaction, which frees the transaction and saves the inodes
* for later replay activities.
*/
error = xfs_defer_ops_capture_and_commit(tp, capture_list);
goto err_unlock;
err_cancel:
xfs_trans_cancel(tp);
err_unlock:
xfs_exchrange_iunlock(ip1, ip2);
err_rele:
xfs_irele(ip2);
xfs_irele(ip1);
return error;
}
/* Relog an intent item to push the log tail forward. */
static struct xfs_log_item *
xfs_exchmaps_relog_intent(
struct xfs_trans *tp,
struct xfs_log_item *intent,
struct xfs_log_item *done_item)
{
struct xfs_xmi_log_item *xmi_lip;
struct xfs_xmi_log_format *old_xlf, *new_xlf;
old_xlf = &XMI_ITEM(intent)->xmi_format;
xmi_lip = xfs_xmi_init(tp->t_mountp);
new_xlf = &xmi_lip->xmi_format;
new_xlf->xmi_inode1 = old_xlf->xmi_inode1;
new_xlf->xmi_inode2 = old_xlf->xmi_inode2;
new_xlf->xmi_igen1 = old_xlf->xmi_igen1;
new_xlf->xmi_igen2 = old_xlf->xmi_igen2;
new_xlf->xmi_startoff1 = old_xlf->xmi_startoff1;
new_xlf->xmi_startoff2 = old_xlf->xmi_startoff2;
new_xlf->xmi_blockcount = old_xlf->xmi_blockcount;
new_xlf->xmi_flags = old_xlf->xmi_flags;
new_xlf->xmi_isize1 = old_xlf->xmi_isize1;
new_xlf->xmi_isize2 = old_xlf->xmi_isize2;
return &xmi_lip->xmi_item;
}
const struct xfs_defer_op_type xfs_exchmaps_defer_type = {
.name = "exchmaps",
.max_items = 1,
.create_intent = xfs_exchmaps_create_intent,
.abort_intent = xfs_exchmaps_abort_intent,
.create_done = xfs_exchmaps_create_done,
.finish_item = xfs_exchmaps_finish_item,
.cancel_item = xfs_exchmaps_cancel_item,
.recover_work = xfs_exchmaps_recover_work,
.relog_intent = xfs_exchmaps_relog_intent,
};
STATIC bool
xfs_xmi_item_match(
struct xfs_log_item *lip,
uint64_t intent_id)
{
return XMI_ITEM(lip)->xmi_format.xmi_id == intent_id;
}
static const struct xfs_item_ops xfs_xmi_item_ops = {
.flags = XFS_ITEM_INTENT,
.iop_size = xfs_xmi_item_size,
.iop_format = xfs_xmi_item_format,
.iop_unpin = xfs_xmi_item_unpin,
.iop_release = xfs_xmi_item_release,
.iop_match = xfs_xmi_item_match,
};
/*
* This routine is called to create an in-core file mapping exchange item from
* the xmi format structure which was logged on disk. It allocates an in-core
* xmi, copies the exchange information from the format structure into it, and
* adds the xmi to the AIL with the given LSN.
*/
STATIC int
xlog_recover_xmi_commit_pass2(
struct xlog *log,
struct list_head *buffer_list,
struct xlog_recover_item *item,
xfs_lsn_t lsn)
{
struct xfs_mount *mp = log->l_mp;
struct xfs_xmi_log_item *xmi_lip;
struct xfs_xmi_log_format *xmi_formatp;
size_t len;
len = sizeof(struct xfs_xmi_log_format);
if (item->ri_buf[0].i_len != len) {
XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
return -EFSCORRUPTED;
}
xmi_formatp = item->ri_buf[0].i_addr;
if (xmi_formatp->__pad != 0) {
XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
return -EFSCORRUPTED;
}
xmi_lip = xfs_xmi_init(mp);
memcpy(&xmi_lip->xmi_format, xmi_formatp, len);
xlog_recover_intent_item(log, &xmi_lip->xmi_item, lsn,
&xfs_exchmaps_defer_type);
return 0;
}
const struct xlog_recover_item_ops xlog_xmi_item_ops = {
.item_type = XFS_LI_XMI,
.commit_pass2 = xlog_recover_xmi_commit_pass2,
};
/*
* This routine is called when an XMD format structure is found in a committed
* transaction in the log. Its purpose is to cancel the corresponding XMI if it
* was still in the log. To do this it searches the AIL for the XMI with an id
* equal to that in the XMD format structure. If we find it we drop the XMD
* reference, which removes the XMI from the AIL and frees it.
*/
STATIC int
xlog_recover_xmd_commit_pass2(
struct xlog *log,
struct list_head *buffer_list,
struct xlog_recover_item *item,
xfs_lsn_t lsn)
{
struct xfs_xmd_log_format *xmd_formatp;
xmd_formatp = item->ri_buf[0].i_addr;
if (item->ri_buf[0].i_len != sizeof(struct xfs_xmd_log_format)) {
XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
return -EFSCORRUPTED;
}
xlog_recover_release_intent(log, XFS_LI_XMI, xmd_formatp->xmd_xmi_id);
return 0;
}
const struct xlog_recover_item_ops xlog_xmd_item_ops = {
.item_type = XFS_LI_XMD,
.commit_pass2 = xlog_recover_xmd_commit_pass2,
};

View File

@ -0,0 +1,64 @@
/* SPDX-License-Identifier: GPL-2.0-or-later */
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#ifndef __XFS_EXCHMAPS_ITEM_H__
#define __XFS_EXCHMAPS_ITEM_H__
/*
* The file mapping exchange intent item helps us exchange multiple file
* mappings between two inode forks. It does this by tracking the range of
* file block offsets that still need to be exchanged, and relogs as progress
* happens.
*
* *I items should be recorded in the *first* of a series of rolled
* transactions, and the *D items should be recorded in the same transaction
* that records the associated bmbt updates.
*
* Should the system crash after the commit of the first transaction but
* before the commit of the final transaction in a series, log recovery will
* use the redo information recorded by the intent items to replay the
* rest of the mapping exchanges.
*/
/* kernel only XMI/XMD definitions */
struct xfs_mount;
struct kmem_cache;
/*
* This is the incore file mapping exchange intent log item. It is used to log
* the fact that we are exchanging mappings between two files. It is used in
* conjunction with the incore file mapping exchange done log item described
* below.
*
* These log items follow the same rules as struct xfs_efi_log_item; see the
* comments about that structure (in xfs_extfree_item.h) for more details.
*/
struct xfs_xmi_log_item {
struct xfs_log_item xmi_item;
atomic_t xmi_refcount;
struct xfs_xmi_log_format xmi_format;
};
/*
* This is the incore file mapping exchange done log item. It is used to log
* the fact that an exchange mentioned in an earlier xmi item have been
* performed.
*/
struct xfs_xmd_log_item {
struct xfs_log_item xmd_item;
struct xfs_xmi_log_item *xmd_intent_log_item;
struct xfs_xmd_log_format xmd_format;
};
extern struct kmem_cache *xfs_xmi_cache;
extern struct kmem_cache *xfs_xmd_cache;
struct xfs_exchmaps_intent;
void xfs_exchmaps_defer_add(struct xfs_trans *tp,
struct xfs_exchmaps_intent *xmi);
#endif /* __XFS_EXCHMAPS_ITEM_H__ */

804
fs/xfs/xfs_exchrange.c Normal file
View File

@ -0,0 +1,804 @@
// SPDX-License-Identifier: GPL-2.0-or-later
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#include "xfs.h"
#include "xfs_shared.h"
#include "xfs_format.h"
#include "xfs_log_format.h"
#include "xfs_trans_resv.h"
#include "xfs_mount.h"
#include "xfs_defer.h"
#include "xfs_inode.h"
#include "xfs_trans.h"
#include "xfs_quota.h"
#include "xfs_bmap_util.h"
#include "xfs_reflink.h"
#include "xfs_trace.h"
#include "xfs_exchrange.h"
#include "xfs_exchmaps.h"
#include "xfs_sb.h"
#include "xfs_icache.h"
#include "xfs_log.h"
#include "xfs_rtbitmap.h"
#include <linux/fsnotify.h>
/* Lock (and optionally join) two inodes for a file range exchange. */
void
xfs_exchrange_ilock(
struct xfs_trans *tp,
struct xfs_inode *ip1,
struct xfs_inode *ip2)
{
if (ip1 != ip2)
xfs_lock_two_inodes(ip1, XFS_ILOCK_EXCL,
ip2, XFS_ILOCK_EXCL);
else
xfs_ilock(ip1, XFS_ILOCK_EXCL);
if (tp) {
xfs_trans_ijoin(tp, ip1, 0);
if (ip2 != ip1)
xfs_trans_ijoin(tp, ip2, 0);
}
}
/* Unlock two inodes after a file range exchange operation. */
void
xfs_exchrange_iunlock(
struct xfs_inode *ip1,
struct xfs_inode *ip2)
{
if (ip2 != ip1)
xfs_iunlock(ip2, XFS_ILOCK_EXCL);
xfs_iunlock(ip1, XFS_ILOCK_EXCL);
}
/*
* Estimate the resource requirements to exchange file contents between the two
* files. The caller is required to hold the IOLOCK and the MMAPLOCK and to
* have flushed both inodes' pagecache and active direct-ios.
*/
int
xfs_exchrange_estimate(
struct xfs_exchmaps_req *req)
{
int error;
xfs_exchrange_ilock(NULL, req->ip1, req->ip2);
error = xfs_exchmaps_estimate(req);
xfs_exchrange_iunlock(req->ip1, req->ip2);
return error;
}
#define QRETRY_IP1 (0x1)
#define QRETRY_IP2 (0x2)
/*
* Obtain a quota reservation to make sure we don't hit EDQUOT. We can skip
* this if quota enforcement is disabled or if both inodes' dquots are the
* same. The qretry structure must be initialized to zeroes before the first
* call to this function.
*/
STATIC int
xfs_exchrange_reserve_quota(
struct xfs_trans *tp,
const struct xfs_exchmaps_req *req,
unsigned int *qretry)
{
int64_t ddelta, rdelta;
int ip1_error = 0;
int error;
/*
* Don't bother with a quota reservation if we're not enforcing them
* or the two inodes have the same dquots.
*/
if (!XFS_IS_QUOTA_ON(tp->t_mountp) || req->ip1 == req->ip2 ||
(req->ip1->i_udquot == req->ip2->i_udquot &&
req->ip1->i_gdquot == req->ip2->i_gdquot &&
req->ip1->i_pdquot == req->ip2->i_pdquot))
return 0;
*qretry = 0;
/*
* For each file, compute the net gain in the number of regular blocks
* that will be mapped into that file and reserve that much quota. The
* quota counts must be able to absorb at least that much space.
*/
ddelta = req->ip2_bcount - req->ip1_bcount;
rdelta = req->ip2_rtbcount - req->ip1_rtbcount;
if (ddelta > 0 || rdelta > 0) {
error = xfs_trans_reserve_quota_nblks(tp, req->ip1,
ddelta > 0 ? ddelta : 0,
rdelta > 0 ? rdelta : 0,
false);
if (error == -EDQUOT || error == -ENOSPC) {
/*
* Save this error and see what happens if we try to
* reserve quota for ip2. Then report both.
*/
*qretry |= QRETRY_IP1;
ip1_error = error;
error = 0;
}
if (error)
return error;
}
if (ddelta < 0 || rdelta < 0) {
error = xfs_trans_reserve_quota_nblks(tp, req->ip2,
ddelta < 0 ? -ddelta : 0,
rdelta < 0 ? -rdelta : 0,
false);
if (error == -EDQUOT || error == -ENOSPC)
*qretry |= QRETRY_IP2;
if (error)
return error;
}
if (ip1_error)
return ip1_error;
/*
* For each file, forcibly reserve the gross gain in mapped blocks so
* that we don't trip over any quota block reservation assertions.
* We must reserve the gross gain because the quota code subtracts from
* bcount the number of blocks that we unmap; it does not add that
* quantity back to the quota block reservation.
*/
error = xfs_trans_reserve_quota_nblks(tp, req->ip1, req->ip1_bcount,
req->ip1_rtbcount, true);
if (error)
return error;
return xfs_trans_reserve_quota_nblks(tp, req->ip2, req->ip2_bcount,
req->ip2_rtbcount, true);
}
/* Exchange the mappings (and hence the contents) of two files' forks. */
STATIC int
xfs_exchrange_mappings(
const struct xfs_exchrange *fxr,
struct xfs_inode *ip1,
struct xfs_inode *ip2)
{
struct xfs_mount *mp = ip1->i_mount;
struct xfs_exchmaps_req req = {
.ip1 = ip1,
.ip2 = ip2,
.startoff1 = XFS_B_TO_FSBT(mp, fxr->file1_offset),
.startoff2 = XFS_B_TO_FSBT(mp, fxr->file2_offset),
.blockcount = XFS_B_TO_FSB(mp, fxr->length),
};
struct xfs_trans *tp;
unsigned int qretry;
bool retried = false;
int error;
trace_xfs_exchrange_mappings(fxr, ip1, ip2);
if (fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF)
req.flags |= XFS_EXCHMAPS_SET_SIZES;
if (fxr->flags & XFS_EXCHANGE_RANGE_FILE1_WRITTEN)
req.flags |= XFS_EXCHMAPS_INO1_WRITTEN;
/*
* Round the request length up to the nearest file allocation unit.
* The prep function already checked that the request offsets and
* length in @fxr are safe to round up.
*/
if (xfs_inode_has_bigrtalloc(ip2))
req.blockcount = xfs_rtb_roundup_rtx(mp, req.blockcount);
error = xfs_exchrange_estimate(&req);
if (error)
return error;
retry:
/* Allocate the transaction, lock the inodes, and join them. */
error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, req.resblks, 0,
XFS_TRANS_RES_FDBLKS, &tp);
if (error)
return error;
xfs_exchrange_ilock(tp, ip1, ip2);
trace_xfs_exchrange_before(ip2, 2);
trace_xfs_exchrange_before(ip1, 1);
error = xfs_exchmaps_check_forks(mp, &req);
if (error)
goto out_trans_cancel;
/*
* Reserve ourselves some quota if any of them are in enforcing mode.
* In theory we only need enough to satisfy the change in the number
* of blocks between the two ranges being remapped.
*/
error = xfs_exchrange_reserve_quota(tp, &req, &qretry);
if ((error == -EDQUOT || error == -ENOSPC) && !retried) {
xfs_trans_cancel(tp);
xfs_exchrange_iunlock(ip1, ip2);
if (qretry & QRETRY_IP1)
xfs_blockgc_free_quota(ip1, 0);
if (qretry & QRETRY_IP2)
xfs_blockgc_free_quota(ip2, 0);
retried = true;
goto retry;
}
if (error)
goto out_trans_cancel;
/* If we got this far on a dry run, all parameters are ok. */
if (fxr->flags & XFS_EXCHANGE_RANGE_DRY_RUN)
goto out_trans_cancel;
/* Update the mtime and ctime of both files. */
if (fxr->flags & __XFS_EXCHANGE_RANGE_UPD_CMTIME1)
xfs_trans_ichgtime(tp, ip1, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
if (fxr->flags & __XFS_EXCHANGE_RANGE_UPD_CMTIME2)
xfs_trans_ichgtime(tp, ip2, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
xfs_exchange_mappings(tp, &req);
/*
* Force the log to persist metadata updates if the caller or the
* administrator requires this. The generic prep function already
* flushed the relevant parts of the page cache.
*/
if (xfs_has_wsync(mp) || (fxr->flags & XFS_EXCHANGE_RANGE_DSYNC))
xfs_trans_set_sync(tp);
error = xfs_trans_commit(tp);
trace_xfs_exchrange_after(ip2, 2);
trace_xfs_exchrange_after(ip1, 1);
if (error)
goto out_unlock;
/*
* If the caller wanted us to exchange the contents of two complete
* files of unequal length, exchange the incore sizes now. This should
* be safe because we flushed both files' page caches, exchanged all
* the mappings, and updated the ondisk sizes.
*/
if (fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF) {
loff_t temp;
temp = i_size_read(VFS_I(ip2));
i_size_write(VFS_I(ip2), i_size_read(VFS_I(ip1)));
i_size_write(VFS_I(ip1), temp);
}
out_unlock:
xfs_exchrange_iunlock(ip1, ip2);
return error;
out_trans_cancel:
xfs_trans_cancel(tp);
goto out_unlock;
}
/*
* Generic code for exchanging ranges of two files via XFS_IOC_EXCHANGE_RANGE.
* This part deals with struct file objects and byte ranges and does not deal
* with XFS-specific data structures such as xfs_inodes and block ranges. This
* separation may some day facilitate porting to another filesystem.
*
* The goal is to exchange fxr.length bytes starting at fxr.file1_offset in
* file1 with the same number of bytes starting at fxr.file2_offset in file2.
* Implementations must call xfs_exchange_range_prep to prepare the two
* files prior to taking locks; and they must update the inode change and mod
* times of both files as part of the metadata update. The timestamp update
* and freshness checks must be done atomically as part of the data exchange
* operation to ensure correctness of the freshness check.
* xfs_exchange_range_finish must be called after the operation completes
* successfully but before locks are dropped.
*/
/* Verify that we have security clearance to perform this operation. */
static int
xfs_exchange_range_verify_area(
struct xfs_exchrange *fxr)
{
int ret;
ret = remap_verify_area(fxr->file1, fxr->file1_offset, fxr->length,
true);
if (ret)
return ret;
return remap_verify_area(fxr->file2, fxr->file2_offset, fxr->length,
true);
}
/*
* Performs necessary checks before doing a range exchange, having stabilized
* mutable inode attributes via i_rwsem.
*/
static inline int
xfs_exchange_range_checks(
struct xfs_exchrange *fxr,
unsigned int alloc_unit)
{
struct inode *inode1 = file_inode(fxr->file1);
struct inode *inode2 = file_inode(fxr->file2);
uint64_t allocmask = alloc_unit - 1;
int64_t test_len;
uint64_t blen;
loff_t size1, size2, tmp;
int error;
/* Don't touch certain kinds of inodes */
if (IS_IMMUTABLE(inode1) || IS_IMMUTABLE(inode2))
return -EPERM;
if (IS_SWAPFILE(inode1) || IS_SWAPFILE(inode2))
return -ETXTBSY;
size1 = i_size_read(inode1);
size2 = i_size_read(inode2);
/* Ranges cannot start after EOF. */
if (fxr->file1_offset > size1 || fxr->file2_offset > size2)
return -EINVAL;
/*
* If the caller said to exchange to EOF, we set the length of the
* request large enough to cover everything to the end of both files.
*/
if (fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF) {
fxr->length = max_t(int64_t, size1 - fxr->file1_offset,
size2 - fxr->file2_offset);
error = xfs_exchange_range_verify_area(fxr);
if (error)
return error;
}
/*
* The start of both ranges must be aligned to the file allocation
* unit.
*/
if (!IS_ALIGNED(fxr->file1_offset, alloc_unit) ||
!IS_ALIGNED(fxr->file2_offset, alloc_unit))
return -EINVAL;
/* Ensure offsets don't wrap. */
if (check_add_overflow(fxr->file1_offset, fxr->length, &tmp) ||
check_add_overflow(fxr->file2_offset, fxr->length, &tmp))
return -EINVAL;
/*
* We require both ranges to end within EOF, unless we're exchanging
* to EOF.
*/
if (!(fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF) &&
(fxr->file1_offset + fxr->length > size1 ||
fxr->file2_offset + fxr->length > size2))
return -EINVAL;
/*
* Make sure we don't hit any file size limits. If we hit any size
* limits such that test_length was adjusted, we abort the whole
* operation.
*/
test_len = fxr->length;
error = generic_write_check_limits(fxr->file2, fxr->file2_offset,
&test_len);
if (error)
return error;
error = generic_write_check_limits(fxr->file1, fxr->file1_offset,
&test_len);
if (error)
return error;
if (test_len != fxr->length)
return -EINVAL;
/*
* If the user wanted us to exchange up to the infile's EOF, round up
* to the next allocation unit boundary for this check. Do the same
* for the outfile.
*
* Otherwise, reject the range length if it's not aligned to an
* allocation unit.
*/
if (fxr->file1_offset + fxr->length == size1)
blen = ALIGN(size1, alloc_unit) - fxr->file1_offset;
else if (fxr->file2_offset + fxr->length == size2)
blen = ALIGN(size2, alloc_unit) - fxr->file2_offset;
else if (!IS_ALIGNED(fxr->length, alloc_unit))
return -EINVAL;
else
blen = fxr->length;
/* Don't allow overlapped exchanges within the same file. */
if (inode1 == inode2 &&
fxr->file2_offset + blen > fxr->file1_offset &&
fxr->file1_offset + blen > fxr->file2_offset)
return -EINVAL;
/*
* Ensure that we don't exchange a partial EOF block into the middle of
* another file.
*/
if ((fxr->length & allocmask) == 0)
return 0;
blen = fxr->length;
if (fxr->file2_offset + blen < size2)
blen &= ~allocmask;
if (fxr->file1_offset + blen < size1)
blen &= ~allocmask;
return blen == fxr->length ? 0 : -EINVAL;
}
/*
* Check that the two inodes are eligible for range exchanges, the ranges make
* sense, and then flush all dirty data. Caller must ensure that the inodes
* have been locked against any other modifications.
*/
static inline int
xfs_exchange_range_prep(
struct xfs_exchrange *fxr,
unsigned int alloc_unit)
{
struct inode *inode1 = file_inode(fxr->file1);
struct inode *inode2 = file_inode(fxr->file2);
bool same_inode = (inode1 == inode2);
int error;
/* Check that we don't violate system file offset limits. */
error = xfs_exchange_range_checks(fxr, alloc_unit);
if (error || fxr->length == 0)
return error;
/* Wait for the completion of any pending IOs on both files */
inode_dio_wait(inode1);
if (!same_inode)
inode_dio_wait(inode2);
error = filemap_write_and_wait_range(inode1->i_mapping,
fxr->file1_offset,
fxr->file1_offset + fxr->length - 1);
if (error)
return error;
error = filemap_write_and_wait_range(inode2->i_mapping,
fxr->file2_offset,
fxr->file2_offset + fxr->length - 1);
if (error)
return error;
/*
* If the files or inodes involved require synchronous writes, amend
* the request to force the filesystem to flush all data and metadata
* to disk after the operation completes.
*/
if (((fxr->file1->f_flags | fxr->file2->f_flags) & O_SYNC) ||
IS_SYNC(inode1) || IS_SYNC(inode2))
fxr->flags |= XFS_EXCHANGE_RANGE_DSYNC;
return 0;
}
/*
* Finish a range exchange operation, if it was successful. Caller must ensure
* that the inodes are still locked against any other modifications.
*/
static inline int
xfs_exchange_range_finish(
struct xfs_exchrange *fxr)
{
int error;
error = file_remove_privs(fxr->file1);
if (error)
return error;
if (file_inode(fxr->file1) == file_inode(fxr->file2))
return 0;
return file_remove_privs(fxr->file2);
}
/*
* Check the alignment of an exchange request when the allocation unit size
* isn't a power of two. The generic file-level helpers use (fast)
* bitmask-based alignment checks, but here we have to use slow long division.
*/
static int
xfs_exchrange_check_rtalign(
const struct xfs_exchrange *fxr,
struct xfs_inode *ip1,
struct xfs_inode *ip2,
unsigned int alloc_unit)
{
uint64_t length = fxr->length;
uint64_t blen;
loff_t size1, size2;
size1 = i_size_read(VFS_I(ip1));
size2 = i_size_read(VFS_I(ip2));
/* The start of both ranges must be aligned to a rt extent. */
if (!isaligned_64(fxr->file1_offset, alloc_unit) ||
!isaligned_64(fxr->file2_offset, alloc_unit))
return -EINVAL;
if (fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF)
length = max_t(int64_t, size1 - fxr->file1_offset,
size2 - fxr->file2_offset);
/*
* If the user wanted us to exchange up to the infile's EOF, round up
* to the next rt extent boundary for this check. Do the same for the
* outfile.
*
* Otherwise, reject the range length if it's not rt extent aligned.
* We already confirmed the starting offsets' rt extent block
* alignment.
*/
if (fxr->file1_offset + length == size1)
blen = roundup_64(size1, alloc_unit) - fxr->file1_offset;
else if (fxr->file2_offset + length == size2)
blen = roundup_64(size2, alloc_unit) - fxr->file2_offset;
else if (!isaligned_64(length, alloc_unit))
return -EINVAL;
else
blen = length;
/* Don't allow overlapped exchanges within the same file. */
if (ip1 == ip2 &&
fxr->file2_offset + blen > fxr->file1_offset &&
fxr->file1_offset + blen > fxr->file2_offset)
return -EINVAL;
/*
* Ensure that we don't exchange a partial EOF rt extent into the
* middle of another file.
*/
if (isaligned_64(length, alloc_unit))
return 0;
blen = length;
if (fxr->file2_offset + length < size2)
blen = rounddown_64(blen, alloc_unit);
if (fxr->file1_offset + blen < size1)
blen = rounddown_64(blen, alloc_unit);
return blen == length ? 0 : -EINVAL;
}
/* Prepare two files to have their data exchanged. */
STATIC int
xfs_exchrange_prep(
struct xfs_exchrange *fxr,
struct xfs_inode *ip1,
struct xfs_inode *ip2)
{
struct xfs_mount *mp = ip2->i_mount;
unsigned int alloc_unit = xfs_inode_alloc_unitsize(ip2);
int error;
trace_xfs_exchrange_prep(fxr, ip1, ip2);
/* Verify both files are either real-time or non-realtime */
if (XFS_IS_REALTIME_INODE(ip1) != XFS_IS_REALTIME_INODE(ip2))
return -EINVAL;
/* Check non-power of two alignment issues, if necessary. */
if (!is_power_of_2(alloc_unit)) {
error = xfs_exchrange_check_rtalign(fxr, ip1, ip2, alloc_unit);
if (error)
return error;
/*
* Do the generic file-level checks with the regular block
* alignment.
*/
alloc_unit = mp->m_sb.sb_blocksize;
}
error = xfs_exchange_range_prep(fxr, alloc_unit);
if (error || fxr->length == 0)
return error;
/* Attach dquots to both inodes before changing block maps. */
error = xfs_qm_dqattach(ip2);
if (error)
return error;
error = xfs_qm_dqattach(ip1);
if (error)
return error;
trace_xfs_exchrange_flush(fxr, ip1, ip2);
/* Flush the relevant ranges of both files. */
error = xfs_flush_unmap_range(ip2, fxr->file2_offset, fxr->length);
if (error)
return error;
error = xfs_flush_unmap_range(ip1, fxr->file1_offset, fxr->length);
if (error)
return error;
/*
* Cancel CoW fork preallocations for the ranges of both files. The
* prep function should have flushed all the dirty data, so the only
* CoW mappings remaining should be speculative.
*/
if (xfs_inode_has_cow_data(ip1)) {
error = xfs_reflink_cancel_cow_range(ip1, fxr->file1_offset,
fxr->length, true);
if (error)
return error;
}
if (xfs_inode_has_cow_data(ip2)) {
error = xfs_reflink_cancel_cow_range(ip2, fxr->file2_offset,
fxr->length, true);
if (error)
return error;
}
return 0;
}
/*
* Exchange contents of files. This is the binding between the generic
* file-level concepts and the XFS inode-specific implementation.
*/
STATIC int
xfs_exchrange_contents(
struct xfs_exchrange *fxr)
{
struct inode *inode1 = file_inode(fxr->file1);
struct inode *inode2 = file_inode(fxr->file2);
struct xfs_inode *ip1 = XFS_I(inode1);
struct xfs_inode *ip2 = XFS_I(inode2);
struct xfs_mount *mp = ip1->i_mount;
int error;
if (!xfs_has_exchange_range(mp))
return -EOPNOTSUPP;
if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS |
XFS_EXCHANGE_RANGE_PRIV_FLAGS))
return -EINVAL;
if (xfs_is_shutdown(mp))
return -EIO;
/* Lock both files against IO */
error = xfs_ilock2_io_mmap(ip1, ip2);
if (error)
goto out_err;
/* Prepare and then exchange file contents. */
error = xfs_exchrange_prep(fxr, ip1, ip2);
if (error)
goto out_unlock;
error = xfs_exchrange_mappings(fxr, ip1, ip2);
if (error)
goto out_unlock;
/*
* Finish the exchange by removing special file privileges like any
* other file write would do. This may involve turning on support for
* logged xattrs if either file has security capabilities.
*/
error = xfs_exchange_range_finish(fxr);
if (error)
goto out_unlock;
out_unlock:
xfs_iunlock2_io_mmap(ip1, ip2);
out_err:
if (error)
trace_xfs_exchrange_error(ip2, error, _RET_IP_);
return error;
}
/* Exchange parts of two files. */
static int
xfs_exchange_range(
struct xfs_exchrange *fxr)
{
struct inode *inode1 = file_inode(fxr->file1);
struct inode *inode2 = file_inode(fxr->file2);
int ret;
BUILD_BUG_ON(XFS_EXCHANGE_RANGE_ALL_FLAGS &
XFS_EXCHANGE_RANGE_PRIV_FLAGS);
/* Both files must be on the same mount/filesystem. */
if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt)
return -EXDEV;
if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
return -EINVAL;
/* Userspace requests only honored for regular files. */
if (S_ISDIR(inode1->i_mode) || S_ISDIR(inode2->i_mode))
return -EISDIR;
if (!S_ISREG(inode1->i_mode) || !S_ISREG(inode2->i_mode))
return -EINVAL;
/* Both files must be opened for read and write. */
if (!(fxr->file1->f_mode & FMODE_READ) ||
!(fxr->file1->f_mode & FMODE_WRITE) ||
!(fxr->file2->f_mode & FMODE_READ) ||
!(fxr->file2->f_mode & FMODE_WRITE))
return -EBADF;
/* Neither file can be opened append-only. */
if ((fxr->file1->f_flags & O_APPEND) ||
(fxr->file2->f_flags & O_APPEND))
return -EBADF;
/*
* If we're not exchanging to EOF, we can check the areas before
* stabilizing both files' i_size.
*/
if (!(fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF)) {
ret = xfs_exchange_range_verify_area(fxr);
if (ret)
return ret;
}
/* Update cmtime if the fd/inode don't forbid it. */
if (!(fxr->file1->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode1))
fxr->flags |= __XFS_EXCHANGE_RANGE_UPD_CMTIME1;
if (!(fxr->file2->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode2))
fxr->flags |= __XFS_EXCHANGE_RANGE_UPD_CMTIME2;
file_start_write(fxr->file2);
ret = xfs_exchrange_contents(fxr);
file_end_write(fxr->file2);
if (ret)
return ret;
fsnotify_modify(fxr->file1);
if (fxr->file2 != fxr->file1)
fsnotify_modify(fxr->file2);
return 0;
}
/* Collect exchange-range arguments from userspace. */
long
xfs_ioc_exchange_range(
struct file *file,
struct xfs_exchange_range __user *argp)
{
struct xfs_exchrange fxr = {
.file2 = file,
};
struct xfs_exchange_range args;
struct fd file1;
int error;
if (copy_from_user(&args, argp, sizeof(args)))
return -EFAULT;
if (memchr_inv(&args.pad, 0, sizeof(args.pad)))
return -EINVAL;
if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS)
return -EINVAL;
fxr.file1_offset = args.file1_offset;
fxr.file2_offset = args.file2_offset;
fxr.length = args.length;
fxr.flags = args.flags;
file1 = fdget(args.file1_fd);
if (!file1.file)
return -EBADF;
fxr.file1 = file1.file;
error = xfs_exchange_range(&fxr);
fdput(file1);
return error;
}

38
fs/xfs/xfs_exchrange.h Normal file
View File

@ -0,0 +1,38 @@
/* SPDX-License-Identifier: GPL-2.0-or-later */
/*
* Copyright (c) 2020-2024 Oracle. All Rights Reserved.
* Author: Darrick J. Wong <djwong@kernel.org>
*/
#ifndef __XFS_EXCHRANGE_H__
#define __XFS_EXCHRANGE_H__
/* Update the mtime/cmtime of file1 and file2 */
#define __XFS_EXCHANGE_RANGE_UPD_CMTIME1 (1ULL << 63)
#define __XFS_EXCHANGE_RANGE_UPD_CMTIME2 (1ULL << 62)
#define XFS_EXCHANGE_RANGE_PRIV_FLAGS (__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \
__XFS_EXCHANGE_RANGE_UPD_CMTIME2)
struct xfs_exchrange {
struct file *file1;
struct file *file2;
loff_t file1_offset;
loff_t file2_offset;
u64 length;
u64 flags; /* XFS_EXCHANGE_RANGE flags */
};
long xfs_ioc_exchange_range(struct file *file,
struct xfs_exchange_range __user *argp);
struct xfs_exchmaps_req;
void xfs_exchrange_ilock(struct xfs_trans *tp, struct xfs_inode *ip1,
struct xfs_inode *ip2);
void xfs_exchrange_iunlock(struct xfs_inode *ip1, struct xfs_inode *ip2);
int xfs_exchrange_estimate(struct xfs_exchmaps_req *req);
#endif /* __XFS_EXCHRANGE_H__ */

View File

@ -40,6 +40,7 @@
#include "xfs_xattr.h"
#include "xfs_rtbitmap.h"
#include "xfs_file.h"
#include "xfs_exchrange.h"
#include <linux/mount.h>
#include <linux/namei.h>
@ -2170,6 +2171,9 @@ xfs_file_ioctl(
return error;
}
case XFS_IOC_EXCHANGE_RANGE:
return xfs_ioc_exchange_range(filp, arg);
default:
return -ENOTTY;
}

View File

@ -1767,6 +1767,37 @@ xlog_recover_iget(
return 0;
}
/*
* Get an inode so that we can recover a log operation.
*
* Log intent items that target inodes effectively contain a file handle.
* Check that the generation number matches the intent item like we do for
* other file handles. Log intent items defined after this validation weakness
* was identified must use this function.
*/
int
xlog_recover_iget_handle(
struct xfs_mount *mp,
xfs_ino_t ino,
uint32_t gen,
struct xfs_inode **ipp)
{
struct xfs_inode *ip;
int error;
error = xlog_recover_iget(mp, ino, &ip);
if (error)
return error;
if (VFS_I(ip)->i_generation != gen) {
xfs_irele(ip);
return -EFSCORRUPTED;
}
*ipp = ip;
return 0;
}
/******************************************************************************
*
* Log recover routines
@ -1789,6 +1820,8 @@ static const struct xlog_recover_item_ops *xlog_recover_item_ops[] = {
&xlog_bud_item_ops,
&xlog_attri_item_ops,
&xlog_attrd_item_ops,
&xlog_xmi_item_ops,
&xlog_xmd_item_ops,
};
static const struct xlog_recover_item_ops *

View File

@ -292,6 +292,7 @@ typedef struct xfs_mount {
#define XFS_FEAT_BIGTIME (1ULL << 24) /* large timestamps */
#define XFS_FEAT_NEEDSREPAIR (1ULL << 25) /* needs xfs_repair */
#define XFS_FEAT_NREXT64 (1ULL << 26) /* large extent counters */
#define XFS_FEAT_EXCHANGE_RANGE (1ULL << 27) /* exchange range */
/* Mount features */
#define XFS_FEAT_NOATTR2 (1ULL << 48) /* disable attr2 creation */
@ -355,6 +356,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT)
__XFS_HAS_FEAT(bigtime, BIGTIME)
__XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
__XFS_HAS_FEAT(large_extent_counts, NREXT64)
__XFS_HAS_FEAT(exchange_range, EXCHANGE_RANGE)
/*
* Mount features

View File

@ -43,6 +43,7 @@
#include "xfs_iunlink_item.h"
#include "xfs_dahash_test.h"
#include "xfs_rtbitmap.h"
#include "xfs_exchmaps_item.h"
#include "scrub/stats.h"
#include "scrub/rcbag_btree.h"
@ -1727,6 +1728,10 @@ xfs_fs_fill_super(
goto out_filestream_unmount;
}
if (xfs_has_exchange_range(mp))
xfs_warn(mp,
"EXPERIMENTAL exchange-range feature enabled. Use at your own risk!");
error = xfs_mountfs(mp);
if (error)
goto out_filestream_unmount;
@ -2185,8 +2190,24 @@ xfs_init_caches(void)
if (!xfs_iunlink_cache)
goto out_destroy_attri_cache;
xfs_xmd_cache = kmem_cache_create("xfs_xmd_item",
sizeof(struct xfs_xmd_log_item),
0, 0, NULL);
if (!xfs_xmd_cache)
goto out_destroy_iul_cache;
xfs_xmi_cache = kmem_cache_create("xfs_xmi_item",
sizeof(struct xfs_xmi_log_item),
0, 0, NULL);
if (!xfs_xmi_cache)
goto out_destroy_xmd_cache;
return 0;
out_destroy_xmd_cache:
kmem_cache_destroy(xfs_xmd_cache);
out_destroy_iul_cache:
kmem_cache_destroy(xfs_iunlink_cache);
out_destroy_attri_cache:
kmem_cache_destroy(xfs_attri_cache);
out_destroy_attrd_cache:
@ -2243,6 +2264,8 @@ xfs_destroy_caches(void)
* destroy caches.
*/
rcu_barrier();
kmem_cache_destroy(xfs_xmd_cache);
kmem_cache_destroy(xfs_xmi_cache);
kmem_cache_destroy(xfs_iunlink_cache);
kmem_cache_destroy(xfs_attri_cache);
kmem_cache_destroy(xfs_attrd_cache);

View File

@ -250,19 +250,12 @@ out_release_dquots:
*/
STATIC int
xfs_inactive_symlink_rmt(
struct xfs_inode *ip)
struct xfs_inode *ip)
{
struct xfs_buf *bp;
int done;
int error;
int i;
xfs_mount_t *mp;
xfs_bmbt_irec_t mval[XFS_SYMLINK_MAPS];
int nmaps;
int size;
xfs_trans_t *tp;
struct xfs_mount *mp = ip->i_mount;
struct xfs_trans *tp;
int error;
mp = ip->i_mount;
ASSERT(!xfs_need_iread_extents(&ip->i_df));
/*
* We're freeing a symlink that has some
@ -286,44 +279,14 @@ xfs_inactive_symlink_rmt(
* locked for the second transaction. In the error paths we need it
* held so the cancel won't rele it, see below.
*/
size = (int)ip->i_disk_size;
ip->i_disk_size = 0;
VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG;
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
/*
* Find the block(s) so we can inval and unmap them.
*/
done = 0;
nmaps = ARRAY_SIZE(mval);
error = xfs_bmapi_read(ip, 0, xfs_symlink_blocks(mp, size),
mval, &nmaps, 0);
if (error)
goto error_trans_cancel;
/*
* Invalidate the block(s). No validation is done.
*/
for (i = 0; i < nmaps; i++) {
error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
&bp);
if (error)
goto error_trans_cancel;
xfs_trans_binval(tp, bp);
}
/*
* Unmap the dead block(s) to the dfops.
*/
error = xfs_bunmapi(tp, ip, 0, size, 0, nmaps, &done);
if (error)
goto error_trans_cancel;
ASSERT(done);
/*
* Commit the transaction. This first logs the EFI and the inode, then
* rolls and commits the transaction that frees the extents.
*/
xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
error = xfs_symlink_remote_truncate(tp, ip);
if (error)
goto error_trans_cancel;
error = xfs_trans_commit(tp);
if (error) {
ASSERT(xfs_is_shutdown(mp));

View File

@ -39,6 +39,8 @@
#include "xfs_buf_mem.h"
#include "xfs_btree_mem.h"
#include "xfs_bmap.h"
#include "xfs_exchmaps.h"
#include "xfs_exchrange.h"
/*
* We include this last to have the helpers above available for the trace

View File

@ -82,6 +82,9 @@ struct xfs_perag;
struct xfbtree;
struct xfs_btree_ops;
struct xfs_bmap_intent;
struct xfs_exchmaps_intent;
struct xfs_exchmaps_req;
struct xfs_exchrange;
#define XFS_ATTR_FILTER_FLAGS \
{ XFS_ATTR_ROOT, "ROOT" }, \
@ -4770,6 +4773,330 @@ DEFINE_XFBTREE_FREESP_EVENT(xfbtree_alloc_block);
DEFINE_XFBTREE_FREESP_EVENT(xfbtree_free_block);
#endif /* CONFIG_XFS_BTREE_IN_MEM */
/* exchmaps tracepoints */
#define XFS_EXCHMAPS_STRINGS \
{ XFS_EXCHMAPS_ATTR_FORK, "ATTRFORK" }, \
{ XFS_EXCHMAPS_SET_SIZES, "SETSIZES" }, \
{ XFS_EXCHMAPS_INO1_WRITTEN, "INO1_WRITTEN" }, \
{ XFS_EXCHMAPS_CLEAR_INO1_REFLINK, "CLEAR_INO1_REFLINK" }, \
{ XFS_EXCHMAPS_CLEAR_INO2_REFLINK, "CLEAR_INO2_REFLINK" }, \
{ __XFS_EXCHMAPS_INO2_SHORTFORM, "INO2_SF" }
DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1_skip);
DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1);
DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping2);
DEFINE_ITRUNC_EVENT(xfs_exchmaps_update_inode_size);
#define XFS_EXCHRANGE_INODES \
{ 1, "file1" }, \
{ 2, "file2" }
DECLARE_EVENT_CLASS(xfs_exchrange_inode_class,
TP_PROTO(struct xfs_inode *ip, int whichfile),
TP_ARGS(ip, whichfile),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(int, whichfile)
__field(xfs_ino_t, ino)
__field(int, format)
__field(xfs_extnum_t, nex)
__field(int, broot_size)
__field(int, fork_off)
),
TP_fast_assign(
__entry->dev = VFS_I(ip)->i_sb->s_dev;
__entry->whichfile = whichfile;
__entry->ino = ip->i_ino;
__entry->format = ip->i_df.if_format;
__entry->nex = ip->i_df.if_nextents;
__entry->fork_off = xfs_inode_fork_boff(ip);
),
TP_printk("dev %d:%d ino 0x%llx whichfile %s format %s num_extents %llu forkoff 0x%x",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->ino,
__print_symbolic(__entry->whichfile, XFS_EXCHRANGE_INODES),
__print_symbolic(__entry->format, XFS_INODE_FORMAT_STR),
__entry->nex,
__entry->fork_off)
)
#define DEFINE_EXCHRANGE_INODE_EVENT(name) \
DEFINE_EVENT(xfs_exchrange_inode_class, name, \
TP_PROTO(struct xfs_inode *ip, int whichfile), \
TP_ARGS(ip, whichfile))
DEFINE_EXCHRANGE_INODE_EVENT(xfs_exchrange_before);
DEFINE_EXCHRANGE_INODE_EVENT(xfs_exchrange_after);
DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
#define XFS_EXCHANGE_RANGE_FLAGS_STRS \
{ XFS_EXCHANGE_RANGE_TO_EOF, "TO_EOF" }, \
{ XFS_EXCHANGE_RANGE_DSYNC , "DSYNC" }, \
{ XFS_EXCHANGE_RANGE_DRY_RUN, "DRY_RUN" }, \
{ XFS_EXCHANGE_RANGE_FILE1_WRITTEN, "F1_WRITTEN" }, \
{ __XFS_EXCHANGE_RANGE_UPD_CMTIME1, "CMTIME1" }, \
{ __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" }
/* file exchange-range tracepoint class */
DECLARE_EVENT_CLASS(xfs_exchrange_class,
TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip1,
struct xfs_inode *ip2),
TP_ARGS(fxr, ip1, ip2),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_ino_t, ip1_ino)
__field(loff_t, ip1_isize)
__field(loff_t, ip1_disize)
__field(xfs_ino_t, ip2_ino)
__field(loff_t, ip2_isize)
__field(loff_t, ip2_disize)
__field(loff_t, file1_offset)
__field(loff_t, file2_offset)
__field(unsigned long long, length)
__field(unsigned long long, flags)
),
TP_fast_assign(
__entry->dev = VFS_I(ip1)->i_sb->s_dev;
__entry->ip1_ino = ip1->i_ino;
__entry->ip1_isize = VFS_I(ip1)->i_size;
__entry->ip1_disize = ip1->i_disk_size;
__entry->ip2_ino = ip2->i_ino;
__entry->ip2_isize = VFS_I(ip2)->i_size;
__entry->ip2_disize = ip2->i_disk_size;
__entry->file1_offset = fxr->file1_offset;
__entry->file2_offset = fxr->file2_offset;
__entry->length = fxr->length;
__entry->flags = fxr->flags;
),
TP_printk("dev %d:%d flags %s bytecount 0x%llx "
"ino1 0x%llx isize 0x%llx disize 0x%llx pos 0x%llx -> "
"ino2 0x%llx isize 0x%llx disize 0x%llx pos 0x%llx",
MAJOR(__entry->dev), MINOR(__entry->dev),
__print_flags_u64(__entry->flags, "|", XFS_EXCHANGE_RANGE_FLAGS_STRS),
__entry->length,
__entry->ip1_ino,
__entry->ip1_isize,
__entry->ip1_disize,
__entry->file1_offset,
__entry->ip2_ino,
__entry->ip2_isize,
__entry->ip2_disize,
__entry->file2_offset)
)
#define DEFINE_EXCHRANGE_EVENT(name) \
DEFINE_EVENT(xfs_exchrange_class, name, \
TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip1, \
struct xfs_inode *ip2), \
TP_ARGS(fxr, ip1, ip2))
DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
TRACE_EVENT(xfs_exchmaps_overhead,
TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
unsigned long long rmapbt_blocks),
TP_ARGS(mp, bmbt_blocks, rmapbt_blocks),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(unsigned long long, bmbt_blocks)
__field(unsigned long long, rmapbt_blocks)
),
TP_fast_assign(
__entry->dev = mp->m_super->s_dev;
__entry->bmbt_blocks = bmbt_blocks;
__entry->rmapbt_blocks = rmapbt_blocks;
),
TP_printk("dev %d:%d bmbt_blocks 0x%llx rmapbt_blocks 0x%llx",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->bmbt_blocks,
__entry->rmapbt_blocks)
);
DECLARE_EVENT_CLASS(xfs_exchmaps_estimate_class,
TP_PROTO(const struct xfs_exchmaps_req *req),
TP_ARGS(req),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_ino_t, ino1)
__field(xfs_ino_t, ino2)
__field(xfs_fileoff_t, startoff1)
__field(xfs_fileoff_t, startoff2)
__field(xfs_filblks_t, blockcount)
__field(uint64_t, flags)
__field(xfs_filblks_t, ip1_bcount)
__field(xfs_filblks_t, ip2_bcount)
__field(xfs_filblks_t, ip1_rtbcount)
__field(xfs_filblks_t, ip2_rtbcount)
__field(unsigned long long, resblks)
__field(unsigned long long, nr_exchanges)
),
TP_fast_assign(
__entry->dev = req->ip1->i_mount->m_super->s_dev;
__entry->ino1 = req->ip1->i_ino;
__entry->ino2 = req->ip2->i_ino;
__entry->startoff1 = req->startoff1;
__entry->startoff2 = req->startoff2;
__entry->blockcount = req->blockcount;
__entry->flags = req->flags;
__entry->ip1_bcount = req->ip1_bcount;
__entry->ip2_bcount = req->ip2_bcount;
__entry->ip1_rtbcount = req->ip1_rtbcount;
__entry->ip2_rtbcount = req->ip2_rtbcount;
__entry->resblks = req->resblks;
__entry->nr_exchanges = req->nr_exchanges;
),
TP_printk("dev %d:%d ino1 0x%llx fileoff1 0x%llx ino2 0x%llx fileoff2 0x%llx fsbcount 0x%llx flags (%s) bcount1 0x%llx rtbcount1 0x%llx bcount2 0x%llx rtbcount2 0x%llx resblks 0x%llx nr_exchanges %llu",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->ino1, __entry->startoff1,
__entry->ino2, __entry->startoff2,
__entry->blockcount,
__print_flags_u64(__entry->flags, "|", XFS_EXCHMAPS_STRINGS),
__entry->ip1_bcount,
__entry->ip1_rtbcount,
__entry->ip2_bcount,
__entry->ip2_rtbcount,
__entry->resblks,
__entry->nr_exchanges)
);
#define DEFINE_EXCHMAPS_ESTIMATE_EVENT(name) \
DEFINE_EVENT(xfs_exchmaps_estimate_class, name, \
TP_PROTO(const struct xfs_exchmaps_req *req), \
TP_ARGS(req))
DEFINE_EXCHMAPS_ESTIMATE_EVENT(xfs_exchmaps_initial_estimate);
DEFINE_EXCHMAPS_ESTIMATE_EVENT(xfs_exchmaps_final_estimate);
DECLARE_EVENT_CLASS(xfs_exchmaps_intent_class,
TP_PROTO(struct xfs_mount *mp, const struct xfs_exchmaps_intent *xmi),
TP_ARGS(mp, xmi),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_ino_t, ino1)
__field(xfs_ino_t, ino2)
__field(uint64_t, flags)
__field(xfs_fileoff_t, startoff1)
__field(xfs_fileoff_t, startoff2)
__field(xfs_filblks_t, blockcount)
__field(xfs_fsize_t, isize1)
__field(xfs_fsize_t, isize2)
__field(xfs_fsize_t, new_isize1)
__field(xfs_fsize_t, new_isize2)
),
TP_fast_assign(
__entry->dev = mp->m_super->s_dev;
__entry->ino1 = xmi->xmi_ip1->i_ino;
__entry->ino2 = xmi->xmi_ip2->i_ino;
__entry->flags = xmi->xmi_flags;
__entry->startoff1 = xmi->xmi_startoff1;
__entry->startoff2 = xmi->xmi_startoff2;
__entry->blockcount = xmi->xmi_blockcount;
__entry->isize1 = xmi->xmi_ip1->i_disk_size;
__entry->isize2 = xmi->xmi_ip2->i_disk_size;
__entry->new_isize1 = xmi->xmi_isize1;
__entry->new_isize2 = xmi->xmi_isize2;
),
TP_printk("dev %d:%d ino1 0x%llx fileoff1 0x%llx ino2 0x%llx fileoff2 0x%llx fsbcount 0x%llx flags (%s) isize1 0x%llx newisize1 0x%llx isize2 0x%llx newisize2 0x%llx",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->ino1, __entry->startoff1,
__entry->ino2, __entry->startoff2,
__entry->blockcount,
__print_flags_u64(__entry->flags, "|", XFS_EXCHMAPS_STRINGS),
__entry->isize1, __entry->new_isize1,
__entry->isize2, __entry->new_isize2)
);
#define DEFINE_EXCHMAPS_INTENT_EVENT(name) \
DEFINE_EVENT(xfs_exchmaps_intent_class, name, \
TP_PROTO(struct xfs_mount *mp, const struct xfs_exchmaps_intent *xmi), \
TP_ARGS(mp, xmi))
DEFINE_EXCHMAPS_INTENT_EVENT(xfs_exchmaps_defer);
DEFINE_EXCHMAPS_INTENT_EVENT(xfs_exchmaps_recover);
TRACE_EVENT(xfs_exchmaps_delta_nextents_step,
TP_PROTO(struct xfs_mount *mp,
const struct xfs_bmbt_irec *left,
const struct xfs_bmbt_irec *curr,
const struct xfs_bmbt_irec *new,
const struct xfs_bmbt_irec *right,
int delta, unsigned int state),
TP_ARGS(mp, left, curr, new, right, delta, state),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_fileoff_t, loff)
__field(xfs_fsblock_t, lstart)
__field(xfs_filblks_t, lcount)
__field(xfs_fileoff_t, coff)
__field(xfs_fsblock_t, cstart)
__field(xfs_filblks_t, ccount)
__field(xfs_fileoff_t, noff)
__field(xfs_fsblock_t, nstart)
__field(xfs_filblks_t, ncount)
__field(xfs_fileoff_t, roff)
__field(xfs_fsblock_t, rstart)
__field(xfs_filblks_t, rcount)
__field(int, delta)
__field(unsigned int, state)
),
TP_fast_assign(
__entry->dev = mp->m_super->s_dev;
__entry->loff = left->br_startoff;
__entry->lstart = left->br_startblock;
__entry->lcount = left->br_blockcount;
__entry->coff = curr->br_startoff;
__entry->cstart = curr->br_startblock;
__entry->ccount = curr->br_blockcount;
__entry->noff = new->br_startoff;
__entry->nstart = new->br_startblock;
__entry->ncount = new->br_blockcount;
__entry->roff = right->br_startoff;
__entry->rstart = right->br_startblock;
__entry->rcount = right->br_blockcount;
__entry->delta = delta;
__entry->state = state;
),
TP_printk("dev %d:%d left 0x%llx:0x%llx:0x%llx; curr 0x%llx:0x%llx:0x%llx <- new 0x%llx:0x%llx:0x%llx; right 0x%llx:0x%llx:0x%llx delta %d state 0x%x",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->loff, __entry->lstart, __entry->lcount,
__entry->coff, __entry->cstart, __entry->ccount,
__entry->noff, __entry->nstart, __entry->ncount,
__entry->roff, __entry->rstart, __entry->rcount,
__entry->delta, __entry->state)
);
TRACE_EVENT(xfs_exchmaps_delta_nextents,
TP_PROTO(const struct xfs_exchmaps_req *req, int64_t d_nexts1,
int64_t d_nexts2),
TP_ARGS(req, d_nexts1, d_nexts2),
TP_STRUCT__entry(
__field(dev_t, dev)
__field(xfs_ino_t, ino1)
__field(xfs_ino_t, ino2)
__field(xfs_extnum_t, nexts1)
__field(xfs_extnum_t, nexts2)
__field(int64_t, d_nexts1)
__field(int64_t, d_nexts2)
),
TP_fast_assign(
int whichfork = xfs_exchmaps_reqfork(req);
__entry->dev = req->ip1->i_mount->m_super->s_dev;
__entry->ino1 = req->ip1->i_ino;
__entry->ino2 = req->ip2->i_ino;
__entry->nexts1 = xfs_ifork_ptr(req->ip1, whichfork)->if_nextents;
__entry->nexts2 = xfs_ifork_ptr(req->ip2, whichfork)->if_nextents;
__entry->d_nexts1 = d_nexts1;
__entry->d_nexts2 = d_nexts2;
),
TP_printk("dev %d:%d ino1 0x%llx nexts %llu ino2 0x%llx nexts %llu delta1 %lld delta2 %lld",
MAJOR(__entry->dev), MINOR(__entry->dev),
__entry->ino1, __entry->nexts1,
__entry->ino2, __entry->nexts2,
__entry->d_nexts1, __entry->d_nexts2)
);
#endif /* _TRACE_XFS_H */
#undef TRACE_INCLUDE_PATH

View File

@ -2119,6 +2119,7 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write);
int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *len, unsigned int remap_flags,