Commit Graph

76865 Commits

Author SHA1 Message Date
Jeffle Xu
93b856bb5f erofs: add fscache mode check helper
Until then erofs is exactly blockdev based filesystem.

A new fscache-based mode is going to be introduced for erofs to support
scenarios where on-demand read semantics is needed, e.g. container
image distribution. In this case, erofs could be mounted from data blobs
through fscache.

Add a helper checking which mode erofs works in, and twist the code in
preparation for the upcoming fscache mode.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-11-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18 00:11:19 +08:00
Jeffle Xu
94d7894670 erofs: make erofs_map_blocks() generally available
... so that it can be used in the following introduced fscache mode.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-10-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18 00:11:18 +08:00
Jeffle Xu
1519670e4f cachefiles: add tracepoints for on-demand read mode
Add tracepoints for on-demand read mode. Currently following tracepoints
are added:

	OPEN request / COPEN reply
	CLOSE request
	READ request / CREAD reply
	write through anonymous fd
	release of anonymous fd

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220425122143.56815-8-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18 00:11:18 +08:00
Jeffle Xu
4e4f1788af cachefiles: enable on-demand read mode
Enable on-demand read mode by adding an optional parameter to the "bind"
command.

On-demand mode will be turned on when this parameter is "ondemand", i.e.
"bind ondemand". Otherwise cachefiles will work in the original mode.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220509074028.74954-7-jefflexu@linux.alibaba.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18 00:11:18 +08:00
Jeffle Xu
9032b6e858 cachefiles: implement on-demand read
Implement the data plane of on-demand read mode.

The early implementation [1] place the entry to
cachefiles_ondemand_read() in fscache_read(). However, fscache_read()
can only detect if the requested file range is fully cache miss, whilst
we need to notify the user daemon as long as there's a hole inside the
requested file range.

Thus the entry is now placed in cachefiles_prepare_read(). When working
in on-demand read mode, once a hole detected, the read routine will send
a READ request to the user daemon. The user daemon needs to fetch the
data and write it to the cache file. After sending the READ request, the
read routine will hang there, until the READ request is handled by the
user daemon. Then it will retry to read from the same file range. If no
progress encountered, the read routine will fail then.

A new NETFS_SREQ_ONDEMAND flag is introduced to indicate that on-demand
read should be done when a cache miss encountered.

[1] https://lore.kernel.org/all/20220406075612.60298-6-jefflexu@linux.alibaba.com/ #v8

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220425122143.56815-6-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18 00:11:18 +08:00
Jeffle Xu
324b954ac8 cachefiles: notify the user daemon when withdrawing cookie
Notify the user daemon that cookie is going to be withdrawn, providing a
hint that the associated anonymous fd can be closed.

Be noted that this is only a hint. The user daemon may close the
associated anonymous fd when receiving the CLOSE request, then it will
receive another anonymous fd when the cookie gets looked up. Or it may
ignore the CLOSE request, and keep writing data through the anonymous
fd. However the next time the cookie gets looked up, the user daemon
will still receive another new anonymous fd.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220425122143.56815-5-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18 00:11:17 +08:00
Jeffle Xu
d11b0b043b cachefiles: unbind cachefiles gracefully in on-demand mode
Add a refcount to avoid the deadlock in on-demand read mode. The
on-demand read mode will pin the corresponding cachefiles object for
each anonymous fd. The cachefiles object is unpinned when the anonymous
fd gets closed. When the user daemon exits and the fd of
"/dev/cachefiles" device node gets closed, it will wait for all
cahcefiles objects getting withdrawn. Then if there's any anonymous fd
getting closed after the fd of the device node, the user daemon will
hang forever, waiting for all objects getting withdrawn.

To fix this, add a refcount indicating if there's any object pinned by
anonymous fds. The cachefiles cache gets unbound and withdrawn when the
refcount is decreased to 0. It won't change the behaviour of the
original mode, in which case the cachefiles cache gets unbound and
withdrawn as long as the fd of the device node gets closed.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220509074028.74954-4-jefflexu@linux.alibaba.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18 00:11:17 +08:00
Jeffle Xu
c838305450 cachefiles: notify the user daemon when looking up cookie
Fscache/CacheFiles used to serve as a local cache for a remote
networking fs. A new on-demand read mode will be introduced for
CacheFiles, which can boost the scenario where on-demand read semantics
are needed, e.g. container image distribution.

The essential difference between these two modes is seen when a cache
miss occurs: In the original mode, the netfs will fetch the data from
the remote server and then write it to the cache file; in on-demand
read mode, fetching the data and writing it into the cache is delegated
to a user daemon.

As the first step, notify the user daemon when looking up cookie. In
this case, an anonymous fd is sent to the user daemon, through which the
user daemon can write the fetched data to the cache file. Since the user
daemon may move the anonymous fd around, e.g. through dup(), an object
ID uniquely identifying the cache file is also attached.

Also add one advisory flag (FSCACHE_ADV_WANT_CACHE_SIZE) suggesting that
the cache file size shall be retrieved at runtime. This helps the
scenario where one cache file contains multiple netfs files, e.g. for
the purpose of deduplication. In this case, netfs itself has no idea the
size of the cache file, whilst the user daemon should give the hint on
it.

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220509074028.74954-3-jefflexu@linux.alibaba.com
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18 00:11:17 +08:00
Jeffle Xu
a06fac1599 cachefiles: extract write routine
Extract the generic routine of writing data to cache files, and make it
generally available.

This will be used by the following patch implementing on-demand read
mode. Since it's called inside CacheFiles module, make the interface
generic and unrelated to netfs_cache_resources.

It is worth noting that, ki->inval_counter is not initialized after
this cleanup. It shall not make any visible difference, since
inval_counter is no longer used in the write completion routine, i.e.
cachefiles_write_complete().

Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/20220425122143.56815-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-18 00:11:17 +08:00
Trond Myklebust
c5e483b77c NFS: Don't report errors from nfs_pageio_complete() more than once
Since errors from nfs_pageio_complete() are already being reported
through nfs_async_write_error(), we should not be returning them to the
callers of do_writepages() as well. They will end up being reported
through the generic mechanism instead.

Fixes: 6fbda89b25 ("NFS: Replace custom error reporting mechanism with generic one")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 12:01:59 -04:00
Trond Myklebust
d95b26650e NFS: Do not report flush errors in nfs_write_end()
If we do flush cached writebacks in nfs_write_end() due to the imminent
expiration of an RPCSEC_GSS session, then we should defer reporting any
resulting errors until the calls to file_check_and_advance_wb_err() in
nfs_file_write() and nfs_file_fsync().

Fixes: 6fbda89b25 ("NFS: Replace custom error reporting mechanism with generic one")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 12:01:59 -04:00
Trond Myklebust
e6005436f6 NFS: Don't report ENOSPC write errors twice
Any errors reported by the write() system call need to be cleared from
the file descriptor's error tracking. The current call to nfs_wb_all()
causes the error to be reported, but since it doesn't call
file_check_and_advance_wb_err(), we can end up reporting the same error
a second time when the application calls fsync().

Note that since Linux 4.13, the rule is that EIO may be reported for
write(), but it must be reported by a subsequent fsync(), so let's just
drop reporting it in write.

The check for nfs_ctx_key_to_expire() is just a duplicate to the one
already in nfs_write_end(), so let's drop that too.

Reported-by: ChenXiaoSong <chenxiaosong2@huawei.com>
Fixes: ce368536dd ("nfs: nfs_file_write() should check for writeback errors")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 12:01:59 -04:00
Trond Myklebust
9641d9bc9b NFS: fsync() should report filesystem errors over EINTR/ERESTARTSYS
If the commit to disk is interrupted, we should still first check for
filesystem errors so that we can report them in preference to the error
due to the signal.

Fixes: 2197e9b06c ("NFS: Fix up fsync() when the server rebooted")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 12:01:59 -04:00
Trond Myklebust
cea9ba7239 NFS: Do not report EINTR/ERESTARTSYS as mapping errors
If the attempt to flush data was interrupted due to a local signal, then
just requeue the writes back for I/O.

Fixes: 6fbda89b25 ("NFS: Replace custom error reporting mechanism with generic one")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2022-05-17 12:01:59 -04:00
Chao Yu
6c459b78d4 erofs: support idmapped mounts
This patch enables idmapped mounts for erofs, since all dedicated helpers
for this functionality existsm, so, in this patch we just pass down the
user_namespace argument from the VFS methods to the relevant helpers.

Simple idmap example on erofs image:

1. mkdir dir
2. touch dir/file
3. mkfs.erofs erofs.img dir
4. mount -t erofs -o loop erofs.img  /mnt/erofs/

5. ls -ln /mnt/erofs/
total 0
-rw-rw-r-- 1 1000 1000 0 May 17 15:26 file

6. mount-idmapped --map-mount b:1000:1001:1 /mnt/erofs/ /mnt/scratch_erofs/

7. ls -ln /mnt/scratch_erofs/
total 0
-rw-rw-r-- 1 1001 1001 0 May 17 15:26 file

Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Link: https://lore.kernel.org/r/20220517104103.3570721-1-chao@kernel.org
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-17 23:56:20 +08:00
Hongnan Li
3e917cc305 erofs: make filesystem exportable
Implement export operations in order to make EROFS support accessing
inodes with filehandles so that it can be exported via NFS and used
by overlayfs.

Without this patch, 'exportfs -rv' will report:
exportfs: /root/erofs_mp does not support NFS export

Also tested with unionmount-testsuite and the testcase below passes now:
./run --ov --erofs --verify hard-link

For more details about the testcase, see:
https://github.com/amir73il/unionmount-testsuite/pull/6

Signed-off-by: Hongnan Li <hongnan.li@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220425040712.91685-1-hongnan.li@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-17 23:48:54 +08:00
Gao Xiang
dcbe6803ff erofs: fix buffer copy overflow of ztailpacking feature
I got some KASAN report as below:

[   46.959738] ==================================================================
[   46.960430] BUG: KASAN: use-after-free in z_erofs_shifted_transform+0x2bd/0x370
[   46.960430] Read of size 4074 at addr ffff8880300c2f8e by task fssum/188
...
[   46.960430] Call Trace:
[   46.960430]  <TASK>
[   46.960430]  dump_stack_lvl+0x41/0x5e
[   46.960430]  print_report.cold+0xb2/0x6b7
[   46.960430]  ? z_erofs_shifted_transform+0x2bd/0x370
[   46.960430]  kasan_report+0x8a/0x140
[   46.960430]  ? z_erofs_shifted_transform+0x2bd/0x370
[   46.960430]  kasan_check_range+0x14d/0x1d0
[   46.960430]  memcpy+0x20/0x60
[   46.960430]  z_erofs_shifted_transform+0x2bd/0x370
[   46.960430]  z_erofs_decompress_pcluster+0xaae/0x1080

The root cause is that the tail pcluster won't be a complete filesystem
block anymore. So if ztailpacking is used, the second part of an
uncompressed tail pcluster may not be ``rq->pageofs_out``.

Fixes: ab749badf9 ("erofs: support unaligned data decompression")
Fixes: cecf864d3d ("erofs: support inline data decompression")
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220512115833.24175-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-17 23:38:14 +08:00
Gao Xiang
2833f4bb46 erofs: refine on-disk definition comments
Fix some outdated comments and typos, hopefully helpful.

Link: https://lore.kernel.org/r/20220506194612.117120-3-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-17 23:38:13 +08:00
Gao Xiang
1f7aa6caef erofs: remove obsoleted comments
Some comments haven't been useful anymore since the code updated.
Let's drop them instead.

Link: https://lore.kernel.org/r/20220506194612.117120-2-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-17 23:38:13 +08:00
Yue Hu
1e59af07c7 erofs: do not prompt for risk any more when using big pcluster
The big pcluster feature has been merged for a year, it has been mostly
stable now.

Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220407050505.12683-1-huyue2@coolpad.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2022-05-17 23:38:02 +08:00
Darrick J. Wong
e9c3a8e820 iomap: don't invalidate folios after writeback errors
XFS has the unique behavior (as compared to the other Linux filesystems)
that on writeback errors it will completely invalidate the affected
folio and force the page cache to reread the contents from disk.  All
other filesystems leave the page mapped and up to date.

This is a rude awakening for user programs, since (in the case where
write fails but reread doesn't) file contents will appear to revert to
old disk contents with no notification other than an EIO on fsync.  This
might have been annoying back in the days when iomap dealt with one page
at a time, but with multipage folios, we can now throw away *megabytes*
worth of data for a single write error.

On *most* Linux filesystems, a program can respond to an EIO on write by
redirtying the entire file and scheduling it for writeback.  This isn't
foolproof, since the page that failed writeback is no longer dirty and
could be evicted, but programs that want to recover properly *also*
have to detect XFS and regenerate every write they've made to the file.

When running xfs/314 on arm64, I noticed a UAF when xfs_discard_folio
invalidates multipage folios that could be undergoing writeback.  If,
say, we have a 256K folio caching a mix of written and unwritten
extents, it's possible that we could start writeback of the first (say)
64K of the folio and then hit a writeback error on the next 64K.  We
then free the iop attached to the folio, which is really bad because
writeback completion on the first 64k will trip over the "blocks per
folio > 1 && !iop" assertion.

This can't be fixed by only invalidating the folio if writeback fails at
the start of the folio, since the folio is marked !uptodate, which trips
other assertions elsewhere.  Get rid of the whole behavior entirely.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-05-16 15:27:38 -07:00
Jane Chu
047218ec90 dax: add .recovery_write dax_operation
Introduce dax_recovery_write() operation. The function is used to
recover a dax range that contains poison. Typical use case is when
a user process receives a SIGBUS with si_code BUS_MCEERR_AR
indicating poison(s) in a dax range, in response, the user process
issues a pwrite() to the page-aligned dax range, thus clears the
poison and puts valid data in the range.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Link: https://lore.kernel.org/r/20220422224508.440670-6-jane.chu@oracle.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-05-16 13:37:59 -07:00
Jane Chu
e511c4a3d2 dax: introduce DAX_RECOVERY_WRITE dax access mode
Up till now, dax_direct_access() is used implicitly for normal
access, but for the purpose of recovery write, dax range with
poison is requested.  To make the interface clear, introduce
	enum dax_access_mode {
		DAX_ACCESS,
		DAX_RECOVERY_WRITE,
	}
where DAX_ACCESS is used for normal dax access, and
DAX_RECOVERY_WRITE is used for dax recovery write.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Link: https://lore.kernel.org/r/165247982851.52965.11024212198889762949.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2022-05-16 13:35:56 -07:00
Filipe Manana
521b6803f2 btrfs: send: keep the current inode open while processing it
Every time we send a write command, we open the inode, read some data to
a buffer and then close the inode. The amount of data we read for each
write command is at most 48K, returned by max_send_read_size(), and that
corresponds to: BTRFS_SEND_BUF_SIZE - 16K = 48K. In practice this does
not add any significant overhead, because the time elapsed between every
close (iput()) and open (btrfs_iget()) is very short, so the inode is kept
in the VFS's cache after the iput() and it's still there by the time we
do the next btrfs_iget().

As between processing extents of the current inode we don't do anything
else, it makes sense to keep the inode open after we process its first
extent that needs to be sent and keep it open until we start processing
the next inode. This serves to facilitate the next change, which aims
to avoid having send operations trash the page cache with data extents.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:33 +02:00
Christoph Hellwig
642c5d34da btrfs: allocate the btrfs_dio_private as part of the iomap dio bio
Create a new bio_set that contains all the per-bio private data needed
by btrfs for direct I/O and tell the iomap code to use that instead
of separately allocation the btrfs_dio_private structure.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:33 +02:00
Christoph Hellwig
a3e171a09c btrfs: move struct btrfs_dio_private to inode.c
The btrfs_dio_private structure is only used in inode.c, so move the
definition there.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
Christoph Hellwig
acb8b52a15 btrfs: remove the disk_bytenr in struct btrfs_dio_private
This field is never used, so remove it. Last use was probably in
23ea8e5a07 ("Btrfs: load checksum data once when submitting a direct
read io").

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
Christoph Hellwig
491a6d0118 btrfs: allocate dio_data on stack
Make use of the new iomap_iter->private field to avoid a memory
allocation per iomap range.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
Christoph Hellwig
786f847f43 iomap: add per-iomap_iter private data
Allow the file system to keep state for all iterations.  For now only
wire it up for direct I/O as there is an immediate need for it there.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
Christoph Hellwig
908c54909a iomap: allow the file system to provide a bio_set for direct I/O
Allow the file system to provide a specific bio_set for allocating
direct I/O bios.  This will allow file systems that use the
->submit_io hook to stash away additional information for file system
use.

To make use of this additional space for information in the completion
path, the file system needs to override the ->bi_end_io callback and
then call back into iomap, so export iomap_dio_bio_end_io for that.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
Christoph Hellwig
36e8c62273 btrfs: add a btrfs_dio_rw wrapper
Add a wrapper around iomap_dio_rw that keeps the direct I/O internals
isolated in inode.c.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
Naohiro Aota
74e91b12b1 btrfs: zoned: zone finish unused block group
While the active zones within an active block group are reset, and their
active resource is released, the block group itself is kept in the active
block group list and marked as active. As a result, the list will contain
more than max_active_zones block groups. That itself is not fatal for the
device as the zones are properly reset.

However, that inflated list is, of course, strange. Also, a to-appear
patch series, which deactivates an active block group on demand, gets
confused with the wrong list.

So, fix the issue by finishing the unused block group once it gets
read-only, so that we can release the active resource in an early stage.

Fixes: be1a1d7a5d ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
Naohiro Aota
56fbb0a4e8 btrfs: zoned: properly finish block group on metadata write
Commit be1a1d7a5d ("btrfs: zoned: finish fully written block group")
introduced zone finishing code both for data and metadata end_io path.
However, the metadata side is not working as it should. First, it
compares logical address (eb->start + eb->len) with offset within a
block group (cache->zone_capacity) in submit_eb_page(). That essentially
disabled zone finishing on metadata end_io path.

Furthermore, fixing the issue above revealed we cannot call
btrfs_zone_finish_endio() in end_extent_buffer_writeback(). We cannot
call btrfs_lookup_block_group() which require spin lock inside end_io
context.

Introduce btrfs_schedule_zone_finish_bg() to wait for the extent buffer
writeback and do the zone finish IO in a workqueue.

Also, drop EXTENT_BUFFER_ZONE_FINISH as it is no longer used.

Fixes: be1a1d7a5d ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
Naohiro Aota
8b8a53998c btrfs: zoned: finish block group when there are no more allocatable bytes left
Currently, btrfs_zone_finish_endio() finishes a block group only when the
written region reaches the end of the block group. We can also finish the
block group when no more allocation is possible.

Fixes: be1a1d7a5d ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
Naohiro Aota
d70cbdda75 btrfs: zoned: consolidate zone finish functions
btrfs_zone_finish() and btrfs_zone_finish_endio() have similar code.
Introduce do_zone_finish() to factor out the common code.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
Naohiro Aota
1bfd476754 btrfs: zoned: introduce btrfs_zoned_bg_is_full
Introduce a wrapper to check if all the space in a block group is
allocated or not.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
Nikolay Borisov
cf4f03c3be btrfs: improve error reporting in lookup_inline_extent_backref
When iterating the backrefs in an extent item if the ptr to the
'current' backref record goes beyond the extent item a warning is
generated and -ENOENT is returned. However what's more appropriate to
debug such cases would be to return EUCLEAN and also print identifying
information about the performed search as well as the current content of
the leaf containing the possibly corrupted extent item.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
David Sterba
0f07003b0f btrfs: rename bio_ctrl::bio_flags to compress_type
The bio_ctrl is the last use of bio_flags that has been converted to
compress type everywhere else.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:32 +02:00
David Sterba
cb3a12d988 btrfs: rename bio_flags in parameters and switch type
Several functions take parameter bio_flags that was simplified to just
compress type, unify it and change the type accordingly.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
David Sterba
0ff400135b btrfs: rename io_failure_record::bio_flags to compress_type
The bio_flags is now used to store unchanged compress type, so unify
that.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
David Sterba
7f6ca7f21d btrfs: open code extent_set_compress_type helpers
The helpers extent_set_compress_type and extent_compress_type have
become trivial after previous cleanups and can be removed.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
David Sterba
2a5232a8ce btrfs: simplify handling of bio_ctrl::bio_flags
The bio_flags are used only to encode the compression and there are no
other EXTENT_BIO_* flags, so the compress type can be stored directly.
The struct member name is left unchanged and will be cleaned in later
patches.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
David Sterba
572f3dad52 btrfs: remove trivial helper update_nr_written
The helper used to do more with the wbc state but now it's just one
subtraction, no need to have a special helper.

It became trivial in a91326679f ("Btrfs: make mapping->writeback_index
point to the last written page").

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
David Sterba
a6f5e39ee7 btrfs: remove unused parameter bio_flags from btrfs_wq_submit_bio
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
David Sterba
0e3696f80f btrfs: remove btrfs_delayed_extent_op::is_data
The value of btrfs_delayed_extent_op::is_data is always false, we can
cascade the change and simplify code that depends on it, removing the
structure member eventually.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
David Sterba
2fe6a5a1d2 btrfs: sink parameter is_data to btrfs_set_disk_extent_flags
The parameter has been added in 2009 in the infamous monster commit
5d4f98a28c ("Btrfs: Mixed back reference  (FORWARD ROLLING FORMAT
CHANGE)") but not used ever since. We can sink it and allow further
simplifications.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
Filipe Manana
f5585f4f0e btrfs: fix deadlock between concurrent dio writes when low on free data space
When reserving data space for a direct IO write we can end up deadlocking
if we have multiple tasks attempting a write to the same file range, there
are multiple extents covered by that file range, we are low on available
space for data and the writes don't expand the inode's i_size.

The deadlock can happen like this:

1) We have a file with an i_size of 1M, at offset 0 it has an extent with
   a size of 128K and at offset 128K it has another extent also with a
   size of 128K;

2) Task A does a direct IO write against file range [0, 256K), and because
   the write is within the i_size boundary, it takes the inode's lock (VFS
   level) in shared mode;

3) Task A locks the file range [0, 256K) at btrfs_dio_iomap_begin(), and
   then gets the extent map for the extent covering the range [0, 128K).
   At btrfs_get_blocks_direct_write(), it creates an ordered extent for
   that file range ([0, 128K));

4) Before returning from btrfs_dio_iomap_begin(), it unlocks the file
   range [0, 256K);

5) Task A executes btrfs_dio_iomap_begin() again, this time for the file
   range [128K, 256K), and locks the file range [128K, 256K);

6) Task B starts a direct IO write against file range [0, 256K) as well.
   It also locks the inode in shared mode, as it's within the i_size limit,
   and then tries to lock file range [0, 256K). It is able to lock the
   subrange [0, 128K) but then blocks waiting for the range [128K, 256K),
   as it is currently locked by task A;

7) Task A enters btrfs_get_blocks_direct_write() and tries to reserve data
   space. Because we are low on available free space, it triggers the
   async data reclaim task, and waits for it to reserve data space;

8) The async reclaim task decides to wait for all existing ordered extents
   to complete (through btrfs_wait_ordered_roots()).
   It finds the ordered extent previously created by task A for the file
   range [0, 128K) and waits for it to complete;

9) The ordered extent for the file range [0, 128K) can not complete
   because it blocks at btrfs_finish_ordered_io() when trying to lock the
   file range [0, 128K).

   This results in a deadlock, because:

   - task B is holding the file range [0, 128K) locked, waiting for the
     range [128K, 256K) to be unlocked by task A;

   - task A is holding the file range [128K, 256K) locked and it's waiting
     for the async data reclaim task to satisfy its space reservation
     request;

   - the async data reclaim task is waiting for ordered extent [0, 128K)
     to complete, but the ordered extent can not complete because the
     file range [0, 128K) is currently locked by task B, which is waiting
     on task A to unlock file range [128K, 256K) and task A waiting
     on the async data reclaim task.

   This results in a deadlock between 4 task: task A, task B, the async
   data reclaim task and the task doing ordered extent completion (a work
   queue task).

This type of deadlock can sporadically be triggered by the test case
generic/300 from fstests, and results in a stack trace like the following:

[12084.033689] INFO: task kworker/u16:7:123749 blocked for more than 241 seconds.
[12084.034877]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
[12084.035562] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12084.036548] task:kworker/u16:7   state:D stack:    0 pid:123749 ppid:     2 flags:0x00004000
[12084.036554] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[12084.036599] Call Trace:
[12084.036601]  <TASK>
[12084.036606]  __schedule+0x3cb/0xed0
[12084.036616]  schedule+0x4e/0xb0
[12084.036620]  btrfs_start_ordered_extent+0x109/0x1c0 [btrfs]
[12084.036651]  ? prepare_to_wait_exclusive+0xc0/0xc0
[12084.036659]  btrfs_run_ordered_extent_work+0x1a/0x30 [btrfs]
[12084.036688]  btrfs_work_helper+0xf8/0x400 [btrfs]
[12084.036719]  ? lock_is_held_type+0xe8/0x140
[12084.036727]  process_one_work+0x252/0x5a0
[12084.036736]  ? process_one_work+0x5a0/0x5a0
[12084.036738]  worker_thread+0x52/0x3b0
[12084.036743]  ? process_one_work+0x5a0/0x5a0
[12084.036745]  kthread+0xf2/0x120
[12084.036747]  ? kthread_complete_and_exit+0x20/0x20
[12084.036751]  ret_from_fork+0x22/0x30
[12084.036765]  </TASK>
[12084.036769] INFO: task kworker/u16:11:153787 blocked for more than 241 seconds.
[12084.037702]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
[12084.038540] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12084.039506] task:kworker/u16:11  state:D stack:    0 pid:153787 ppid:     2 flags:0x00004000
[12084.039511] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
[12084.039551] Call Trace:
[12084.039553]  <TASK>
[12084.039557]  __schedule+0x3cb/0xed0
[12084.039566]  schedule+0x4e/0xb0
[12084.039569]  schedule_timeout+0xed/0x130
[12084.039573]  ? mark_held_locks+0x50/0x80
[12084.039578]  ? _raw_spin_unlock_irq+0x24/0x50
[12084.039580]  ? lockdep_hardirqs_on+0x7d/0x100
[12084.039585]  __wait_for_common+0xaf/0x1f0
[12084.039587]  ? usleep_range_state+0xb0/0xb0
[12084.039596]  btrfs_wait_ordered_extents+0x3d6/0x470 [btrfs]
[12084.039636]  btrfs_wait_ordered_roots+0x175/0x240 [btrfs]
[12084.039670]  flush_space+0x25b/0x630 [btrfs]
[12084.039712]  btrfs_async_reclaim_data_space+0x108/0x1b0 [btrfs]
[12084.039747]  process_one_work+0x252/0x5a0
[12084.039756]  ? process_one_work+0x5a0/0x5a0
[12084.039758]  worker_thread+0x52/0x3b0
[12084.039762]  ? process_one_work+0x5a0/0x5a0
[12084.039765]  kthread+0xf2/0x120
[12084.039766]  ? kthread_complete_and_exit+0x20/0x20
[12084.039770]  ret_from_fork+0x22/0x30
[12084.039783]  </TASK>
[12084.039800] INFO: task kworker/u16:17:217907 blocked for more than 241 seconds.
[12084.040709]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
[12084.041398] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12084.042404] task:kworker/u16:17  state:D stack:    0 pid:217907 ppid:     2 flags:0x00004000
[12084.042411] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[12084.042461] Call Trace:
[12084.042463]  <TASK>
[12084.042471]  __schedule+0x3cb/0xed0
[12084.042485]  schedule+0x4e/0xb0
[12084.042490]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
[12084.042539]  ? prepare_to_wait_exclusive+0xc0/0xc0
[12084.042551]  lock_extent_bits+0x37/0x90 [btrfs]
[12084.042601]  btrfs_finish_ordered_io.isra.0+0x3fd/0x960 [btrfs]
[12084.042656]  ? lock_is_held_type+0xe8/0x140
[12084.042667]  btrfs_work_helper+0xf8/0x400 [btrfs]
[12084.042716]  ? lock_is_held_type+0xe8/0x140
[12084.042727]  process_one_work+0x252/0x5a0
[12084.042742]  worker_thread+0x52/0x3b0
[12084.042750]  ? process_one_work+0x5a0/0x5a0
[12084.042754]  kthread+0xf2/0x120
[12084.042757]  ? kthread_complete_and_exit+0x20/0x20
[12084.042763]  ret_from_fork+0x22/0x30
[12084.042783]  </TASK>
[12084.042798] INFO: task fio:234517 blocked for more than 241 seconds.
[12084.043598]       Not tainted 5.18.0-rc2-btrfs-next-115 #1
[12084.044282] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12084.045244] task:fio             state:D stack:    0 pid:234517 ppid:234515 flags:0x00004000
[12084.045248] Call Trace:
[12084.045250]  <TASK>
[12084.045254]  __schedule+0x3cb/0xed0
[12084.045263]  schedule+0x4e/0xb0
[12084.045266]  wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
[12084.045298]  ? prepare_to_wait_exclusive+0xc0/0xc0
[12084.045306]  lock_extent_bits+0x37/0x90 [btrfs]
[12084.045336]  btrfs_dio_iomap_begin+0x336/0xc60 [btrfs]
[12084.045370]  ? lock_is_held_type+0xe8/0x140
[12084.045378]  iomap_iter+0x184/0x4c0
[12084.045383]  __iomap_dio_rw+0x2c6/0x8a0
[12084.045406]  iomap_dio_rw+0xa/0x30
[12084.045408]  btrfs_do_write_iter+0x370/0x5e0 [btrfs]
[12084.045440]  aio_write+0xfa/0x2c0
[12084.045448]  ? __might_fault+0x2a/0x70
[12084.045451]  ? kvm_sched_clock_read+0x14/0x40
[12084.045455]  ? lock_release+0x153/0x4a0
[12084.045463]  io_submit_one+0x615/0x9f0
[12084.045467]  ? __might_fault+0x2a/0x70
[12084.045469]  ? kvm_sched_clock_read+0x14/0x40
[12084.045478]  __x64_sys_io_submit+0x83/0x160
[12084.045483]  ? syscall_enter_from_user_mode+0x1d/0x50
[12084.045489]  do_syscall_64+0x3b/0x90
[12084.045517]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[12084.045521] RIP: 0033:0x7fa76511af79
[12084.045525] RSP: 002b:00007ffd6d6b9058 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
[12084.045530] RAX: ffffffffffffffda RBX: 00007fa75ba6e760 RCX: 00007fa76511af79
[12084.045532] RDX: 0000557b304ff3f0 RSI: 0000000000000001 RDI: 00007fa75ba4c000
[12084.045535] RBP: 00007fa75ba4c000 R08: 00007fa751b76000 R09: 0000000000000330
[12084.045537] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[12084.045540] R13: 0000000000000000 R14: 0000557b304ff3f0 R15: 0000557b30521eb0
[12084.045561]  </TASK>

Fix this issue by always reserving data space before locking a file range
at btrfs_dio_iomap_begin(). If we can't reserve the space, then we don't
error out immediately - instead after locking the file range, check if we
can do a NOCOW write, and if we can we don't error out since we don't need
to allocate a data extent, however if we can't NOCOW then error out with
-ENOSPC. This also implies that we may end up reserving space when it's
not needed because the write will end up being done in NOCOW mode - in that
case we just release the space after we noticed we did a NOCOW write - this
is the same type of logic that is done in the path for buffered IO writes.

Fixes: f0bfa76a11 ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range")
CC: stable@vger.kernel.org # 5.17+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
Goldwyn Rodrigues
1d8fa2e29b btrfs: derive compression type from extent map during reads
Derive the compression type from extent map as opposed to the bio flags
passed. This makes it more precise and not reliant on function
parameters.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
Qu Wenruo
a13467ee7a btrfs: scrub: move scrub_remap_extent() call into scrub_extent()
[SUSPICIOUS CODE]
When refactoring scrub code, I noticed a very strange behavior around
scrub_remap_extent():

	if (sctx->is_dev_replace)
		scrub_remap_extent(fs_info, cur_logical, scrub_len,
				   &cur_physical, &target_dev, &cur_mirror);

As replace target is a 1:1 copy of the source device, thus physical
offset inside the target should be the same as physical inside source,
thus this remap call makes no sense to me.

[REAL FUNCTIONALITY]
After more investigation, the function name scrub_remap_extent()
doesn't tell anything of the truth, nor does its if () condition.

The real story behind this function is that, for scrub_pages() we never
expect missing device, even for replacing missing device.

What scrub_remap_extent() is really doing is to find a live mirror, and
make later scrub_pages() to read data from the good copy, other than
from the missing device and increase error counters unnecessarily.

[IMPROVEMENT]
We have no need to bother scrub_remap_extent() in scrub_simple_mirror()
at all, we only need to call it before we call scrub_pages().

And rename the function to scrub_find_live_copy(), add extra comments on
them.

By this we can remove one parameter from scrub_extent(), and reduce the
unnecessary calls to scrub_remap_extent() for regular replace.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00
Qu Wenruo
d483bfd27a btrfs: scrub: use find_first_extent_item to for extent item search
Since we have find_first_extent_item() to iterate the extent items of a
certain range, there is no need to use the open-coded version.

Replace the final scrub call site with find_first_extent_item().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:17:31 +02:00