Let's introduce multi-reference pclusters at runtime. In details,
if one pcluster is requested by multiple extents at almost the same
time (even belong to different files), the longest extent will be
decompressed as representative and the other extents are actually
copied from the longest one in one round.
After this patch, fully-referenced extents can be correctly handled
and the full decoding check needs to be bypassed for
partial-referenced extents.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-17-hsiangkao@linux.alibaba.com
Currently, `pcl->length' records the longest decompressed length
as long as the pcluster itself isn't reclaimed. However, such
number is unneeded for the general cases since it doesn't indicate
the exact decompressed size in this round.
Instead, let's record the decompressed size for this round instead,
thus `pcl->nr_pages' can be completely dropped and pageofs_out is
also designed to be kept in sync with `pcl->length'.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-16-hsiangkao@linux.alibaba.com
Since all decompressed offsets have been integrated to bvecs[], this
patch avoids all sub-indexes so that page->private only includes a
part count and an eio flag, thus in the future folio->private can have
the same meaning.
In addition, PG_error will not be used anymore after this patch and
we're heading to use page->private (later folio->private) and
page->mapping (later folio->mapping) only.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-9-hsiangkao@linux.alibaba.com
For each pcluster, the total compressed buffers are determined in
advance, yet the number of decompressed buffers actually vary. Too
many decompressed pages can be recorded if one pcluster is highly
compressed or its pcluster size is large. That takes extra memory
footprints compared to uncompressed filesystems, especially a lot of
I/O in flight on low-ended devices.
Therefore, similar to inplace I/O, pagevec was introduced to reuse
page cache to store these pointers in the time-sharing way since
these pages are actually unused before decompressing.
In order to make it more flexable, a cleaner bufvec is used to
replace the old pagevec stuffs so that
- Decompressed offsets can be stored inline, thus it can be used
for the upcoming feature like compressed data deduplication.
It's calculated by `page_offset(page) - map->m_la';
- Towards supporting large folios for compressed inodes since
our final goal is to completely avoid page->private but use
folio->private only for all page cache pages.
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220715154203.48093-5-hsiangkao@linux.alibaba.com
Currently, vmap()s are avoided if physical addresses are
consecutive for decompressed buffers.
I observed that is very common for 4KiB pclusters since the
numbers of decompressed pages are almost 2 or 3.
However, such detection doesn't work for Highmem pages on
32-bit machines, let's fix it now.
Reported-by: Liu Jinbao <liujinbao1@xiaomi.com>
Fixes: 7fc45dbc93 ("staging: erofs: introduce generic decompression backend")
Link: https://lore.kernel.org/r/20220708101001.21242-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
When the user mounts the erofs second times, the decompression thread
may hung. The problem happens due to a sequence of steps like the
following:
1) Task A called z_erofs_load_lzma_config which obtain all of the node
from the z_erofs_lzma_head.
2) At this time, task B called the z_erofs_lzma_decompress and wanted to
get a node. But the z_erofs_lzma_head was empty, the Task B had to
sleep.
3) Task A release nodes and push nodes into the z_erofs_lzma_head. But
task B was still sleeping.
One example report when the hung happens:
task:kworker/u3:1 state:D stack:14384 pid: 86 ppid: 2 flags:0x00004000
Workqueue: erofs_unzipd z_erofs_decompressqueue_work
Call Trace:
<TASK>
__schedule+0x281/0x760
schedule+0x49/0xb0
z_erofs_lzma_decompress+0x4bc/0x580
? cpu_core_flags+0x10/0x10
z_erofs_decompress_pcluster+0x49b/0xba0
? __update_load_avg_se+0x2b0/0x330
? __update_load_avg_se+0x2b0/0x330
? update_load_avg+0x5f/0x690
? update_load_avg+0x5f/0x690
? set_next_entity+0xbd/0x110
? _raw_spin_unlock+0xd/0x20
z_erofs_decompress_queue.isra.0+0x2e/0x50
z_erofs_decompressqueue_work+0x30/0x60
process_one_work+0x1d3/0x3a0
worker_thread+0x45/0x3a0
? process_one_work+0x3a0/0x3a0
kthread+0xe2/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>
Signed-off-by: Yuwen Chen <chenyuwen1@meizu.com>
Fixes: 622ceaddb7 ("erofs: lzma compression support")
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220626224041.4288-1-chenyuwen1@meizu.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Pull more erofs updates from Gao Xiang:
"This is a follow-up to the main updates, including some fixes of
fscache mode related to compressed inodes and a cachefiles tracepoint.
There is also a patch to fix an unexpected decompression strategy
change due to a cleanup in the past. All the fixes are quite small.
Apart from these, documentation is also updated for a better
description of recent new features.
In addition, this has some trivial cleanups without actual code logic
changes, so I could have a more recent codebase to work on folios and
avoiding the PG_error page flag for the next cycle.
Summary:
- Leave compressed inodes unsupported in fscache mode for now
- Avoid crash when using tracepoint cachefiles_prep_read
- Fix `backmost' behavior due to a recent cleanup
- Update documentation for better description of recent new features
- Several decompression cleanups w/o logical change"
* tag 'erofs-for-5.19-rc1-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix 'backmost' member of z_erofs_decompress_frontend
erofs: simplify z_erofs_pcluster_readmore()
erofs: get rid of label `restart_now'
erofs: get rid of `struct z_erofs_collection'
erofs: update documentation
erofs: fix crash when enable tracepoint cachefiles_prep_read
erofs: leave compressed inodes unsupported in fscache mode for now
It was incompletely introduced for deduplication between different
logical extents backed with the same pcluster.
We will have a better in-memory representation in the next release
cycle for this, as well as partial memory folios support. So get rid
of it instead.
No logic changes.
Link: https://lore.kernel.org/r/20220529055425.226363-2-xiang@kernel.org
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
erofs over fscache doesn't support the compressed layout yet. It will
cause NULL crash if there are compressed inodes contained when working
in fscache mode.
So far in the erofs based container image distribution scenarios
(RAFS v6), the compressed RAFS v6 images are downloaded and then
decompressed on demand as an uncompressed erofs image. Then the erofs
image is mounted in fscache mode for containers to use. IOWs, currently
compressed data is decompressed on the userspace side instead and
uncompressed erofs images will be finally cached.
The fscache support for the compressed layout is still under
development and it will be used for runtime decompression feature.
Anyway, to avoid the potential crash, let's leave the compressed inodes
unsupported in fscache mode until we support it later.
Fixes: 1442b02b66 ("erofs: implement fscache-based data read for non-inline layout")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220526010344.118493-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Pull page cache updates from Matthew Wilcox:
- Appoint myself page cache maintainer
- Fix how scsicam uses the page cache
- Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
- Remove the AOP flags entirely
- Remove pagecache_write_begin() and pagecache_write_end()
- Documentation updates
- Convert several address_space operations to use folios:
- is_dirty_writeback
- readpage becomes read_folio
- releasepage becomes release_folio
- freepage becomes free_folio
- Change filler_t to require a struct file pointer be the first
argument like ->read_folio
* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
nilfs2: Fix some kernel-doc comments
Appoint myself page cache maintainer
fs: Remove aops->freepage
secretmem: Convert to free_folio
nfs: Convert to free_folio
orangefs: Convert to free_folio
fs: Add free_folio address space operation
fs: Convert drop_buffers() to use a folio
fs: Change try_to_free_buffers() to take a folio
jbd2: Convert release_buffer_page() to use a folio
jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
reiserfs: Convert release_buffer_page() to use a folio
fs: Remove last vestiges of releasepage
ubifs: Convert to release_folio
reiserfs: Convert to release_folio
orangefs: Convert to release_folio
ocfs2: Convert to release_folio
nilfs2: Remove comment about releasepage
nfs: Convert to release_folio
jfs: Convert to release_folio
...
Pull btrfs updates from David Sterba:
"Features:
- subpage:
- support for PAGE_SIZE > 4K (previously only 64K)
- make it work with raid56
- repair super block num_devices automatically if it does not match
the number of device items
- defrag can convert inline extents to regular extents, up to now
inline files were skipped but the setting of mount option
max_inline could affect the decision logic
- zoned:
- minimal accepted zone size is explicitly set to 4MiB
- make zone reclaim less aggressive and don't reclaim if there are
enough free zones
- add per-profile sysfs tunable of the reclaim threshold
- allow automatic block group reclaim for non-zoned filesystems, with
sysfs tunables
- tree-checker: new check, compare extent buffer owner against owner
rootid
Performance:
- avoid blocking on space reservation when doing nowait direct io
writes (+7% throughput for reads and writes)
- NOCOW write throughput improvement due to refined locking (+3%)
- send: reduce pressure to page cache by dropping extent pages right
after they're processed
Core:
- convert all radix trees to xarray
- add iterators for b-tree node items
- support printk message index
- user bulk page allocation for extent buffers
- switch to bio_alloc API, use on-stack bios where convenient, other
bio cleanups
- use rw lock for block groups to favor concurrent reads
- simplify workques, don't allocate high priority threads for all
normal queues as we need only one
- refactor scrub, process chunks based on their constraints and
similarity
- allocate direct io structures on stack and pass around only
pointers, avoids allocation and reduces potential error handling
Fixes:
- fix count of reserved transaction items for various inode
operations
- fix deadlock between concurrent dio writes when low on free data
space
- fix a few cases when zones need to be finished
VFS, iomap:
- add helper to check if sb write has started (usable for assertions)
- new helper iomap_dio_alloc_bio, export iomap_dio_bio_end_io"
* tag 'for-5.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (173 commits)
btrfs: zoned: introduce a minimal zone size 4M and reject mount
btrfs: allow defrag to convert inline extents to regular extents
btrfs: add "0x" prefix for unsupported optional features
btrfs: do not account twice for inode ref when reserving metadata units
btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer
btrfs: send: avoid trashing the page cache
btrfs: send: keep the current inode open while processing it
btrfs: allocate the btrfs_dio_private as part of the iomap dio bio
btrfs: move struct btrfs_dio_private to inode.c
btrfs: remove the disk_bytenr in struct btrfs_dio_private
btrfs: allocate dio_data on stack
iomap: add per-iomap_iter private data
iomap: allow the file system to provide a bio_set for direct I/O
btrfs: add a btrfs_dio_rw wrapper
btrfs: zoned: zone finish unused block group
btrfs: zoned: properly finish block group on metadata write
btrfs: zoned: finish block group when there are no more allocatable bytes left
btrfs: zoned: consolidate zone finish functions
btrfs: zoned: introduce btrfs_zoned_bg_is_full
btrfs: improve error reporting in lookup_inline_extent_backref
...
Registers fscache context for primary data blob. Also move the
initialization of s_op and related fields forward, since anonymous
inode will be allocated under the super block when registering the
fscache context.
Something worth mentioning about the cleanup routine.
1. The fscache context will instantiate anonymous inodes under the super
block. Release these anonymous inodes when .put_super() is called, or
we'll get "VFS: Busy inodes after unmount." warning.
2. The fscache context is initialized prior to the root inode. If
.kill_sb() is called when mount failed, .put_super() won't be called
when root inode has not been initialized yet. Thus .kill_sb() shall
also contain the cleanup routine.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-16-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
A new fscache based mode is going to be introduced for erofs, in which
case on-demand read semantics is implemented through fscache.
As the first step, register fscache volume for each erofs filesystem.
That means, data blobs can not be shared among erofs filesystems. In the
following iteration, we are going to introduce the domain semantics, in
which case several erofs filesystems can belong to one domain, and data
blobs can be shared among these erofs filesystems of one domain.
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220425122143.56815-12-jefflexu@linux.alibaba.com
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
This patch enables idmapped mounts for erofs, since all dedicated helpers
for this functionality existsm, so, in this patch we just pass down the
user_namespace argument from the VFS methods to the relevant helpers.
Simple idmap example on erofs image:
1. mkdir dir
2. touch dir/file
3. mkfs.erofs erofs.img dir
4. mount -t erofs -o loop erofs.img /mnt/erofs/
5. ls -ln /mnt/erofs/
total 0
-rw-rw-r-- 1 1000 1000 0 May 17 15:26 file
6. mount-idmapped --map-mount b:1000:1001:1 /mnt/erofs/ /mnt/scratch_erofs/
7. ls -ln /mnt/scratch_erofs/
total 0
-rw-rw-r-- 1 1001 1001 0 May 17 15:26 file
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Chao Yu <chao.yu@oppo.com>
Link: https://lore.kernel.org/r/20220517104103.3570721-1-chao@kernel.org
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
I got some KASAN report as below:
[ 46.959738] ==================================================================
[ 46.960430] BUG: KASAN: use-after-free in z_erofs_shifted_transform+0x2bd/0x370
[ 46.960430] Read of size 4074 at addr ffff8880300c2f8e by task fssum/188
...
[ 46.960430] Call Trace:
[ 46.960430] <TASK>
[ 46.960430] dump_stack_lvl+0x41/0x5e
[ 46.960430] print_report.cold+0xb2/0x6b7
[ 46.960430] ? z_erofs_shifted_transform+0x2bd/0x370
[ 46.960430] kasan_report+0x8a/0x140
[ 46.960430] ? z_erofs_shifted_transform+0x2bd/0x370
[ 46.960430] kasan_check_range+0x14d/0x1d0
[ 46.960430] memcpy+0x20/0x60
[ 46.960430] z_erofs_shifted_transform+0x2bd/0x370
[ 46.960430] z_erofs_decompress_pcluster+0xaae/0x1080
The root cause is that the tail pcluster won't be a complete filesystem
block anymore. So if ztailpacking is used, the second part of an
uncompressed tail pcluster may not be ``rq->pageofs_out``.
Fixes: ab749badf9 ("erofs: support unaligned data decompression")
Fixes: cecf864d3d ("erofs: support inline data decompression")
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20220512115833.24175-1-hsiangkao@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>