License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:07:57 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2018-04-03 17:23:33 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
#include <linux/bitops.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/page-flags.h>
|
2022-04-06 18:24:18 +00:00
|
|
|
#include <linux/sched/mm.h>
|
2008-01-24 21:13:08 +00:00
|
|
|
#include <linux/spinlock.h>
|
|
|
|
#include <linux/blkdev.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/writeback.h>
|
|
|
|
#include <linux/pagevec.h>
|
2011-05-20 19:50:29 +00:00
|
|
|
#include <linux/prefetch.h>
|
2021-06-30 20:01:49 +00:00
|
|
|
#include <linux/fsverity.h>
|
2008-01-24 21:13:08 +00:00
|
|
|
#include "extent_io.h"
|
2019-09-23 14:05:19 +00:00
|
|
|
#include "extent-io-tree.h"
|
2008-01-24 21:13:08 +00:00
|
|
|
#include "extent_map.h"
|
2008-08-20 12:51:49 +00:00
|
|
|
#include "ctree.h"
|
|
|
|
#include "btrfs_inode.h"
|
2022-11-15 09:44:05 +00:00
|
|
|
#include "bio.h"
|
2012-03-13 13:38:00 +00:00
|
|
|
#include "locking.h"
|
2013-09-22 04:54:23 +00:00
|
|
|
#include "backref.h"
|
2017-06-23 02:09:57 +00:00
|
|
|
#include "disk-io.h"
|
2021-01-26 08:33:48 +00:00
|
|
|
#include "subpage.h"
|
2021-02-04 10:21:54 +00:00
|
|
|
#include "zoned.h"
|
2021-02-04 10:22:08 +00:00
|
|
|
#include "block-group.h"
|
2021-07-27 12:47:09 +00:00
|
|
|
#include "compression.h"
|
2022-10-19 14:50:51 +00:00
|
|
|
#include "fs.h"
|
2022-10-19 14:51:00 +00:00
|
|
|
#include "accessors.h"
|
2022-10-26 19:08:27 +00:00
|
|
|
#include "file-item.h"
|
2022-10-26 19:08:30 +00:00
|
|
|
#include "file.h"
|
2022-10-26 19:08:36 +00:00
|
|
|
#include "dev-replace.h"
|
2022-10-26 19:08:40 +00:00
|
|
|
#include "super.h"
|
2023-01-26 21:00:59 +00:00
|
|
|
#include "transaction.h"
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
static struct kmem_cache *extent_buffer_cache;
|
|
|
|
|
2013-04-22 16:12:31 +00:00
|
|
|
#ifdef CONFIG_BTRFS_DEBUG
|
2022-09-09 21:53:19 +00:00
|
|
|
static inline void btrfs_leak_debug_add_eb(struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&fs_info->eb_leak_lock, flags);
|
|
|
|
list_add(&eb->leak_list, &fs_info->allocated_ebs);
|
|
|
|
spin_unlock_irqrestore(&fs_info->eb_leak_lock, flags);
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_leak_debug_del_eb(struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
spin_lock_irqsave(&fs_info->eb_leak_lock, flags);
|
|
|
|
list_del(&eb->leak_list);
|
|
|
|
spin_unlock_irqrestore(&fs_info->eb_leak_lock, flags);
|
2013-04-22 16:12:31 +00:00
|
|
|
}
|
|
|
|
|
2020-02-14 21:11:40 +00:00
|
|
|
void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info)
|
2013-04-22 16:12:31 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb;
|
2020-02-14 21:11:40 +00:00
|
|
|
unsigned long flags;
|
2013-04-22 16:12:31 +00:00
|
|
|
|
2020-02-14 21:11:42 +00:00
|
|
|
/*
|
|
|
|
* If we didn't get into open_ctree our allocated_ebs will not be
|
|
|
|
* initialized, so just skip this.
|
|
|
|
*/
|
|
|
|
if (!fs_info->allocated_ebs.next)
|
|
|
|
return;
|
|
|
|
|
2022-03-15 10:01:33 +00:00
|
|
|
WARN_ON(!list_empty(&fs_info->allocated_ebs));
|
2020-02-14 21:11:40 +00:00
|
|
|
spin_lock_irqsave(&fs_info->eb_leak_lock, flags);
|
|
|
|
while (!list_empty(&fs_info->allocated_ebs)) {
|
|
|
|
eb = list_first_entry(&fs_info->allocated_ebs,
|
|
|
|
struct extent_buffer, leak_list);
|
2020-02-14 21:11:42 +00:00
|
|
|
pr_err(
|
2024-01-05 05:35:55 +00:00
|
|
|
"BTRFS: buffer leak start %llu len %u refs %d bflags %lu owner %llu\n",
|
2020-02-14 21:11:42 +00:00
|
|
|
eb->start, eb->len, atomic_read(&eb->refs), eb->bflags,
|
|
|
|
btrfs_header_owner(eb));
|
2019-09-23 14:05:17 +00:00
|
|
|
list_del(&eb->leak_list);
|
2024-01-02 20:18:07 +00:00
|
|
|
WARN_ON_ONCE(1);
|
2019-09-23 14:05:17 +00:00
|
|
|
kmem_cache_free(extent_buffer_cache, eb);
|
|
|
|
}
|
2020-02-14 21:11:40 +00:00
|
|
|
spin_unlock_irqrestore(&fs_info->eb_leak_lock, flags);
|
2019-09-23 14:05:17 +00:00
|
|
|
}
|
2013-04-22 16:12:31 +00:00
|
|
|
#else
|
2022-09-09 21:53:19 +00:00
|
|
|
#define btrfs_leak_debug_add_eb(eb) do {} while (0)
|
|
|
|
#define btrfs_leak_debug_del_eb(eb) do {} while (0)
|
2008-09-08 15:18:08 +00:00
|
|
|
#endif
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2022-04-15 14:33:24 +00:00
|
|
|
/*
|
|
|
|
* Structure to record info about the bio being assembled, and other info like
|
|
|
|
* how many bytes are there before stripe/ordered extent boundary.
|
|
|
|
*/
|
|
|
|
struct btrfs_bio_ctrl {
|
2023-03-07 16:39:43 +00:00
|
|
|
struct btrfs_bio *bbio;
|
2021-07-27 13:11:53 +00:00
|
|
|
enum btrfs_compression_type compress_type;
|
2022-04-15 14:33:24 +00:00
|
|
|
u32 len_to_oe_boundary;
|
2023-02-27 15:16:55 +00:00
|
|
|
blk_opf_t opf;
|
2022-09-13 05:31:14 +00:00
|
|
|
btrfs_bio_end_io_t end_io_func;
|
2023-02-27 15:16:57 +00:00
|
|
|
struct writeback_control *wbc;
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The sectors of the page which are going to be submitted by
|
|
|
|
* extent_writepage_io().
|
|
|
|
* This is to avoid touching ranges covered by compression/inline.
|
|
|
|
*/
|
|
|
|
unsigned long submit_bitmap;
|
2008-01-24 21:13:08 +00:00
|
|
|
};
|
|
|
|
|
2022-06-03 07:11:03 +00:00
|
|
|
static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
|
2019-01-25 05:09:15 +00:00
|
|
|
{
|
2023-03-07 16:39:43 +00:00
|
|
|
struct btrfs_bio *bbio = bio_ctrl->bbio;
|
2022-06-03 07:11:03 +00:00
|
|
|
|
2023-03-07 16:39:43 +00:00
|
|
|
if (!bbio)
|
2022-06-03 07:11:03 +00:00
|
|
|
return;
|
2019-01-25 05:09:15 +00:00
|
|
|
|
btrfs: subpage: allow submit_extent_page() to do bio split
Current submit_extent_page() just checks if the current page range can
be fitted into current bio, and if not, submit then re-add.
But this behavior can't handle subpage case at all.
For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.
This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.
The proper way to handle it is:
- Check how many bytes we can be put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create a new bio
- Add the remaining bytes into the new bio
Refactor submit_extent_page() so that it does the above iteration.
The main loop inside submit_extent_page() will look like this:
cur = pg_offset;
while (cur < pg_offset + size) {
u32 offset = cur - pg_offset;
int added;
if (!bio_ctrl->bio) {
/* Allocate new bio if needed */
}
/* Add as many bytes into the bio */
added = btrfs_bio_add_page();
if (added < size - offset) {
/* The current bio is full, submit it */
}
cur += added;
}
Also, since we're doing new bio allocation deep inside the main loop,
extract that code into a new helper, alloc_new_bio().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 06:35:00 +00:00
|
|
|
/* Caller should ensure the bio has at least some range added */
|
2023-03-07 16:39:43 +00:00
|
|
|
ASSERT(bbio->bio.bi_iter.bi_size);
|
btrfs: avoid double clean up when submit_one_bio() failed
[BUG]
When running generic/475 with 64K page size and 4K sector size, it has a
very high chance (almost 100%) to hang, with mostly data page locked but
no one is going to unlock it.
[CAUSE]
With commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on
reads"), if we failed to lookup checksum due to metadata IO error, we
will return error for btrfs_submit_data_bio().
This will cause the page to be unlocked twice in btrfs_do_readpage():
btrfs_do_readpage()
|- submit_extent_page()
| |- submit_one_bio()
| |- btrfs_submit_data_bio()
| |- if (ret) {
| |- bio->bi_status = ret;
| |- bio_endio(bio); }
| In the endio function, we will call end_page_read()
| and unlock_extent() to cleanup the subpage range.
|
|- if (ret) {
|- unlock_extent(); end_page_read() }
Here we unlock the extent and cleanup the subpage range
again.
For unlock_extent(), it's mostly double unlock safe.
But for end_page_read(), it's not, especially for subpage case,
as for subpage case we will call btrfs_subpage_end_reader() to reduce
the reader number, and use that to number to determine if we need to
unlock the full page.
If double accounted, it can underflow the number and leave the page
locked without anyone to unlock it.
[FIX]
The commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on
reads") itself is completely fine, it's our existing code not properly
handling the error from bio submission hook properly.
This patch will make submit_one_bio() to return void so that the callers
will never be able to do cleanup when bio submission hook fails.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-12 12:30:13 +00:00
|
|
|
|
2023-03-07 16:39:43 +00:00
|
|
|
if (btrfs_op(&bbio->bio) == BTRFS_MAP_READ &&
|
2023-01-21 06:50:28 +00:00
|
|
|
bio_ctrl->compress_type != BTRFS_COMPRESS_NONE)
|
2023-05-03 15:24:27 +00:00
|
|
|
btrfs_submit_compressed_read(bbio);
|
2023-01-21 06:50:28 +00:00
|
|
|
else
|
2024-08-27 01:40:11 +00:00
|
|
|
btrfs_submit_bbio(bbio, 0);
|
2023-01-21 06:50:28 +00:00
|
|
|
|
2023-03-07 16:39:43 +00:00
|
|
|
/* The bbio is owned by the end_io handler now */
|
|
|
|
bio_ctrl->bbio = NULL;
|
2019-03-20 06:27:42 +00:00
|
|
|
}
|
|
|
|
|
2019-03-20 06:27:41 +00:00
|
|
|
/*
|
2022-10-27 11:07:05 +00:00
|
|
|
* Submit or fail the current bio in the bio_ctrl structure.
|
2019-03-20 06:27:41 +00:00
|
|
|
*/
|
2022-10-27 11:07:05 +00:00
|
|
|
static void submit_write_bio(struct btrfs_bio_ctrl *bio_ctrl, int ret)
|
2019-01-25 05:09:15 +00:00
|
|
|
{
|
2023-03-07 16:39:43 +00:00
|
|
|
struct btrfs_bio *bbio = bio_ctrl->bbio;
|
2019-01-25 05:09:15 +00:00
|
|
|
|
2023-03-07 16:39:43 +00:00
|
|
|
if (!bbio)
|
2022-06-03 07:11:02 +00:00
|
|
|
return;
|
|
|
|
|
|
|
|
if (ret) {
|
|
|
|
ASSERT(ret < 0);
|
2023-03-07 16:39:43 +00:00
|
|
|
btrfs_bio_end_io(bbio, errno_to_blk_status(ret));
|
2022-08-06 08:03:26 +00:00
|
|
|
/* The bio is owned by the end_io handler now */
|
2023-03-07 16:39:43 +00:00
|
|
|
bio_ctrl->bbio = NULL;
|
2022-06-03 07:11:02 +00:00
|
|
|
} else {
|
2022-10-27 11:07:05 +00:00
|
|
|
submit_one_bio(bio_ctrl);
|
2019-01-25 05:09:15 +00:00
|
|
|
}
|
|
|
|
}
|
2017-06-23 02:16:17 +00:00
|
|
|
|
2022-09-09 21:53:18 +00:00
|
|
|
int __init extent_buffer_init_cachep(void)
|
|
|
|
{
|
2012-09-07 09:00:48 +00:00
|
|
|
extent_buffer_cache = kmem_cache_create("btrfs_extent_buffer",
|
2024-02-24 13:47:09 +00:00
|
|
|
sizeof(struct extent_buffer), 0, 0,
|
|
|
|
NULL);
|
2022-09-09 21:53:18 +00:00
|
|
|
if (!extent_buffer_cache)
|
2019-09-23 14:05:18 +00:00
|
|
|
return -ENOMEM;
|
2013-09-20 03:37:07 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2022-09-09 21:53:18 +00:00
|
|
|
void __cold extent_buffer_free_cachep(void)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2012-09-26 01:33:07 +00:00
|
|
|
/*
|
|
|
|
* Make sure all delayed rcu free are flushed before we
|
|
|
|
* destroy caches.
|
|
|
|
*/
|
|
|
|
rcu_barrier();
|
2016-01-29 13:36:35 +00:00
|
|
|
kmem_cache_destroy(extent_buffer_cache);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2024-07-24 20:22:15 +00:00
|
|
|
static void process_one_folio(struct btrfs_fs_info *fs_info,
|
|
|
|
struct folio *folio, const struct folio *locked_folio,
|
|
|
|
unsigned long page_ops, u64 start, u64 end)
|
2021-05-31 08:50:38 +00:00
|
|
|
{
|
2021-05-31 08:50:42 +00:00
|
|
|
u32 len;
|
|
|
|
|
|
|
|
ASSERT(end + 1 - start != 0 && end + 1 - start < U32_MAX);
|
|
|
|
len = end + 1 - start;
|
|
|
|
|
2021-05-31 08:50:38 +00:00
|
|
|
if (page_ops & PAGE_SET_ORDERED)
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_folio_clamp_set_ordered(fs_info, folio, start, len);
|
2021-05-31 08:50:38 +00:00
|
|
|
if (page_ops & PAGE_START_WRITEBACK) {
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_folio_clamp_clear_dirty(fs_info, folio, start, len);
|
|
|
|
btrfs_folio_clamp_set_writeback(fs_info, folio, start, len);
|
2021-05-31 08:50:38 +00:00
|
|
|
}
|
|
|
|
if (page_ops & PAGE_END_WRITEBACK)
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_folio_clamp_clear_writeback(fs_info, folio, start, len);
|
2021-05-31 08:50:47 +00:00
|
|
|
|
2024-07-24 20:22:15 +00:00
|
|
|
if (folio != locked_folio && (page_ops & PAGE_UNLOCK))
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_folio_end_writer_lock(fs_info, folio, start, len);
|
2021-05-31 08:50:38 +00:00
|
|
|
}
|
|
|
|
|
2024-07-24 20:20:02 +00:00
|
|
|
static void __process_folios_contig(struct address_space *mapping,
|
|
|
|
const struct folio *locked_folio, u64 start,
|
|
|
|
u64 end, unsigned long page_ops)
|
2021-05-31 08:50:38 +00:00
|
|
|
{
|
2023-09-14 14:45:41 +00:00
|
|
|
struct btrfs_fs_info *fs_info = inode_to_fs_info(mapping->host);
|
2021-05-31 08:50:38 +00:00
|
|
|
pgoff_t start_index = start >> PAGE_SHIFT;
|
|
|
|
pgoff_t end_index = end >> PAGE_SHIFT;
|
|
|
|
pgoff_t index = start_index;
|
2022-08-24 00:40:18 +00:00
|
|
|
struct folio_batch fbatch;
|
2021-05-31 08:50:38 +00:00
|
|
|
int i;
|
|
|
|
|
2022-08-24 00:40:18 +00:00
|
|
|
folio_batch_init(&fbatch);
|
|
|
|
while (index <= end_index) {
|
|
|
|
int found_folios;
|
|
|
|
|
|
|
|
found_folios = filemap_get_folios_contig(mapping, &index,
|
|
|
|
end_index, &fbatch);
|
|
|
|
for (i = 0; i < found_folios; i++) {
|
|
|
|
struct folio *folio = fbatch.folios[i];
|
2023-06-28 15:31:24 +00:00
|
|
|
|
2024-07-24 20:22:15 +00:00
|
|
|
process_one_folio(fs_info, folio, locked_folio,
|
|
|
|
page_ops, start, end);
|
2021-05-31 08:50:38 +00:00
|
|
|
}
|
2022-08-24 00:40:18 +00:00
|
|
|
folio_batch_release(&fbatch);
|
2021-05-31 08:50:38 +00:00
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
}
|
2017-02-10 15:41:05 +00:00
|
|
|
|
2024-05-30 17:14:12 +00:00
|
|
|
static noinline void __unlock_for_delalloc(const struct inode *inode,
|
2024-07-24 20:17:43 +00:00
|
|
|
const struct folio *locked_folio,
|
2012-03-01 13:56:26 +00:00
|
|
|
u64 start, u64 end)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
unsigned long index = start >> PAGE_SHIFT;
|
|
|
|
unsigned long end_index = end >> PAGE_SHIFT;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
2024-07-24 20:17:43 +00:00
|
|
|
ASSERT(locked_folio);
|
|
|
|
if (index == locked_folio->index && end_index == index)
|
2012-03-01 13:56:26 +00:00
|
|
|
return;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
2024-07-24 20:20:02 +00:00
|
|
|
__process_folios_contig(inode->i_mapping, locked_folio, start, end,
|
|
|
|
PAGE_UNLOCK);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
}
|
|
|
|
|
2024-07-24 20:15:13 +00:00
|
|
|
static noinline int lock_delalloc_folios(struct inode *inode,
|
|
|
|
const struct folio *locked_folio,
|
|
|
|
u64 start, u64 end)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
{
|
2023-09-14 14:45:41 +00:00
|
|
|
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
|
2023-06-28 15:31:24 +00:00
|
|
|
struct address_space *mapping = inode->i_mapping;
|
|
|
|
pgoff_t start_index = start >> PAGE_SHIFT;
|
|
|
|
pgoff_t end_index = end >> PAGE_SHIFT;
|
|
|
|
pgoff_t index = start_index;
|
|
|
|
u64 processed_end = start;
|
|
|
|
struct folio_batch fbatch;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
2024-07-24 20:15:13 +00:00
|
|
|
if (index == locked_folio->index && index == end_index)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
return 0;
|
|
|
|
|
2023-06-28 15:31:24 +00:00
|
|
|
folio_batch_init(&fbatch);
|
|
|
|
while (index <= end_index) {
|
|
|
|
unsigned int found_folios, i;
|
|
|
|
|
|
|
|
found_folios = filemap_get_folios_contig(mapping, &index,
|
|
|
|
end_index, &fbatch);
|
|
|
|
if (found_folios == 0)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
for (i = 0; i < found_folios; i++) {
|
2023-12-12 02:28:37 +00:00
|
|
|
struct folio *folio = fbatch.folios[i];
|
2023-06-28 15:31:24 +00:00
|
|
|
u32 len = end + 1 - start;
|
|
|
|
|
2024-07-24 20:15:13 +00:00
|
|
|
if (folio == locked_folio)
|
2023-06-28 15:31:24 +00:00
|
|
|
continue;
|
|
|
|
|
2023-12-12 02:28:37 +00:00
|
|
|
if (btrfs_folio_start_writer_lock(fs_info, folio, start,
|
|
|
|
len))
|
2023-06-28 15:31:24 +00:00
|
|
|
goto out;
|
|
|
|
|
2024-07-24 20:15:13 +00:00
|
|
|
if (!folio_test_dirty(folio) || folio->mapping != mapping) {
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_folio_end_writer_lock(fs_info, folio, start,
|
|
|
|
len);
|
2023-06-28 15:31:24 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2024-07-24 20:15:13 +00:00
|
|
|
processed_end = folio_pos(folio) + folio_size(folio) - 1;
|
2023-06-28 15:31:24 +00:00
|
|
|
}
|
|
|
|
folio_batch_release(&fbatch);
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
out:
|
|
|
|
folio_batch_release(&fbatch);
|
|
|
|
if (processed_end > start)
|
2024-07-24 20:17:43 +00:00
|
|
|
__unlock_for_delalloc(inode, locked_folio, start,
|
2024-07-24 20:15:13 +00:00
|
|
|
processed_end);
|
2023-06-28 15:31:24 +00:00
|
|
|
return -EAGAIN;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2018-11-29 03:33:38 +00:00
|
|
|
* Find and lock a contiguous range of bytes in the file marked as delalloc, no
|
btrfs: subpage: avoid potential deadlock with compression and delalloc
[BUG]
With experimental subpage compression enabled, a simple fsstress can
lead to self deadlock on page 720896:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount $dev -o compress $mnt
$fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156
[CAUSE]
If we have a file layout looks like below:
0 32K 64K 96K 128K
|//| |///////////////|
4K
Then we run delalloc range for the inode, it will:
- Call find_lock_delalloc_range() with @delalloc_start = 0
Then we got a delalloc range [0, 4K).
This range will be COWed.
- Call find_lock_delalloc_range() again with @delalloc_start = 4K
Since find_lock_delalloc_range() never cares whether the range
is still inside page range [0, 64K), it will return range [64K, 128K).
This range meets the condition for subpage compression, will go
through async COW path.
And async COW path will return @page_started.
But that @page_started is now for range [64K, 128K), not for range
[0, 64K).
- writepage_dellloc() returned 1 for page [0, 64K)
Thus page [0, 64K) will not be unlocked, nor its page dirty status
will be cleared.
Next time when we try to lock page [0, 64K) we will deadlock, as there
is no one to release page [0, 64K).
This problem will never happen for regular page size as one page only
contains one sector. After the first find_lock_delalloc_range() call,
the @delalloc_end will go beyond @page_end no matter if we found a
delalloc range or not
Thus this bug only happens for subpage, as now we need multiple runs to
exhaust the delalloc range of a page.
[FIX]
Fix the problem by ensuring the delalloc range we ran at least started
inside @locked_page.
So that we will never get incorrect @page_started.
And to prevent such problem from happening again:
- Make find_lock_delalloc_range() return false if the found range is
beyond @end value passed in.
Since @end will be utilized now, add an ASSERT() to ensure we pass
correct @end into find_lock_delalloc_range().
This also means, for selftests we needs to populate @end before calling
find_lock_delalloc_range().
- New ASSERT() in find_lock_delalloc_range()
Now we will make sure the @start/@end passed in at least covers part
of the page.
- New ASSERT() in run_delalloc_range()
To make sure the range at least starts inside @locked page.
- Use @delalloc_start as proper cursor, while @delalloc_end is always
reset to @page_end.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:22:07 +00:00
|
|
|
* more than @max_bytes.
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
*
|
btrfs: subpage: avoid potential deadlock with compression and delalloc
[BUG]
With experimental subpage compression enabled, a simple fsstress can
lead to self deadlock on page 720896:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount $dev -o compress $mnt
$fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156
[CAUSE]
If we have a file layout looks like below:
0 32K 64K 96K 128K
|//| |///////////////|
4K
Then we run delalloc range for the inode, it will:
- Call find_lock_delalloc_range() with @delalloc_start = 0
Then we got a delalloc range [0, 4K).
This range will be COWed.
- Call find_lock_delalloc_range() again with @delalloc_start = 4K
Since find_lock_delalloc_range() never cares whether the range
is still inside page range [0, 64K), it will return range [64K, 128K).
This range meets the condition for subpage compression, will go
through async COW path.
And async COW path will return @page_started.
But that @page_started is now for range [64K, 128K), not for range
[0, 64K).
- writepage_dellloc() returned 1 for page [0, 64K)
Thus page [0, 64K) will not be unlocked, nor its page dirty status
will be cleared.
Next time when we try to lock page [0, 64K) we will deadlock, as there
is no one to release page [0, 64K).
This problem will never happen for regular page size as one page only
contains one sector. After the first find_lock_delalloc_range() call,
the @delalloc_end will go beyond @page_end no matter if we found a
delalloc range or not
Thus this bug only happens for subpage, as now we need multiple runs to
exhaust the delalloc range of a page.
[FIX]
Fix the problem by ensuring the delalloc range we ran at least started
inside @locked_page.
So that we will never get incorrect @page_started.
And to prevent such problem from happening again:
- Make find_lock_delalloc_range() return false if the found range is
beyond @end value passed in.
Since @end will be utilized now, add an ASSERT() to ensure we pass
correct @end into find_lock_delalloc_range().
This also means, for selftests we needs to populate @end before calling
find_lock_delalloc_range().
- New ASSERT() in find_lock_delalloc_range()
Now we will make sure the @start/@end passed in at least covers part
of the page.
- New ASSERT() in run_delalloc_range()
To make sure the range at least starts inside @locked page.
- Use @delalloc_start as proper cursor, while @delalloc_end is always
reset to @page_end.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:22:07 +00:00
|
|
|
* @start: The original start bytenr to search.
|
|
|
|
* Will store the extent range start bytenr.
|
|
|
|
* @end: The original end bytenr of the search range
|
|
|
|
* Will store the extent range end bytenr.
|
|
|
|
*
|
|
|
|
* Return true if we find a delalloc range which starts inside the original
|
|
|
|
* range, and @start/@end will store the delalloc range start/end.
|
|
|
|
*
|
|
|
|
* Return false if we can't find any delalloc range which starts inside the
|
|
|
|
* original range, and @start/@end will be the non-delalloc range start/end.
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
*/
|
2018-11-19 09:38:17 +00:00
|
|
|
EXPORT_FOR_TESTS
|
2018-11-29 03:33:38 +00:00
|
|
|
noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
|
2024-07-24 20:08:13 +00:00
|
|
|
struct folio *locked_folio,
|
|
|
|
u64 *start, u64 *end)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
{
|
2023-09-14 14:45:41 +00:00
|
|
|
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
|
2019-06-21 15:02:54 +00:00
|
|
|
struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
|
btrfs: subpage: avoid potential deadlock with compression and delalloc
[BUG]
With experimental subpage compression enabled, a simple fsstress can
lead to self deadlock on page 720896:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount $dev -o compress $mnt
$fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156
[CAUSE]
If we have a file layout looks like below:
0 32K 64K 96K 128K
|//| |///////////////|
4K
Then we run delalloc range for the inode, it will:
- Call find_lock_delalloc_range() with @delalloc_start = 0
Then we got a delalloc range [0, 4K).
This range will be COWed.
- Call find_lock_delalloc_range() again with @delalloc_start = 4K
Since find_lock_delalloc_range() never cares whether the range
is still inside page range [0, 64K), it will return range [64K, 128K).
This range meets the condition for subpage compression, will go
through async COW path.
And async COW path will return @page_started.
But that @page_started is now for range [64K, 128K), not for range
[0, 64K).
- writepage_dellloc() returned 1 for page [0, 64K)
Thus page [0, 64K) will not be unlocked, nor its page dirty status
will be cleared.
Next time when we try to lock page [0, 64K) we will deadlock, as there
is no one to release page [0, 64K).
This problem will never happen for regular page size as one page only
contains one sector. After the first find_lock_delalloc_range() call,
the @delalloc_end will go beyond @page_end no matter if we found a
delalloc range or not
Thus this bug only happens for subpage, as now we need multiple runs to
exhaust the delalloc range of a page.
[FIX]
Fix the problem by ensuring the delalloc range we ran at least started
inside @locked_page.
So that we will never get incorrect @page_started.
And to prevent such problem from happening again:
- Make find_lock_delalloc_range() return false if the found range is
beyond @end value passed in.
Since @end will be utilized now, add an ASSERT() to ensure we pass
correct @end into find_lock_delalloc_range().
This also means, for selftests we needs to populate @end before calling
find_lock_delalloc_range().
- New ASSERT() in find_lock_delalloc_range()
Now we will make sure the @start/@end passed in at least covers part
of the page.
- New ASSERT() in run_delalloc_range()
To make sure the range at least starts inside @locked page.
- Use @delalloc_start as proper cursor, while @delalloc_end is always
reset to @page_end.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:22:07 +00:00
|
|
|
const u64 orig_start = *start;
|
|
|
|
const u64 orig_end = *end;
|
btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
On zoned filesystem, data write out is limited by max_zone_append_size,
and a large ordered extent is split according the size of a bio. OTOH,
the number of extents to be written is calculated using
BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
metadata bytes to update and/or create the metadata items.
The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.
The increase of the number of extents easily occurs on zoned filesystem
if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
following warning on a small RAM environment with disabling metadata
over-commit (in the following patch).
[75721.498492] ------------[ cut here ]------------
[75721.505624] BTRFS: block rsv 1 returned -28
[75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G W 5.18.0-rc2-BTRFS-ZNS+ #109
[75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
[75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
[75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
[75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
[75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
[75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
[75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
[75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
[75721.701878] FS: 0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
[75721.712601] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
[75721.730499] Call Trace:
[75721.735166] <TASK>
[75721.739886] btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
[75721.747545] ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
[75721.756145] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.762852] ? btrfs_get_32+0xea/0x2d0 [btrfs]
[75721.769520] ? push_leaf_left+0x420/0x620 [btrfs]
[75721.776431] ? memcpy+0x4e/0x60
[75721.781931] split_leaf+0x433/0x12d0 [btrfs]
[75721.788392] ? btrfs_get_token_32+0x580/0x580 [btrfs]
[75721.795636] ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
[75721.803759] ? leaf_space_used+0x15d/0x1a0 [btrfs]
[75721.811156] btrfs_search_slot+0x1bc3/0x2790 [btrfs]
[75721.818300] ? lock_downgrade+0x7c0/0x7c0
[75721.824411] ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
[75721.832456] ? split_leaf+0x12d0/0x12d0 [btrfs]
[75721.839149] ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
[75721.846945] ? free_extent_buffer+0x13/0x20 [btrfs]
[75721.853960] ? btrfs_release_path+0x4b/0x190 [btrfs]
[75721.861429] btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
[75721.869313] ? rcu_read_lock_sched_held+0x16/0x80
[75721.876085] ? lock_release+0x552/0xf80
[75721.881957] ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
[75721.888886] ? __kasan_check_write+0x14/0x20
[75721.895152] ? do_raw_read_unlock+0x44/0x80
[75721.901323] ? _raw_write_lock_irq+0x60/0x80
[75721.907983] ? btrfs_global_root+0xb9/0xe0 [btrfs]
[75721.915166] ? btrfs_csum_root+0x12b/0x180 [btrfs]
[75721.921918] ? btrfs_get_global_root+0x820/0x820 [btrfs]
[75721.929166] ? _raw_write_unlock+0x23/0x40
[75721.935116] ? unpin_extent_cache+0x1e3/0x390 [btrfs]
[75721.942041] btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
[75721.949906] ? try_to_wake_up+0x30/0x14a0
[75721.955700] ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
[75721.962661] ? rcu_read_lock_sched_held+0x16/0x80
[75721.969111] ? lock_acquire+0x41b/0x4c0
[75721.974982] finish_ordered_fn+0x15/0x20 [btrfs]
[75721.981639] btrfs_work_helper+0x1af/0xa80 [btrfs]
[75721.988184] ? _raw_spin_unlock_irq+0x28/0x50
[75721.994643] process_one_work+0x815/0x1460
[75722.000444] ? pwq_dec_nr_in_flight+0x250/0x250
[75722.006643] ? do_raw_spin_trylock+0xbb/0x190
[75722.013086] worker_thread+0x59a/0xeb0
[75722.018511] kthread+0x2ac/0x360
[75722.023428] ? process_one_work+0x1460/0x1460
[75722.029431] ? kthread_complete_and_exit+0x30/0x30
[75722.036044] ret_from_fork+0x22/0x30
[75722.041255] </TASK>
[75722.045047] irq event stamp: 0
[75722.049703] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
[75722.067533] softirqs last enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
[75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
[75722.085335] ---[ end trace 0000000000000000 ]---
To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular vs zoned filesystem.
Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
filesystem, it is set to fs_info->max_zone_append_size.
CC: stable@vger.kernel.org # 5.12+
Fixes: d8e3fb106f39 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-08 23:18:40 +00:00
|
|
|
/* The sanity tests may not set a valid fs_info. */
|
|
|
|
u64 max_bytes = fs_info ? fs_info->max_extent_size : BTRFS_MAX_EXTENT_SIZE;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
u64 delalloc_start;
|
|
|
|
u64 delalloc_end;
|
2018-11-29 03:33:38 +00:00
|
|
|
bool found;
|
2009-09-02 19:22:30 +00:00
|
|
|
struct extent_state *cached_state = NULL;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
int ret;
|
|
|
|
int loops = 0;
|
|
|
|
|
btrfs: subpage: avoid potential deadlock with compression and delalloc
[BUG]
With experimental subpage compression enabled, a simple fsstress can
lead to self deadlock on page 720896:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount $dev -o compress $mnt
$fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156
[CAUSE]
If we have a file layout looks like below:
0 32K 64K 96K 128K
|//| |///////////////|
4K
Then we run delalloc range for the inode, it will:
- Call find_lock_delalloc_range() with @delalloc_start = 0
Then we got a delalloc range [0, 4K).
This range will be COWed.
- Call find_lock_delalloc_range() again with @delalloc_start = 4K
Since find_lock_delalloc_range() never cares whether the range
is still inside page range [0, 64K), it will return range [64K, 128K).
This range meets the condition for subpage compression, will go
through async COW path.
And async COW path will return @page_started.
But that @page_started is now for range [64K, 128K), not for range
[0, 64K).
- writepage_dellloc() returned 1 for page [0, 64K)
Thus page [0, 64K) will not be unlocked, nor its page dirty status
will be cleared.
Next time when we try to lock page [0, 64K) we will deadlock, as there
is no one to release page [0, 64K).
This problem will never happen for regular page size as one page only
contains one sector. After the first find_lock_delalloc_range() call,
the @delalloc_end will go beyond @page_end no matter if we found a
delalloc range or not
Thus this bug only happens for subpage, as now we need multiple runs to
exhaust the delalloc range of a page.
[FIX]
Fix the problem by ensuring the delalloc range we ran at least started
inside @locked_page.
So that we will never get incorrect @page_started.
And to prevent such problem from happening again:
- Make find_lock_delalloc_range() return false if the found range is
beyond @end value passed in.
Since @end will be utilized now, add an ASSERT() to ensure we pass
correct @end into find_lock_delalloc_range().
This also means, for selftests we needs to populate @end before calling
find_lock_delalloc_range().
- New ASSERT() in find_lock_delalloc_range()
Now we will make sure the @start/@end passed in at least covers part
of the page.
- New ASSERT() in run_delalloc_range()
To make sure the range at least starts inside @locked page.
- Use @delalloc_start as proper cursor, while @delalloc_end is always
reset to @page_end.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:22:07 +00:00
|
|
|
/* Caller should pass a valid @end to indicate the search range end */
|
|
|
|
ASSERT(orig_end > orig_start);
|
|
|
|
|
2024-07-24 20:08:13 +00:00
|
|
|
/* The range should at least cover part of the folio */
|
|
|
|
ASSERT(!(orig_start >= folio_pos(locked_folio) + folio_size(locked_folio) ||
|
|
|
|
orig_end <= folio_pos(locked_folio)));
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
again:
|
|
|
|
/* step one, find a bunch of delalloc bytes starting at start */
|
|
|
|
delalloc_start = *start;
|
|
|
|
delalloc_end = 0;
|
2019-09-23 14:05:20 +00:00
|
|
|
found = btrfs_find_delalloc_range(tree, &delalloc_start, &delalloc_end,
|
|
|
|
max_bytes, &cached_state);
|
btrfs: subpage: avoid potential deadlock with compression and delalloc
[BUG]
With experimental subpage compression enabled, a simple fsstress can
lead to self deadlock on page 720896:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount $dev -o compress $mnt
$fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156
[CAUSE]
If we have a file layout looks like below:
0 32K 64K 96K 128K
|//| |///////////////|
4K
Then we run delalloc range for the inode, it will:
- Call find_lock_delalloc_range() with @delalloc_start = 0
Then we got a delalloc range [0, 4K).
This range will be COWed.
- Call find_lock_delalloc_range() again with @delalloc_start = 4K
Since find_lock_delalloc_range() never cares whether the range
is still inside page range [0, 64K), it will return range [64K, 128K).
This range meets the condition for subpage compression, will go
through async COW path.
And async COW path will return @page_started.
But that @page_started is now for range [64K, 128K), not for range
[0, 64K).
- writepage_dellloc() returned 1 for page [0, 64K)
Thus page [0, 64K) will not be unlocked, nor its page dirty status
will be cleared.
Next time when we try to lock page [0, 64K) we will deadlock, as there
is no one to release page [0, 64K).
This problem will never happen for regular page size as one page only
contains one sector. After the first find_lock_delalloc_range() call,
the @delalloc_end will go beyond @page_end no matter if we found a
delalloc range or not
Thus this bug only happens for subpage, as now we need multiple runs to
exhaust the delalloc range of a page.
[FIX]
Fix the problem by ensuring the delalloc range we ran at least started
inside @locked_page.
So that we will never get incorrect @page_started.
And to prevent such problem from happening again:
- Make find_lock_delalloc_range() return false if the found range is
beyond @end value passed in.
Since @end will be utilized now, add an ASSERT() to ensure we pass
correct @end into find_lock_delalloc_range().
This also means, for selftests we needs to populate @end before calling
find_lock_delalloc_range().
- New ASSERT() in find_lock_delalloc_range()
Now we will make sure the @start/@end passed in at least covers part
of the page.
- New ASSERT() in run_delalloc_range()
To make sure the range at least starts inside @locked page.
- Use @delalloc_start as proper cursor, while @delalloc_end is always
reset to @page_end.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:22:07 +00:00
|
|
|
if (!found || delalloc_end <= *start || delalloc_start > orig_end) {
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
*start = delalloc_start;
|
btrfs: subpage: avoid potential deadlock with compression and delalloc
[BUG]
With experimental subpage compression enabled, a simple fsstress can
lead to self deadlock on page 720896:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount $dev -o compress $mnt
$fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156
[CAUSE]
If we have a file layout looks like below:
0 32K 64K 96K 128K
|//| |///////////////|
4K
Then we run delalloc range for the inode, it will:
- Call find_lock_delalloc_range() with @delalloc_start = 0
Then we got a delalloc range [0, 4K).
This range will be COWed.
- Call find_lock_delalloc_range() again with @delalloc_start = 4K
Since find_lock_delalloc_range() never cares whether the range
is still inside page range [0, 64K), it will return range [64K, 128K).
This range meets the condition for subpage compression, will go
through async COW path.
And async COW path will return @page_started.
But that @page_started is now for range [64K, 128K), not for range
[0, 64K).
- writepage_dellloc() returned 1 for page [0, 64K)
Thus page [0, 64K) will not be unlocked, nor its page dirty status
will be cleared.
Next time when we try to lock page [0, 64K) we will deadlock, as there
is no one to release page [0, 64K).
This problem will never happen for regular page size as one page only
contains one sector. After the first find_lock_delalloc_range() call,
the @delalloc_end will go beyond @page_end no matter if we found a
delalloc range or not
Thus this bug only happens for subpage, as now we need multiple runs to
exhaust the delalloc range of a page.
[FIX]
Fix the problem by ensuring the delalloc range we ran at least started
inside @locked_page.
So that we will never get incorrect @page_started.
And to prevent such problem from happening again:
- Make find_lock_delalloc_range() return false if the found range is
beyond @end value passed in.
Since @end will be utilized now, add an ASSERT() to ensure we pass
correct @end into find_lock_delalloc_range().
This also means, for selftests we needs to populate @end before calling
find_lock_delalloc_range().
- New ASSERT() in find_lock_delalloc_range()
Now we will make sure the @start/@end passed in at least covers part
of the page.
- New ASSERT() in run_delalloc_range()
To make sure the range at least starts inside @locked page.
- Use @delalloc_start as proper cursor, while @delalloc_end is always
reset to @page_end.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:22:07 +00:00
|
|
|
|
|
|
|
/* @delalloc_end can be -1, never go beyond @orig_end */
|
|
|
|
*end = min(delalloc_end, orig_end);
|
2010-02-02 21:19:11 +00:00
|
|
|
free_extent_state(cached_state);
|
2018-11-29 03:33:38 +00:00
|
|
|
return false;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
}
|
|
|
|
|
2008-10-31 16:46:39 +00:00
|
|
|
/*
|
2024-07-24 20:08:13 +00:00
|
|
|
* start comes from the offset of locked_folio. We have to lock
|
|
|
|
* folios in order, so we can't process delalloc bytes before
|
|
|
|
* locked_folio
|
2008-10-31 16:46:39 +00:00
|
|
|
*/
|
2009-01-06 02:25:51 +00:00
|
|
|
if (delalloc_start < *start)
|
2008-10-31 16:46:39 +00:00
|
|
|
delalloc_start = *start;
|
|
|
|
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
/*
|
2024-07-24 20:08:13 +00:00
|
|
|
* make sure to limit the number of folios we try to lock down
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
*/
|
2013-10-08 02:11:09 +00:00
|
|
|
if (delalloc_end + 1 - delalloc_start > max_bytes)
|
|
|
|
delalloc_end = delalloc_start + max_bytes - 1;
|
2009-01-06 02:25:51 +00:00
|
|
|
|
2024-07-24 20:08:13 +00:00
|
|
|
/* step two, lock all the folioss after the folios that has start */
|
2024-07-24 20:15:13 +00:00
|
|
|
ret = lock_delalloc_folios(inode, locked_folio, delalloc_start,
|
|
|
|
delalloc_end);
|
2018-10-26 11:43:21 +00:00
|
|
|
ASSERT(!ret || ret == -EAGAIN);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
if (ret == -EAGAIN) {
|
2024-07-24 20:08:13 +00:00
|
|
|
/* some of the folios are gone, lets avoid looping by
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
* shortening the size of the delalloc range we're searching
|
|
|
|
*/
|
2009-09-02 19:22:30 +00:00
|
|
|
free_extent_state(cached_state);
|
2014-05-21 12:49:54 +00:00
|
|
|
cached_state = NULL;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
if (!loops) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
max_bytes = PAGE_SIZE;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
loops = 1;
|
|
|
|
goto again;
|
|
|
|
} else {
|
2018-11-29 03:33:38 +00:00
|
|
|
found = false;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
goto out_failed;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* step three, lock the state bits for the whole range */
|
2022-09-09 21:53:43 +00:00
|
|
|
lock_extent(tree, delalloc_start, delalloc_end, &cached_state);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
|
|
|
/* then test to make sure it is all still delalloc */
|
|
|
|
ret = test_range_bit(tree, delalloc_start, delalloc_end,
|
2020-08-14 09:35:16 +00:00
|
|
|
EXTENT_DELALLOC, cached_state);
|
2024-02-12 21:10:44 +00:00
|
|
|
|
|
|
|
unlock_extent(tree, delalloc_start, delalloc_end, &cached_state);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
if (!ret) {
|
2024-07-24 20:17:43 +00:00
|
|
|
__unlock_for_delalloc(inode, locked_folio, delalloc_start,
|
|
|
|
delalloc_end);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
cond_resched();
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
*start = delalloc_start;
|
|
|
|
*end = delalloc_end;
|
|
|
|
out_failed:
|
|
|
|
return found;
|
|
|
|
}
|
|
|
|
|
2020-06-03 05:55:06 +00:00
|
|
|
void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
|
2024-07-24 20:29:18 +00:00
|
|
|
const struct folio *locked_folio,
|
2024-04-03 21:29:40 +00:00
|
|
|
struct extent_state **cached,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 clear_bits, unsigned long page_ops)
|
2017-02-03 01:49:22 +00:00
|
|
|
{
|
2024-04-03 21:29:40 +00:00
|
|
|
clear_extent_bit(&inode->io_tree, start, end, clear_bits, cached);
|
2017-02-03 01:49:22 +00:00
|
|
|
|
2024-07-24 20:29:18 +00:00
|
|
|
__process_folios_contig(inode->vfs_inode.i_mapping, locked_folio, start,
|
|
|
|
end, page_ops);
|
2017-02-03 01:49:22 +00:00
|
|
|
}
|
|
|
|
|
2024-07-23 20:16:20 +00:00
|
|
|
static bool btrfs_verify_folio(struct folio *folio, u64 start, u32 len)
|
2023-05-31 06:04:51 +00:00
|
|
|
{
|
2024-07-23 20:16:20 +00:00
|
|
|
struct btrfs_fs_info *fs_info = folio_to_fs_info(folio);
|
|
|
|
|
|
|
|
if (!fsverity_active(folio->mapping->host) ||
|
|
|
|
btrfs_folio_test_uptodate(fs_info, folio, start, len) ||
|
|
|
|
start >= i_size_read(folio->mapping->host))
|
2023-05-31 06:04:51 +00:00
|
|
|
return true;
|
2024-07-23 20:16:20 +00:00
|
|
|
return fsverity_verify_folio(folio);
|
2023-05-31 06:04:51 +00:00
|
|
|
}
|
|
|
|
|
2024-07-23 20:16:20 +00:00
|
|
|
static void end_folio_read(struct folio *folio, bool uptodate, u64 start, u32 len)
|
btrfs: submit read time repair only for each corrupted sector
Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair. But we can avoid all
these extra mechanisms if we submit the repair for each sector.
By this, each read repair can be easily handled without the need to
verify which sector is corrupted.
This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.
So this patch will:
- Introduce repair_one_sector()
The main code submitting repair, which is more or less the same as old
btrfs_submit_read_repair().
But this time, it only repairs one sector.
- Make btrfs_submit_read_repair() to handle sectors differently
There are 3 different cases:
* Good sector
We need to release the page and extent, set the range uptodate.
* Bad sector and failed to submit repair bio
We need to release the page and extent, but not set the range
uptodate.
* Bad sector but repair bio submitted
The page and extent release will be handled by the submitted repair
bio. Nothing needs to be done.
Since btrfs_submit_read_repair() will handle the page and extent
release now, we need to skip to next bvec even we hit some error.
- Change the lifespan of @uptodate in end_bio_extent_readpage()
Since now btrfs_submit_read_repair() will handle the full bvec
which contains any corruption, we don't need to bother updating
@uptodate bit anymore.
Just let @uptodate to be local variable inside the main loop,
so that any error from one bvec won't affect later bvec.
- Only export btrfs_repair_one_sector(), unexport
btrfs_submit_read_repair()
The only outside caller for read repair is DIO, which already submits
its repair for just one sector.
Only export btrfs_repair_one_sector() for DIO.
This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-03 02:08:55 +00:00
|
|
|
{
|
2024-07-23 20:16:20 +00:00
|
|
|
struct btrfs_fs_info *fs_info = folio_to_fs_info(folio);
|
btrfs: submit read time repair only for each corrupted sector
Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair. But we can avoid all
these extra mechanisms if we submit the repair for each sector.
By this, each read repair can be easily handled without the need to
verify which sector is corrupted.
This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.
So this patch will:
- Introduce repair_one_sector()
The main code submitting repair, which is more or less the same as old
btrfs_submit_read_repair().
But this time, it only repairs one sector.
- Make btrfs_submit_read_repair() to handle sectors differently
There are 3 different cases:
* Good sector
We need to release the page and extent, set the range uptodate.
* Bad sector and failed to submit repair bio
We need to release the page and extent, but not set the range
uptodate.
* Bad sector but repair bio submitted
The page and extent release will be handled by the submitted repair
bio. Nothing needs to be done.
Since btrfs_submit_read_repair() will handle the page and extent
release now, we need to skip to next bvec even we hit some error.
- Change the lifespan of @uptodate in end_bio_extent_readpage()
Since now btrfs_submit_read_repair() will handle the full bvec
which contains any corruption, we don't need to bother updating
@uptodate bit anymore.
Just let @uptodate to be local variable inside the main loop,
so that any error from one bvec won't affect later bvec.
- Only export btrfs_repair_one_sector(), unexport
btrfs_submit_read_repair()
The only outside caller for read repair is DIO, which already submits
its repair for just one sector.
Only export btrfs_repair_one_sector() for DIO.
This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-03 02:08:55 +00:00
|
|
|
|
2024-07-23 20:16:20 +00:00
|
|
|
ASSERT(folio_pos(folio) <= start &&
|
|
|
|
start + len <= folio_pos(folio) + PAGE_SIZE);
|
btrfs: submit read time repair only for each corrupted sector
Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair. But we can avoid all
these extra mechanisms if we submit the repair for each sector.
By this, each read repair can be easily handled without the need to
verify which sector is corrupted.
This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.
So this patch will:
- Introduce repair_one_sector()
The main code submitting repair, which is more or less the same as old
btrfs_submit_read_repair().
But this time, it only repairs one sector.
- Make btrfs_submit_read_repair() to handle sectors differently
There are 3 different cases:
* Good sector
We need to release the page and extent, set the range uptodate.
* Bad sector and failed to submit repair bio
We need to release the page and extent, but not set the range
uptodate.
* Bad sector but repair bio submitted
The page and extent release will be handled by the submitted repair
bio. Nothing needs to be done.
Since btrfs_submit_read_repair() will handle the page and extent
release now, we need to skip to next bvec even we hit some error.
- Change the lifespan of @uptodate in end_bio_extent_readpage()
Since now btrfs_submit_read_repair() will handle the full bvec
which contains any corruption, we don't need to bother updating
@uptodate bit anymore.
Just let @uptodate to be local variable inside the main loop,
so that any error from one bvec won't affect later bvec.
- Only export btrfs_repair_one_sector(), unexport
btrfs_submit_read_repair()
The only outside caller for read repair is DIO, which already submits
its repair for just one sector.
Only export btrfs_repair_one_sector() for DIO.
This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-03 02:08:55 +00:00
|
|
|
|
2024-07-23 20:16:20 +00:00
|
|
|
if (uptodate && btrfs_verify_folio(folio, start, len))
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_folio_set_uptodate(fs_info, folio, start, len);
|
2023-05-31 06:04:57 +00:00
|
|
|
else
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_folio_clear_uptodate(fs_info, folio, start, len);
|
btrfs: submit read time repair only for each corrupted sector
Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair. But we can avoid all
these extra mechanisms if we submit the repair for each sector.
By this, each read repair can be easily handled without the need to
verify which sector is corrupted.
This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.
So this patch will:
- Introduce repair_one_sector()
The main code submitting repair, which is more or less the same as old
btrfs_submit_read_repair().
But this time, it only repairs one sector.
- Make btrfs_submit_read_repair() to handle sectors differently
There are 3 different cases:
* Good sector
We need to release the page and extent, set the range uptodate.
* Bad sector and failed to submit repair bio
We need to release the page and extent, but not set the range
uptodate.
* Bad sector but repair bio submitted
The page and extent release will be handled by the submitted repair
bio. Nothing needs to be done.
Since btrfs_submit_read_repair() will handle the page and extent
release now, we need to skip to next bvec even we hit some error.
- Change the lifespan of @uptodate in end_bio_extent_readpage()
Since now btrfs_submit_read_repair() will handle the full bvec
which contains any corruption, we don't need to bother updating
@uptodate bit anymore.
Just let @uptodate to be local variable inside the main loop,
so that any error from one bvec won't affect later bvec.
- Only export btrfs_repair_one_sector(), unexport
btrfs_submit_read_repair()
The only outside caller for read repair is DIO, which already submits
its repair for just one sector.
Only export btrfs_repair_one_sector() for DIO.
This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-03 02:08:55 +00:00
|
|
|
|
2024-07-23 20:16:20 +00:00
|
|
|
if (!btrfs_is_subpage(fs_info, folio->mapping))
|
|
|
|
folio_unlock(folio);
|
btrfs: subpage: fix a rare race between metadata endio and eb freeing
[BUG]
There is a very rare ASSERT() triggering during full fstests run for
subpage rw support.
No other reproducer so far.
The ASSERT() gets triggered for metadata read in
btrfs_page_set_uptodate() inside end_page_read().
[CAUSE]
There is still a small race window for metadata only, the race could
happen like this:
T1 | T2
------------------------------------+-----------------------------
end_bio_extent_readpage() |
|- btrfs_validate_metadata_buffer() |
| |- free_extent_buffer() |
| Still have 2 refs |
|- end_page_read() |
|- if (unlikely(PagePrivate()) |
| The page still has Private |
| | free_extent_buffer()
| | | Only one ref 1, will be
| | | released
| | |- detach_extent_buffer_page()
| | |- btrfs_detach_subpage()
|- btrfs_set_page_uptodate() |
The page no longer has Private|
>>> ASSERT() triggered <<< |
This race window is super small, thus pretty hard to hit, even with so
many runs of fstests.
But the race window is still there, we have to go another way to solve
it other than relying on random PagePrivate() check.
Data path is not affected, as it will lock the page before reading,
while unlocking the page after the last read has finished, thus no race
window.
[FIX]
This patch will fix the bug by repurposing btrfs_subpage::readers.
Now btrfs_subpage::readers will be a member shared by both metadata and
data.
For metadata path, we don't do the page unlock as metadata only relies
on extent locking.
At the same time, teach page_range_has_eb() to take
btrfs_subpage::readers into consideration.
So that even if the last eb of a page gets freed, page::private won't be
detached as long as there still are pending end_page_read() calls.
By this we eliminate the race window, this will slight increase the
metadata memory usage, as the page may not be released as frequently as
usual. But it should not be a big deal.
The code got introduced in ("btrfs: submit read time repair only for
each corrupted sector"), but the fix is in a separate patch to keep the
problem description and the crash is rare so it should not hurt
bisectability.
Signed-off-by: Qu Wegruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-07 09:02:58 +00:00
|
|
|
else
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_subpage_end_reader(fs_info, folio, start, len);
|
btrfs: submit read time repair only for each corrupted sector
Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair. But we can avoid all
these extra mechanisms if we submit the repair for each sector.
By this, each read repair can be easily handled without the need to
verify which sector is corrupted.
This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.
So this patch will:
- Introduce repair_one_sector()
The main code submitting repair, which is more or less the same as old
btrfs_submit_read_repair().
But this time, it only repairs one sector.
- Make btrfs_submit_read_repair() to handle sectors differently
There are 3 different cases:
* Good sector
We need to release the page and extent, set the range uptodate.
* Bad sector and failed to submit repair bio
We need to release the page and extent, but not set the range
uptodate.
* Bad sector but repair bio submitted
The page and extent release will be handled by the submitted repair
bio. Nothing needs to be done.
Since btrfs_submit_read_repair() will handle the page and extent
release now, we need to skip to next bvec even we hit some error.
- Change the lifespan of @uptodate in end_bio_extent_readpage()
Since now btrfs_submit_read_repair() will handle the full bvec
which contains any corruption, we don't need to bother updating
@uptodate bit anymore.
Just let @uptodate to be local variable inside the main loop,
so that any error from one bvec won't affect later bvec.
- Only export btrfs_repair_one_sector(), unexport
btrfs_submit_read_repair()
The only outside caller for read repair is DIO, which already submits
its repair for just one sector.
Only export btrfs_repair_one_sector() for DIO.
This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-03 02:08:55 +00:00
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
2023-12-12 02:28:38 +00:00
|
|
|
* After a write IO is done, we need to:
|
|
|
|
*
|
|
|
|
* - clear the uptodate bits on error
|
|
|
|
* - clear the writeback bits in the extent tree for the range
|
|
|
|
* - filio_end_writeback() if there is no more pending io for the folio
|
2008-01-24 21:13:08 +00:00
|
|
|
*
|
|
|
|
* Scheduling is not allowed, so the extent state tree is expected
|
|
|
|
* to have one and only one object corresponding to this IO.
|
|
|
|
*/
|
2023-12-12 02:28:38 +00:00
|
|
|
static void end_bbio_data_write(struct btrfs_bio *bbio)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2024-01-24 22:24:03 +00:00
|
|
|
struct btrfs_fs_info *fs_info = bbio->fs_info;
|
2022-08-06 08:03:26 +00:00
|
|
|
struct bio *bio = &bbio->bio;
|
2017-06-03 07:38:06 +00:00
|
|
|
int error = blk_status_to_errno(bio->bi_status);
|
2023-12-12 02:28:38 +00:00
|
|
|
struct folio_iter fi;
|
2024-01-24 22:24:03 +00:00
|
|
|
const u32 sectorsize = fs_info->sectorsize;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2017-07-13 16:10:07 +00:00
|
|
|
ASSERT(!bio_flagged(bio, BIO_CLONED));
|
2023-12-12 02:28:38 +00:00
|
|
|
bio_for_each_folio_all(fi, bio) {
|
|
|
|
struct folio *folio = fi.folio;
|
|
|
|
u64 start = folio_pos(folio) + fi.offset;
|
|
|
|
u32 len = fi.length;
|
|
|
|
|
|
|
|
/* Only order 0 (single page) folios are allowed for data. */
|
|
|
|
ASSERT(folio_order(folio) == 0);
|
2021-05-31 08:50:40 +00:00
|
|
|
|
|
|
|
/* Our read/write should always be sector aligned. */
|
2023-12-12 02:28:38 +00:00
|
|
|
if (!IS_ALIGNED(fi.offset, sectorsize))
|
2021-05-31 08:50:40 +00:00
|
|
|
btrfs_err(fs_info,
|
2023-12-12 02:28:38 +00:00
|
|
|
"partial page write in btrfs with offset %zu and length %zu",
|
|
|
|
fi.offset, fi.length);
|
|
|
|
else if (!IS_ALIGNED(fi.length, sectorsize))
|
2021-05-31 08:50:40 +00:00
|
|
|
btrfs_info(fs_info,
|
2023-12-12 02:28:38 +00:00
|
|
|
"incomplete page write with offset %zu and length %zu",
|
|
|
|
fi.offset, fi.length);
|
2021-05-31 08:50:40 +00:00
|
|
|
|
2024-07-24 19:53:18 +00:00
|
|
|
btrfs_finish_ordered_extent(bbio->ordered, folio, start, len,
|
|
|
|
!error);
|
btrfs: don't clear uptodate on write errors
We have been consistently seeing hangs with generic/648 in our subpage
GitHub CI setup. This is a classic deadlock, we are calling
btrfs_read_folio() on a folio, which requires holding the folio lock on
the folio, and then finding a ordered extent that overlaps that range
and calling btrfs_start_ordered_extent(), which then tries to write out
the dirty page, which requires taking the folio lock and then we
deadlock.
The hang happens because we're writing to range [1271750656, 1271767040),
page index [77621, 77622], and page 77621 is !Uptodate. It is also Dirty,
so we call btrfs_read_folio() for 77621 and which does
btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered
extent which is [1271644160, 1271746560), page index [77615, 77621].
The page indexes overlap, but the actual bytes don't overlap. We're
holding the page lock for 77621, then call
btrfs_lock_and_flush_ordered_range() which tries to flush the dirty
page, and tries to lock 77621 again and then we deadlock.
The byte ranges do not overlap, but with subpage support if we clear
uptodate on any portion of the page we mark the entire thing as not
uptodate.
We have been clearing page uptodate on write errors, but no other file
system does this, and is in fact incorrect. This doesn't hurt us in the
!subpage case because we can't end up with overlapped ranges that don't
also overlap on the page.
Fix this by not clearing uptodate when we have a write error. The only
thing we should be doing in this case is setting the mapping error and
carrying on. This makes it so we would no longer call
btrfs_read_folio() on the page as it's uptodate and eliminates the
deadlock.
With this patch we're now able to make it through a full fstests run on
our subpage blocksize VMs.
Note for stable backports: this probably goes beyond 6.1 but the code
has been cleaned up and clearing the uptodate bit must be verified on
each version independently.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-09-08 19:31:39 +00:00
|
|
|
if (error)
|
2023-12-12 02:28:38 +00:00
|
|
|
mapping_set_error(folio->mapping, error);
|
|
|
|
btrfs_folio_clear_writeback(fs_info, folio, start, len);
|
2013-11-07 20:20:26 +00:00
|
|
|
}
|
2008-09-24 15:48:04 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
bio_put(bio);
|
|
|
|
}
|
|
|
|
|
2024-07-23 20:19:12 +00:00
|
|
|
static void begin_folio_read(struct btrfs_fs_info *fs_info, struct folio *folio)
|
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 02:28:36 +00:00
|
|
|
{
|
2023-12-12 02:28:37 +00:00
|
|
|
ASSERT(folio_test_locked(folio));
|
|
|
|
if (!btrfs_is_subpage(fs_info, folio->mapping))
|
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 02:28:36 +00:00
|
|
|
return;
|
|
|
|
|
2023-11-17 03:54:14 +00:00
|
|
|
ASSERT(folio_test_private(folio));
|
2024-07-23 20:19:12 +00:00
|
|
|
btrfs_subpage_start_reader(fs_info, folio, folio_pos(folio), PAGE_SIZE);
|
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-02 02:28:36 +00:00
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
2023-12-12 02:28:38 +00:00
|
|
|
* After a data read IO is done, we need to:
|
|
|
|
*
|
|
|
|
* - clear the uptodate bits on error
|
|
|
|
* - set the uptodate bits if things worked
|
|
|
|
* - set the folio up to date if all extents in the tree are uptodate
|
|
|
|
* - clear the lock bit in the extent tree
|
|
|
|
* - unlock the folio if there are no other extents locked for it
|
2008-01-24 21:13:08 +00:00
|
|
|
*
|
|
|
|
* Scheduling is not allowed, so the extent state tree is expected
|
|
|
|
* to have one and only one object corresponding to this IO.
|
|
|
|
*/
|
2023-12-12 02:28:38 +00:00
|
|
|
static void end_bbio_data_read(struct btrfs_bio *bbio)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2024-01-24 22:24:03 +00:00
|
|
|
struct btrfs_fs_info *fs_info = bbio->fs_info;
|
2022-08-06 08:03:26 +00:00
|
|
|
struct bio *bio = &bbio->bio;
|
2023-12-12 02:28:38 +00:00
|
|
|
struct folio_iter fi;
|
2024-01-24 22:24:03 +00:00
|
|
|
const u32 sectorsize = fs_info->sectorsize;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2017-07-13 16:10:07 +00:00
|
|
|
ASSERT(!bio_flagged(bio, BIO_CLONED));
|
2023-12-12 02:28:38 +00:00
|
|
|
bio_for_each_folio_all(fi, &bbio->bio) {
|
btrfs: submit read time repair only for each corrupted sector
Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair. But we can avoid all
these extra mechanisms if we submit the repair for each sector.
By this, each read repair can be easily handled without the need to
verify which sector is corrupted.
This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.
So this patch will:
- Introduce repair_one_sector()
The main code submitting repair, which is more or less the same as old
btrfs_submit_read_repair().
But this time, it only repairs one sector.
- Make btrfs_submit_read_repair() to handle sectors differently
There are 3 different cases:
* Good sector
We need to release the page and extent, set the range uptodate.
* Bad sector and failed to submit repair bio
We need to release the page and extent, but not set the range
uptodate.
* Bad sector but repair bio submitted
The page and extent release will be handled by the submitted repair
bio. Nothing needs to be done.
Since btrfs_submit_read_repair() will handle the page and extent
release now, we need to skip to next bvec even we hit some error.
- Change the lifespan of @uptodate in end_bio_extent_readpage()
Since now btrfs_submit_read_repair() will handle the full bvec
which contains any corruption, we don't need to bother updating
@uptodate bit anymore.
Just let @uptodate to be local variable inside the main loop,
so that any error from one bvec won't affect later bvec.
- Only export btrfs_repair_one_sector(), unexport
btrfs_submit_read_repair()
The only outside caller for read repair is DIO, which already submits
its repair for just one sector.
Only export btrfs_repair_one_sector() for DIO.
This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-05-03 02:08:55 +00:00
|
|
|
bool uptodate = !bio->bi_status;
|
2023-12-12 02:28:38 +00:00
|
|
|
struct folio *folio = fi.folio;
|
|
|
|
struct inode *inode = folio->mapping->host;
|
2020-12-02 06:47:58 +00:00
|
|
|
u64 start;
|
|
|
|
u64 end;
|
|
|
|
u32 len;
|
2011-04-06 10:02:20 +00:00
|
|
|
|
2023-12-12 02:28:38 +00:00
|
|
|
/* For now only order 0 folios are supported for data. */
|
|
|
|
ASSERT(folio_order(folio) == 0);
|
2016-09-20 14:05:02 +00:00
|
|
|
btrfs_debug(fs_info,
|
2023-12-12 02:28:38 +00:00
|
|
|
"%s: bi_sector=%llu, err=%d, mirror=%u",
|
|
|
|
__func__, bio->bi_iter.bi_sector, bio->bi_status,
|
2021-09-15 07:17:18 +00:00
|
|
|
bbio->mirror_num);
|
2008-08-20 12:51:49 +00:00
|
|
|
|
2020-10-21 06:24:58 +00:00
|
|
|
/*
|
|
|
|
* We always issue full-sector reads, but if some block in a
|
2023-12-12 02:28:38 +00:00
|
|
|
* folio fails to read, blk_update_request() will advance
|
2020-10-21 06:24:58 +00:00
|
|
|
* bv_offset and adjust bv_len to compensate. Print a warning
|
|
|
|
* for unaligned offsets, and an error if they don't add up to
|
|
|
|
* a full sector.
|
|
|
|
*/
|
2023-12-12 02:28:38 +00:00
|
|
|
if (!IS_ALIGNED(fi.offset, sectorsize))
|
2020-10-21 06:24:58 +00:00
|
|
|
btrfs_err(fs_info,
|
2023-12-12 02:28:38 +00:00
|
|
|
"partial page read in btrfs with offset %zu and length %zu",
|
|
|
|
fi.offset, fi.length);
|
|
|
|
else if (!IS_ALIGNED(fi.offset + fi.length, sectorsize))
|
2020-10-21 06:24:58 +00:00
|
|
|
btrfs_info(fs_info,
|
2023-12-12 02:28:38 +00:00
|
|
|
"incomplete page read with offset %zu and length %zu",
|
|
|
|
fi.offset, fi.length);
|
2020-10-21 06:24:58 +00:00
|
|
|
|
2023-12-12 02:28:38 +00:00
|
|
|
start = folio_pos(folio) + fi.offset;
|
|
|
|
end = start + fi.length - 1;
|
|
|
|
len = fi.length;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2013-07-25 11:22:35 +00:00
|
|
|
if (likely(uptodate)) {
|
2013-06-17 21:14:39 +00:00
|
|
|
loff_t i_size = i_size_read(inode);
|
2023-12-12 02:28:38 +00:00
|
|
|
pgoff_t end_index = i_size >> folio_shift(folio);
|
2013-06-17 21:14:39 +00:00
|
|
|
|
btrfs: subpage: fix the false data csum mismatch error
[BUG]
When running fstresss, we can hit strange data csum mismatch where the
on-disk data is in fact correct (passes scrub).
With some extra debug info added, we have the following traces:
0482us: btrfs_do_readpage: root=5 ino=284 offset=393216, submit force=0 pgoff=0 iosize=8192
0494us: btrfs_do_readpage: root=5 ino=284 offset=401408, submit force=0 pgoff=8192 iosize=4096
0498us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=393216 len=8192
0591us: btrfs_do_readpage: root=5 ino=284 offset=405504, submit force=0 pgoff=12288 iosize=36864
0594us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=401408 len=4096
0863us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=405504 len=36864
0933us: btrfs_verify_data_csum: root=5 ino=284 offset=393216 len=8192
0967us: btrfs_do_readpage: root=5 ino=284 offset=442368, skip beyond isize pgoff=49152 iosize=16384
1047us: btrfs_verify_data_csum: root=5 ino=284 offset=401408 len=4096
1163us: btrfs_verify_data_csum: root=5 ino=284 offset=405504 len=36864
1290us: check_data_csum: !!! root=5 ino=284 offset=438272 pg_off=45056 !!!
7387us: end_bio_extent_readpage: root=5 ino=284 before pending_read_bios=0
[CAUSE]
Normally we expect all submitted bio reads to only touch the range we
specified, and under subpage context, it means we should only touch the
range specified in each bvec.
But in data read path, inside end_bio_extent_readpage(), we have page
zeroing which only takes regular page size into consideration.
This means for subpage if we have an inode whose content looks like below:
0 16K 32K 48K 64K
|///////| |///////| |
|//| = data needs to be read from disk
| | = hole
And i_size is 64K initially.
Then the following race can happen:
T1 | T2
--------------------------------+--------------------------------
btrfs_do_readpage() |
|- isize = 64K; |
| At this time, the isize is |
| 64K |
| |
|- submit_extent_page() |
| submit previous assembled bio|
| assemble bio for [0, 16K) |
| |
|- submit_extent_page() |
submit read bio for [0, 16K) |
assemble read bio for |
[32K, 48K) |
|
| btrfs_setsize()
| |- i_size_write(, 16K);
| Now i_size is only 16K
end_io() for [0K, 16K) |
|- end_bio_extent_readpage() |
|- btrfs_verify_data_csum() |
| No csum error |
|- i_size = 16K; |
|- zero_user_segment(16K, |
PAGE_SIZE); |
!!! We zeroed range |
!!! [32K, 48K) |
| end_io for [32K, 48K)
| |- end_bio_extent_readpage()
| |- btrfs_verify_data_csum()
| ! CSUM MISMATCH !
| ! As the range is zeroed now !
[FIX]
To fix the problem, make end_bio_extent_readpage() to only zero the
range of bvec.
The bug only affects subpage read-write support, as for full read-only
mount we can't change i_size thus won't hit the race condition.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 08:44:22 +00:00
|
|
|
/*
|
|
|
|
* Zero out the remaining part if this range straddles
|
|
|
|
* i_size.
|
|
|
|
*
|
2023-12-12 02:28:38 +00:00
|
|
|
* Here we should only zero the range inside the folio,
|
btrfs: subpage: fix the false data csum mismatch error
[BUG]
When running fstresss, we can hit strange data csum mismatch where the
on-disk data is in fact correct (passes scrub).
With some extra debug info added, we have the following traces:
0482us: btrfs_do_readpage: root=5 ino=284 offset=393216, submit force=0 pgoff=0 iosize=8192
0494us: btrfs_do_readpage: root=5 ino=284 offset=401408, submit force=0 pgoff=8192 iosize=4096
0498us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=393216 len=8192
0591us: btrfs_do_readpage: root=5 ino=284 offset=405504, submit force=0 pgoff=12288 iosize=36864
0594us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=401408 len=4096
0863us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=405504 len=36864
0933us: btrfs_verify_data_csum: root=5 ino=284 offset=393216 len=8192
0967us: btrfs_do_readpage: root=5 ino=284 offset=442368, skip beyond isize pgoff=49152 iosize=16384
1047us: btrfs_verify_data_csum: root=5 ino=284 offset=401408 len=4096
1163us: btrfs_verify_data_csum: root=5 ino=284 offset=405504 len=36864
1290us: check_data_csum: !!! root=5 ino=284 offset=438272 pg_off=45056 !!!
7387us: end_bio_extent_readpage: root=5 ino=284 before pending_read_bios=0
[CAUSE]
Normally we expect all submitted bio reads to only touch the range we
specified, and under subpage context, it means we should only touch the
range specified in each bvec.
But in data read path, inside end_bio_extent_readpage(), we have page
zeroing which only takes regular page size into consideration.
This means for subpage if we have an inode whose content looks like below:
0 16K 32K 48K 64K
|///////| |///////| |
|//| = data needs to be read from disk
| | = hole
And i_size is 64K initially.
Then the following race can happen:
T1 | T2
--------------------------------+--------------------------------
btrfs_do_readpage() |
|- isize = 64K; |
| At this time, the isize is |
| 64K |
| |
|- submit_extent_page() |
| submit previous assembled bio|
| assemble bio for [0, 16K) |
| |
|- submit_extent_page() |
submit read bio for [0, 16K) |
assemble read bio for |
[32K, 48K) |
|
| btrfs_setsize()
| |- i_size_write(, 16K);
| Now i_size is only 16K
end_io() for [0K, 16K) |
|- end_bio_extent_readpage() |
|- btrfs_verify_data_csum() |
| No csum error |
|- i_size = 16K; |
|- zero_user_segment(16K, |
PAGE_SIZE); |
!!! We zeroed range |
!!! [32K, 48K) |
| end_io for [32K, 48K)
| |- end_bio_extent_readpage()
| |- btrfs_verify_data_csum()
| ! CSUM MISMATCH !
| ! As the range is zeroed now !
[FIX]
To fix the problem, make end_bio_extent_readpage() to only zero the
range of bvec.
The bug only affects subpage read-write support, as for full read-only
mount we can't change i_size thus won't hit the race condition.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 08:44:22 +00:00
|
|
|
* not touch anything else.
|
|
|
|
*
|
|
|
|
* NOTE: i_size is exclusive while end is inclusive.
|
|
|
|
*/
|
2023-12-12 02:28:38 +00:00
|
|
|
if (folio_index(folio) == end_index && i_size <= end) {
|
|
|
|
u32 zero_start = max(offset_in_folio(folio, i_size),
|
|
|
|
offset_in_folio(folio, start));
|
|
|
|
u32 zero_len = offset_in_folio(folio, end) + 1 -
|
|
|
|
zero_start;
|
btrfs: subpage: fix the false data csum mismatch error
[BUG]
When running fstresss, we can hit strange data csum mismatch where the
on-disk data is in fact correct (passes scrub).
With some extra debug info added, we have the following traces:
0482us: btrfs_do_readpage: root=5 ino=284 offset=393216, submit force=0 pgoff=0 iosize=8192
0494us: btrfs_do_readpage: root=5 ino=284 offset=401408, submit force=0 pgoff=8192 iosize=4096
0498us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=393216 len=8192
0591us: btrfs_do_readpage: root=5 ino=284 offset=405504, submit force=0 pgoff=12288 iosize=36864
0594us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=401408 len=4096
0863us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=405504 len=36864
0933us: btrfs_verify_data_csum: root=5 ino=284 offset=393216 len=8192
0967us: btrfs_do_readpage: root=5 ino=284 offset=442368, skip beyond isize pgoff=49152 iosize=16384
1047us: btrfs_verify_data_csum: root=5 ino=284 offset=401408 len=4096
1163us: btrfs_verify_data_csum: root=5 ino=284 offset=405504 len=36864
1290us: check_data_csum: !!! root=5 ino=284 offset=438272 pg_off=45056 !!!
7387us: end_bio_extent_readpage: root=5 ino=284 before pending_read_bios=0
[CAUSE]
Normally we expect all submitted bio reads to only touch the range we
specified, and under subpage context, it means we should only touch the
range specified in each bvec.
But in data read path, inside end_bio_extent_readpage(), we have page
zeroing which only takes regular page size into consideration.
This means for subpage if we have an inode whose content looks like below:
0 16K 32K 48K 64K
|///////| |///////| |
|//| = data needs to be read from disk
| | = hole
And i_size is 64K initially.
Then the following race can happen:
T1 | T2
--------------------------------+--------------------------------
btrfs_do_readpage() |
|- isize = 64K; |
| At this time, the isize is |
| 64K |
| |
|- submit_extent_page() |
| submit previous assembled bio|
| assemble bio for [0, 16K) |
| |
|- submit_extent_page() |
submit read bio for [0, 16K) |
assemble read bio for |
[32K, 48K) |
|
| btrfs_setsize()
| |- i_size_write(, 16K);
| Now i_size is only 16K
end_io() for [0K, 16K) |
|- end_bio_extent_readpage() |
|- btrfs_verify_data_csum() |
| No csum error |
|- i_size = 16K; |
|- zero_user_segment(16K, |
PAGE_SIZE); |
!!! We zeroed range |
!!! [32K, 48K) |
| end_io for [32K, 48K)
| |- end_bio_extent_readpage()
| |- btrfs_verify_data_csum()
| ! CSUM MISMATCH !
| ! As the range is zeroed now !
[FIX]
To fix the problem, make end_bio_extent_readpage() to only zero the
range of bvec.
The bug only affects subpage read-write support, as for full read-only
mount we can't change i_size thus won't hit the race condition.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 08:44:22 +00:00
|
|
|
|
2023-12-12 02:28:38 +00:00
|
|
|
folio_zero_range(folio, zero_start, zero_len);
|
btrfs: subpage: fix the false data csum mismatch error
[BUG]
When running fstresss, we can hit strange data csum mismatch where the
on-disk data is in fact correct (passes scrub).
With some extra debug info added, we have the following traces:
0482us: btrfs_do_readpage: root=5 ino=284 offset=393216, submit force=0 pgoff=0 iosize=8192
0494us: btrfs_do_readpage: root=5 ino=284 offset=401408, submit force=0 pgoff=8192 iosize=4096
0498us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=393216 len=8192
0591us: btrfs_do_readpage: root=5 ino=284 offset=405504, submit force=0 pgoff=12288 iosize=36864
0594us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=401408 len=4096
0863us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=405504 len=36864
0933us: btrfs_verify_data_csum: root=5 ino=284 offset=393216 len=8192
0967us: btrfs_do_readpage: root=5 ino=284 offset=442368, skip beyond isize pgoff=49152 iosize=16384
1047us: btrfs_verify_data_csum: root=5 ino=284 offset=401408 len=4096
1163us: btrfs_verify_data_csum: root=5 ino=284 offset=405504 len=36864
1290us: check_data_csum: !!! root=5 ino=284 offset=438272 pg_off=45056 !!!
7387us: end_bio_extent_readpage: root=5 ino=284 before pending_read_bios=0
[CAUSE]
Normally we expect all submitted bio reads to only touch the range we
specified, and under subpage context, it means we should only touch the
range specified in each bvec.
But in data read path, inside end_bio_extent_readpage(), we have page
zeroing which only takes regular page size into consideration.
This means for subpage if we have an inode whose content looks like below:
0 16K 32K 48K 64K
|///////| |///////| |
|//| = data needs to be read from disk
| | = hole
And i_size is 64K initially.
Then the following race can happen:
T1 | T2
--------------------------------+--------------------------------
btrfs_do_readpage() |
|- isize = 64K; |
| At this time, the isize is |
| 64K |
| |
|- submit_extent_page() |
| submit previous assembled bio|
| assemble bio for [0, 16K) |
| |
|- submit_extent_page() |
submit read bio for [0, 16K) |
assemble read bio for |
[32K, 48K) |
|
| btrfs_setsize()
| |- i_size_write(, 16K);
| Now i_size is only 16K
end_io() for [0K, 16K) |
|- end_bio_extent_readpage() |
|- btrfs_verify_data_csum() |
| No csum error |
|- i_size = 16K; |
|- zero_user_segment(16K, |
PAGE_SIZE); |
!!! We zeroed range |
!!! [32K, 48K) |
| end_io for [32K, 48K)
| |- end_bio_extent_readpage()
| |- btrfs_verify_data_csum()
| ! CSUM MISMATCH !
| ! As the range is zeroed now !
[FIX]
To fix the problem, make end_bio_extent_readpage() to only zero the
range of bvec.
The bug only affects subpage read-write support, as for full read-only
mount we can't change i_size thus won't hit the race condition.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-01 08:44:22 +00:00
|
|
|
}
|
2008-01-29 14:59:12 +00:00
|
|
|
}
|
2022-05-22 11:47:51 +00:00
|
|
|
|
2023-01-21 06:50:07 +00:00
|
|
|
/* Update page status and unlock. */
|
2024-07-23 20:16:20 +00:00
|
|
|
end_folio_read(folio, uptodate, start, len);
|
2013-11-07 20:20:26 +00:00
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
bio_put(bio);
|
|
|
|
}
|
|
|
|
|
2024-01-29 09:46:10 +00:00
|
|
|
/*
|
2024-06-27 03:45:55 +00:00
|
|
|
* Populate every free slot in a provided array with folios using GFP_NOFS.
|
2024-01-29 09:46:10 +00:00
|
|
|
*
|
|
|
|
* @nr_folios: number of folios to allocate
|
|
|
|
* @folio_array: the array to fill with folios; any existing non-NULL entries in
|
|
|
|
* the array will be skipped
|
|
|
|
*
|
|
|
|
* Return: 0 if all folios were able to be allocated;
|
|
|
|
* -ENOMEM otherwise, the partially allocated folios would be freed and
|
|
|
|
* the array slots zeroed
|
|
|
|
*/
|
2024-06-27 03:45:55 +00:00
|
|
|
int btrfs_alloc_folio_array(unsigned int nr_folios, struct folio **folio_array)
|
2024-01-29 09:46:10 +00:00
|
|
|
{
|
|
|
|
for (int i = 0; i < nr_folios; i++) {
|
|
|
|
if (folio_array[i])
|
|
|
|
continue;
|
2024-06-27 03:45:55 +00:00
|
|
|
folio_array[i] = folio_alloc(GFP_NOFS, 0);
|
2024-01-29 09:46:10 +00:00
|
|
|
if (!folio_array[i])
|
|
|
|
goto error;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
error:
|
|
|
|
for (int i = 0; i < nr_folios; i++) {
|
|
|
|
if (folio_array[i])
|
|
|
|
folio_put(folio_array[i]);
|
|
|
|
}
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2022-10-27 12:21:42 +00:00
|
|
|
/*
|
2024-06-27 03:52:16 +00:00
|
|
|
* Populate every free slot in a provided array with pages, using GFP_NOFS.
|
2022-03-30 20:11:22 +00:00
|
|
|
*
|
|
|
|
* @nr_pages: number of pages to allocate
|
|
|
|
* @page_array: the array to fill with pages; any existing non-null entries in
|
2024-06-27 03:52:16 +00:00
|
|
|
* the array will be skipped
|
|
|
|
* @nofail: whether using __GFP_NOFAIL flag
|
2022-03-30 20:11:22 +00:00
|
|
|
*
|
|
|
|
* Return: 0 if all pages were able to be allocated;
|
btrfs: free the allocated memory if btrfs_alloc_page_array() fails
[BUG]
If btrfs_alloc_page_array() fail to allocate all pages but part of the
slots, then the partially allocated pages would be leaked in function
btrfs_submit_compressed_read().
[CAUSE]
As explicitly stated, if btrfs_alloc_page_array() returned -ENOMEM,
caller is responsible to free the partially allocated pages.
For the existing call sites, most of them are fine:
- btrfs_raid_bio::stripe_pages
Handled by free_raid_bio().
- extent_buffer::pages[]
Handled btrfs_release_extent_buffer_pages().
- scrub_stripe::pages[]
Handled by release_scrub_stripe().
But there is one exception in btrfs_submit_compressed_read(), if
btrfs_alloc_page_array() failed, we didn't cleanup the array and freed
the array pointer directly.
Initially there is still the error handling in commit dd137dd1f2d7
("btrfs: factor out allocating an array of pages"), but later in commit
544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio"),
the error handling is removed, leading to the possible memory leak.
[FIX]
This patch would add back the error handling first, then to prevent such
situation from happening again, also
Make btrfs_alloc_page_array() to free the allocated pages as a extra
safety net, then we don't need to add the error handling to
btrfs_submit_compressed_read().
Fixes: 544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24 04:23:50 +00:00
|
|
|
* -ENOMEM otherwise, the partially allocated pages would be freed and
|
|
|
|
* the array slots zeroed
|
2022-03-30 20:11:22 +00:00
|
|
|
*/
|
2023-11-29 22:32:08 +00:00
|
|
|
int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
|
2024-06-27 03:52:16 +00:00
|
|
|
bool nofail)
|
2022-03-30 20:11:22 +00:00
|
|
|
{
|
2024-06-27 03:52:16 +00:00
|
|
|
const gfp_t gfp = nofail ? (GFP_NOFS | __GFP_NOFAIL) : GFP_NOFS;
|
2022-03-30 20:11:23 +00:00
|
|
|
unsigned int allocated;
|
2022-03-30 20:11:22 +00:00
|
|
|
|
2022-03-30 20:11:23 +00:00
|
|
|
for (allocated = 0; allocated < nr_pages;) {
|
|
|
|
unsigned int last = allocated;
|
2022-03-30 20:11:22 +00:00
|
|
|
|
2024-03-25 22:46:46 +00:00
|
|
|
allocated = alloc_pages_bulk_array(gfp, nr_pages, page_array);
|
|
|
|
if (unlikely(allocated == last)) {
|
|
|
|
/* No progress, fail and do cleanup. */
|
btrfs: free the allocated memory if btrfs_alloc_page_array() fails
[BUG]
If btrfs_alloc_page_array() fail to allocate all pages but part of the
slots, then the partially allocated pages would be leaked in function
btrfs_submit_compressed_read().
[CAUSE]
As explicitly stated, if btrfs_alloc_page_array() returned -ENOMEM,
caller is responsible to free the partially allocated pages.
For the existing call sites, most of them are fine:
- btrfs_raid_bio::stripe_pages
Handled by free_raid_bio().
- extent_buffer::pages[]
Handled btrfs_release_extent_buffer_pages().
- scrub_stripe::pages[]
Handled by release_scrub_stripe().
But there is one exception in btrfs_submit_compressed_read(), if
btrfs_alloc_page_array() failed, we didn't cleanup the array and freed
the array pointer directly.
Initially there is still the error handling in commit dd137dd1f2d7
("btrfs: factor out allocating an array of pages"), but later in commit
544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio"),
the error handling is removed, leading to the possible memory leak.
[FIX]
This patch would add back the error handling first, then to prevent such
situation from happening again, also
Make btrfs_alloc_page_array() to free the allocated pages as a extra
safety net, then we don't need to add the error handling to
btrfs_submit_compressed_read().
Fixes: 544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24 04:23:50 +00:00
|
|
|
for (int i = 0; i < allocated; i++) {
|
|
|
|
__free_page(page_array[i]);
|
|
|
|
page_array[i] = NULL;
|
|
|
|
}
|
2022-03-30 20:11:22 +00:00
|
|
|
return -ENOMEM;
|
btrfs: free the allocated memory if btrfs_alloc_page_array() fails
[BUG]
If btrfs_alloc_page_array() fail to allocate all pages but part of the
slots, then the partially allocated pages would be leaked in function
btrfs_submit_compressed_read().
[CAUSE]
As explicitly stated, if btrfs_alloc_page_array() returned -ENOMEM,
caller is responsible to free the partially allocated pages.
For the existing call sites, most of them are fine:
- btrfs_raid_bio::stripe_pages
Handled by free_raid_bio().
- extent_buffer::pages[]
Handled btrfs_release_extent_buffer_pages().
- scrub_stripe::pages[]
Handled by release_scrub_stripe().
But there is one exception in btrfs_submit_compressed_read(), if
btrfs_alloc_page_array() failed, we didn't cleanup the array and freed
the array pointer directly.
Initially there is still the error handling in commit dd137dd1f2d7
("btrfs: factor out allocating an array of pages"), but later in commit
544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio"),
the error handling is removed, leading to the possible memory leak.
[FIX]
This patch would add back the error handling first, then to prevent such
situation from happening again, also
Make btrfs_alloc_page_array() to free the allocated pages as a extra
safety net, then we don't need to add the error handling to
btrfs_submit_compressed_read().
Fixes: 544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-11-24 04:23:50 +00:00
|
|
|
}
|
2022-03-30 20:11:22 +00:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-12-06 23:09:27 +00:00
|
|
|
/*
|
|
|
|
* Populate needed folios for the extent buffer.
|
|
|
|
*
|
|
|
|
* For now, the folios populated are always in order 0 (aka, single page).
|
|
|
|
*/
|
2024-06-27 03:52:16 +00:00
|
|
|
static int alloc_eb_folio_array(struct extent_buffer *eb, bool nofail)
|
2023-12-06 23:09:27 +00:00
|
|
|
{
|
|
|
|
struct page *page_array[INLINE_EXTENT_BUFFER_PAGES] = { 0 };
|
|
|
|
int num_pages = num_extent_pages(eb);
|
|
|
|
int ret;
|
|
|
|
|
2024-06-27 03:52:16 +00:00
|
|
|
ret = btrfs_alloc_page_array(num_pages, page_array, nofail);
|
2023-12-06 23:09:27 +00:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
for (int i = 0; i < num_pages; i++)
|
|
|
|
eb->folios[i] = page_folio(page_array[i]);
|
2024-01-05 05:35:55 +00:00
|
|
|
eb->folio_size = PAGE_SIZE;
|
|
|
|
eb->folio_shift = PAGE_SHIFT;
|
2023-12-06 23:09:27 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-02-27 15:17:03 +00:00
|
|
|
static bool btrfs_bio_is_contig(struct btrfs_bio_ctrl *bio_ctrl,
|
2024-07-23 20:32:29 +00:00
|
|
|
struct folio *folio, u64 disk_bytenr,
|
2023-02-27 15:17:03 +00:00
|
|
|
unsigned int pg_offset)
|
|
|
|
{
|
2023-03-07 16:39:43 +00:00
|
|
|
struct bio *bio = &bio_ctrl->bbio->bio;
|
2023-02-27 15:17:03 +00:00
|
|
|
struct bio_vec *bvec = bio_last_bvec_all(bio);
|
|
|
|
const sector_t sector = disk_bytenr >> SECTOR_SHIFT;
|
2024-07-23 20:32:29 +00:00
|
|
|
struct folio *bv_folio = page_folio(bvec->bv_page);
|
2023-02-27 15:17:03 +00:00
|
|
|
|
|
|
|
if (bio_ctrl->compress_type != BTRFS_COMPRESS_NONE) {
|
|
|
|
/*
|
|
|
|
* For compression, all IO should have its logical bytenr set
|
|
|
|
* to the starting bytenr of the compressed extent.
|
|
|
|
*/
|
|
|
|
return bio->bi_iter.bi_sector == sector;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The contig check requires the following conditions to be met:
|
|
|
|
*
|
2024-07-23 20:32:29 +00:00
|
|
|
* 1) The folios are belonging to the same inode
|
2023-02-27 15:17:03 +00:00
|
|
|
* This is implied by the call chain.
|
|
|
|
*
|
|
|
|
* 2) The range has adjacent logical bytenr
|
|
|
|
*
|
|
|
|
* 3) The range has adjacent file offset
|
|
|
|
* This is required for the usage of btrfs_bio->file_offset.
|
|
|
|
*/
|
|
|
|
return bio_end_sector(bio) == sector &&
|
2024-07-23 20:32:29 +00:00
|
|
|
folio_pos(bv_folio) + bvec->bv_offset + bvec->bv_len ==
|
|
|
|
folio_pos(folio) + pg_offset;
|
2023-02-27 15:17:03 +00:00
|
|
|
}
|
|
|
|
|
2023-02-21 16:21:04 +00:00
|
|
|
static void alloc_new_bio(struct btrfs_inode *inode,
|
|
|
|
struct btrfs_bio_ctrl *bio_ctrl,
|
|
|
|
u64 disk_bytenr, u64 file_offset)
|
btrfs: refactor submit_extent_page() to make bio and its flag tracing easier
There is a lot of code inside extent_io.c needs both "struct bio
**bio_ret" and "unsigned long prev_bio_flags", along with some
parameters like "unsigned long bio_flags".
Such strange parameters are here for bio assembly.
For example, we have such inode page layout:
0 4K 8K 12K
|<-- Extent A-->|<- EB->|
Then what we do is:
- Page [0, 4K)
*bio_ret = NULL
So we allocate a new bio to bio_ret,
Add page [0, 4K) to *bio_ret.
- Page [4K, 8K)
*bio_ret != NULL
We found this page is continuous to *bio_ret,
and if we're not at stripe boundary, we
add page [4K, 8K) to *bio_ret.
- Page [8K, 12K)
*bio_ret != NULL
But we found this page is not continuous, so
we submit *bio_ret, then allocate a new bio,
and add page [8K, 12K) to the new bio.
This means we need to record both the bio and its bio_flag, but we
record them manually using those strange parameter list, other than
encapsulating them into their own structure.
So this patch will introduce a new structure, btrfs_bio_ctrl, to record
both the bio, and its bio_flags.
Also, in above case, for all pages added to the bio, we need to check if
the new page crosses stripe boundary. This check itself can be time
consuming, and we don't really need to do that for each page.
This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
When a new bio is allocated, the stripe and ordered extent boundary is
also calculated, so no matter how large the bio will be, we only
calculate the boundaries once, to save some CPU time.
The following functions/structures are affected:
- struct extent_page_data
Replace its bio pointer with structure btrfs_bio_ctrl (embedded
structure, not pointer)
- end_write_bio()
- flush_write_bio()
Just change how bio is fetched
- btrfs_bio_add_page()
Use pre-calculated boundaries instead of re-calculating them.
And use @bio_ctrl to replace @bio and @prev_bio_flags.
- calc_bio_boundaries()
New function
- submit_extent_page() callers
- btrfs_do_readpage() callers
- contiguous_readpages() callers
To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
bio.
- btrfs_bio_fits_in_ordered_extent()
Removed, as now the ordered extent size limit is done at bio
allocation time, no need to check for each page range.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-14 08:42:15 +00:00
|
|
|
{
|
2023-02-21 16:21:04 +00:00
|
|
|
struct btrfs_fs_info *fs_info = inode->root->fs_info;
|
2023-03-07 16:39:44 +00:00
|
|
|
struct btrfs_bio *bbio;
|
2023-02-21 16:21:04 +00:00
|
|
|
|
2023-03-23 09:01:20 +00:00
|
|
|
bbio = btrfs_bio_alloc(BIO_MAX_VECS, bio_ctrl->opf, fs_info,
|
2023-03-07 16:39:44 +00:00
|
|
|
bio_ctrl->end_io_func, NULL);
|
|
|
|
bbio->bio.bi_iter.bi_sector = disk_bytenr >> SECTOR_SHIFT;
|
2023-03-23 09:01:20 +00:00
|
|
|
bbio->inode = inode;
|
2023-03-07 16:39:44 +00:00
|
|
|
bbio->file_offset = file_offset;
|
|
|
|
bio_ctrl->bbio = bbio;
|
2023-02-21 16:21:04 +00:00
|
|
|
bio_ctrl->len_to_oe_boundary = U32_MAX;
|
btrfs: refactor submit_extent_page() to make bio and its flag tracing easier
There is a lot of code inside extent_io.c needs both "struct bio
**bio_ret" and "unsigned long prev_bio_flags", along with some
parameters like "unsigned long bio_flags".
Such strange parameters are here for bio assembly.
For example, we have such inode page layout:
0 4K 8K 12K
|<-- Extent A-->|<- EB->|
Then what we do is:
- Page [0, 4K)
*bio_ret = NULL
So we allocate a new bio to bio_ret,
Add page [0, 4K) to *bio_ret.
- Page [4K, 8K)
*bio_ret != NULL
We found this page is continuous to *bio_ret,
and if we're not at stripe boundary, we
add page [4K, 8K) to *bio_ret.
- Page [8K, 12K)
*bio_ret != NULL
But we found this page is not continuous, so
we submit *bio_ret, then allocate a new bio,
and add page [8K, 12K) to the new bio.
This means we need to record both the bio and its bio_flag, but we
record them manually using those strange parameter list, other than
encapsulating them into their own structure.
So this patch will introduce a new structure, btrfs_bio_ctrl, to record
both the bio, and its bio_flags.
Also, in above case, for all pages added to the bio, we need to check if
the new page crosses stripe boundary. This check itself can be time
consuming, and we don't really need to do that for each page.
This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
When a new bio is allocated, the stripe and ordered extent boundary is
also calculated, so no matter how large the bio will be, we only
calculate the boundaries once, to save some CPU time.
The following functions/structures are affected:
- struct extent_page_data
Replace its bio pointer with structure btrfs_bio_ctrl (embedded
structure, not pointer)
- end_write_bio()
- flush_write_bio()
Just change how bio is fetched
- btrfs_bio_add_page()
Use pre-calculated boundaries instead of re-calculating them.
And use @bio_ctrl to replace @bio and @prev_bio_flags.
- calc_bio_boundaries()
New function
- submit_extent_page() callers
- btrfs_do_readpage() callers
- contiguous_readpages() callers
To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
bio.
- btrfs_bio_fits_in_ordered_extent()
Removed, as now the ordered extent size limit is done at bio
allocation time, no need to check for each page range.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-14 08:42:15 +00:00
|
|
|
|
2023-05-31 07:53:55 +00:00
|
|
|
/* Limit data write bios to the ordered boundary. */
|
|
|
|
if (bio_ctrl->wbc) {
|
2023-02-21 16:21:04 +00:00
|
|
|
struct btrfs_ordered_extent *ordered;
|
|
|
|
|
2023-01-21 06:50:22 +00:00
|
|
|
ordered = btrfs_lookup_ordered_extent(inode, file_offset);
|
|
|
|
if (ordered) {
|
|
|
|
bio_ctrl->len_to_oe_boundary = min_t(u32, U32_MAX,
|
2022-12-12 07:37:18 +00:00
|
|
|
ordered->file_offset +
|
|
|
|
ordered->disk_num_bytes - file_offset);
|
2023-05-31 07:54:02 +00:00
|
|
|
bbio->ordered = ordered;
|
2023-01-21 06:50:22 +00:00
|
|
|
}
|
btrfs: refactor submit_extent_page() to make bio and its flag tracing easier
There is a lot of code inside extent_io.c needs both "struct bio
**bio_ret" and "unsigned long prev_bio_flags", along with some
parameters like "unsigned long bio_flags".
Such strange parameters are here for bio assembly.
For example, we have such inode page layout:
0 4K 8K 12K
|<-- Extent A-->|<- EB->|
Then what we do is:
- Page [0, 4K)
*bio_ret = NULL
So we allocate a new bio to bio_ret,
Add page [0, 4K) to *bio_ret.
- Page [4K, 8K)
*bio_ret != NULL
We found this page is continuous to *bio_ret,
and if we're not at stripe boundary, we
add page [4K, 8K) to *bio_ret.
- Page [8K, 12K)
*bio_ret != NULL
But we found this page is not continuous, so
we submit *bio_ret, then allocate a new bio,
and add page [8K, 12K) to the new bio.
This means we need to record both the bio and its bio_flag, but we
record them manually using those strange parameter list, other than
encapsulating them into their own structure.
So this patch will introduce a new structure, btrfs_bio_ctrl, to record
both the bio, and its bio_flags.
Also, in above case, for all pages added to the bio, we need to check if
the new page crosses stripe boundary. This check itself can be time
consuming, and we don't really need to do that for each page.
This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
When a new bio is allocated, the stripe and ordered extent boundary is
also calculated, so no matter how large the bio will be, we only
calculate the boundaries once, to save some CPU time.
The following functions/structures are affected:
- struct extent_page_data
Replace its bio pointer with structure btrfs_bio_ctrl (embedded
structure, not pointer)
- end_write_bio()
- flush_write_bio()
Just change how bio is fetched
- btrfs_bio_add_page()
Use pre-calculated boundaries instead of re-calculating them.
And use @bio_ctrl to replace @bio and @prev_bio_flags.
- calc_bio_boundaries()
New function
- submit_extent_page() callers
- btrfs_do_readpage() callers
- contiguous_readpages() callers
To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
bio.
- btrfs_bio_fits_in_ordered_extent()
Removed, as now the ordered extent size limit is done at bio
allocation time, no need to check for each page range.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-04-14 08:42:15 +00:00
|
|
|
|
2022-03-24 16:52:10 +00:00
|
|
|
/*
|
2023-01-21 06:50:30 +00:00
|
|
|
* Pick the last added device to support cgroup writeback. For
|
|
|
|
* multi-device file systems this means blk-cgroup policies have
|
|
|
|
* to always be set on the last added/replaced device.
|
|
|
|
* This is a bit odd but has been like that for a long time.
|
2022-03-24 16:52:10 +00:00
|
|
|
*/
|
2023-03-07 16:39:44 +00:00
|
|
|
bio_set_dev(&bbio->bio, fs_info->fs_devices->latest_dev->bdev);
|
|
|
|
wbc_init_bio(bio_ctrl->wbc, &bbio->bio);
|
btrfs: subpage: allow submit_extent_page() to do bio split
Current submit_extent_page() just checks if the current page range can
be fitted into current bio, and if not, submit then re-add.
But this behavior can't handle subpage case at all.
For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.
This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.
The proper way to handle it is:
- Check how many bytes we can be put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create a new bio
- Add the remaining bytes into the new bio
Refactor submit_extent_page() so that it does the above iteration.
The main loop inside submit_extent_page() will look like this:
cur = pg_offset;
while (cur < pg_offset + size) {
u32 offset = cur - pg_offset;
int added;
if (!bio_ctrl->bio) {
/* Allocate new bio if needed */
}
/* Add as many bytes into the bio */
added = btrfs_bio_add_page();
if (added < size - offset) {
/* The current bio is full, submit it */
}
cur += added;
}
Also, since we're doing new bio allocation deep inside the main loop,
extract that code into a new helper, alloc_new_bio().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 06:35:00 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-06-06 17:14:26 +00:00
|
|
|
/*
|
2021-01-06 01:01:40 +00:00
|
|
|
* @disk_bytenr: logical bytenr where the write will be
|
btrfs: switch page and disk_bytenr argument position for submit_extent_page()
Normally we put (page, pg_len, pg_offset) arguments together, just like
what __bio_add_page() does.
But in submit_extent_page(), what we got is, (page, disk_bytenr, pg_len,
pg_offset), which sometimes can be confusing.
Change the order to (disk_bytenr, page, pg_len, pg_offset) to make it
to follow the common schema.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-13 05:31:13 +00:00
|
|
|
* @page: page to add to the bio
|
2021-01-06 01:01:40 +00:00
|
|
|
* @size: portion of page that we want to write to
|
2017-06-12 17:50:41 +00:00
|
|
|
* @pg_offset: offset of the new bio or to check whether we are adding
|
|
|
|
* a contiguous page to the previous one
|
2022-09-13 05:31:12 +00:00
|
|
|
*
|
2023-03-07 16:39:43 +00:00
|
|
|
* The will either add the page into the existing @bio_ctrl->bbio, or allocate a
|
|
|
|
* new one in @bio_ctrl->bbio.
|
2022-09-13 05:31:12 +00:00
|
|
|
* The mirror number for this IO should already be initizlied in
|
|
|
|
* @bio_ctrl->mirror_num.
|
2017-06-06 17:14:26 +00:00
|
|
|
*/
|
2024-07-23 20:32:29 +00:00
|
|
|
static void submit_extent_folio(struct btrfs_bio_ctrl *bio_ctrl,
|
|
|
|
u64 disk_bytenr, struct folio *folio,
|
2023-02-27 15:17:01 +00:00
|
|
|
size_t size, unsigned long pg_offset)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2024-07-23 20:32:29 +00:00
|
|
|
struct btrfs_inode *inode = folio_to_inode(folio);
|
2022-09-13 05:31:14 +00:00
|
|
|
|
2023-02-27 15:17:04 +00:00
|
|
|
ASSERT(pg_offset + size <= PAGE_SIZE);
|
2022-09-13 05:31:14 +00:00
|
|
|
ASSERT(bio_ctrl->end_io_func);
|
|
|
|
|
2023-03-07 16:39:43 +00:00
|
|
|
if (bio_ctrl->bbio &&
|
2024-07-23 20:32:29 +00:00
|
|
|
!btrfs_bio_is_contig(bio_ctrl, folio, disk_bytenr, pg_offset))
|
2023-02-27 15:17:03 +00:00
|
|
|
submit_one_bio(bio_ctrl);
|
|
|
|
|
2023-02-27 15:17:04 +00:00
|
|
|
do {
|
|
|
|
u32 len = size;
|
btrfs: subpage: allow submit_extent_page() to do bio split
Current submit_extent_page() just checks if the current page range can
be fitted into current bio, and if not, submit then re-add.
But this behavior can't handle subpage case at all.
For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.
This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.
The proper way to handle it is:
- Check how many bytes we can be put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create a new bio
- Add the remaining bytes into the new bio
Refactor submit_extent_page() so that it does the above iteration.
The main loop inside submit_extent_page() will look like this:
cur = pg_offset;
while (cur < pg_offset + size) {
u32 offset = cur - pg_offset;
int added;
if (!bio_ctrl->bio) {
/* Allocate new bio if needed */
}
/* Add as many bytes into the bio */
added = btrfs_bio_add_page();
if (added < size - offset) {
/* The current bio is full, submit it */
}
cur += added;
}
Also, since we're doing new bio allocation deep inside the main loop,
extract that code into a new helper, alloc_new_bio().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 06:35:00 +00:00
|
|
|
|
|
|
|
/* Allocate new bio if needed */
|
2023-03-07 16:39:43 +00:00
|
|
|
if (!bio_ctrl->bbio) {
|
2023-02-27 15:16:57 +00:00
|
|
|
alloc_new_bio(inode, bio_ctrl, disk_bytenr,
|
2024-07-23 20:32:29 +00:00
|
|
|
folio_pos(folio) + pg_offset);
|
btrfs: subpage: allow submit_extent_page() to do bio split
Current submit_extent_page() just checks if the current page range can
be fitted into current bio, and if not, submit then re-add.
But this behavior can't handle subpage case at all.
For subpage case, the problem is in the page size, 64K, which is also
the same size as stripe size.
This means, if we can't fit a full 64K into a bio, due to stripe limit,
then it won't fit into next bio without crossing stripe either.
The proper way to handle it is:
- Check how many bytes we can be put into current bio
- Put as many bytes as possible into current bio first
- Submit current bio
- Create a new bio
- Add the remaining bytes into the new bio
Refactor submit_extent_page() so that it does the above iteration.
The main loop inside submit_extent_page() will look like this:
cur = pg_offset;
while (cur < pg_offset + size) {
u32 offset = cur - pg_offset;
int added;
if (!bio_ctrl->bio) {
/* Allocate new bio if needed */
}
/* Add as many bytes into the bio */
added = btrfs_bio_add_page();
if (added < size - offset) {
/* The current bio is full, submit it */
}
cur += added;
}
Also, since we're doing new bio allocation deep inside the main loop,
extract that code into a new helper, alloc_new_bio().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 06:35:00 +00:00
|
|
|
}
|
2023-02-27 15:17:04 +00:00
|
|
|
|
|
|
|
/* Cap to the current ordered extent boundary if there is one. */
|
|
|
|
if (len > bio_ctrl->len_to_oe_boundary) {
|
|
|
|
ASSERT(bio_ctrl->compress_type == BTRFS_COMPRESS_NONE);
|
2022-06-15 13:03:11 +00:00
|
|
|
ASSERT(is_data_inode(inode));
|
2023-02-27 15:17:04 +00:00
|
|
|
len = bio_ctrl->len_to_oe_boundary;
|
|
|
|
}
|
|
|
|
|
2024-07-23 20:32:29 +00:00
|
|
|
if (!bio_add_folio(&bio_ctrl->bbio->bio, folio, len, pg_offset)) {
|
2023-02-27 15:17:04 +00:00
|
|
|
/* bio full: move on to a new one */
|
2022-06-03 07:11:03 +00:00
|
|
|
submit_one_bio(bio_ctrl);
|
2023-02-27 15:17:04 +00:00
|
|
|
continue;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2023-02-27 15:17:04 +00:00
|
|
|
|
|
|
|
if (bio_ctrl->wbc)
|
2024-07-23 20:32:29 +00:00
|
|
|
wbc_account_cgroup_owner(bio_ctrl->wbc, &folio->page,
|
|
|
|
len);
|
2023-02-27 15:17:04 +00:00
|
|
|
|
|
|
|
size -= len;
|
|
|
|
pg_offset += len;
|
|
|
|
disk_bytenr += len;
|
btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
as we submit bios for writes. Every time we add a page to the bio, we
decrement those bytes from len_to_oe_boundary, and then we submit the
bio if we happen to hit zero.
Most of the time, len_to_oe_boundary gets set to U32_MAX.
submit_extent_page() adds pages into our bio, and the size of the bio
ends up limited by:
- Are we contiguous on disk?
- Does bio_add_page() allow us to stuff more in?
- is len_to_oe_boundary > 0?
The len_to_oe_boundary math starts with U32_MAX, which isn't page or
sector aligned, and subtracts from it until it hits zero. In the
non-zoned case, the last IO we submit before we hit zero is going to be
unaligned, triggering BUGs.
This is hard to trigger because bio_add_page() isn't going to make a bio
of U32_MAX size unless you give it a perfect set of pages and fully
contiguous extents on disk. We can hit it pretty reliably while making
large swapfiles during provisioning because the machine is freshly
booted, mostly idle, and the disk is freshly formatted. It's also
possible to trigger with reads when read_ahead_kb is set to 4GB.
The code has been clean up and shifted around a few times, but this flaw
has been lurking since the counter was added. I think the commit
24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") ended
up exposing the bug.
The fix used here is to skip doing math on len_to_oe_boundary unless
we've changed it from the default U32_MAX value. bio_add_page() is the
real limit we want, and there's no reason to do extra math when block
layer is doing it for us.
Sample reproducer, note you'll need to change the path to the bdi and
device:
SUBVOL=/btrfs/swapvol
SWAPFILE=$SUBVOL/swapfile
SZMB=8192
mkfs.btrfs -f /dev/vdb
mount /dev/vdb /btrfs
btrfs subvol create $SUBVOL
chattr +C $SUBVOL
dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
sync
echo 4 > /proc/sys/vm/drop_caches
echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb
while true; do
echo 1 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/drop_caches
dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
done
Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-01 16:28:28 +00:00
|
|
|
|
|
|
|
/*
|
2024-07-23 20:32:29 +00:00
|
|
|
* len_to_oe_boundary defaults to U32_MAX, which isn't folio or
|
btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
as we submit bios for writes. Every time we add a page to the bio, we
decrement those bytes from len_to_oe_boundary, and then we submit the
bio if we happen to hit zero.
Most of the time, len_to_oe_boundary gets set to U32_MAX.
submit_extent_page() adds pages into our bio, and the size of the bio
ends up limited by:
- Are we contiguous on disk?
- Does bio_add_page() allow us to stuff more in?
- is len_to_oe_boundary > 0?
The len_to_oe_boundary math starts with U32_MAX, which isn't page or
sector aligned, and subtracts from it until it hits zero. In the
non-zoned case, the last IO we submit before we hit zero is going to be
unaligned, triggering BUGs.
This is hard to trigger because bio_add_page() isn't going to make a bio
of U32_MAX size unless you give it a perfect set of pages and fully
contiguous extents on disk. We can hit it pretty reliably while making
large swapfiles during provisioning because the machine is freshly
booted, mostly idle, and the disk is freshly formatted. It's also
possible to trigger with reads when read_ahead_kb is set to 4GB.
The code has been clean up and shifted around a few times, but this flaw
has been lurking since the counter was added. I think the commit
24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") ended
up exposing the bug.
The fix used here is to skip doing math on len_to_oe_boundary unless
we've changed it from the default U32_MAX value. bio_add_page() is the
real limit we want, and there's no reason to do extra math when block
layer is doing it for us.
Sample reproducer, note you'll need to change the path to the bdi and
device:
SUBVOL=/btrfs/swapvol
SWAPFILE=$SUBVOL/swapfile
SZMB=8192
mkfs.btrfs -f /dev/vdb
mount /dev/vdb /btrfs
btrfs subvol create $SUBVOL
chattr +C $SUBVOL
dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
sync
echo 4 > /proc/sys/vm/drop_caches
echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb
while true; do
echo 1 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/drop_caches
dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
done
Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-01 16:28:28 +00:00
|
|
|
* sector aligned. alloc_new_bio() then sets it to the end of
|
|
|
|
* our ordered extent for writes into zoned devices.
|
|
|
|
*
|
|
|
|
* When len_to_oe_boundary is tracking an ordered extent, we
|
|
|
|
* trust the ordered extent code to align things properly, and
|
|
|
|
* the check above to cap our write to the ordered extent
|
|
|
|
* boundary is correct.
|
|
|
|
*
|
|
|
|
* When len_to_oe_boundary is U32_MAX, the cap above would
|
2024-07-23 20:32:29 +00:00
|
|
|
* result in a 4095 byte IO for the last folio right before
|
|
|
|
* we hit the bio limit of UINT_MAX. bio_add_folio() has all
|
btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
as we submit bios for writes. Every time we add a page to the bio, we
decrement those bytes from len_to_oe_boundary, and then we submit the
bio if we happen to hit zero.
Most of the time, len_to_oe_boundary gets set to U32_MAX.
submit_extent_page() adds pages into our bio, and the size of the bio
ends up limited by:
- Are we contiguous on disk?
- Does bio_add_page() allow us to stuff more in?
- is len_to_oe_boundary > 0?
The len_to_oe_boundary math starts with U32_MAX, which isn't page or
sector aligned, and subtracts from it until it hits zero. In the
non-zoned case, the last IO we submit before we hit zero is going to be
unaligned, triggering BUGs.
This is hard to trigger because bio_add_page() isn't going to make a bio
of U32_MAX size unless you give it a perfect set of pages and fully
contiguous extents on disk. We can hit it pretty reliably while making
large swapfiles during provisioning because the machine is freshly
booted, mostly idle, and the disk is freshly formatted. It's also
possible to trigger with reads when read_ahead_kb is set to 4GB.
The code has been clean up and shifted around a few times, but this flaw
has been lurking since the counter was added. I think the commit
24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") ended
up exposing the bug.
The fix used here is to skip doing math on len_to_oe_boundary unless
we've changed it from the default U32_MAX value. bio_add_page() is the
real limit we want, and there's no reason to do extra math when block
layer is doing it for us.
Sample reproducer, note you'll need to change the path to the bdi and
device:
SUBVOL=/btrfs/swapvol
SWAPFILE=$SUBVOL/swapfile
SZMB=8192
mkfs.btrfs -f /dev/vdb
mount /dev/vdb /btrfs
btrfs subvol create $SUBVOL
chattr +C $SUBVOL
dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
sync
echo 4 > /proc/sys/vm/drop_caches
echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb
while true; do
echo 1 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/drop_caches
dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
done
Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-01 16:28:28 +00:00
|
|
|
* the checks required to make sure we don't overflow the bio,
|
|
|
|
* and we should just ignore len_to_oe_boundary completely
|
|
|
|
* unless we're using it to track an ordered extent.
|
|
|
|
*
|
|
|
|
* It's pretty hard to make a bio sized U32_MAX, but it can
|
|
|
|
* happen when the page cache is able to feed us contiguous
|
2024-07-23 20:32:29 +00:00
|
|
|
* folios for large extents.
|
btrfs: only subtract from len_to_oe_boundary when it is tracking an extent
bio_ctrl->len_to_oe_boundary is used to make sure we stay inside a zone
as we submit bios for writes. Every time we add a page to the bio, we
decrement those bytes from len_to_oe_boundary, and then we submit the
bio if we happen to hit zero.
Most of the time, len_to_oe_boundary gets set to U32_MAX.
submit_extent_page() adds pages into our bio, and the size of the bio
ends up limited by:
- Are we contiguous on disk?
- Does bio_add_page() allow us to stuff more in?
- is len_to_oe_boundary > 0?
The len_to_oe_boundary math starts with U32_MAX, which isn't page or
sector aligned, and subtracts from it until it hits zero. In the
non-zoned case, the last IO we submit before we hit zero is going to be
unaligned, triggering BUGs.
This is hard to trigger because bio_add_page() isn't going to make a bio
of U32_MAX size unless you give it a perfect set of pages and fully
contiguous extents on disk. We can hit it pretty reliably while making
large swapfiles during provisioning because the machine is freshly
booted, mostly idle, and the disk is freshly formatted. It's also
possible to trigger with reads when read_ahead_kb is set to 4GB.
The code has been clean up and shifted around a few times, but this flaw
has been lurking since the counter was added. I think the commit
24e6c8082208 ("btrfs: simplify main loop in submit_extent_page") ended
up exposing the bug.
The fix used here is to skip doing math on len_to_oe_boundary unless
we've changed it from the default U32_MAX value. bio_add_page() is the
real limit we want, and there's no reason to do extra math when block
layer is doing it for us.
Sample reproducer, note you'll need to change the path to the bdi and
device:
SUBVOL=/btrfs/swapvol
SWAPFILE=$SUBVOL/swapfile
SZMB=8192
mkfs.btrfs -f /dev/vdb
mount /dev/vdb /btrfs
btrfs subvol create $SUBVOL
chattr +C $SUBVOL
dd if=/dev/zero of=$SWAPFILE bs=1M count=$SZMB
sync
echo 4 > /proc/sys/vm/drop_caches
echo 4194304 > /sys/class/bdi/btrfs-2/read_ahead_kb
while true; do
echo 1 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/drop_caches
dd of=/dev/zero if=$SWAPFILE bs=4096M count=2 iflag=fullblock
done
Fixes: 24e6c8082208 ("btrfs: simplify main loop in submit_extent_page")
CC: stable@vger.kernel.org # 6.4+
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-01 16:28:28 +00:00
|
|
|
*/
|
|
|
|
if (bio_ctrl->len_to_oe_boundary != U32_MAX)
|
|
|
|
bio_ctrl->len_to_oe_boundary -= len;
|
2023-02-27 15:17:04 +00:00
|
|
|
|
|
|
|
/* Ordered extent boundary: move on to a new bio. */
|
|
|
|
if (bio_ctrl->len_to_oe_boundary == 0)
|
|
|
|
submit_one_bio(bio_ctrl);
|
|
|
|
} while (size);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
static int attach_extent_buffer_folio(struct extent_buffer *eb,
|
|
|
|
struct folio *folio,
|
|
|
|
struct btrfs_subpage *prealloc)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2021-01-26 08:33:48 +00:00
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
|
|
|
int ret = 0;
|
|
|
|
|
2020-10-21 06:25:02 +00:00
|
|
|
/*
|
|
|
|
* If the page is mapped to btree inode, we should hold the private
|
|
|
|
* lock to prevent race.
|
|
|
|
* For cloned or dummy extent buffers, their pages are not mapped and
|
|
|
|
* will not race with any other ebs.
|
|
|
|
*/
|
2023-12-06 23:09:28 +00:00
|
|
|
if (folio->mapping)
|
for-6.8-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
2024-01-10 17:27:40 +00:00
|
|
|
lockdep_assert_held(&folio->mapping->i_private_lock);
|
2020-10-21 06:25:02 +00:00
|
|
|
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
if (fs_info->nodesize >= PAGE_SIZE) {
|
2023-11-17 03:54:14 +00:00
|
|
|
if (!folio_test_private(folio))
|
|
|
|
folio_attach_private(folio, eb);
|
2021-01-26 08:33:48 +00:00
|
|
|
else
|
2023-11-17 03:54:14 +00:00
|
|
|
WARN_ON(folio_get_private(folio) != eb);
|
2021-01-26 08:33:48 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Already mapped, just free prealloc */
|
2023-11-17 03:54:14 +00:00
|
|
|
if (folio_test_private(folio)) {
|
2021-01-26 08:33:48 +00:00
|
|
|
btrfs_free_subpage(prealloc);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (prealloc)
|
|
|
|
/* Has preallocated memory for subpage */
|
2023-11-17 03:54:14 +00:00
|
|
|
folio_attach_private(folio, prealloc);
|
2020-06-02 04:47:45 +00:00
|
|
|
else
|
2021-01-26 08:33:48 +00:00
|
|
|
/* Do new allocation to attach subpage */
|
2023-12-12 02:28:37 +00:00
|
|
|
ret = btrfs_attach_subpage(fs_info, folio, BTRFS_SUBPAGE_METADATA);
|
2021-01-26 08:33:48 +00:00
|
|
|
return ret;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2021-01-26 08:34:00 +00:00
|
|
|
int set_page_extent_mapped(struct page *page)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2023-12-14 16:13:29 +00:00
|
|
|
return set_folio_extent_mapped(page_folio(page));
|
|
|
|
}
|
|
|
|
|
|
|
|
int set_folio_extent_mapped(struct folio *folio)
|
|
|
|
{
|
2021-01-26 08:34:00 +00:00
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
|
2023-12-14 16:13:29 +00:00
|
|
|
ASSERT(folio->mapping);
|
2021-01-26 08:34:00 +00:00
|
|
|
|
2023-11-17 03:54:14 +00:00
|
|
|
if (folio_test_private(folio))
|
2021-01-26 08:34:00 +00:00
|
|
|
return 0;
|
|
|
|
|
2023-09-14 14:24:43 +00:00
|
|
|
fs_info = folio_to_fs_info(folio);
|
2021-01-26 08:34:00 +00:00
|
|
|
|
2023-12-14 16:13:29 +00:00
|
|
|
if (btrfs_is_subpage(fs_info, folio->mapping))
|
2023-12-12 02:28:37 +00:00
|
|
|
return btrfs_attach_subpage(fs_info, folio, BTRFS_SUBPAGE_DATA);
|
2021-01-26 08:34:00 +00:00
|
|
|
|
2023-11-17 03:54:14 +00:00
|
|
|
folio_attach_private(folio, (void *)EXTENT_FOLIO_PRIVATE);
|
2021-01-26 08:34:00 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2024-08-28 18:28:55 +00:00
|
|
|
void clear_folio_extent_mapped(struct folio *folio)
|
2021-01-26 08:34:00 +00:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
|
2024-08-28 18:28:55 +00:00
|
|
|
ASSERT(folio->mapping);
|
2021-01-26 08:34:00 +00:00
|
|
|
|
2023-11-17 03:54:14 +00:00
|
|
|
if (!folio_test_private(folio))
|
2021-01-26 08:34:00 +00:00
|
|
|
return;
|
|
|
|
|
2024-08-28 18:28:55 +00:00
|
|
|
fs_info = folio_to_fs_info(folio);
|
|
|
|
if (btrfs_is_subpage(fs_info, folio->mapping))
|
2023-12-12 02:28:37 +00:00
|
|
|
return btrfs_detach_subpage(fs_info, folio);
|
2021-01-26 08:34:00 +00:00
|
|
|
|
2023-11-17 03:54:14 +00:00
|
|
|
folio_detach_private(folio);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2024-07-24 23:28:26 +00:00
|
|
|
static struct extent_map *__get_extent_map(struct inode *inode,
|
|
|
|
struct folio *folio, u64 start,
|
|
|
|
u64 len, struct extent_map **em_cached)
|
2013-07-25 11:22:37 +00:00
|
|
|
{
|
|
|
|
struct extent_map *em;
|
btrfs: do not hold the extent lock for entire read
Historically we've held the extent lock throughout the entire read.
There's been a few reasons for this, but it's mostly just caused us
problems. For example, this prevents us from allowing page faults
during direct io reads, because we could deadlock. This has forced us
to only allow 4k reads at a time for io_uring NOWAIT requests because we
have no idea if we'll be forced to page fault and thus have to do a
whole lot of work.
On the buffered side we are protected by the page lock, as long as we're
reading things like buffered writes, punch hole, and even direct IO to a
certain degree will get hung up on the page lock while the page is in
flight.
On the direct side we have the dio extent lock, which acts much like the
way the extent lock worked previously to this patch, however just for
direct reads. This protects direct reads from concurrent direct writes,
while we're protected from buffered writes via the inode lock.
Now that we're protected in all cases, narrow the extent lock to the
part where we're getting the extent map to submit the reads, no longer
holding the extent lock for the entire read operation. Push the extent
lock down into do_readpage() so that we're only grabbing it when looking
up the extent map. This portion was contributed by Goldwyn.
Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-16 19:16:24 +00:00
|
|
|
struct extent_state *cached_state = NULL;
|
2013-07-25 11:22:37 +00:00
|
|
|
|
2024-02-06 22:45:09 +00:00
|
|
|
ASSERT(em_cached);
|
|
|
|
|
|
|
|
if (*em_cached) {
|
2013-07-25 11:22:37 +00:00
|
|
|
em = *em_cached;
|
2014-02-25 14:15:12 +00:00
|
|
|
if (extent_map_in_tree(em) && start >= em->start &&
|
2013-07-25 11:22:37 +00:00
|
|
|
start < extent_map_end(em)) {
|
2017-03-03 08:55:12 +00:00
|
|
|
refcount_inc(&em->refs);
|
2013-07-25 11:22:37 +00:00
|
|
|
return em;
|
|
|
|
}
|
|
|
|
|
|
|
|
free_extent_map(em);
|
|
|
|
*em_cached = NULL;
|
|
|
|
}
|
|
|
|
|
btrfs: do not hold the extent lock for entire read
Historically we've held the extent lock throughout the entire read.
There's been a few reasons for this, but it's mostly just caused us
problems. For example, this prevents us from allowing page faults
during direct io reads, because we could deadlock. This has forced us
to only allow 4k reads at a time for io_uring NOWAIT requests because we
have no idea if we'll be forced to page fault and thus have to do a
whole lot of work.
On the buffered side we are protected by the page lock, as long as we're
reading things like buffered writes, punch hole, and even direct IO to a
certain degree will get hung up on the page lock while the page is in
flight.
On the direct side we have the dio extent lock, which acts much like the
way the extent lock worked previously to this patch, however just for
direct reads. This protects direct reads from concurrent direct writes,
while we're protected from buffered writes via the inode lock.
Now that we're protected in all cases, narrow the extent lock to the
part where we're getting the extent map to submit the reads, no longer
holding the extent lock for the entire read operation. Push the extent
lock down into do_readpage() so that we're only grabbing it when looking
up the extent map. This portion was contributed by Goldwyn.
Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-16 19:16:24 +00:00
|
|
|
btrfs_lock_and_flush_ordered_range(BTRFS_I(inode), start, start + len - 1, &cached_state);
|
2024-07-24 23:28:26 +00:00
|
|
|
em = btrfs_get_extent(BTRFS_I(inode), folio, start, len);
|
2024-02-06 22:45:09 +00:00
|
|
|
if (!IS_ERR(em)) {
|
2013-07-25 11:22:37 +00:00
|
|
|
BUG_ON(*em_cached);
|
2017-03-03 08:55:12 +00:00
|
|
|
refcount_inc(&em->refs);
|
2013-07-25 11:22:37 +00:00
|
|
|
*em_cached = em;
|
|
|
|
}
|
btrfs: do not hold the extent lock for entire read
Historically we've held the extent lock throughout the entire read.
There's been a few reasons for this, but it's mostly just caused us
problems. For example, this prevents us from allowing page faults
during direct io reads, because we could deadlock. This has forced us
to only allow 4k reads at a time for io_uring NOWAIT requests because we
have no idea if we'll be forced to page fault and thus have to do a
whole lot of work.
On the buffered side we are protected by the page lock, as long as we're
reading things like buffered writes, punch hole, and even direct IO to a
certain degree will get hung up on the page lock while the page is in
flight.
On the direct side we have the dio extent lock, which acts much like the
way the extent lock worked previously to this patch, however just for
direct reads. This protects direct reads from concurrent direct writes,
while we're protected from buffered writes via the inode lock.
Now that we're protected in all cases, narrow the extent lock to the
part where we're getting the extent map to submit the reads, no longer
holding the extent lock for the entire read operation. Push the extent
lock down into do_readpage() so that we're only grabbing it when looking
up the extent map. This portion was contributed by Goldwyn.
Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-16 19:16:24 +00:00
|
|
|
unlock_extent(&BTRFS_I(inode)->io_tree, start, start + len - 1, &cached_state);
|
|
|
|
|
2013-07-25 11:22:37 +00:00
|
|
|
return em;
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* basic readpage implementation. Locked extent state structs are inserted
|
|
|
|
* into the tree that are removed when the IO is done (by the end_io
|
|
|
|
* handlers)
|
2012-03-12 15:03:00 +00:00
|
|
|
* XXX JDM: This needs looking at to ensure proper page locking
|
2016-07-11 17:39:07 +00:00
|
|
|
* return 0 on success, otherwise return error
|
2008-01-24 21:13:08 +00:00
|
|
|
*/
|
2024-07-23 21:06:03 +00:00
|
|
|
static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
|
2023-02-27 15:16:55 +00:00
|
|
|
struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2024-07-23 21:06:03 +00:00
|
|
|
struct inode *inode = folio->mapping->host;
|
2023-09-14 14:45:41 +00:00
|
|
|
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
|
2024-07-23 21:06:03 +00:00
|
|
|
u64 start = folio_pos(folio);
|
2017-06-06 17:50:13 +00:00
|
|
|
const u64 end = start + PAGE_SIZE - 1;
|
2008-01-24 21:13:08 +00:00
|
|
|
u64 cur = start;
|
|
|
|
u64 extent_offset;
|
|
|
|
u64 last_byte = i_size_read(inode);
|
|
|
|
u64 block_start;
|
|
|
|
struct extent_map *em;
|
2016-07-11 17:39:07 +00:00
|
|
|
int ret = 0;
|
2011-04-19 12:29:38 +00:00
|
|
|
size_t pg_offset = 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
size_t iosize;
|
2024-01-16 16:33:20 +00:00
|
|
|
size_t blocksize = fs_info->sectorsize;
|
2020-02-05 18:09:30 +00:00
|
|
|
|
2024-07-23 21:06:03 +00:00
|
|
|
ret = set_folio_extent_mapped(folio);
|
2021-01-26 08:34:00 +00:00
|
|
|
if (ret < 0) {
|
2024-07-23 21:06:03 +00:00
|
|
|
folio_unlock(folio);
|
2023-02-27 15:17:01 +00:00
|
|
|
return ret;
|
2021-01-26 08:34:00 +00:00
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2024-07-23 21:06:03 +00:00
|
|
|
if (folio->index == last_byte >> folio_shift(folio)) {
|
|
|
|
size_t zero_offset = offset_in_folio(folio, last_byte);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
|
|
|
if (zero_offset) {
|
2024-07-23 21:06:03 +00:00
|
|
|
iosize = folio_size(folio) - zero_offset;
|
|
|
|
folio_zero_range(folio, zero_offset, iosize);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
}
|
|
|
|
}
|
2023-12-12 02:28:38 +00:00
|
|
|
bio_ctrl->end_io_func = end_bbio_data_read;
|
2024-07-23 21:06:03 +00:00
|
|
|
begin_folio_read(fs_info, folio);
|
2008-01-24 21:13:08 +00:00
|
|
|
while (cur <= end) {
|
2023-02-27 15:16:59 +00:00
|
|
|
enum btrfs_compression_type compress_type = BTRFS_COMPRESS_NONE;
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
bool force_bio_submit = false;
|
2021-01-06 01:01:40 +00:00
|
|
|
u64 disk_bytenr;
|
2013-02-11 16:33:00 +00:00
|
|
|
|
btrfs: subpage: make add_ra_bio_pages() compatible
[BUG]
If we remove the subpage limitation in add_ra_bio_pages(), then read a
compressed extent which has part of its range in next page, like the
following inode layout:
0 32K 64K 96K 128K
|<--------------|-------------->|
Btrfs will trigger ASSERT() in endio function:
assertion failed: atomic_read(&subpage->readers) >= nbits
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3431!
Internal error: Oops - BUG: 0 [#1] SMP
Workqueue: btrfs-endio btrfs_work_helper [btrfs]
Call trace:
assertfail.constprop.0+0x28/0x2c [btrfs]
btrfs_subpage_end_reader+0x148/0x14c [btrfs]
end_page_read+0x8c/0x100 [btrfs]
end_bio_extent_readpage+0x320/0x6b0 [btrfs]
bio_endio+0x15c/0x1dc
end_workqueue_fn+0x44/0x64 [btrfs]
btrfs_work_helper+0x74/0x250 [btrfs]
process_one_work+0x1d4/0x47c
worker_thread+0x180/0x400
kthread+0x11c/0x120
ret_from_fork+0x10/0x30
---[ end trace c8b7b552d3bb408c ]---
[CAUSE]
When we read the page range [0, 64K), we find it's a compressed extent,
and we will try to add extra pages in add_ra_bio_pages() to avoid
reading the same compressed extent.
But when we add such page into the read bio, it doesn't follow the
behavior of btrfs_do_readpage() to properly set subpage::readers.
This means, for page [64K, 128K), its subpage::readers is still 0.
And when endio is executed on both pages, since page [64K, 128K) has 0
subpage::readers, it triggers above ASSERT()
[FIX]
Function add_ra_bio_pages() is far from subpage compatible, it always
assume PAGE_SIZE == sectorsize, thus when it skip to next range it
always just skip PAGE_SIZE.
Make it subpage compatible by:
- Skip to next page properly when needed
If we find there is already a page cache, we need to skip to next page.
For that case, we shouldn't just skip PAGE_SIZE bytes, but use
@pg_index to calculate the next bytenr and continue.
- Only add the page range covered by current extent map
We need to calculate which range is covered by current extent map and
only add that part into the read bio.
- Update subpage::readers before submitting the bio
- Use proper cursor other than confusing @last_offset
- Calculate the missed threshold based on sector size
It's no longer using missed pages, as for 64K page size, we have at
most 3 pages to skip. (If aligned only 2 pages)
- Add ASSERT() to make sure our bytenr is always aligned
- Add comment for the function
Add a special note for subpage case, as the function won't really
work well for subpage cases.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:21:47 +00:00
|
|
|
ASSERT(IS_ALIGNED(cur, fs_info->sectorsize));
|
2008-01-24 21:13:08 +00:00
|
|
|
if (cur >= last_byte) {
|
2024-07-23 21:06:03 +00:00
|
|
|
iosize = folio_size(folio) - pg_offset;
|
|
|
|
folio_zero_range(folio, pg_offset, iosize);
|
|
|
|
end_folio_read(folio, true, cur, iosize);
|
2008-01-24 21:13:08 +00:00
|
|
|
break;
|
|
|
|
}
|
2024-07-24 23:28:26 +00:00
|
|
|
em = __get_extent_map(inode, folio, cur, end - cur + 1,
|
|
|
|
em_cached);
|
2022-02-03 15:36:42 +00:00
|
|
|
if (IS_ERR(em)) {
|
2024-07-23 21:06:03 +00:00
|
|
|
end_folio_read(folio, false, cur, end + 1 - cur);
|
2023-02-27 15:17:01 +00:00
|
|
|
return PTR_ERR(em);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
extent_offset = cur - em->start;
|
|
|
|
BUG_ON(extent_map_end(em) <= cur);
|
|
|
|
BUG_ON(end < cur);
|
|
|
|
|
btrfs: use the flags of an extent map to identify the compression type
Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.
We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.
So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
|
|
|
compress_type = extent_map_compression(em);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
iosize = min(extent_map_end(em) - cur, end - cur + 1);
|
2013-02-26 08:10:22 +00:00
|
|
|
iosize = ALIGN(iosize, blocksize);
|
2023-02-27 15:16:59 +00:00
|
|
|
if (compress_type != BTRFS_COMPRESS_NONE)
|
2024-04-29 22:23:06 +00:00
|
|
|
disk_bytenr = em->disk_bytenr;
|
2020-09-15 15:41:40 +00:00
|
|
|
else
|
2024-04-29 22:23:06 +00:00
|
|
|
disk_bytenr = extent_map_block_start(em) + extent_offset;
|
|
|
|
block_start = extent_map_block_start(em);
|
btrfs: use the flags of an extent map to identify the compression type
Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.
We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.
So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
|
|
|
if (em->flags & EXTENT_FLAG_PREALLOC)
|
2008-10-30 18:25:28 +00:00
|
|
|
block_start = EXTENT_MAP_HOLE;
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have a file range that points to a compressed extent
|
2020-08-05 02:48:34 +00:00
|
|
|
* and it's followed by a consecutive file range that points
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
* to the same compressed extent (possibly with a different
|
|
|
|
* offset and/or length, so it either points to the whole extent
|
|
|
|
* or only part of it), we must make sure we do not submit a
|
2024-07-23 21:06:03 +00:00
|
|
|
* single bio to populate the folios for the 2 ranges because
|
|
|
|
* this makes the compressed extent read zero out the folios
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
* belonging to the 2nd range. Imagine the following scenario:
|
|
|
|
*
|
|
|
|
* File layout
|
|
|
|
* [0 - 8K] [8K - 24K]
|
|
|
|
* | |
|
|
|
|
* | |
|
|
|
|
* points to extent X, points to extent X,
|
|
|
|
* offset 4K, length of 8K offset 0, length 16K
|
|
|
|
*
|
|
|
|
* [extent X, compressed length = 4K uncompressed length = 16K]
|
|
|
|
*
|
|
|
|
* If the bio to read the compressed extent covers both ranges,
|
2024-07-23 21:06:03 +00:00
|
|
|
* it will decompress extent X into the folios belonging to the
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
* first range and then it will stop, zeroing out the remaining
|
2024-07-23 21:06:03 +00:00
|
|
|
* folios that belong to the other range that points to extent X.
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
* So here we make sure we submit 2 bios, one for the first
|
|
|
|
* range and another one for the third range. Both will target
|
|
|
|
* the same physical extent from disk, but we can't currently
|
2024-07-23 21:06:03 +00:00
|
|
|
* make the compressed bio endio callback populate the folios
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
* for both ranges because each compressed bio is tightly
|
|
|
|
* coupled with a single extent map, and each range can have
|
|
|
|
* an extent map with a different offset value relative to the
|
|
|
|
* uncompressed data of our extent and different lengths. This
|
|
|
|
* is a corner case so we prioritize correctness over
|
|
|
|
* non-optimal behavior (submitting 2 bios for the same extent).
|
|
|
|
*/
|
btrfs: use the flags of an extent map to identify the compression type
Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.
We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.
So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
|
|
|
if (compress_type != BTRFS_COMPRESS_NONE &&
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
prev_em_start && *prev_em_start != (u64)-1 &&
|
Btrfs: fix corruption reading shared and compressed extents after hole punching
In the past we had data corruption when reading compressed extents that
are shared within the same file and they are consecutive, this got fixed
by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and
shared extents") and by commit 808f80b46790f ("Btrfs: update fix for read
corruption of compressed and shared extents"). However there was a case
that was missing in those fixes, which is when the shared and compressed
extents are referenced with a non-zero offset. The following shell script
creates a reproducer for this issue:
#!/bin/bash
mkfs.btrfs -f /dev/sdc &> /dev/null
mount -o compress /dev/sdc /mnt/sdc
# Create a file with 3 consecutive compressed extents, each has an
# uncompressed size of 128Kb and a compressed size of 4Kb.
for ((i = 1; i <= 3; i++)); do
head -c 4096 /dev/zero
for ((j = 1; j <= 31; j++)); do
head -c 4096 /dev/zero | tr '\0' "\377"
done
done > /mnt/sdc/foobar
sync
echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)"
# Clone the first extent into offsets 128K and 256K.
xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
sync
echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)"
# Punch holes into the regions that are already full of zeroes.
xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
sync
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
echo "Dropping page cache..."
sysctl -q vm.drop_caches=1
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
umount /dev/sdc
When running the script we get the following output:
Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
linked 131072/131072 bytes at offset 131072
128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
linked 131072/131072 bytes at offset 262144
128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Dropping page cache...
Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar
This happens because after reading all the pages of the extent in the
range from 128K to 256K for example, we read the hole at offset 256K
and then when reading the page at offset 260K we don't submit the
existing bio, which is responsible for filling all the page in the
range 128K to 256K only, therefore adding the pages from range 260K
to 384K to the existing bio and submitting it after iterating over the
entire range. Once the bio completes, the uncompressed data fills only
the pages in the range 128K to 256K because there's no more data read
from disk, leaving the pages in the range 260K to 384K unfilled. It is
just a slightly different variant of what was solved by commit
005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared
extents").
Fix this by forcing a bio submit, during readpages(), whenever we find a
compressed extent map for a page that is different from the extent map
for the previous page or has a different starting offset (in case it's
the same compressed extent), instead of the extent map's original start
offset.
A test case for fstests follows soon.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents")
Fixes: 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents")
Cc: stable@vger.kernel.org # 4.3+
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-14 15:17:20 +00:00
|
|
|
*prev_em_start != em->start)
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
force_bio_submit = true;
|
|
|
|
|
|
|
|
if (prev_em_start)
|
Btrfs: fix corruption reading shared and compressed extents after hole punching
In the past we had data corruption when reading compressed extents that
are shared within the same file and they are consecutive, this got fixed
by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and
shared extents") and by commit 808f80b46790f ("Btrfs: update fix for read
corruption of compressed and shared extents"). However there was a case
that was missing in those fixes, which is when the shared and compressed
extents are referenced with a non-zero offset. The following shell script
creates a reproducer for this issue:
#!/bin/bash
mkfs.btrfs -f /dev/sdc &> /dev/null
mount -o compress /dev/sdc /mnt/sdc
# Create a file with 3 consecutive compressed extents, each has an
# uncompressed size of 128Kb and a compressed size of 4Kb.
for ((i = 1; i <= 3; i++)); do
head -c 4096 /dev/zero
for ((j = 1; j <= 31; j++)); do
head -c 4096 /dev/zero | tr '\0' "\377"
done
done > /mnt/sdc/foobar
sync
echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)"
# Clone the first extent into offsets 128K and 256K.
xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
sync
echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)"
# Punch holes into the regions that are already full of zeroes.
xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
sync
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
echo "Dropping page cache..."
sysctl -q vm.drop_caches=1
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
umount /dev/sdc
When running the script we get the following output:
Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
linked 131072/131072 bytes at offset 131072
128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
linked 131072/131072 bytes at offset 262144
128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Dropping page cache...
Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar
This happens because after reading all the pages of the extent in the
range from 128K to 256K for example, we read the hole at offset 256K
and then when reading the page at offset 260K we don't submit the
existing bio, which is responsible for filling all the page in the
range 128K to 256K only, therefore adding the pages from range 260K
to 384K to the existing bio and submitting it after iterating over the
entire range. Once the bio completes, the uncompressed data fills only
the pages in the range 128K to 256K because there's no more data read
from disk, leaving the pages in the range 260K to 384K unfilled. It is
just a slightly different variant of what was solved by commit
005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared
extents").
Fix this by forcing a bio submit, during readpages(), whenever we find a
compressed extent map for a page that is different from the extent map
for the previous page or has a different starting offset (in case it's
the same compressed extent), instead of the extent map's original start
offset.
A test case for fstests follows soon.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents")
Fixes: 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents")
Cc: stable@vger.kernel.org # 4.3+
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-14 15:17:20 +00:00
|
|
|
*prev_em_start = em->start;
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
free_extent_map(em);
|
|
|
|
em = NULL;
|
|
|
|
|
|
|
|
/* we've found a hole, just zero and go on */
|
|
|
|
if (block_start == EXTENT_MAP_HOLE) {
|
2024-07-23 21:06:03 +00:00
|
|
|
folio_zero_range(folio, pg_offset, iosize);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2024-07-23 21:06:03 +00:00
|
|
|
end_folio_read(folio, true, cur, iosize);
|
2008-01-24 21:13:08 +00:00
|
|
|
cur = cur + iosize;
|
2011-04-19 12:29:38 +00:00
|
|
|
pg_offset += iosize;
|
2008-01-24 21:13:08 +00:00
|
|
|
continue;
|
|
|
|
}
|
2024-07-23 21:06:03 +00:00
|
|
|
/* the get_extent function already copied into the folio */
|
2008-01-29 14:59:12 +00:00
|
|
|
if (block_start == EXTENT_MAP_INLINE) {
|
2024-07-23 21:06:03 +00:00
|
|
|
end_folio_read(folio, true, cur, iosize);
|
2008-01-29 14:59:12 +00:00
|
|
|
cur = cur + iosize;
|
2011-04-19 12:29:38 +00:00
|
|
|
pg_offset += iosize;
|
2008-01-29 14:59:12 +00:00
|
|
|
continue;
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-02-27 15:17:00 +00:00
|
|
|
if (bio_ctrl->compress_type != compress_type) {
|
2023-02-27 15:16:58 +00:00
|
|
|
submit_one_bio(bio_ctrl);
|
2023-02-27 15:17:00 +00:00
|
|
|
bio_ctrl->compress_type = compress_type;
|
|
|
|
}
|
2023-02-27 15:16:58 +00:00
|
|
|
|
2023-02-27 15:16:54 +00:00
|
|
|
if (force_bio_submit)
|
|
|
|
submit_one_bio(bio_ctrl);
|
2024-07-23 21:06:03 +00:00
|
|
|
submit_extent_folio(bio_ctrl, disk_bytenr, folio, iosize,
|
|
|
|
pg_offset);
|
2008-01-24 21:13:08 +00:00
|
|
|
cur = cur + iosize;
|
2011-04-19 12:29:38 +00:00
|
|
|
pg_offset += iosize;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2023-02-27 15:17:01 +00:00
|
|
|
|
|
|
|
return 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2022-05-25 02:55:07 +00:00
|
|
|
int btrfs_read_folio(struct file *file, struct folio *folio)
|
2022-04-15 14:33:24 +00:00
|
|
|
{
|
2023-02-27 15:16:55 +00:00
|
|
|
struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
|
2024-02-06 22:45:09 +00:00
|
|
|
struct extent_map *em_cached = NULL;
|
2022-04-15 14:33:24 +00:00
|
|
|
int ret;
|
|
|
|
|
2024-07-23 21:06:03 +00:00
|
|
|
ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
|
2024-02-06 22:45:09 +00:00
|
|
|
free_extent_map(em_cached);
|
|
|
|
|
2022-04-15 14:33:24 +00:00
|
|
|
/*
|
|
|
|
* If btrfs_do_readpage() failed we will want to submit the assembled
|
|
|
|
* bio to do the cleanup.
|
|
|
|
*/
|
2022-06-03 07:11:03 +00:00
|
|
|
submit_one_bio(&bio_ctrl);
|
2022-04-15 14:33:24 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
2024-08-27 01:30:16 +00:00
|
|
|
* helper for extent_writepage(), doing all of the delayed allocation setup.
|
2014-05-21 20:35:51 +00:00
|
|
|
*
|
2018-11-01 12:09:46 +00:00
|
|
|
* This returns 1 if btrfs_run_delalloc_range function did all the work required
|
2014-05-21 20:35:51 +00:00
|
|
|
* to write the page (copy into inline extent). In this case the IO has
|
|
|
|
* been started and the page is already unlocked.
|
|
|
|
*
|
|
|
|
* This returns 0 if all went well (page still locked)
|
|
|
|
* This returns < 0 if there were errors (page still locked)
|
2008-01-24 21:13:08 +00:00
|
|
|
*/
|
2020-06-05 07:42:10 +00:00
|
|
|
static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
|
2024-07-24 20:03:04 +00:00
|
|
|
struct folio *folio,
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
struct btrfs_bio_ctrl *bio_ctrl)
|
2014-05-21 20:35:51 +00:00
|
|
|
{
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
struct btrfs_fs_info *fs_info = inode_to_fs_info(&inode->vfs_inode);
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
struct writeback_control *wbc = bio_ctrl->wbc;
|
2024-07-24 20:03:04 +00:00
|
|
|
const bool is_subpage = btrfs_is_subpage(fs_info, folio->mapping);
|
|
|
|
const u64 page_start = folio_pos(folio);
|
|
|
|
const u64 page_end = page_start + folio_size(folio) - 1;
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
/*
|
|
|
|
* Save the last found delalloc end. As the delalloc end can go beyond
|
|
|
|
* page boundary, thus we cannot rely on subpage bitmap to locate the
|
|
|
|
* last delalloc end.
|
|
|
|
*/
|
|
|
|
u64 last_delalloc_end = 0;
|
2023-06-28 15:31:30 +00:00
|
|
|
u64 delalloc_start = page_start;
|
|
|
|
u64 delalloc_end = page_end;
|
2014-05-21 20:35:51 +00:00
|
|
|
u64 delalloc_to_write = 0;
|
2023-06-28 15:31:31 +00:00
|
|
|
int ret = 0;
|
2014-05-21 20:35:51 +00:00
|
|
|
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
/* Save the dirty bitmap as our submission bitmap will be a subset of it. */
|
|
|
|
if (btrfs_is_subpage(fs_info, inode->vfs_inode.i_mapping)) {
|
|
|
|
ASSERT(fs_info->sectors_per_page > 1);
|
|
|
|
btrfs_get_subpage_dirty_bitmap(fs_info, folio, &bio_ctrl->submit_bitmap);
|
|
|
|
} else {
|
|
|
|
bio_ctrl->submit_bitmap = 1;
|
|
|
|
}
|
|
|
|
|
2024-07-24 20:03:04 +00:00
|
|
|
/* Lock all (subpage) delalloc ranges inside the folio first. */
|
btrfs: subpage: avoid potential deadlock with compression and delalloc
[BUG]
With experimental subpage compression enabled, a simple fsstress can
lead to self deadlock on page 720896:
mkfs.btrfs -f -s 4k $dev > /dev/null
mount $dev -o compress $mnt
$fsstress -p 1 -n 100 -w -d $mnt -v -s 1625511156
[CAUSE]
If we have a file layout looks like below:
0 32K 64K 96K 128K
|//| |///////////////|
4K
Then we run delalloc range for the inode, it will:
- Call find_lock_delalloc_range() with @delalloc_start = 0
Then we got a delalloc range [0, 4K).
This range will be COWed.
- Call find_lock_delalloc_range() again with @delalloc_start = 4K
Since find_lock_delalloc_range() never cares whether the range
is still inside page range [0, 64K), it will return range [64K, 128K).
This range meets the condition for subpage compression, will go
through async COW path.
And async COW path will return @page_started.
But that @page_started is now for range [64K, 128K), not for range
[0, 64K).
- writepage_dellloc() returned 1 for page [0, 64K)
Thus page [0, 64K) will not be unlocked, nor its page dirty status
will be cleared.
Next time when we try to lock page [0, 64K) we will deadlock, as there
is no one to release page [0, 64K).
This problem will never happen for regular page size as one page only
contains one sector. After the first find_lock_delalloc_range() call,
the @delalloc_end will go beyond @page_end no matter if we found a
delalloc range or not
Thus this bug only happens for subpage, as now we need multiple runs to
exhaust the delalloc range of a page.
[FIX]
Fix the problem by ensuring the delalloc range we ran at least started
inside @locked_page.
So that we will never get incorrect @page_started.
And to prevent such problem from happening again:
- Make find_lock_delalloc_range() return false if the found range is
beyond @end value passed in.
Since @end will be utilized now, add an ASSERT() to ensure we pass
correct @end into find_lock_delalloc_range().
This also means, for selftests we needs to populate @end before calling
find_lock_delalloc_range().
- New ASSERT() in find_lock_delalloc_range()
Now we will make sure the @start/@end passed in at least covers part
of the page.
- New ASSERT() in run_delalloc_range()
To make sure the range at least starts inside @locked page.
- Use @delalloc_start as proper cursor, while @delalloc_end is always
reset to @page_end.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:22:07 +00:00
|
|
|
while (delalloc_start < page_end) {
|
2023-06-28 15:31:30 +00:00
|
|
|
delalloc_end = page_end;
|
2024-07-24 20:08:13 +00:00
|
|
|
if (!find_lock_delalloc_range(&inode->vfs_inode, folio,
|
2023-06-28 15:31:30 +00:00
|
|
|
&delalloc_start, &delalloc_end)) {
|
2014-05-21 20:35:51 +00:00
|
|
|
delalloc_start = delalloc_end + 1;
|
|
|
|
continue;
|
|
|
|
}
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
btrfs_folio_set_writer_lock(fs_info, folio, delalloc_start,
|
|
|
|
min(delalloc_end, page_end) + 1 -
|
|
|
|
delalloc_start);
|
|
|
|
last_delalloc_end = delalloc_end;
|
2014-05-21 20:35:51 +00:00
|
|
|
delalloc_start = delalloc_end + 1;
|
|
|
|
}
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
delalloc_start = page_start;
|
2023-06-28 15:31:30 +00:00
|
|
|
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
if (!last_delalloc_end)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
/* Run the delalloc ranges for the above locked ranges. */
|
|
|
|
while (delalloc_start < page_end) {
|
|
|
|
u64 found_start;
|
|
|
|
u32 found_len;
|
|
|
|
bool found;
|
|
|
|
|
|
|
|
if (!is_subpage) {
|
|
|
|
/*
|
|
|
|
* For non-subpage case, the found delalloc range must
|
2024-07-24 20:03:04 +00:00
|
|
|
* cover this folio and there must be only one locked
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
* delalloc range.
|
|
|
|
*/
|
|
|
|
found_start = page_start;
|
|
|
|
found_len = last_delalloc_end + 1 - found_start;
|
|
|
|
found = true;
|
|
|
|
} else {
|
|
|
|
found = btrfs_subpage_find_writer_locked(fs_info, folio,
|
|
|
|
delalloc_start, &found_start, &found_len);
|
|
|
|
}
|
|
|
|
if (!found)
|
|
|
|
break;
|
|
|
|
/*
|
|
|
|
* The subpage range covers the last sector, the delalloc range may
|
2024-07-24 20:03:04 +00:00
|
|
|
* end beyond the folio boundary, use the saved delalloc_end
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
* instead.
|
|
|
|
*/
|
|
|
|
if (found_start + found_len >= page_end)
|
|
|
|
found_len = last_delalloc_end + 1 - found_start;
|
|
|
|
|
|
|
|
if (ret >= 0) {
|
|
|
|
/* No errors hit so far, run the current delalloc range. */
|
2024-07-24 20:56:32 +00:00
|
|
|
ret = btrfs_run_delalloc_range(inode, folio,
|
2024-07-24 20:03:04 +00:00
|
|
|
found_start,
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
found_start + found_len - 1,
|
|
|
|
wbc);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* We've hit an error during previous delalloc range,
|
|
|
|
* have to cleanup the remaining locked ranges.
|
|
|
|
*/
|
|
|
|
unlock_extent(&inode->io_tree, found_start,
|
|
|
|
found_start + found_len - 1, NULL);
|
2024-07-24 20:17:43 +00:00
|
|
|
__unlock_for_delalloc(&inode->vfs_inode, folio,
|
2024-07-24 20:03:04 +00:00
|
|
|
found_start,
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
found_start + found_len - 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
* We have some ranges that's going to be submitted asynchronously
|
|
|
|
* (compression or inline). These range have their own control
|
|
|
|
* on when to unlock the pages. We should not touch them
|
|
|
|
* anymore, so clear the range from the submission bitmap.
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
*/
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
if (ret > 0) {
|
|
|
|
unsigned int start_bit = (found_start - page_start) >>
|
|
|
|
fs_info->sectorsize_bits;
|
|
|
|
unsigned int end_bit = (min(page_end + 1, found_start + found_len) -
|
|
|
|
page_start) >> fs_info->sectorsize_bits;
|
|
|
|
bitmap_clear(&bio_ctrl->submit_bitmap, start_bit, end_bit - start_bit);
|
|
|
|
}
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
/*
|
2024-07-24 20:03:04 +00:00
|
|
|
* Above btrfs_run_delalloc_range() may have unlocked the folio,
|
|
|
|
* thus for the last range, we cannot touch the folio anymore.
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
*/
|
|
|
|
if (found_start + found_len >= last_delalloc_end + 1)
|
|
|
|
break;
|
|
|
|
|
|
|
|
delalloc_start = found_start + found_len;
|
|
|
|
}
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
out:
|
|
|
|
if (last_delalloc_end)
|
|
|
|
delalloc_end = last_delalloc_end;
|
|
|
|
else
|
|
|
|
delalloc_end = page_end;
|
2023-06-28 15:31:30 +00:00
|
|
|
/*
|
|
|
|
* delalloc_end is already one less than the total length, so
|
|
|
|
* we don't subtract one from PAGE_SIZE
|
|
|
|
*/
|
|
|
|
delalloc_to_write +=
|
|
|
|
DIV_ROUND_UP(delalloc_end + 1 - page_start, PAGE_SIZE);
|
2023-06-28 15:31:31 +00:00
|
|
|
|
|
|
|
/*
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
* If all ranges are submitted asynchronously, we just need to account
|
|
|
|
* for them here.
|
2023-06-28 15:31:31 +00:00
|
|
|
*/
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
if (bitmap_empty(&bio_ctrl->submit_bitmap, fs_info->sectors_per_page)) {
|
2023-06-28 15:31:31 +00:00
|
|
|
wbc->nr_to_write -= delalloc_to_write;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2014-05-21 20:35:51 +00:00
|
|
|
if (wbc->nr_to_write < delalloc_to_write) {
|
|
|
|
int thresh = 8192;
|
|
|
|
|
|
|
|
if (delalloc_to_write < thresh * 2)
|
|
|
|
thresh = delalloc_to_write;
|
|
|
|
wbc->nr_to_write = min_t(u64, delalloc_to_write,
|
|
|
|
thresh);
|
|
|
|
}
|
|
|
|
|
2020-07-16 15:17:19 +00:00
|
|
|
return 0;
|
2014-05-21 20:35:51 +00:00
|
|
|
}
|
|
|
|
|
2021-05-31 08:50:50 +00:00
|
|
|
/*
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
* Return 0 if we have submitted or queued the sector for submission.
|
|
|
|
* Return <0 for critical errors.
|
2021-05-31 08:50:50 +00:00
|
|
|
*
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
* Caller should make sure filepos < i_size and handle filepos >= i_size case.
|
2021-05-31 08:50:50 +00:00
|
|
|
*/
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
static int submit_one_sector(struct btrfs_inode *inode,
|
|
|
|
struct folio *folio,
|
|
|
|
u64 filepos, struct btrfs_bio_ctrl *bio_ctrl,
|
|
|
|
loff_t i_size)
|
2021-05-31 08:50:50 +00:00
|
|
|
{
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
struct btrfs_fs_info *fs_info = inode->root->fs_info;
|
|
|
|
struct extent_map *em;
|
|
|
|
u64 block_start;
|
|
|
|
u64 disk_bytenr;
|
|
|
|
u64 extent_offset;
|
|
|
|
u64 em_end;
|
|
|
|
const u32 sectorsize = fs_info->sectorsize;
|
2021-05-31 08:50:50 +00:00
|
|
|
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
ASSERT(IS_ALIGNED(filepos, sectorsize));
|
|
|
|
|
|
|
|
/* @filepos >= i_size case should be handled by the caller. */
|
|
|
|
ASSERT(filepos < i_size);
|
|
|
|
|
|
|
|
em = btrfs_get_extent(inode, NULL, filepos, sectorsize);
|
|
|
|
if (IS_ERR(em))
|
|
|
|
return PTR_ERR_OR_ZERO(em);
|
2021-05-31 08:50:50 +00:00
|
|
|
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
extent_offset = filepos - em->start;
|
|
|
|
em_end = extent_map_end(em);
|
|
|
|
ASSERT(filepos <= em_end);
|
|
|
|
ASSERT(IS_ALIGNED(em->start, sectorsize));
|
|
|
|
ASSERT(IS_ALIGNED(em->len, sectorsize));
|
2021-08-17 09:38:52 +00:00
|
|
|
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
block_start = extent_map_block_start(em);
|
|
|
|
disk_bytenr = extent_map_block_start(em) + extent_offset;
|
2021-05-31 08:50:50 +00:00
|
|
|
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
ASSERT(!extent_map_is_compressed(em));
|
|
|
|
ASSERT(block_start != EXTENT_MAP_HOLE);
|
|
|
|
ASSERT(block_start != EXTENT_MAP_INLINE);
|
2021-08-17 09:38:52 +00:00
|
|
|
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
free_extent_map(em);
|
|
|
|
em = NULL;
|
|
|
|
|
2024-10-04 04:53:35 +00:00
|
|
|
/*
|
|
|
|
* Although the PageDirty bit is cleared before entering this
|
|
|
|
* function, subpage dirty bit is not cleared.
|
|
|
|
* So clear subpage dirty bit here so next time we won't submit
|
|
|
|
* a folio for a range already written to disk.
|
|
|
|
*/
|
|
|
|
btrfs_folio_clear_dirty(fs_info, folio, filepos, sectorsize);
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
btrfs_set_range_writeback(inode, filepos, filepos + sectorsize - 1);
|
|
|
|
/*
|
|
|
|
* Above call should set the whole folio with writeback flag, even
|
|
|
|
* just for a single subpage sector.
|
|
|
|
* As long as the folio is properly locked and the range is correct,
|
|
|
|
* we should always get the folio with writeback flag.
|
|
|
|
*/
|
|
|
|
ASSERT(folio_test_writeback(folio));
|
|
|
|
|
|
|
|
submit_extent_folio(bio_ctrl, disk_bytenr, folio,
|
|
|
|
sectorsize, filepos - folio_pos(folio));
|
|
|
|
return 0;
|
2021-05-31 08:50:50 +00:00
|
|
|
}
|
|
|
|
|
2014-05-21 20:35:51 +00:00
|
|
|
/*
|
2024-08-27 01:30:16 +00:00
|
|
|
* Helper for extent_writepage(). This calls the writepage start hooks,
|
2014-05-21 20:35:51 +00:00
|
|
|
* and does the loop to map the page into extents and bios.
|
|
|
|
*
|
|
|
|
* We return 1 if the IO is started and the page is unlocked,
|
|
|
|
* 0 if all went well (page still locked)
|
|
|
|
* < 0 if there were errors (page still locked)
|
|
|
|
*/
|
2024-08-27 01:30:16 +00:00
|
|
|
static noinline_for_stack int extent_writepage_io(struct btrfs_inode *inode,
|
|
|
|
struct folio *folio,
|
|
|
|
u64 start, u32 len,
|
|
|
|
struct btrfs_bio_ctrl *bio_ctrl,
|
|
|
|
loff_t i_size)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2021-01-06 01:01:41 +00:00
|
|
|
struct btrfs_fs_info *fs_info = inode->root->fs_info;
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
unsigned long range_bitmap = 0;
|
btrfs: remove the nr_ret parameter from __extent_writepage_io()
The parameter @nr_ret is used to tell the caller how many sectors have
been submitted for IO.
Then callers check @nr_ret value to determine if we need to manually
clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all
sectors are beyond i_size) there is no folio_start_writeback() called thus
PAGECACHE_TAG_DIRTY tag will not be cleared.
Remove this parameter by:
- Moving the btrfs_folio_clear_writeback() call into
__extent_writepage_io()
So that if we didn't submit any IO, then manually call
btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when
the page is no longer dirty.
- Use a bool to record if we have submitted any sector
Instead of an int.
- Use subpage compatible helpers to end folio writeback.
This brings no change to the behavior, just for the sake of consistency.
As for the call site inside __extent_writepage(), we're always called
for the whole page, so the existing full page helper
folio_(start|end)_writeback() is totally fine.
For the call site inside extent_write_locked_range(), although we can
have subpage range, folio_start_writeback() will only clear
PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio
will still be dirty if there is any subpage dirty range.
Only when the last dirty subpage sector is cleared, the
folio_start_writeback() will clear PAGECACHE_TAG_DIRTY.
So no matter if we call the full page or subpage helper, the result
is still the same, then just use the subpage helpers for consistency.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-14 01:20:21 +00:00
|
|
|
bool submitted_io = false;
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
const u64 folio_start = folio_pos(folio);
|
|
|
|
u64 cur;
|
|
|
|
int bit;
|
2014-05-21 20:35:51 +00:00
|
|
|
int ret = 0;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
ASSERT(start >= folio_start &&
|
|
|
|
start + len <= folio_start + folio_size(folio));
|
btrfs: make __extent_writepage_io() to write specified range only
Function __extent_writepage_io() is designed to find all dirty ranges of
a page, and add the dirty ranges to the bio_ctrl for submission.
It requires all the dirtied ranges to be covered by an ordered extent.
It gets called in two locations, but one call site is not subpage aware:
- __extent_writepage()
It gets called when writepage_delalloc() returned 0, which means
writepage_delalloc() has handled delalloc for all subpage sectors
inside the page.
So this call site is OK.
- extent_write_locked_range()
This call site is utilized by zoned support, and in this case, we may
only run delalloc range for a subset of the page, like this: (64K page
size)
0 16K 32K 48K 64K
|/////| |///////| |
In the above case, if extent_write_locked_range() is only triggered for
range [0, 16K), __extent_writepage_io() would still try to submit
the dirty range of [32K, 48K), then it would not find any ordered
extent for it and triggers various ASSERT()s.
Fix this problem by:
- Introducing @start and @len parameters to specify the range
For the first call site, we just pass the whole page, and the behavior
is not touched, since run_delalloc_range() for the page should have
created all ordered extents for the page.
For the second call site, we avoid touching anything beyond the
range, thus avoiding the dirty range which is not yet covered by any
delalloc range.
- Making btrfs_folio_assert_not_dirty() subpage aware
The only caller is inside __extent_writepage_io(), and since that
caller now accepts a subpage range, we should also check the subpage
range other than the whole page.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-16 04:03:41 +00:00
|
|
|
|
2024-07-24 21:26:31 +00:00
|
|
|
ret = btrfs_writepage_cow_fixup(folio);
|
2018-11-01 12:09:47 +00:00
|
|
|
if (ret) {
|
|
|
|
/* Fixup worker will requeue */
|
2024-07-24 18:38:01 +00:00
|
|
|
folio_redirty_for_writepage(bio_ctrl->wbc, folio);
|
|
|
|
folio_unlock(folio);
|
2018-11-01 12:09:47 +00:00
|
|
|
return 1;
|
2008-07-17 16:53:51 +00:00
|
|
|
}
|
|
|
|
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
for (cur = start; cur < start + len; cur += fs_info->sectorsize)
|
|
|
|
set_bit((cur - folio_start) >> fs_info->sectorsize_bits, &range_bitmap);
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
bitmap_and(&bio_ctrl->submit_bitmap, &bio_ctrl->submit_bitmap, &range_bitmap,
|
|
|
|
fs_info->sectors_per_page);
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
|
2023-12-12 02:28:38 +00:00
|
|
|
bio_ctrl->end_io_func = end_bbio_data_write;
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
for_each_set_bit(bit, &bio_ctrl->submit_bitmap, fs_info->sectors_per_page) {
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
cur = folio_pos(folio) + (bit << fs_info->sectorsize_bits);
|
2016-05-04 09:46:10 +00:00
|
|
|
|
2014-05-21 20:35:51 +00:00
|
|
|
if (cur >= i_size) {
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
btrfs_mark_ordered_io_finished(inode, folio, cur,
|
|
|
|
start + len - cur, true);
|
btrfs: subpage: fix writeback which does not have ordered extent
[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:
kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30
[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().
Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.
This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.
In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.
But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.
This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.
[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().
Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 06:34:58 +00:00
|
|
|
/*
|
|
|
|
* This range is beyond i_size, thus we don't need to
|
|
|
|
* bother writing back.
|
|
|
|
* But we still need to clear the dirty subpage bit, or
|
2024-07-24 18:38:01 +00:00
|
|
|
* the next time the folio gets dirtied, we will try to
|
btrfs: subpage: fix writeback which does not have ordered extent
[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:
kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30
[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().
Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.
This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.
In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.
But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.
This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.
[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().
Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 06:34:58 +00:00
|
|
|
* writeback the sectors with subpage dirty bits,
|
|
|
|
* causing writeback without ordered extent.
|
|
|
|
*/
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
btrfs_folio_clear_dirty(fs_info, folio, cur,
|
|
|
|
start + len - cur);
|
2008-01-24 21:13:08 +00:00
|
|
|
break;
|
|
|
|
}
|
btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.
This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.
However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:
- Extract the sector submission code into submit_one_sector()
- Add the needed code to extract the dirty bitmap for subpage case
There is a small pitfall for non-subpage case, as we cleared page
dirty before starting writeback, so we have to manually set
the default dirty_bitmap to 1 for such case.
- Use bitmap_and() to calculate the target sectors we need to submit
This is done for both subpage and non-subpage cases, and will later
be expanded to skip inline/compression ranges.
For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.
For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.
But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-07 05:01:54 +00:00
|
|
|
ret = submit_one_sector(inode, folio, cur, bio_ctrl, i_size);
|
|
|
|
if (ret < 0)
|
btrfs: remove the nr_ret parameter from __extent_writepage_io()
The parameter @nr_ret is used to tell the caller how many sectors have
been submitted for IO.
Then callers check @nr_ret value to determine if we need to manually
clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all
sectors are beyond i_size) there is no folio_start_writeback() called thus
PAGECACHE_TAG_DIRTY tag will not be cleared.
Remove this parameter by:
- Moving the btrfs_folio_clear_writeback() call into
__extent_writepage_io()
So that if we didn't submit any IO, then manually call
btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when
the page is no longer dirty.
- Use a bool to record if we have submitted any sector
Instead of an int.
- Use subpage compatible helpers to end folio writeback.
This brings no change to the behavior, just for the sake of consistency.
As for the call site inside __extent_writepage(), we're always called
for the whole page, so the existing full page helper
folio_(start|end)_writeback() is totally fine.
For the call site inside extent_write_locked_range(), although we can
have subpage range, folio_start_writeback() will only clear
PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio
will still be dirty if there is any subpage dirty range.
Only when the last dirty subpage sector is cleared, the
folio_start_writeback() will clear PAGECACHE_TAG_DIRTY.
So no matter if we call the full page or subpage helper, the result
is still the same, then just use the subpage helpers for consistency.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-14 01:20:21 +00:00
|
|
|
goto out;
|
|
|
|
submitted_io = true;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2023-02-27 15:17:02 +00:00
|
|
|
|
2024-07-24 18:38:01 +00:00
|
|
|
btrfs_folio_assert_not_dirty(fs_info, folio, start, len);
|
btrfs: remove the nr_ret parameter from __extent_writepage_io()
The parameter @nr_ret is used to tell the caller how many sectors have
been submitted for IO.
Then callers check @nr_ret value to determine if we need to manually
clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all
sectors are beyond i_size) there is no folio_start_writeback() called thus
PAGECACHE_TAG_DIRTY tag will not be cleared.
Remove this parameter by:
- Moving the btrfs_folio_clear_writeback() call into
__extent_writepage_io()
So that if we didn't submit any IO, then manually call
btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when
the page is no longer dirty.
- Use a bool to record if we have submitted any sector
Instead of an int.
- Use subpage compatible helpers to end folio writeback.
This brings no change to the behavior, just for the sake of consistency.
As for the call site inside __extent_writepage(), we're always called
for the whole page, so the existing full page helper
folio_(start|end)_writeback() is totally fine.
For the call site inside extent_write_locked_range(), although we can
have subpage range, folio_start_writeback() will only clear
PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio
will still be dirty if there is any subpage dirty range.
Only when the last dirty subpage sector is cleared, the
folio_start_writeback() will clear PAGECACHE_TAG_DIRTY.
So no matter if we call the full page or subpage helper, the result
is still the same, then just use the subpage helpers for consistency.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-14 01:20:21 +00:00
|
|
|
out:
|
btrfs: subpage: fix writeback which does not have ordered extent
[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:
kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30
[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().
Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.
This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.
In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.
But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.
This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.
[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().
Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 06:34:58 +00:00
|
|
|
/*
|
btrfs: remove the nr_ret parameter from __extent_writepage_io()
The parameter @nr_ret is used to tell the caller how many sectors have
been submitted for IO.
Then callers check @nr_ret value to determine if we need to manually
clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all
sectors are beyond i_size) there is no folio_start_writeback() called thus
PAGECACHE_TAG_DIRTY tag will not be cleared.
Remove this parameter by:
- Moving the btrfs_folio_clear_writeback() call into
__extent_writepage_io()
So that if we didn't submit any IO, then manually call
btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when
the page is no longer dirty.
- Use a bool to record if we have submitted any sector
Instead of an int.
- Use subpage compatible helpers to end folio writeback.
This brings no change to the behavior, just for the sake of consistency.
As for the call site inside __extent_writepage(), we're always called
for the whole page, so the existing full page helper
folio_(start|end)_writeback() is totally fine.
For the call site inside extent_write_locked_range(), although we can
have subpage range, folio_start_writeback() will only clear
PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio
will still be dirty if there is any subpage dirty range.
Only when the last dirty subpage sector is cleared, the
folio_start_writeback() will clear PAGECACHE_TAG_DIRTY.
So no matter if we call the full page or subpage helper, the result
is still the same, then just use the subpage helpers for consistency.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-14 01:20:21 +00:00
|
|
|
* If we didn't submitted any sector (>= i_size), folio dirty get
|
|
|
|
* cleared but PAGECACHE_TAG_DIRTY is not cleared (only cleared
|
|
|
|
* by folio_start_writeback() if the folio is not dirty).
|
|
|
|
*
|
|
|
|
* Here we set writeback and clear for the range. If the full folio
|
|
|
|
* is no longer dirty then we clear the PAGECACHE_TAG_DIRTY tag.
|
btrfs: subpage: fix writeback which does not have ordered extent
[BUG]
When running fsstress with subpage RW support, there are random
BUG_ON()s triggered with the following trace:
kernel BUG at fs/btrfs/file-item.c:667!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 1 PID: 3486 Comm: kworker/u13:2 5.11.0-rc4-custom+ #43
Hardware name: Radxa ROCK Pi 4B (DT)
Workqueue: btrfs-worker-high btrfs_work_helper [btrfs]
pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
pc : btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
lr : btrfs_csum_one_bio+0x400/0x4e0 [btrfs]
Call trace:
btrfs_csum_one_bio+0x420/0x4e0 [btrfs]
btrfs_submit_bio_start+0x20/0x30 [btrfs]
run_one_async_start+0x28/0x44 [btrfs]
btrfs_work_helper+0x128/0x1b4 [btrfs]
process_one_work+0x22c/0x430
worker_thread+0x70/0x3a0
kthread+0x13c/0x140
ret_from_fork+0x10/0x30
[CAUSE]
Above BUG_ON() means there is some bio range which doesn't have ordered
extent, which indeed is worth a BUG_ON().
Unlike regular sectorsize == PAGE_SIZE case, in subpage we have extra
subpage dirty bitmap to record which range is dirty and should be
written back.
This means, if we submit bio for a subpage range, we do not only need to
clear page dirty, but also need to clear subpage dirty bits.
In __extent_writepage_io(), we will call btrfs_page_clear_dirty() for
any range we submit a bio.
But there is loophole, if we hit a range which is beyond i_size, we just
call btrfs_writepage_endio_finish_ordered() to finish the ordered io,
then break out, without clearing the subpage dirty.
This means, if we hit above branch, the subpage dirty bits are still
there, if other range of the page get dirtied and we need to writeback
that page again, we will submit bio for the old range, leaving a wild
bio range which doesn't have ordered extent.
[FIX]
Fix it by always calling btrfs_page_clear_dirty() in
__extent_writepage_io().
Also to avoid such problem from happening again, add a new assert,
btrfs_page_assert_not_dirty(), to make sure both page dirty and subpage
dirty bits are cleared before exiting __extent_writepage_io().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-26 06:34:58 +00:00
|
|
|
*/
|
btrfs: remove the nr_ret parameter from __extent_writepage_io()
The parameter @nr_ret is used to tell the caller how many sectors have
been submitted for IO.
Then callers check @nr_ret value to determine if we need to manually
clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all
sectors are beyond i_size) there is no folio_start_writeback() called thus
PAGECACHE_TAG_DIRTY tag will not be cleared.
Remove this parameter by:
- Moving the btrfs_folio_clear_writeback() call into
__extent_writepage_io()
So that if we didn't submit any IO, then manually call
btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when
the page is no longer dirty.
- Use a bool to record if we have submitted any sector
Instead of an int.
- Use subpage compatible helpers to end folio writeback.
This brings no change to the behavior, just for the sake of consistency.
As for the call site inside __extent_writepage(), we're always called
for the whole page, so the existing full page helper
folio_(start|end)_writeback() is totally fine.
For the call site inside extent_write_locked_range(), although we can
have subpage range, folio_start_writeback() will only clear
PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio
will still be dirty if there is any subpage dirty range.
Only when the last dirty subpage sector is cleared, the
folio_start_writeback() will clear PAGECACHE_TAG_DIRTY.
So no matter if we call the full page or subpage helper, the result
is still the same, then just use the subpage helpers for consistency.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-14 01:20:21 +00:00
|
|
|
if (!submitted_io) {
|
|
|
|
btrfs_folio_set_writeback(fs_info, folio, start, len);
|
|
|
|
btrfs_folio_clear_writeback(fs_info, folio, start, len);
|
|
|
|
}
|
2014-05-21 20:35:51 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the writepage semantics are similar to regular writepage. extent
|
|
|
|
* records are inserted to lock ranges in the tree, and as dirty areas
|
|
|
|
* are found, they are marked writeback. Then the lock bits are removed
|
|
|
|
* and the end_io handler clears the writeback ranges
|
2019-03-20 06:27:42 +00:00
|
|
|
*
|
|
|
|
* Return 0 if everything goes well.
|
|
|
|
* Return <0 for error.
|
2014-05-21 20:35:51 +00:00
|
|
|
*/
|
2024-08-27 01:30:16 +00:00
|
|
|
static int extent_writepage(struct folio *folio, struct btrfs_bio_ctrl *bio_ctrl)
|
2014-05-21 20:35:51 +00:00
|
|
|
{
|
2024-07-24 18:58:22 +00:00
|
|
|
struct inode *inode = folio->mapping->host;
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
|
2024-07-24 18:58:22 +00:00
|
|
|
const u64 page_start = folio_pos(folio);
|
2014-05-21 20:35:51 +00:00
|
|
|
int ret;
|
2019-12-03 01:34:20 +00:00
|
|
|
size_t pg_offset;
|
2014-05-21 20:35:51 +00:00
|
|
|
loff_t i_size = i_size_read(inode);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
unsigned long end_index = i_size >> PAGE_SHIFT;
|
2014-05-21 20:35:51 +00:00
|
|
|
|
2024-08-27 01:30:16 +00:00
|
|
|
trace_extent_writepage(folio, inode, bio_ctrl->wbc);
|
2014-05-21 20:35:51 +00:00
|
|
|
|
2024-07-24 18:58:22 +00:00
|
|
|
WARN_ON(!folio_test_locked(folio));
|
2014-05-21 20:35:51 +00:00
|
|
|
|
2024-07-24 18:58:22 +00:00
|
|
|
pg_offset = offset_in_folio(folio, i_size);
|
|
|
|
if (folio->index > end_index ||
|
|
|
|
(folio->index == end_index && !pg_offset)) {
|
2022-02-09 20:21:29 +00:00
|
|
|
folio_invalidate(folio, 0, folio_size(folio));
|
|
|
|
folio_unlock(folio);
|
2014-05-21 20:35:51 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2024-07-24 18:58:22 +00:00
|
|
|
if (folio->index == end_index)
|
|
|
|
folio_zero_range(folio, pg_offset, folio_size(folio) - pg_offset);
|
2014-05-21 20:35:51 +00:00
|
|
|
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
/*
|
|
|
|
* Default to unlock the whole folio.
|
|
|
|
* The proper bitmap can only be initialized until writepage_delalloc().
|
|
|
|
*/
|
|
|
|
bio_ctrl->submit_bitmap = (unsigned long)-1;
|
2024-07-24 18:58:22 +00:00
|
|
|
ret = set_folio_extent_mapped(folio);
|
2023-05-31 06:04:57 +00:00
|
|
|
if (ret < 0)
|
2021-01-26 08:34:00 +00:00
|
|
|
goto done;
|
2014-05-21 20:35:51 +00:00
|
|
|
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
ret = writepage_delalloc(BTRFS_I(inode), folio, bio_ctrl);
|
2023-05-31 06:05:01 +00:00
|
|
|
if (ret == 1)
|
|
|
|
return 0;
|
|
|
|
if (ret)
|
|
|
|
goto done;
|
2014-05-21 20:35:51 +00:00
|
|
|
|
2024-08-27 01:30:16 +00:00
|
|
|
ret = extent_writepage_io(BTRFS_I(inode), folio, folio_pos(folio),
|
|
|
|
PAGE_SIZE, bio_ctrl, i_size);
|
2014-05-21 20:35:51 +00:00
|
|
|
if (ret == 1)
|
2019-12-03 01:34:21 +00:00
|
|
|
return 0;
|
2014-05-21 20:35:51 +00:00
|
|
|
|
2023-05-31 06:05:00 +00:00
|
|
|
bio_ctrl->wbc->nr_to_write--;
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
done:
|
2023-06-28 15:31:26 +00:00
|
|
|
if (ret) {
|
2024-07-24 19:57:10 +00:00
|
|
|
btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
|
2024-07-24 18:58:22 +00:00
|
|
|
page_start, PAGE_SIZE, !ret);
|
|
|
|
mapping_set_error(folio->mapping, ret);
|
2023-06-28 15:31:26 +00:00
|
|
|
}
|
btrfs: lock subpage ranges in one go for writepage_delalloc()
If we have a subpage range like this for a 16K page with 4K sectorsize:
0 4K 8K 12K 16K
|/////| |//////| |
|/////| = dirty range
Currently writepage_delalloc() would go through the following steps:
- lock range [0, 4K)
- run delalloc range for [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [8K 12K)
So far it's fine for regular subpage writeback, as
btrfs_run_delalloc_range() can only go into one of run_delalloc_nocow(),
cow_file_range() and run_delalloc_compressed().
But there is a special case for zoned subpage, where we will go
through run_delalloc_cow(), which would create the ordered extent for the
range and immediately submit the range.
This would unlock the whole page range, causing all kinds of different
ASSERT()s related to locked page.
Address the page unlocking problem of run_delalloc_cow(), by changing
the workflow to the following one:
- lock range [0, 4K)
- lock range [8K, 12K)
- run delalloc range for [0, 4K)
- run delalloc range for [8K, 12K)
So that run_delalloc_cow() can only unlock the full page until the
last lock user released.
To do that:
- Utilize subpage locked bitmap
So for every delalloc range we found, call
btrfs_folio_set_writer_lock() to populate the subpage locked bitmap,
and later btrfs_folio_end_all_writers() if the page is fully unlocked.
So we know there is a delalloc range that needs to be run later.
- Save the @delalloc_end as @last_delalloc_end inside writepage_delalloc()
Since subpage locked bitmap is only for ranges inside the page,
meanwhile we can have delalloc range ends beyond our page boundary,
we have to save the @last_delalloc_end just in case it's beyond our
page boundary.
Although there is one extra point to notice:
- We need to handle errors in previous iteration
Since we can have multiple locked delalloc ranges we have to call
run_delalloc_ranges() multiple times.
If we hit an error half way, we still need to unlock the remaining
ranges.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-02-18 06:39:32 +00:00
|
|
|
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
/*
|
|
|
|
* Only unlock ranges that are submitted. As there can be some async
|
|
|
|
* submitted ranges inside the folio.
|
|
|
|
*/
|
|
|
|
btrfs_folio_end_writer_lock_bitmap(fs_info, folio, bio_ctrl->submit_bitmap);
|
2019-03-20 06:27:42 +00:00
|
|
|
ASSERT(ret <= 0);
|
2014-05-21 20:35:51 +00:00
|
|
|
return ret;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2013-04-24 20:41:19 +00:00
|
|
|
void wait_on_extent_buffer_writeback(struct extent_buffer *eb)
|
2012-03-13 13:38:00 +00:00
|
|
|
{
|
sched: Remove proliferation of wait_on_bit() action functions
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().
So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.
Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.
All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.
The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"
The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.
A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).
Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-07 05:16:04 +00:00
|
|
|
wait_on_bit_io(&eb->bflags, EXTENT_BUFFER_WRITEBACK,
|
|
|
|
TASK_UNINTERRUPTIBLE);
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
|
2019-03-20 06:27:46 +00:00
|
|
|
/*
|
btrfs: fix the comment on lock_extent_buffer_for_io
The return value of that function is completely wrong.
That function only returns 0 if the extent buffer doesn't need to be
submitted. The "ret = 1" and "ret = 0" are determined by the return
value of "test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)".
And if we get ret == 1, it's because the extent buffer is dirty, and we
set its status to EXTENT_BUFFER_WRITE_BACK, and continue to page
locking.
While if we get ret == 0, it means the extent is not dirty from the
beginning, so we don't need to write it back.
The caller also follows this, in btree_write_cache_pages(), if
lock_extent_buffer_for_io() returns 0, we just skip the extent buffer
completely.
So the comment is completely wrong.
Since we're here, also change the description a little. The write bio
flushing won't be visible to the caller, thus it's not an major feature.
In the main description, only describe the locking part to make the
point more clear.
For reference, added in commit 2e3c25136adf ("btrfs: extent_io: add
proper error handling to lock_extent_buffer_for_io()")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-21 06:24:49 +00:00
|
|
|
* Lock extent buffer status and pages for writeback.
|
2019-03-20 06:27:46 +00:00
|
|
|
*
|
2023-05-03 15:24:30 +00:00
|
|
|
* Return %false if the extent buffer doesn't need to be submitted (e.g. the
|
|
|
|
* extent buffer is not dirty)
|
|
|
|
* Return %true is the extent buffer is submitted to bio.
|
2019-03-20 06:27:46 +00:00
|
|
|
*/
|
2023-05-03 15:24:30 +00:00
|
|
|
static noinline_for_stack bool lock_extent_buffer_for_io(struct extent_buffer *eb,
|
2023-05-03 15:24:31 +00:00
|
|
|
struct writeback_control *wbc)
|
2012-03-13 13:38:00 +00:00
|
|
|
{
|
2019-03-20 10:21:41 +00:00
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
2023-05-03 15:24:30 +00:00
|
|
|
bool ret = false;
|
2012-03-13 13:38:00 +00:00
|
|
|
|
2023-05-03 15:24:31 +00:00
|
|
|
btrfs_tree_lock(eb);
|
|
|
|
while (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
|
2012-03-13 13:38:00 +00:00
|
|
|
btrfs_tree_unlock(eb);
|
2023-05-03 15:24:31 +00:00
|
|
|
if (wbc->sync_mode != WB_SYNC_ALL)
|
2023-05-03 15:24:30 +00:00
|
|
|
return false;
|
2023-05-03 15:24:31 +00:00
|
|
|
wait_on_extent_buffer_writeback(eb);
|
|
|
|
btrfs_tree_lock(eb);
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
|
2012-07-20 20:25:24 +00:00
|
|
|
/*
|
|
|
|
* We need to do this to prevent races in people who check if the eb is
|
|
|
|
* under IO since we can end up having no IO bits set for a short period
|
|
|
|
* of time.
|
|
|
|
*/
|
|
|
|
spin_lock(&eb->refs_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
|
|
|
|
set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
|
2012-07-20 20:25:24 +00:00
|
|
|
spin_unlock(&eb->refs_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
|
2017-06-20 18:01:20 +00:00
|
|
|
percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
|
|
|
|
-eb->len,
|
|
|
|
fs_info->dirty_metadata_batch);
|
2023-05-03 15:24:30 +00:00
|
|
|
ret = true;
|
2012-07-20 20:25:24 +00:00
|
|
|
} else {
|
|
|
|
spin_unlock(&eb->refs_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
btrfs_tree_unlock(eb);
|
2019-03-20 06:27:46 +00:00
|
|
|
return ret;
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
|
2023-05-03 15:24:34 +00:00
|
|
|
static void set_btree_ioerr(struct extent_buffer *eb)
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
{
|
2021-03-25 07:14:44 +00:00
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
|
2023-05-03 15:24:34 +00:00
|
|
|
set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
|
2021-11-24 19:14:23 +00:00
|
|
|
/*
|
|
|
|
* A read may stumble upon this buffer later, make sure that it gets an
|
|
|
|
* error and knows there was an error.
|
|
|
|
*/
|
|
|
|
clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
|
|
|
|
|
2021-11-24 19:14:25 +00:00
|
|
|
/*
|
|
|
|
* We need to set the mapping with the io error as well because a write
|
|
|
|
* error will flip the file system readonly, and then syncfs() will
|
|
|
|
* return a 0 because we are readonly if we don't modify the err seq for
|
|
|
|
* the superblock.
|
|
|
|
*/
|
2023-05-03 15:24:34 +00:00
|
|
|
mapping_set_error(eb->fs_info->btree_inode->i_mapping, -EIO);
|
2021-11-24 19:14:25 +00:00
|
|
|
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
/*
|
|
|
|
* If writeback for a btree extent that doesn't belong to a log tree
|
|
|
|
* failed, increment the counter transaction->eb_write_errors.
|
|
|
|
* We do this because while the transaction is running and before it's
|
|
|
|
* committing (when we call filemap_fdata[write|wait]_range against
|
|
|
|
* the btree inode), we might have
|
|
|
|
* btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
|
|
|
|
* returns an error or an error happens during writeback, when we're
|
|
|
|
* committing the transaction we wouldn't know about it, since the pages
|
|
|
|
* can be no longer dirty nor marked anymore for writeback (if a
|
|
|
|
* subsequent modification to the extent buffer didn't happen before the
|
|
|
|
* transaction commit), which makes filemap_fdata[write|wait]_range not
|
2024-04-20 02:49:59 +00:00
|
|
|
* able to find the pages which contain errors at transaction
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
* commit time. So if this happens we must abort the transaction,
|
|
|
|
* otherwise we commit a super block with btree roots that point to
|
|
|
|
* btree nodes/leafs whose content on disk is invalid - either garbage
|
|
|
|
* or the content of some node/leaf from a past generation that got
|
|
|
|
* cowed or deleted and is no longer valid.
|
|
|
|
*
|
|
|
|
* Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
|
|
|
|
* not be enough - we need to distinguish between log tree extents vs
|
|
|
|
* non-log tree extents, and the next filemap_fdatawait_range() call
|
|
|
|
* will catch and clear such errors in the mapping - and that call might
|
|
|
|
* be from a log sync and not from a transaction commit. Also, checking
|
|
|
|
* for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
|
|
|
|
* not done and would not be reliable - the eb might have been released
|
|
|
|
* from memory and reading it back again means that flag would not be
|
|
|
|
* set (since it's a runtime flag, not persisted on disk).
|
|
|
|
*
|
|
|
|
* Using the flags below in the btree inode also makes us achieve the
|
|
|
|
* goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
|
|
|
|
* writeback for all dirty pages and before filemap_fdatawait_range()
|
|
|
|
* is called, the writeback for all dirty pages had already finished
|
|
|
|
* with errors - because we were not using AS_EIO/AS_ENOSPC,
|
|
|
|
* filemap_fdatawait_range() would return success, as it could not know
|
|
|
|
* that writeback errors happened (the pages were no longer tagged for
|
|
|
|
* writeback).
|
|
|
|
*/
|
|
|
|
switch (eb->log_index) {
|
|
|
|
case -1:
|
2021-03-25 07:14:44 +00:00
|
|
|
set_bit(BTRFS_FS_BTREE_ERR, &fs_info->flags);
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
break;
|
|
|
|
case 0:
|
2021-03-25 07:14:44 +00:00
|
|
|
set_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
break;
|
|
|
|
case 1:
|
2021-03-25 07:14:44 +00:00
|
|
|
set_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
BUG(); /* unexpected, logic error */
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2021-04-06 00:36:00 +00:00
|
|
|
/*
|
|
|
|
* The endio specific version which won't touch any unsafe spinlock in endio
|
|
|
|
* context.
|
|
|
|
*/
|
|
|
|
static struct extent_buffer *find_extent_buffer_nolock(
|
2024-05-30 17:14:12 +00:00
|
|
|
const struct btrfs_fs_info *fs_info, u64 start)
|
2021-04-06 00:36:00 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
2022-07-15 11:59:31 +00:00
|
|
|
eb = radix_tree_lookup(&fs_info->buffer_radix,
|
|
|
|
start >> fs_info->sectorsize_bits);
|
2021-04-06 00:36:00 +00:00
|
|
|
if (eb && atomic_inc_not_zero(&eb->refs)) {
|
|
|
|
rcu_read_unlock();
|
|
|
|
return eb;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2023-12-12 02:28:38 +00:00
|
|
|
static void end_bbio_meta_write(struct btrfs_bio *bbio)
|
2021-04-06 00:36:00 +00:00
|
|
|
{
|
2023-05-03 15:24:34 +00:00
|
|
|
struct extent_buffer *eb = bbio->private;
|
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
|
|
|
bool uptodate = !bbio->bio.bi_status;
|
2023-12-12 02:28:38 +00:00
|
|
|
struct folio_iter fi;
|
2023-05-03 15:24:34 +00:00
|
|
|
u32 bio_offset = 0;
|
2021-04-06 00:36:00 +00:00
|
|
|
|
2023-05-03 15:24:34 +00:00
|
|
|
if (!uptodate)
|
|
|
|
set_btree_ioerr(eb);
|
2021-04-27 04:53:35 +00:00
|
|
|
|
2023-12-12 02:28:38 +00:00
|
|
|
bio_for_each_folio_all(fi, &bbio->bio) {
|
2023-05-03 15:24:34 +00:00
|
|
|
u64 start = eb->start + bio_offset;
|
2023-12-12 02:28:38 +00:00
|
|
|
struct folio *folio = fi.folio;
|
|
|
|
u32 len = fi.length;
|
2021-04-06 00:36:00 +00:00
|
|
|
|
2023-12-12 02:28:38 +00:00
|
|
|
btrfs_folio_clear_writeback(fs_info, folio, start, len);
|
2023-05-03 15:24:34 +00:00
|
|
|
bio_offset += len;
|
2021-04-06 00:36:00 +00:00
|
|
|
}
|
2012-03-13 13:38:00 +00:00
|
|
|
|
2023-05-03 15:24:34 +00:00
|
|
|
clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
|
|
|
|
smp_mb__after_atomic();
|
|
|
|
wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
|
2012-03-13 13:38:00 +00:00
|
|
|
|
2023-05-03 15:24:34 +00:00
|
|
|
bio_put(&bbio->bio);
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
|
2021-04-27 04:53:35 +00:00
|
|
|
static void prepare_eb_write(struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
u32 nritems;
|
|
|
|
unsigned long start;
|
|
|
|
unsigned long end;
|
|
|
|
|
|
|
|
clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
|
|
|
|
|
|
|
|
/* Set btree blocks beyond nritems with 0 to avoid stale content */
|
|
|
|
nritems = btrfs_header_nritems(eb);
|
|
|
|
if (btrfs_header_level(eb) > 0) {
|
2022-11-15 16:16:16 +00:00
|
|
|
end = btrfs_node_key_ptr_offset(eb, nritems);
|
2021-04-27 04:53:35 +00:00
|
|
|
memzero_extent_buffer(eb, end, eb->len - end);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Leaf:
|
|
|
|
* header 0 1 2 .. N ... data_N .. data_2 data_1 data_0
|
|
|
|
*/
|
2022-11-15 16:16:15 +00:00
|
|
|
start = btrfs_item_nr_offset(eb, nritems);
|
2022-11-15 16:16:18 +00:00
|
|
|
end = btrfs_item_nr_offset(eb, 0);
|
2022-11-15 16:16:11 +00:00
|
|
|
if (nritems == 0)
|
|
|
|
end += BTRFS_LEAF_DATA_SIZE(eb->fs_info);
|
|
|
|
else
|
|
|
|
end += btrfs_item_offset(eb, nritems - 1);
|
2021-04-27 04:53:35 +00:00
|
|
|
memzero_extent_buffer(eb, start, end - start);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-02-27 15:17:01 +00:00
|
|
|
static noinline_for_stack void write_one_eb(struct extent_buffer *eb,
|
2023-05-03 15:24:31 +00:00
|
|
|
struct writeback_control *wbc)
|
2012-03-13 13:38:00 +00:00
|
|
|
{
|
2023-05-03 15:24:41 +00:00
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
2023-05-03 15:24:33 +00:00
|
|
|
struct btrfs_bio *bbio;
|
2012-03-13 13:38:00 +00:00
|
|
|
|
2021-04-27 04:53:35 +00:00
|
|
|
prepare_eb_write(eb);
|
2021-04-06 00:36:01 +00:00
|
|
|
|
2023-05-03 15:24:33 +00:00
|
|
|
bbio = btrfs_bio_alloc(INLINE_EXTENT_BUFFER_PAGES,
|
|
|
|
REQ_OP_WRITE | REQ_META | wbc_to_write_flags(wbc),
|
2023-12-12 02:28:38 +00:00
|
|
|
eb->fs_info, end_bbio_meta_write, eb);
|
2023-05-03 15:24:33 +00:00
|
|
|
bbio->bio.bi_iter.bi_sector = eb->start >> SECTOR_SHIFT;
|
2023-05-03 15:24:41 +00:00
|
|
|
bio_set_dev(&bbio->bio, fs_info->fs_devices->latest_dev->bdev);
|
2023-05-03 15:24:33 +00:00
|
|
|
wbc_init_bio(wbc, &bbio->bio);
|
|
|
|
bbio->inode = BTRFS_I(eb->fs_info->btree_inode);
|
|
|
|
bbio->file_offset = eb->start;
|
2023-05-03 15:24:41 +00:00
|
|
|
if (fs_info->nodesize < PAGE_SIZE) {
|
2023-12-12 02:28:37 +00:00
|
|
|
struct folio *folio = eb->folios[0];
|
|
|
|
bool ret;
|
2012-03-13 13:38:00 +00:00
|
|
|
|
2023-12-12 02:28:37 +00:00
|
|
|
folio_lock(folio);
|
|
|
|
btrfs_subpage_set_writeback(fs_info, folio, eb->start, eb->len);
|
|
|
|
if (btrfs_subpage_clear_and_test_dirty(fs_info, folio, eb->start,
|
2023-05-03 15:24:41 +00:00
|
|
|
eb->len)) {
|
2023-12-12 02:28:37 +00:00
|
|
|
folio_clear_dirty_for_io(folio);
|
2023-05-03 15:24:41 +00:00
|
|
|
wbc->nr_to_write--;
|
|
|
|
}
|
2023-12-12 02:28:37 +00:00
|
|
|
ret = bio_add_folio(&bbio->bio, folio, eb->len,
|
|
|
|
eb->start - folio_pos(folio));
|
|
|
|
ASSERT(ret);
|
|
|
|
wbc_account_cgroup_owner(wbc, folio_page(folio, 0), eb->len);
|
|
|
|
folio_unlock(folio);
|
2023-05-03 15:24:41 +00:00
|
|
|
} else {
|
2023-12-06 23:09:28 +00:00
|
|
|
int num_folios = num_extent_folios(eb);
|
|
|
|
|
|
|
|
for (int i = 0; i < num_folios; i++) {
|
|
|
|
struct folio *folio = eb->folios[i];
|
|
|
|
bool ret;
|
|
|
|
|
|
|
|
folio_lock(folio);
|
|
|
|
folio_clear_dirty_for_io(folio);
|
|
|
|
folio_start_writeback(folio);
|
2024-01-05 05:35:55 +00:00
|
|
|
ret = bio_add_folio(&bbio->bio, folio, eb->folio_size, 0);
|
2023-12-06 23:09:28 +00:00
|
|
|
ASSERT(ret);
|
|
|
|
wbc_account_cgroup_owner(wbc, folio_page(folio, 0),
|
2024-01-05 05:35:55 +00:00
|
|
|
eb->folio_size);
|
2023-12-06 23:09:28 +00:00
|
|
|
wbc->nr_to_write -= folio_nr_pages(folio);
|
|
|
|
folio_unlock(folio);
|
2023-05-03 15:24:41 +00:00
|
|
|
}
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
2024-08-27 01:40:11 +00:00
|
|
|
btrfs_submit_bbio(bbio, 0);
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
|
2021-04-06 00:36:03 +00:00
|
|
|
/*
|
|
|
|
* Submit one subpage btree page.
|
|
|
|
*
|
|
|
|
* The main difference to submit_eb_page() is:
|
|
|
|
* - Page locking
|
|
|
|
* For subpage, we don't rely on page locking at all.
|
|
|
|
*
|
|
|
|
* - Flush write bio
|
|
|
|
* We only flush bio if we may be unable to fit current extent buffers into
|
|
|
|
* current bio.
|
|
|
|
*
|
|
|
|
* Return >=0 for the number of submitted extent buffers.
|
|
|
|
* Return <0 for fatal error.
|
|
|
|
*/
|
2024-08-28 18:29:00 +00:00
|
|
|
static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
|
2021-04-06 00:36:03 +00:00
|
|
|
{
|
2024-08-28 18:29:00 +00:00
|
|
|
struct btrfs_fs_info *fs_info = folio_to_fs_info(folio);
|
2021-04-06 00:36:03 +00:00
|
|
|
int submitted = 0;
|
2024-08-28 18:29:00 +00:00
|
|
|
u64 folio_start = folio_pos(folio);
|
2021-04-06 00:36:03 +00:00
|
|
|
int bit_start = 0;
|
|
|
|
int sectors_per_node = fs_info->nodesize >> fs_info->sectorsize_bits;
|
|
|
|
|
|
|
|
/* Lock and write each dirty extent buffers in the range */
|
2024-08-26 06:14:50 +00:00
|
|
|
while (bit_start < fs_info->sectors_per_page) {
|
2023-11-17 03:54:14 +00:00
|
|
|
struct btrfs_subpage *subpage = folio_get_private(folio);
|
2021-04-06 00:36:03 +00:00
|
|
|
struct extent_buffer *eb;
|
|
|
|
unsigned long flags;
|
|
|
|
u64 start;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Take private lock to ensure the subpage won't be detached
|
|
|
|
* in the meantime.
|
|
|
|
*/
|
2024-08-28 18:29:00 +00:00
|
|
|
spin_lock(&folio->mapping->i_private_lock);
|
2023-11-17 03:54:14 +00:00
|
|
|
if (!folio_test_private(folio)) {
|
2024-08-28 18:29:00 +00:00
|
|
|
spin_unlock(&folio->mapping->i_private_lock);
|
2021-04-06 00:36:03 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
spin_lock_irqsave(&subpage->lock, flags);
|
2024-08-26 06:14:50 +00:00
|
|
|
if (!test_bit(bit_start + btrfs_bitmap_nr_dirty * fs_info->sectors_per_page,
|
2021-08-17 09:38:52 +00:00
|
|
|
subpage->bitmaps)) {
|
2021-04-06 00:36:03 +00:00
|
|
|
spin_unlock_irqrestore(&subpage->lock, flags);
|
2024-08-28 18:29:00 +00:00
|
|
|
spin_unlock(&folio->mapping->i_private_lock);
|
2021-04-06 00:36:03 +00:00
|
|
|
bit_start++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2024-08-28 18:29:00 +00:00
|
|
|
start = folio_start + bit_start * fs_info->sectorsize;
|
2021-04-06 00:36:03 +00:00
|
|
|
bit_start += sectors_per_node;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Here we just want to grab the eb without touching extra
|
|
|
|
* spin locks, so call find_extent_buffer_nolock().
|
|
|
|
*/
|
|
|
|
eb = find_extent_buffer_nolock(fs_info, start);
|
|
|
|
spin_unlock_irqrestore(&subpage->lock, flags);
|
2024-08-28 18:29:00 +00:00
|
|
|
spin_unlock(&folio->mapping->i_private_lock);
|
2021-04-06 00:36:03 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The eb has already reached 0 refs thus find_extent_buffer()
|
|
|
|
* doesn't return it. We don't need to write back such eb
|
|
|
|
* anyway.
|
|
|
|
*/
|
|
|
|
if (!eb)
|
|
|
|
continue;
|
|
|
|
|
2023-05-03 15:24:31 +00:00
|
|
|
if (lock_extent_buffer_for_io(eb, wbc)) {
|
2023-05-03 15:24:41 +00:00
|
|
|
write_one_eb(eb, wbc);
|
2023-05-03 15:24:30 +00:00
|
|
|
submitted++;
|
2021-04-06 00:36:03 +00:00
|
|
|
}
|
|
|
|
free_extent_buffer(eb);
|
|
|
|
}
|
|
|
|
return submitted;
|
|
|
|
}
|
|
|
|
|
2020-12-02 06:48:00 +00:00
|
|
|
/*
|
|
|
|
* Submit all page(s) of one extent buffer.
|
|
|
|
*
|
|
|
|
* @page: the page of one extent buffer
|
|
|
|
* @eb_context: to determine if we need to submit this page, if current page
|
|
|
|
* belongs to this eb, we don't need to submit
|
|
|
|
*
|
|
|
|
* The caller should pass each page in their bytenr order, and here we use
|
|
|
|
* @eb_context to determine if we have submitted pages of one extent buffer.
|
|
|
|
*
|
|
|
|
* If we have, we just skip until we hit a new page that doesn't belong to
|
|
|
|
* current @eb_context.
|
|
|
|
*
|
|
|
|
* If not, we submit all the page(s) of the extent buffer.
|
|
|
|
*
|
|
|
|
* Return >0 if we have submitted the extent buffer successfully.
|
|
|
|
* Return 0 if we don't need to submit the page, as it's already submitted by
|
|
|
|
* previous call.
|
|
|
|
* Return <0 for fatal error.
|
|
|
|
*/
|
2024-08-28 18:29:01 +00:00
|
|
|
static int submit_eb_page(struct folio *folio, struct btrfs_eb_write_context *ctx)
|
2020-12-02 06:48:00 +00:00
|
|
|
{
|
2023-08-07 16:12:31 +00:00
|
|
|
struct writeback_control *wbc = ctx->wbc;
|
2024-08-28 18:29:01 +00:00
|
|
|
struct address_space *mapping = folio->mapping;
|
2020-12-02 06:48:00 +00:00
|
|
|
struct extent_buffer *eb;
|
|
|
|
int ret;
|
|
|
|
|
2023-11-17 03:54:14 +00:00
|
|
|
if (!folio_test_private(folio))
|
2020-12-02 06:48:00 +00:00
|
|
|
return 0;
|
|
|
|
|
2024-08-28 18:29:01 +00:00
|
|
|
if (folio_to_fs_info(folio)->nodesize < PAGE_SIZE)
|
2024-08-28 18:29:00 +00:00
|
|
|
return submit_eb_subpage(folio, wbc);
|
2021-04-06 00:36:03 +00:00
|
|
|
|
2023-11-17 21:58:23 +00:00
|
|
|
spin_lock(&mapping->i_private_lock);
|
2023-11-17 03:54:14 +00:00
|
|
|
if (!folio_test_private(folio)) {
|
2023-11-17 21:58:23 +00:00
|
|
|
spin_unlock(&mapping->i_private_lock);
|
2020-12-02 06:48:00 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-11-17 03:54:14 +00:00
|
|
|
eb = folio_get_private(folio);
|
2020-12-02 06:48:00 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Shouldn't happen and normally this would be a BUG_ON but no point
|
|
|
|
* crashing the machine for something we can survive anyway.
|
|
|
|
*/
|
|
|
|
if (WARN_ON(!eb)) {
|
2023-11-17 21:58:23 +00:00
|
|
|
spin_unlock(&mapping->i_private_lock);
|
2020-12-02 06:48:00 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-08-07 16:12:31 +00:00
|
|
|
if (eb == ctx->eb) {
|
2023-11-17 21:58:23 +00:00
|
|
|
spin_unlock(&mapping->i_private_lock);
|
2020-12-02 06:48:00 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
ret = atomic_inc_not_zero(&eb->refs);
|
2023-11-17 21:58:23 +00:00
|
|
|
spin_unlock(&mapping->i_private_lock);
|
2020-12-02 06:48:00 +00:00
|
|
|
if (!ret)
|
|
|
|
return 0;
|
|
|
|
|
2023-08-07 16:12:31 +00:00
|
|
|
ctx->eb = eb;
|
|
|
|
|
2023-08-07 16:12:33 +00:00
|
|
|
ret = btrfs_check_meta_write_pointer(eb->fs_info, ctx);
|
|
|
|
if (ret) {
|
|
|
|
if (ret == -EBUSY)
|
2021-02-04 10:22:08 +00:00
|
|
|
ret = 0;
|
|
|
|
free_extent_buffer(eb);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2023-05-03 15:24:31 +00:00
|
|
|
if (!lock_extent_buffer_for_io(eb, wbc)) {
|
2020-12-02 06:48:00 +00:00
|
|
|
free_extent_buffer(eb);
|
2023-05-03 15:24:31 +00:00
|
|
|
return 0;
|
2020-12-02 06:48:00 +00:00
|
|
|
}
|
2023-08-07 16:12:34 +00:00
|
|
|
/* Implies write in zoned mode. */
|
2023-08-07 16:12:32 +00:00
|
|
|
if (ctx->zoned_bg) {
|
2023-08-07 16:12:34 +00:00
|
|
|
/* Mark the last eb in the block group. */
|
2023-08-07 16:12:32 +00:00
|
|
|
btrfs_schedule_zone_finish_bg(ctx->zoned_bg, eb);
|
2023-08-07 16:12:34 +00:00
|
|
|
ctx->zoned_bg->meta_write_pointer += eb->len;
|
2021-08-19 12:19:23 +00:00
|
|
|
}
|
2023-05-03 15:24:31 +00:00
|
|
|
write_one_eb(eb, wbc);
|
2020-12-02 06:48:00 +00:00
|
|
|
free_extent_buffer(eb);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
int btree_write_cache_pages(struct address_space *mapping,
|
|
|
|
struct writeback_control *wbc)
|
|
|
|
{
|
2023-08-07 16:12:31 +00:00
|
|
|
struct btrfs_eb_write_context ctx = { .wbc = wbc };
|
2023-09-14 14:45:41 +00:00
|
|
|
struct btrfs_fs_info *fs_info = inode_to_fs_info(mapping->host);
|
2012-03-13 13:38:00 +00:00
|
|
|
int ret = 0;
|
|
|
|
int done = 0;
|
|
|
|
int nr_to_write_done = 0;
|
2023-01-04 21:14:31 +00:00
|
|
|
struct folio_batch fbatch;
|
|
|
|
unsigned int nr_folios;
|
2012-03-13 13:38:00 +00:00
|
|
|
pgoff_t index;
|
|
|
|
pgoff_t end; /* Inclusive */
|
|
|
|
int scanned = 0;
|
2017-12-05 22:30:38 +00:00
|
|
|
xa_mark_t tag;
|
2012-03-13 13:38:00 +00:00
|
|
|
|
2023-01-04 21:14:31 +00:00
|
|
|
folio_batch_init(&fbatch);
|
2012-03-13 13:38:00 +00:00
|
|
|
if (wbc->range_cyclic) {
|
|
|
|
index = mapping->writeback_index; /* Start from prev offset */
|
|
|
|
end = -1;
|
2020-01-03 15:38:44 +00:00
|
|
|
/*
|
|
|
|
* Start from the beginning does not need to cycle over the
|
|
|
|
* range, mark it as scanned.
|
|
|
|
*/
|
|
|
|
scanned = (index == 0);
|
2012-03-13 13:38:00 +00:00
|
|
|
} else {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
index = wbc->range_start >> PAGE_SHIFT;
|
|
|
|
end = wbc->range_end >> PAGE_SHIFT;
|
2012-03-13 13:38:00 +00:00
|
|
|
scanned = 1;
|
|
|
|
}
|
|
|
|
if (wbc->sync_mode == WB_SYNC_ALL)
|
|
|
|
tag = PAGECACHE_TAG_TOWRITE;
|
|
|
|
else
|
|
|
|
tag = PAGECACHE_TAG_DIRTY;
|
2021-02-04 10:22:08 +00:00
|
|
|
btrfs_zoned_meta_io_lock(fs_info);
|
2012-03-13 13:38:00 +00:00
|
|
|
retry:
|
|
|
|
if (wbc->sync_mode == WB_SYNC_ALL)
|
|
|
|
tag_pages_for_writeback(mapping, index, end);
|
|
|
|
while (!done && !nr_to_write_done && (index <= end) &&
|
2023-01-04 21:14:31 +00:00
|
|
|
(nr_folios = filemap_get_folios_tag(mapping, &index, end,
|
|
|
|
tag, &fbatch))) {
|
2012-03-13 13:38:00 +00:00
|
|
|
unsigned i;
|
|
|
|
|
2023-01-04 21:14:31 +00:00
|
|
|
for (i = 0; i < nr_folios; i++) {
|
|
|
|
struct folio *folio = fbatch.folios[i];
|
2012-03-13 13:38:00 +00:00
|
|
|
|
2024-08-28 18:29:01 +00:00
|
|
|
ret = submit_eb_page(folio, &ctx);
|
2020-12-02 06:48:00 +00:00
|
|
|
if (ret == 0)
|
2012-03-13 13:38:00 +00:00
|
|
|
continue;
|
2020-12-02 06:48:00 +00:00
|
|
|
if (ret < 0) {
|
2012-03-13 13:38:00 +00:00
|
|
|
done = 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the filesystem may choose to bump up nr_to_write.
|
|
|
|
* We have to make sure to honor the new nr_to_write
|
|
|
|
* at any time
|
|
|
|
*/
|
|
|
|
nr_to_write_done = wbc->nr_to_write <= 0;
|
|
|
|
}
|
2023-01-04 21:14:31 +00:00
|
|
|
folio_batch_release(&fbatch);
|
2012-03-13 13:38:00 +00:00
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
if (!scanned && !done) {
|
|
|
|
/*
|
|
|
|
* We hit the last page and there is more work to be done: wrap
|
|
|
|
* back to the start of the file
|
|
|
|
*/
|
|
|
|
scanned = 1;
|
|
|
|
index = 0;
|
|
|
|
goto retry;
|
|
|
|
}
|
btrfs: Don't submit any btree write bio if the fs has errors
[BUG]
There is a fuzzed image which could cause KASAN report at unmount time.
BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
Read of size 8 at addr ffff888067cf6848 by task umount/1922
CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x5b/0x8b
print_address_description+0x70/0x280
kasan_report+0x13a/0x19b
btrfs_queue_work+0x2c1/0x390
btrfs_wq_submit_bio+0x1cd/0x240
btree_submit_bio_hook+0x18c/0x2a0
submit_one_bio+0x1be/0x320
flush_write_bio.isra.41+0x2c/0x70
btree_write_cache_pages+0x3bb/0x7f0
do_writepages+0x5c/0x130
__writeback_single_inode+0xa3/0x9a0
writeback_single_inode+0x23d/0x390
write_inode_now+0x1b5/0x280
iput+0x2ef/0x600
close_ctree+0x341/0x750
generic_shutdown_super+0x126/0x370
kill_anon_super+0x31/0x50
btrfs_kill_super+0x36/0x2b0
deactivate_locked_super+0x80/0xc0
deactivate_super+0x13c/0x150
cleanup_mnt+0x9a/0x130
task_work_run+0x11a/0x1b0
exit_to_usermode_loop+0x107/0x130
do_syscall_64+0x1e5/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[CAUSE]
The fuzzed image has a completely screwd up extent tree:
leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 0 count 1
item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 271 offset 0 count 1
item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
item 3 key (29360128 169 0) itemoff 3803 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 4 key (29368320 169 1) itemoff 3770 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 5 key (29372416 169 0) itemoff 3737 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
Note that leaf 29421568 doesn't have its backref in the extent tree.
Thus extent allocator can re-allocate leaf 29421568 for other trees.
In short, the bug is caused by:
- Existing tree block gets allocated to log tree
This got its generation bumped.
- Log tree balance cleaned dirty bit of offending tree block
It will not be written back to disk, thus no WRITTEN flag.
- Original owner of the tree block gets COWed
Since the tree block has higher transid, no WRITTEN flag, it's reused,
and not traced by transaction::dirty_pages.
- Transaction aborted
Tree blocks get cleaned according to transaction::dirty_pages. But the
offending tree block is not recorded at all.
- Filesystem unmount
All pages are assumed to be are clean, destroying all workqueue, then
call iput(btree_inode).
But offending tree block is still dirty, which triggers writeback, and
causes use-after-free bug.
The detailed sequence looks like this:
- Initial status
eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
not traced by any dirty extent_iot_tree.
- New tree block is allocated
Since there is no backref for 29421568, it's re-allocated as new tree
block.
Keep in mind that tree block 29421568 is still referred by extent
tree.
- Tree block 29421568 is filled for log tree
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
traced by btrfs_root::dirty_log_pages
- Some log tree operations
Since the fs is using node size 4096, the log tree can easily go a
level higher.
- Log tree needs balance
Tree block 29421568 gets all its content pushed to right, thus now
it is empty, and we don't need it.
btrfs_clean_tree_block() from __push_leaf_right() get called.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
traced by btrfs_root::dirty_log_pages
- Log tree write back
btree_write_cache_pages() goes through dirty pages ranges, but since
page of tree block 29421568 gets cleaned already, it's not written
back to disk. Thus it doesn't have WRITTEN bit set.
But ranges in dirty_log_pages are cleared.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
not traced by any dirty extent_iot_tree.
- Extent tree update when committing transaction
Since tree block 29421568 has transid equal to running trans, and has
no WRITTEN bit, should_cow_block() will use it directly without adding
it to btrfs_transaction::dirty_pages.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
At this stage, we're doomed. We have a dirty eb not tracked by any
extent io tree.
- Transaction gets aborted due to corrupted extent tree
Btrfs cleans up dirty pages according to transaction::dirty_pages and
btrfs_root::dirty_log_pages.
But since tree block 29421568 is not tracked by neither of them, it's
still dirty.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
- Filesystem unmount
Since all cleanup is assumed to be done, all workqueus are destroyed.
Then iput(btree_inode) is called, expecting no dirty pages.
But tree 29421568 is still dirty, thus triggering writeback.
Since all workqueues are already freed, we cause use-after-free.
This shows us that, log tree blocks + bad extent tree can cause wild
dirty pages.
[FIX]
To fix the problem, don't submit any btree write bio if the filesytem
has any error. This is the last safe net, just in case other cleanup
haven't caught catch it.
Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12 06:12:44 +00:00
|
|
|
/*
|
|
|
|
* If something went wrong, don't allow any metadata write bio to be
|
|
|
|
* submitted.
|
|
|
|
*
|
|
|
|
* This would prevent use-after-free if we had dirty pages not
|
|
|
|
* cleaned up, which can still happen by fuzzed images.
|
|
|
|
*
|
|
|
|
* - Bad extent tree
|
|
|
|
* Allowing existing tree block to be allocated for other trees.
|
|
|
|
*
|
|
|
|
* - Log tree operations
|
|
|
|
* Exiting tree blocks get allocated to log tree, bumps its
|
|
|
|
* generation, then get cleaned in tree re-balance.
|
|
|
|
* Such tree block will not be written back, since it's clean,
|
|
|
|
* thus no WRITTEN flag set.
|
|
|
|
* And after log writes back, this tree block is not traced by
|
|
|
|
* any dirty extent_io_tree.
|
|
|
|
*
|
|
|
|
* - Offending tree block gets re-dirtied from its original owner
|
|
|
|
* Since it has bumped generation, no WRITTEN flag, it can be
|
|
|
|
* reused without COWing. This tree block will not be traced
|
|
|
|
* by btrfs_transaction::dirty_pages.
|
|
|
|
*
|
|
|
|
* Now such dirty tree block will not be cleaned by any dirty
|
|
|
|
* extent io tree. Thus we don't want to submit such wild eb
|
|
|
|
* if the fs already has error.
|
2022-06-03 07:11:02 +00:00
|
|
|
*
|
2024-07-23 20:32:29 +00:00
|
|
|
* We can get ret > 0 from submit_extent_folio() indicating how many ebs
|
btrfs: avoid double clean up when submit_one_bio() failed
[BUG]
When running generic/475 with 64K page size and 4K sector size, it has a
very high chance (almost 100%) to hang, with mostly data page locked but
no one is going to unlock it.
[CAUSE]
With commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on
reads"), if we failed to lookup checksum due to metadata IO error, we
will return error for btrfs_submit_data_bio().
This will cause the page to be unlocked twice in btrfs_do_readpage():
btrfs_do_readpage()
|- submit_extent_page()
| |- submit_one_bio()
| |- btrfs_submit_data_bio()
| |- if (ret) {
| |- bio->bi_status = ret;
| |- bio_endio(bio); }
| In the endio function, we will call end_page_read()
| and unlock_extent() to cleanup the subpage range.
|
|- if (ret) {
|- unlock_extent(); end_page_read() }
Here we unlock the extent and cleanup the subpage range
again.
For unlock_extent(), it's mostly double unlock safe.
But for end_page_read(), it's not, especially for subpage case,
as for subpage case we will call btrfs_subpage_end_reader() to reduce
the reader number, and use that to number to determine if we need to
unlock the full page.
If double accounted, it can underflow the number and leave the page
locked without anyone to unlock it.
[FIX]
The commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on
reads") itself is completely fine, it's our existing code not properly
handling the error from bio submission hook properly.
This patch will make submit_one_bio() to return void so that the callers
will never be able to do cleanup when bio submission hook fails.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-12 12:30:13 +00:00
|
|
|
* were submitted. Reset it to 0 to avoid false alerts for the caller.
|
|
|
|
*/
|
|
|
|
if (ret > 0)
|
|
|
|
ret = 0;
|
2022-06-03 07:11:02 +00:00
|
|
|
if (!ret && BTRFS_FS_ERROR(fs_info))
|
|
|
|
ret = -EROFS;
|
2023-08-07 16:12:32 +00:00
|
|
|
|
|
|
|
if (ctx.zoned_bg)
|
|
|
|
btrfs_put_block_group(ctx.zoned_bg);
|
2022-06-03 07:11:02 +00:00
|
|
|
btrfs_zoned_meta_io_unlock(fs_info);
|
2012-03-13 13:38:00 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2022-10-27 12:21:42 +00:00
|
|
|
/*
|
2021-01-22 09:58:03 +00:00
|
|
|
* Walk the list of dirty pages of the given address space and write all of them.
|
|
|
|
*
|
2022-10-27 11:07:05 +00:00
|
|
|
* @mapping: address space structure to write
|
|
|
|
* @wbc: subtract the number of written pages from *@wbc->nr_to_write
|
|
|
|
* @bio_ctrl: holds context for the write, namely the bio
|
2008-01-24 21:13:08 +00:00
|
|
|
*
|
|
|
|
* If a page is already under I/O, write_cache_pages() skips it, even
|
|
|
|
* if it's dirty. This is desirable behaviour for memory-cleaning writeback,
|
|
|
|
* but it is INCORRECT for data-integrity system calls such as fsync(). fsync()
|
|
|
|
* and msync() need to guarantee that all the data which was dirty at the time
|
|
|
|
* the call was made get new I/O started against them. If wbc->sync_mode is
|
|
|
|
* WB_SYNC_ALL then we were called for data integrity and we must wait for
|
|
|
|
* existing IO to complete.
|
|
|
|
*/
|
2017-02-10 18:38:24 +00:00
|
|
|
static int extent_write_cache_pages(struct address_space *mapping,
|
2022-10-27 11:07:05 +00:00
|
|
|
struct btrfs_bio_ctrl *bio_ctrl)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2023-02-27 15:16:57 +00:00
|
|
|
struct writeback_control *wbc = bio_ctrl->wbc;
|
2012-06-27 21:18:41 +00:00
|
|
|
struct inode *inode = mapping->host;
|
2008-01-24 21:13:08 +00:00
|
|
|
int ret = 0;
|
|
|
|
int done = 0;
|
2009-09-18 20:03:16 +00:00
|
|
|
int nr_to_write_done = 0;
|
2023-01-04 21:14:32 +00:00
|
|
|
struct folio_batch fbatch;
|
|
|
|
unsigned int nr_folios;
|
2008-01-24 21:13:08 +00:00
|
|
|
pgoff_t index;
|
|
|
|
pgoff_t end; /* Inclusive */
|
2016-03-08 00:56:21 +00:00
|
|
|
pgoff_t done_index;
|
|
|
|
int range_whole = 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
int scanned = 0;
|
2017-12-05 22:30:38 +00:00
|
|
|
xa_mark_t tag;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2012-06-27 21:18:41 +00:00
|
|
|
/*
|
|
|
|
* We have to hold onto the inode so that ordered extents can do their
|
|
|
|
* work when the IO finishes. The alternative to this is failing to add
|
|
|
|
* an ordered extent if the igrab() fails there and that is a huge pain
|
|
|
|
* to deal with, so instead just hold onto the inode throughout the
|
|
|
|
* writepages operation. If it fails here we are freeing up the inode
|
|
|
|
* anyway and we'd rather not waste our time writing out stuff that is
|
|
|
|
* going to be truncated anyway.
|
|
|
|
*/
|
|
|
|
if (!igrab(inode))
|
|
|
|
return 0;
|
|
|
|
|
2023-01-04 21:14:32 +00:00
|
|
|
folio_batch_init(&fbatch);
|
2008-01-24 21:13:08 +00:00
|
|
|
if (wbc->range_cyclic) {
|
|
|
|
index = mapping->writeback_index; /* Start from prev offset */
|
|
|
|
end = -1;
|
2020-01-03 15:38:44 +00:00
|
|
|
/*
|
|
|
|
* Start from the beginning does not need to cycle over the
|
|
|
|
* range, mark it as scanned.
|
|
|
|
*/
|
|
|
|
scanned = (index == 0);
|
2008-01-24 21:13:08 +00:00
|
|
|
} else {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
index = wbc->range_start >> PAGE_SHIFT;
|
|
|
|
end = wbc->range_end >> PAGE_SHIFT;
|
2016-03-08 00:56:21 +00:00
|
|
|
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
|
|
|
|
range_whole = 1;
|
2008-01-24 21:13:08 +00:00
|
|
|
scanned = 1;
|
|
|
|
}
|
2018-11-01 06:49:03 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We do the tagged writepage as long as the snapshot flush bit is set
|
|
|
|
* and we are the first one who do the filemap_flush() on this inode.
|
|
|
|
*
|
|
|
|
* The nr_to_write == LONG_MAX is needed to make sure other flushers do
|
|
|
|
* not race in and drop the bit.
|
|
|
|
*/
|
|
|
|
if (range_whole && wbc->nr_to_write == LONG_MAX &&
|
|
|
|
test_and_clear_bit(BTRFS_INODE_SNAPSHOT_FLUSH,
|
|
|
|
&BTRFS_I(inode)->runtime_flags))
|
|
|
|
wbc->tagged_writepages = 1;
|
|
|
|
|
|
|
|
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
|
2011-07-15 21:26:38 +00:00
|
|
|
tag = PAGECACHE_TAG_TOWRITE;
|
|
|
|
else
|
|
|
|
tag = PAGECACHE_TAG_DIRTY;
|
2008-01-24 21:13:08 +00:00
|
|
|
retry:
|
2018-11-01 06:49:03 +00:00
|
|
|
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
|
2011-07-15 21:26:38 +00:00
|
|
|
tag_pages_for_writeback(mapping, index, end);
|
2016-03-08 00:56:21 +00:00
|
|
|
done_index = index;
|
2009-09-18 20:03:16 +00:00
|
|
|
while (!done && !nr_to_write_done && (index <= end) &&
|
2023-01-04 21:14:32 +00:00
|
|
|
(nr_folios = filemap_get_folios_tag(mapping, &index,
|
|
|
|
end, tag, &fbatch))) {
|
2008-01-24 21:13:08 +00:00
|
|
|
unsigned i;
|
|
|
|
|
2023-01-04 21:14:32 +00:00
|
|
|
for (i = 0; i < nr_folios; i++) {
|
|
|
|
struct folio *folio = fbatch.folios[i];
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-07-17 07:16:22 +00:00
|
|
|
done_index = folio_next_index(folio);
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
2018-04-10 23:36:56 +00:00
|
|
|
* At this point we hold neither the i_pages lock nor
|
|
|
|
* the page lock: the page may be truncated or
|
|
|
|
* invalidated (changing page->mapping to NULL),
|
|
|
|
* or even swizzled back from swapper_space to
|
|
|
|
* tmpfs file mapping
|
2008-01-24 21:13:08 +00:00
|
|
|
*/
|
2023-01-04 21:14:32 +00:00
|
|
|
if (!folio_trylock(folio)) {
|
2022-10-27 11:07:05 +00:00
|
|
|
submit_write_bio(bio_ctrl, 0);
|
2023-01-04 21:14:32 +00:00
|
|
|
folio_lock(folio);
|
2011-11-01 14:08:06 +00:00
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-01-04 21:14:32 +00:00
|
|
|
if (unlikely(folio->mapping != mapping)) {
|
|
|
|
folio_unlock(folio);
|
2008-01-24 21:13:08 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2023-07-24 13:26:54 +00:00
|
|
|
if (!folio_test_dirty(folio)) {
|
|
|
|
/* Someone wrote it for us. */
|
|
|
|
folio_unlock(folio);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2008-11-19 17:44:22 +00:00
|
|
|
if (wbc->sync_mode != WB_SYNC_NONE) {
|
2023-01-04 21:14:32 +00:00
|
|
|
if (folio_test_writeback(folio))
|
2022-10-27 11:07:05 +00:00
|
|
|
submit_write_bio(bio_ctrl, 0);
|
2023-01-04 21:14:32 +00:00
|
|
|
folio_wait_writeback(folio);
|
2008-11-19 17:44:22 +00:00
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-01-04 21:14:32 +00:00
|
|
|
if (folio_test_writeback(folio) ||
|
|
|
|
!folio_clear_dirty_for_io(folio)) {
|
|
|
|
folio_unlock(folio);
|
2008-01-24 21:13:08 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2024-08-27 01:30:16 +00:00
|
|
|
ret = extent_writepage(folio, bio_ctrl);
|
2016-03-08 00:56:21 +00:00
|
|
|
if (ret < 0) {
|
|
|
|
done = 1;
|
|
|
|
break;
|
|
|
|
}
|
2009-09-18 20:03:16 +00:00
|
|
|
|
|
|
|
/*
|
2023-07-24 13:26:53 +00:00
|
|
|
* The filesystem may choose to bump up nr_to_write.
|
2009-09-18 20:03:16 +00:00
|
|
|
* We have to make sure to honor the new nr_to_write
|
2023-07-24 13:26:53 +00:00
|
|
|
* at any time.
|
2009-09-18 20:03:16 +00:00
|
|
|
*/
|
2023-07-24 13:26:53 +00:00
|
|
|
nr_to_write_done = (wbc->sync_mode == WB_SYNC_NONE &&
|
|
|
|
wbc->nr_to_write <= 0);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2023-01-04 21:14:32 +00:00
|
|
|
folio_batch_release(&fbatch);
|
2008-01-24 21:13:08 +00:00
|
|
|
cond_resched();
|
|
|
|
}
|
2016-03-08 00:56:22 +00:00
|
|
|
if (!scanned && !done) {
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* We hit the last page and there is more work to be done: wrap
|
|
|
|
* back to the start of the file
|
|
|
|
*/
|
|
|
|
scanned = 1;
|
|
|
|
index = 0;
|
2020-01-23 20:33:02 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're looping we could run into a page that is locked by a
|
|
|
|
* writer and that writer could be waiting on writeback for a
|
|
|
|
* page in our current bio, and thus deadlock, so flush the
|
|
|
|
* write bio here.
|
|
|
|
*/
|
2022-10-27 11:07:05 +00:00
|
|
|
submit_write_bio(bio_ctrl, 0);
|
btrfs: avoid double clean up when submit_one_bio() failed
[BUG]
When running generic/475 with 64K page size and 4K sector size, it has a
very high chance (almost 100%) to hang, with mostly data page locked but
no one is going to unlock it.
[CAUSE]
With commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on
reads"), if we failed to lookup checksum due to metadata IO error, we
will return error for btrfs_submit_data_bio().
This will cause the page to be unlocked twice in btrfs_do_readpage():
btrfs_do_readpage()
|- submit_extent_page()
| |- submit_one_bio()
| |- btrfs_submit_data_bio()
| |- if (ret) {
| |- bio->bi_status = ret;
| |- bio_endio(bio); }
| In the endio function, we will call end_page_read()
| and unlock_extent() to cleanup the subpage range.
|
|- if (ret) {
|- unlock_extent(); end_page_read() }
Here we unlock the extent and cleanup the subpage range
again.
For unlock_extent(), it's mostly double unlock safe.
But for end_page_read(), it's not, especially for subpage case,
as for subpage case we will call btrfs_subpage_end_reader() to reduce
the reader number, and use that to number to determine if we need to
unlock the full page.
If double accounted, it can underflow the number and leave the page
locked without anyone to unlock it.
[FIX]
The commit 1784b7d502a9 ("btrfs: handle csum lookup errors properly on
reads") itself is completely fine, it's our existing code not properly
handling the error from bio submission hook properly.
This patch will make submit_one_bio() to return void so that the callers
will never be able to do cleanup when bio submission hook fails.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-12 12:30:13 +00:00
|
|
|
goto retry;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2016-03-08 00:56:21 +00:00
|
|
|
|
|
|
|
if (wbc->range_cyclic || (wbc->nr_to_write > 0 && range_whole))
|
|
|
|
mapping->writeback_index = done_index;
|
|
|
|
|
2022-10-28 01:53:04 +00:00
|
|
|
btrfs_add_delayed_iput(BTRFS_I(inode));
|
2016-03-08 00:56:22 +00:00
|
|
|
return ret;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
btrfs: cleanup for extent_write_locked_range()
There are several cleanups for extent_write_locked_range(), most of them
are pure cleanups, but with some preparation for future subpage support.
- Add a proper comment for which call sites are suitable
Unlike regular synchronized extent write back, if async COW or zoned
COW happens, we have all pages in the range still locked.
Thus for those (only) two call sites, we need this function to submit
page content into bios and submit them.
- Remove @mode parameter
All the existing two call sites pass WB_SYNC_ALL. No need for @mode
parameter.
- Better error handling
Currently if we hit an error during the page iteration loop, we
overwrite @ret, causing only the last error can be recorded.
Here we add @found_error and @first_error variable to record if we hit
any error, and the first error we hit.
So the first error won't get lost.
- Don't reuse @start as the cursor
We reuse the parameter @start as the cursor to iterate the range, not
a big problem, but since we're here, introduce a proper @cur as the
cursor.
- Remove impossible branch
Since all pages are still locked after the ordered extent is inserted,
there is no way that pages can get its dirty bit cleared.
Remove the branch where page is not dirty and replace it with an
ASSERT().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:21:58 +00:00
|
|
|
/*
|
|
|
|
* Submit the pages in the range to bio for call sites which delalloc range has
|
|
|
|
* already been ran (aka, ordered extent inserted) and all pages are still
|
|
|
|
* locked.
|
|
|
|
*/
|
2024-07-24 20:31:44 +00:00
|
|
|
void extent_write_locked_range(struct inode *inode, const struct folio *locked_folio,
|
2023-06-28 15:31:42 +00:00
|
|
|
u64 start, u64 end, struct writeback_control *wbc,
|
|
|
|
bool pages_dirty)
|
2008-11-07 03:02:51 +00:00
|
|
|
{
|
btrfs: cleanup for extent_write_locked_range()
There are several cleanups for extent_write_locked_range(), most of them
are pure cleanups, but with some preparation for future subpage support.
- Add a proper comment for which call sites are suitable
Unlike regular synchronized extent write back, if async COW or zoned
COW happens, we have all pages in the range still locked.
Thus for those (only) two call sites, we need this function to submit
page content into bios and submit them.
- Remove @mode parameter
All the existing two call sites pass WB_SYNC_ALL. No need for @mode
parameter.
- Better error handling
Currently if we hit an error during the page iteration loop, we
overwrite @ret, causing only the last error can be recorded.
Here we add @found_error and @first_error variable to record if we hit
any error, and the first error we hit.
So the first error won't get lost.
- Don't reuse @start as the cursor
We reuse the parameter @start as the cursor to iterate the range, not
a big problem, but since we're here, introduce a proper @cur as the
cursor.
- Remove impossible branch
Since all pages are still locked after the ordered extent is inserted,
there is no way that pages can get its dirty bit cleared.
Remove the branch where page is not dirty and replace it with an
ASSERT().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:21:58 +00:00
|
|
|
bool found_error = false;
|
2008-11-07 03:02:51 +00:00
|
|
|
int ret = 0;
|
|
|
|
struct address_space *mapping = inode->i_mapping;
|
2023-09-14 14:45:41 +00:00
|
|
|
struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
|
2023-05-31 06:05:01 +00:00
|
|
|
const u32 sectorsize = fs_info->sectorsize;
|
|
|
|
loff_t i_size = i_size_read(inode);
|
btrfs: cleanup for extent_write_locked_range()
There are several cleanups for extent_write_locked_range(), most of them
are pure cleanups, but with some preparation for future subpage support.
- Add a proper comment for which call sites are suitable
Unlike regular synchronized extent write back, if async COW or zoned
COW happens, we have all pages in the range still locked.
Thus for those (only) two call sites, we need this function to submit
page content into bios and submit them.
- Remove @mode parameter
All the existing two call sites pass WB_SYNC_ALL. No need for @mode
parameter.
- Better error handling
Currently if we hit an error during the page iteration loop, we
overwrite @ret, causing only the last error can be recorded.
Here we add @found_error and @first_error variable to record if we hit
any error, and the first error we hit.
So the first error won't get lost.
- Don't reuse @start as the cursor
We reuse the parameter @start as the cursor to iterate the range, not
a big problem, but since we're here, introduce a proper @cur as the
cursor.
- Remove impossible branch
Since all pages are still locked after the ordered extent is inserted,
there is no way that pages can get its dirty bit cleared.
Remove the branch where page is not dirty and replace it with an
ASSERT().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:21:58 +00:00
|
|
|
u64 cur = start;
|
2023-02-27 15:16:55 +00:00
|
|
|
struct btrfs_bio_ctrl bio_ctrl = {
|
2023-05-31 06:05:02 +00:00
|
|
|
.wbc = wbc,
|
|
|
|
.opf = REQ_OP_WRITE | wbc_to_write_flags(wbc),
|
2023-02-27 15:16:55 +00:00
|
|
|
};
|
2008-11-07 03:02:51 +00:00
|
|
|
|
2023-05-31 06:05:02 +00:00
|
|
|
if (wbc->no_cgroup_owner)
|
|
|
|
bio_ctrl.opf |= REQ_BTRFS_CGROUP_PUNT;
|
|
|
|
|
2021-09-27 07:22:02 +00:00
|
|
|
ASSERT(IS_ALIGNED(start, sectorsize) && IS_ALIGNED(end + 1, sectorsize));
|
|
|
|
|
btrfs: cleanup for extent_write_locked_range()
There are several cleanups for extent_write_locked_range(), most of them
are pure cleanups, but with some preparation for future subpage support.
- Add a proper comment for which call sites are suitable
Unlike regular synchronized extent write back, if async COW or zoned
COW happens, we have all pages in the range still locked.
Thus for those (only) two call sites, we need this function to submit
page content into bios and submit them.
- Remove @mode parameter
All the existing two call sites pass WB_SYNC_ALL. No need for @mode
parameter.
- Better error handling
Currently if we hit an error during the page iteration loop, we
overwrite @ret, causing only the last error can be recorded.
Here we add @found_error and @first_error variable to record if we hit
any error, and the first error we hit.
So the first error won't get lost.
- Don't reuse @start as the cursor
We reuse the parameter @start as the cursor to iterate the range, not
a big problem, but since we're here, introduce a proper @cur as the
cursor.
- Remove impossible branch
Since all pages are still locked after the ordered extent is inserted,
there is no way that pages can get its dirty bit cleared.
Remove the branch where page is not dirty and replace it with an
ASSERT().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:21:58 +00:00
|
|
|
while (cur <= end) {
|
2021-09-27 07:22:02 +00:00
|
|
|
u64 cur_end = min(round_down(cur, PAGE_SIZE) + PAGE_SIZE - 1, end);
|
2023-06-28 15:31:26 +00:00
|
|
|
u32 cur_len = cur_end + 1 - cur;
|
2024-07-24 18:50:51 +00:00
|
|
|
struct folio *folio;
|
2021-09-27 07:22:02 +00:00
|
|
|
|
2024-07-24 18:50:51 +00:00
|
|
|
folio = __filemap_get_folio(mapping, cur >> PAGE_SHIFT, 0, 0);
|
2023-05-31 06:05:01 +00:00
|
|
|
|
2024-07-24 18:50:51 +00:00
|
|
|
/*
|
|
|
|
* This shouldn't happen, the pages are pinned and locked, this
|
|
|
|
* code is just in case, but shouldn't actually be run.
|
|
|
|
*/
|
|
|
|
if (IS_ERR(folio)) {
|
|
|
|
btrfs_mark_ordered_io_finished(BTRFS_I(inode), NULL,
|
|
|
|
cur, cur_len, false);
|
|
|
|
mapping_set_error(mapping, PTR_ERR(folio));
|
|
|
|
cur = cur_end + 1;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(folio_test_locked(folio));
|
2024-07-24 20:31:44 +00:00
|
|
|
if (pages_dirty && folio != locked_folio)
|
2024-07-24 18:50:51 +00:00
|
|
|
ASSERT(folio_test_dirty(folio));
|
|
|
|
|
btrfs: only unlock the to-be-submitted ranges inside a folio
[SUBPAGE COMPRESSION LIMITS]
Currently inside writepage_delalloc(), if a delalloc range is going to
be submitted asynchronously (inline or compression, the page
dirty/writeback/unlock are all handled in at different time, not at the
submission time), then we return 1 and extent_writepage() will skip the
submission.
This is fine if every sector matches page size, but if a sector is
smaller than page size (aka, subpage case), then it can be very
problematic, for example for the following 64K page:
0 16K 32K 48K 64K
|/| |///////| |/|
| |
4K 52K
Where |/| is the dirty range we need to submit.
In the above case, we need the following different handling for the 3
ranges:
- [0, 4K) needs to be submitted for regular write
A single sector cannot be compressed.
- [16K, 32K) needs to be submitted for compressed write
- [48K, 52K) needs to be submitted for regular write.
Above, if we try to submit [16K, 32K) for compressed write, we will
return 1 and immediately, and without submitting the remaining
[48K, 52K) range.
Furthermore, since extent_writepage() will exit without unlocking any
sectors, the submitted range [0, 4K) will not have sector unlocked.
That's the reason why for now subpage is only allowed for full page
range.
[ENHANCEMENT]
- Introduce a submission bitmap at btrfs_bio_ctrl::submit_bitmap
This records which sectors will be submitted by extent_writepage_io().
This allows us to track which sectors needs to be submitted thus later
to be properly unlocked.
For asynchronously submitted range (inline/compression), the
corresponding bits will be cleared from that bitmap.
- Only return 1 if no sector needs to be submitted in
writepage_delalloc()
- Only submit sectors marked by submission bitmap inside
extent_writepage_io()
So we won't touch the asynchronously submitted part.
- Introduce btrfs_folio_end_writer_lock_bitmap() helper
This will only unlock the involved sectors specified by @bitmap
parameter, to avoid touching the range asynchronously submitted.
Please note that, since subpage compression is still limited to page
aligned range, this change is only a preparation for future sector
perfect compression support for subpage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-02 04:59:06 +00:00
|
|
|
/*
|
|
|
|
* Set the submission bitmap to submit all sectors.
|
|
|
|
* extent_writepage_io() will do the truncation correctly.
|
|
|
|
*/
|
|
|
|
bio_ctrl.submit_bitmap = (unsigned long)-1;
|
2024-08-27 01:30:16 +00:00
|
|
|
ret = extent_writepage_io(BTRFS_I(inode), folio, cur, cur_len,
|
|
|
|
&bio_ctrl, i_size);
|
2023-05-31 06:05:01 +00:00
|
|
|
if (ret == 1)
|
|
|
|
goto next_page;
|
|
|
|
|
2023-06-28 15:31:26 +00:00
|
|
|
if (ret) {
|
2024-07-24 19:57:10 +00:00
|
|
|
btrfs_mark_ordered_io_finished(BTRFS_I(inode), folio,
|
2023-06-28 15:31:26 +00:00
|
|
|
cur, cur_len, !ret);
|
2024-07-24 18:50:51 +00:00
|
|
|
mapping_set_error(mapping, ret);
|
2023-06-28 15:31:26 +00:00
|
|
|
}
|
2024-09-02 04:27:08 +00:00
|
|
|
btrfs_folio_end_writer_lock(fs_info, folio, cur, cur_len);
|
2023-06-28 15:31:29 +00:00
|
|
|
if (ret < 0)
|
btrfs: cleanup for extent_write_locked_range()
There are several cleanups for extent_write_locked_range(), most of them
are pure cleanups, but with some preparation for future subpage support.
- Add a proper comment for which call sites are suitable
Unlike regular synchronized extent write back, if async COW or zoned
COW happens, we have all pages in the range still locked.
Thus for those (only) two call sites, we need this function to submit
page content into bios and submit them.
- Remove @mode parameter
All the existing two call sites pass WB_SYNC_ALL. No need for @mode
parameter.
- Better error handling
Currently if we hit an error during the page iteration loop, we
overwrite @ret, causing only the last error can be recorded.
Here we add @found_error and @first_error variable to record if we hit
any error, and the first error we hit.
So the first error won't get lost.
- Don't reuse @start as the cursor
We reuse the parameter @start as the cursor to iterate the range, not
a big problem, but since we're here, introduce a proper @cur as the
cursor.
- Remove impossible branch
Since all pages are still locked after the ordered extent is inserted,
there is no way that pages can get its dirty bit cleared.
Remove the branch where page is not dirty and replace it with an
ASSERT().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-09-27 07:21:58 +00:00
|
|
|
found_error = true;
|
2023-05-31 06:05:01 +00:00
|
|
|
next_page:
|
2024-07-24 18:50:51 +00:00
|
|
|
folio_put(folio);
|
2021-09-27 07:22:02 +00:00
|
|
|
cur = cur_end + 1;
|
2008-11-07 03:02:51 +00:00
|
|
|
}
|
|
|
|
|
2022-10-27 11:07:05 +00:00
|
|
|
submit_write_bio(&bio_ctrl, found_error ? ret : 0);
|
2008-11-07 03:02:51 +00:00
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2024-03-18 11:58:28 +00:00
|
|
|
int btrfs_writepages(struct address_space *mapping, struct writeback_control *wbc)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2021-09-08 16:19:27 +00:00
|
|
|
struct inode *inode = mapping->host;
|
2008-01-24 21:13:08 +00:00
|
|
|
int ret = 0;
|
2022-10-27 11:07:05 +00:00
|
|
|
struct btrfs_bio_ctrl bio_ctrl = {
|
2023-02-27 15:16:57 +00:00
|
|
|
.wbc = wbc,
|
2023-02-27 15:16:55 +00:00
|
|
|
.opf = REQ_OP_WRITE | wbc_to_write_flags(wbc),
|
2008-01-24 21:13:08 +00:00
|
|
|
};
|
|
|
|
|
2021-09-08 16:19:27 +00:00
|
|
|
/*
|
|
|
|
* Allow only a single thread to do the reloc work in zoned mode to
|
|
|
|
* protect the write pointer updates.
|
|
|
|
*/
|
2021-12-07 14:28:34 +00:00
|
|
|
btrfs_zoned_data_reloc_lock(BTRFS_I(inode));
|
2023-02-27 15:16:57 +00:00
|
|
|
ret = extent_write_cache_pages(mapping, &bio_ctrl);
|
2022-10-27 11:07:05 +00:00
|
|
|
submit_write_bio(&bio_ctrl, ret);
|
2022-06-07 07:08:30 +00:00
|
|
|
btrfs_zoned_data_reloc_unlock(BTRFS_I(inode));
|
2008-01-24 21:13:08 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2024-03-18 11:52:00 +00:00
|
|
|
void btrfs_readahead(struct readahead_control *rac)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2023-02-27 15:16:55 +00:00
|
|
|
struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ | REQ_RAHEAD };
|
2024-07-23 19:55:33 +00:00
|
|
|
struct folio *folio;
|
2013-07-25 11:22:37 +00:00
|
|
|
struct extent_map *em_cached = NULL;
|
2015-09-28 08:56:26 +00:00
|
|
|
u64 prev_em_start = (u64)-1;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2024-07-23 19:55:33 +00:00
|
|
|
while ((folio = readahead_folio(rac)) != NULL)
|
2024-07-23 21:06:03 +00:00
|
|
|
btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
|
Btrfs: improve multi-thread buffer read
While testing with my buffer read fio jobs[1], I find that btrfs does not
perform well enough.
Here is a scenario in fio jobs:
We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
and all of them will race on add_to_page_cache_lru(), and if one thread
successfully puts its page into the page cache, it takes the responsibility
to read the page's data.
And what's more, reading a page needs a period of time to finish, in which
other threads can slide in and process rest pages:
t1 t2 t3 t4
add Page1
read Page1 add Page2
| read Page2 add Page3
| | read Page3 add Page4
| | | read Page4
-----|------------|-----------|-----------|--------
v v v v
bio bio bio bio
Now we have four bios, each of which holds only one page since we need to
maintain consecutive pages in bio. Thus, we can end up with far more bios
than we need.
Here we're going to
a) delay the real read-page section and
b) try to put more pages into page cache.
With that said, we can make each bio hold more pages and reduce the number
of bios we need.
Here is some numbers taken from fio results:
w/o patch w patch
------------- -------- ---------------
READ: 745MB/s +25% 934MB/s
[1]:
[global]
group_reporting
thread
numjobs=4
bs=32k
rw=read
ioengine=sync
directory=/mnt/btrfs/
[READ]
filename=foobar
size=2000M
invalidate=1
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-21 03:43:09 +00:00
|
|
|
|
2013-07-25 11:22:37 +00:00
|
|
|
if (em_cached)
|
|
|
|
free_extent_map(em_cached);
|
2022-06-03 07:11:03 +00:00
|
|
|
submit_one_bio(&bio_ctrl);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2022-02-09 20:21:39 +00:00
|
|
|
* basic invalidate_folio code, this waits on any locked or writeback
|
|
|
|
* ranges corresponding to the folio, and then deletes any extent state
|
2008-01-24 21:13:08 +00:00
|
|
|
* records from the tree
|
|
|
|
*/
|
2022-02-09 20:21:39 +00:00
|
|
|
int extent_invalidate_folio(struct extent_io_tree *tree,
|
|
|
|
struct folio *folio, size_t offset)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2010-02-03 19:33:23 +00:00
|
|
|
struct extent_state *cached_state = NULL;
|
2022-02-09 20:21:39 +00:00
|
|
|
u64 start = folio_pos(folio);
|
|
|
|
u64 end = start + folio_size(folio) - 1;
|
2023-09-14 14:24:43 +00:00
|
|
|
size_t blocksize = folio_to_fs_info(folio)->sectorsize;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2020-11-13 12:51:39 +00:00
|
|
|
/* This function is only called for the btree inode */
|
|
|
|
ASSERT(tree->owner == IO_TREE_BTREE_INODE_IO);
|
|
|
|
|
2013-02-26 08:10:22 +00:00
|
|
|
start += ALIGN(offset, blocksize);
|
2008-01-24 21:13:08 +00:00
|
|
|
if (start > end)
|
|
|
|
return 0;
|
|
|
|
|
2022-09-09 21:53:43 +00:00
|
|
|
lock_extent(tree, start, end, &cached_state);
|
2022-02-09 20:21:39 +00:00
|
|
|
folio_wait_writeback(folio);
|
2020-11-13 12:51:39 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Currently for btree io tree, only EXTENT_LOCKED is utilized,
|
|
|
|
* so here we only need to unlock the extent range to free any
|
|
|
|
* existing extent state.
|
|
|
|
*/
|
2022-09-09 21:53:43 +00:00
|
|
|
unlock_extent(tree, start, end, &cached_state);
|
2008-01-24 21:13:08 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-04-18 14:29:50 +00:00
|
|
|
/*
|
2022-05-01 03:15:16 +00:00
|
|
|
* a helper for release_folio, this tests for areas of the page that
|
2008-04-18 14:29:50 +00:00
|
|
|
* are locked or under IO and drops the related state bits if it is safe
|
|
|
|
* to drop the page.
|
|
|
|
*/
|
2024-04-16 19:52:30 +00:00
|
|
|
static bool try_release_extent_state(struct extent_io_tree *tree,
|
2024-08-28 18:29:02 +00:00
|
|
|
struct folio *folio, gfp_t mask)
|
2008-04-18 14:29:50 +00:00
|
|
|
{
|
2024-08-28 18:29:02 +00:00
|
|
|
u64 start = folio_pos(folio);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
u64 end = start + PAGE_SIZE - 1;
|
2024-04-16 19:52:30 +00:00
|
|
|
bool ret;
|
2008-04-18 14:29:50 +00:00
|
|
|
|
2023-09-11 23:09:23 +00:00
|
|
|
if (test_range_bit_exists(tree, start, end, EXTENT_LOCKED)) {
|
2024-04-16 19:52:30 +00:00
|
|
|
ret = false;
|
2019-03-14 13:28:31 +00:00
|
|
|
} else {
|
2022-09-09 21:53:46 +00:00
|
|
|
u32 clear_bits = ~(EXTENT_LOCKED | EXTENT_NODATASUM |
|
2023-12-01 21:00:12 +00:00
|
|
|
EXTENT_DELALLOC_NEW | EXTENT_CTLBITS |
|
|
|
|
EXTENT_QGROUP_RESERVED);
|
2024-04-16 19:52:30 +00:00
|
|
|
int ret2;
|
2022-09-09 21:53:46 +00:00
|
|
|
|
2009-09-24 00:28:46 +00:00
|
|
|
/*
|
btrfs: update the number of bytes used by an inode atomically
There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.
In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
The cases where this can happen are the following:
-> Case 1
If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).
This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.
If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).
Example reproducer:
$ cat reproducer-1.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)
# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!
for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done
kill $loop_pid &> /dev/null
wait
umount $DEV
$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 2
If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.
This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.
Example reproducer:
$ cat reproducer-2.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)
# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar
kill $loop_pid &> /dev/null
wait $loop_pid
done
umount $DEV
$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 3
Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.
The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.
Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.
Example reproducer:
$ cat reproducer-3.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT
extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))
# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
expected=$(stat -c %b $MNT/foo)
# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done
umount $DEV
$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
So fix this by:
1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;
2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;
3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 11:07:34 +00:00
|
|
|
* At this point we can safely clear everything except the
|
|
|
|
* locked bit, the nodatasum bit and the delalloc new bit.
|
|
|
|
* The delalloc new bit will be cleared by ordered extent
|
|
|
|
* completion.
|
2009-09-24 00:28:46 +00:00
|
|
|
*/
|
2024-04-16 19:52:30 +00:00
|
|
|
ret2 = __clear_extent_bit(tree, start, end, clear_bits, NULL, NULL);
|
2011-02-14 17:52:08 +00:00
|
|
|
|
|
|
|
/* if clear_extent_bit failed for enomem reasons,
|
|
|
|
* we can't allow the release to continue.
|
|
|
|
*/
|
2024-04-16 19:52:30 +00:00
|
|
|
if (ret2 < 0)
|
|
|
|
ret = false;
|
2011-02-14 17:52:08 +00:00
|
|
|
else
|
2024-04-16 19:52:30 +00:00
|
|
|
ret = true;
|
2008-04-18 14:29:50 +00:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
2022-05-01 03:15:16 +00:00
|
|
|
* a helper for release_folio. As long as there are no locked extents
|
2008-01-24 21:13:08 +00:00
|
|
|
* in the range corresponding to the page, both state records and extent
|
|
|
|
* map records are removed
|
|
|
|
*/
|
2024-08-28 18:29:03 +00:00
|
|
|
bool try_release_extent_mapping(struct folio *folio, gfp_t mask)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2024-08-28 18:29:03 +00:00
|
|
|
u64 start = folio_pos(folio);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
u64 end = start + PAGE_SIZE - 1;
|
2024-08-28 18:29:03 +00:00
|
|
|
struct btrfs_inode *inode = folio_to_inode(folio);
|
2024-04-16 14:07:13 +00:00
|
|
|
struct extent_io_tree *io_tree = &inode->io_tree;
|
2024-04-16 14:34:51 +00:00
|
|
|
|
|
|
|
while (start <= end) {
|
|
|
|
const u64 cur_gen = btrfs_get_fs_generation(inode->root->fs_info);
|
|
|
|
const u64 len = end - start + 1;
|
|
|
|
struct extent_map_tree *extent_tree = &inode->extent_tree;
|
|
|
|
struct extent_map *em;
|
|
|
|
|
|
|
|
write_lock(&extent_tree->lock);
|
|
|
|
em = lookup_extent_mapping(extent_tree, start, len);
|
|
|
|
if (!em) {
|
|
|
|
write_unlock(&extent_tree->lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if ((em->flags & EXTENT_FLAG_PINNED) || em->start != start) {
|
|
|
|
write_unlock(&extent_tree->lock);
|
2020-07-22 11:28:52 +00:00
|
|
|
free_extent_map(em);
|
2024-04-16 14:34:51 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (test_range_bit_exists(io_tree, em->start,
|
|
|
|
extent_map_end(em) - 1, EXTENT_LOCKED))
|
|
|
|
goto next;
|
|
|
|
/*
|
|
|
|
* If it's not in the list of modified extents, used by a fast
|
|
|
|
* fsync, we can remove it. If it's being logged we can safely
|
|
|
|
* remove it since fsync took an extra reference on the em.
|
|
|
|
*/
|
|
|
|
if (list_empty(&em->list) || (em->flags & EXTENT_FLAG_LOGGING))
|
|
|
|
goto remove_em;
|
|
|
|
/*
|
|
|
|
* If it's in the list of modified extents, remove it only if
|
|
|
|
* its generation is older then the current one, in which case
|
|
|
|
* we don't need it for a fast fsync. Otherwise don't remove it,
|
|
|
|
* we could be racing with an ongoing fast fsync that could miss
|
|
|
|
* the new extent.
|
|
|
|
*/
|
|
|
|
if (em->generation >= cur_gen)
|
|
|
|
goto next;
|
|
|
|
remove_em:
|
|
|
|
/*
|
|
|
|
* We only remove extent maps that are not in the list of
|
|
|
|
* modified extents or that are in the list but with a
|
|
|
|
* generation lower then the current generation, so there is no
|
|
|
|
* need to set the full fsync flag on the inode (it hurts the
|
|
|
|
* fsync performance for workloads with a data size that exceeds
|
|
|
|
* or is close to the system's memory).
|
|
|
|
*/
|
|
|
|
remove_extent_mapping(inode, em);
|
|
|
|
/* Once for the inode's extent map tree. */
|
|
|
|
free_extent_map(em);
|
btrfs: fix race between page release and a fast fsync
When releasing an extent map, done through the page release callback, we
can race with an ongoing fast fsync and cause the fsync to miss a new
extent and not log it. The steps for this to happen are the following:
1) A page is dirtied for some inode I;
2) Writeback for that page is triggered by a path other than fsync, for
example by the system due to memory pressure;
3) When the ordered extent for the extent (a single 4K page) finishes,
we unpin the corresponding extent map and set its generation to N,
the current transaction's generation;
4) The btrfs_releasepage() callback is invoked by the system due to
memory pressure for that no longer dirty page of inode I;
5) At the same time, some task calls fsync on inode I, joins transaction
N, and at btrfs_log_inode() it sees that the inode does not have the
full sync flag set, so we proceed with a fast fsync. But before we get
into btrfs_log_changed_extents() and lock the inode's extent map tree:
6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
and we remove the extent map for the new 4Kb extent, because it is
neither pinned anymore nor locked. By calling remove_extent_mapping(),
we remove the extent map from the list of modified extents, since the
extent map does not have the logging flag set. We unlock the inode's
extent map tree;
7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
locks the inode's extent map tree and iterates its list of modified
extents, which no longer has the 4Kb extent in it, so it does not log
the extent;
8) The fsync finishes;
9) Before transaction N is committed, a power failure happens. After
replaying the log, the 4K extent of inode I will be missing, since
it was not logged due to the race with try_release_extent_mapping().
So fix this by teaching try_release_extent_mapping() to not remove an
extent map if it's still in the list of modified extents.
Fixes: ff44c6e36dc9dc ("Btrfs: do not hold the write_lock on the extent tree while logging")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-22 11:28:37 +00:00
|
|
|
next:
|
2024-04-16 14:34:51 +00:00
|
|
|
start = extent_map_end(em);
|
|
|
|
write_unlock(&extent_tree->lock);
|
2008-01-29 14:59:12 +00:00
|
|
|
|
2024-04-16 14:34:51 +00:00
|
|
|
/* Once for us, for the lookup_extent_mapping() reference. */
|
|
|
|
free_extent_map(em);
|
|
|
|
|
|
|
|
if (need_resched()) {
|
|
|
|
/*
|
|
|
|
* If we need to resched but we can't block just exit
|
|
|
|
* and leave any remaining extent maps.
|
|
|
|
*/
|
|
|
|
if (!gfpflags_allow_blocking(mask))
|
|
|
|
break;
|
2020-05-08 21:15:37 +00:00
|
|
|
|
2024-04-16 14:34:51 +00:00
|
|
|
cond_resched();
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
}
|
2024-08-28 18:29:02 +00:00
|
|
|
return try_release_extent_state(io_tree, folio, mask);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2010-08-06 17:21:20 +00:00
|
|
|
static void __free_extent_buffer(struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
kmem_cache_free(extent_buffer_cache, eb);
|
|
|
|
}
|
|
|
|
|
2023-05-03 15:24:21 +00:00
|
|
|
static int extent_buffer_under_io(const struct extent_buffer *eb)
|
2013-08-07 18:54:37 +00:00
|
|
|
{
|
2023-05-03 15:24:36 +00:00
|
|
|
return (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) ||
|
2013-08-07 18:54:37 +00:00
|
|
|
test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
|
|
|
|
}
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
static bool folio_range_has_eb(struct btrfs_fs_info *fs_info, struct folio *folio)
|
2013-08-07 18:54:37 +00:00
|
|
|
{
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
struct btrfs_subpage *subpage;
|
2013-08-07 18:54:37 +00:00
|
|
|
|
for-6.8-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
2024-01-10 17:27:40 +00:00
|
|
|
lockdep_assert_held(&folio->mapping->i_private_lock);
|
2013-08-07 18:54:37 +00:00
|
|
|
|
2023-11-17 03:54:14 +00:00
|
|
|
if (folio_test_private(folio)) {
|
|
|
|
subpage = folio_get_private(folio);
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
if (atomic_read(&subpage->eb_refs))
|
|
|
|
return true;
|
btrfs: subpage: fix a rare race between metadata endio and eb freeing
[BUG]
There is a very rare ASSERT() triggering during full fstests run for
subpage rw support.
No other reproducer so far.
The ASSERT() gets triggered for metadata read in
btrfs_page_set_uptodate() inside end_page_read().
[CAUSE]
There is still a small race window for metadata only, the race could
happen like this:
T1 | T2
------------------------------------+-----------------------------
end_bio_extent_readpage() |
|- btrfs_validate_metadata_buffer() |
| |- free_extent_buffer() |
| Still have 2 refs |
|- end_page_read() |
|- if (unlikely(PagePrivate()) |
| The page still has Private |
| | free_extent_buffer()
| | | Only one ref 1, will be
| | | released
| | |- detach_extent_buffer_page()
| | |- btrfs_detach_subpage()
|- btrfs_set_page_uptodate() |
The page no longer has Private|
>>> ASSERT() triggered <<< |
This race window is super small, thus pretty hard to hit, even with so
many runs of fstests.
But the race window is still there, we have to go another way to solve
it other than relying on random PagePrivate() check.
Data path is not affected, as it will lock the page before reading,
while unlocking the page after the last read has finished, thus no race
window.
[FIX]
This patch will fix the bug by repurposing btrfs_subpage::readers.
Now btrfs_subpage::readers will be a member shared by both metadata and
data.
For metadata path, we don't do the page unlock as metadata only relies
on extent locking.
At the same time, teach page_range_has_eb() to take
btrfs_subpage::readers into consideration.
So that even if the last eb of a page gets freed, page::private won't be
detached as long as there still are pending end_page_read() calls.
By this we eliminate the race window, this will slight increase the
metadata memory usage, as the page may not be released as frequently as
usual. But it should not be a big deal.
The code got introduced in ("btrfs: submit read time repair only for
each corrupted sector"), but the fix is in a separate patch to keep the
problem description and the crash is rare so it should not hurt
bisectability.
Signed-off-by: Qu Wegruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-07 09:02:58 +00:00
|
|
|
/*
|
|
|
|
* Even there is no eb refs here, we may still have
|
2024-07-23 20:16:20 +00:00
|
|
|
* end_folio_read() call relying on page::private.
|
btrfs: subpage: fix a rare race between metadata endio and eb freeing
[BUG]
There is a very rare ASSERT() triggering during full fstests run for
subpage rw support.
No other reproducer so far.
The ASSERT() gets triggered for metadata read in
btrfs_page_set_uptodate() inside end_page_read().
[CAUSE]
There is still a small race window for metadata only, the race could
happen like this:
T1 | T2
------------------------------------+-----------------------------
end_bio_extent_readpage() |
|- btrfs_validate_metadata_buffer() |
| |- free_extent_buffer() |
| Still have 2 refs |
|- end_page_read() |
|- if (unlikely(PagePrivate()) |
| The page still has Private |
| | free_extent_buffer()
| | | Only one ref 1, will be
| | | released
| | |- detach_extent_buffer_page()
| | |- btrfs_detach_subpage()
|- btrfs_set_page_uptodate() |
The page no longer has Private|
>>> ASSERT() triggered <<< |
This race window is super small, thus pretty hard to hit, even with so
many runs of fstests.
But the race window is still there, we have to go another way to solve
it other than relying on random PagePrivate() check.
Data path is not affected, as it will lock the page before reading,
while unlocking the page after the last read has finished, thus no race
window.
[FIX]
This patch will fix the bug by repurposing btrfs_subpage::readers.
Now btrfs_subpage::readers will be a member shared by both metadata and
data.
For metadata path, we don't do the page unlock as metadata only relies
on extent locking.
At the same time, teach page_range_has_eb() to take
btrfs_subpage::readers into consideration.
So that even if the last eb of a page gets freed, page::private won't be
detached as long as there still are pending end_page_read() calls.
By this we eliminate the race window, this will slight increase the
metadata memory usage, as the page may not be released as frequently as
usual. But it should not be a big deal.
The code got introduced in ("btrfs: submit read time repair only for
each corrupted sector"), but the fix is in a separate patch to keep the
problem description and the crash is rare so it should not hurt
bisectability.
Signed-off-by: Qu Wegruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-07 09:02:58 +00:00
|
|
|
*/
|
|
|
|
if (atomic_read(&subpage->readers))
|
|
|
|
return true;
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
2013-08-07 18:54:37 +00:00
|
|
|
|
2024-05-30 17:14:12 +00:00
|
|
|
static void detach_extent_buffer_folio(const struct extent_buffer *eb, struct folio *folio)
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
|
|
|
const bool mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
|
|
|
|
|
|
|
|
/*
|
2023-11-17 03:54:14 +00:00
|
|
|
* For mapped eb, we're going to change the folio private, which should
|
2023-11-17 21:58:23 +00:00
|
|
|
* be done under the i_private_lock.
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
*/
|
|
|
|
if (mapped)
|
for-6.8-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
2024-01-10 17:27:40 +00:00
|
|
|
spin_lock(&folio->mapping->i_private_lock);
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
|
2023-11-17 03:54:14 +00:00
|
|
|
if (!folio_test_private(folio)) {
|
2015-02-09 09:31:45 +00:00
|
|
|
if (mapped)
|
for-6.8-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
2024-01-10 17:27:40 +00:00
|
|
|
spin_unlock(&folio->mapping->i_private_lock);
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
if (fs_info->nodesize >= PAGE_SIZE) {
|
2015-02-09 09:31:45 +00:00
|
|
|
/*
|
|
|
|
* We do this since we'll remove the pages after we've
|
|
|
|
* removed the eb from the radix tree, so we could race
|
|
|
|
* and have this page now attached to the new eb. So
|
2023-11-17 03:54:14 +00:00
|
|
|
* only clear folio if it's still connected to
|
2015-02-09 09:31:45 +00:00
|
|
|
* this eb.
|
|
|
|
*/
|
2023-11-17 03:54:14 +00:00
|
|
|
if (folio_test_private(folio) && folio_get_private(folio) == eb) {
|
2015-02-09 09:31:45 +00:00
|
|
|
BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
|
2023-12-06 23:09:28 +00:00
|
|
|
BUG_ON(folio_test_dirty(folio));
|
|
|
|
BUG_ON(folio_test_writeback(folio));
|
2023-11-17 03:54:14 +00:00
|
|
|
/* We need to make sure we haven't be attached to a new eb. */
|
|
|
|
folio_detach_private(folio);
|
2013-08-07 18:54:37 +00:00
|
|
|
}
|
2015-02-09 09:31:45 +00:00
|
|
|
if (mapped)
|
for-6.8-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
2024-01-10 17:27:40 +00:00
|
|
|
spin_unlock(&folio->mapping->i_private_lock);
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2023-11-17 03:54:14 +00:00
|
|
|
* For subpage, we can have dummy eb with folio private attached. In
|
|
|
|
* this case, we can directly detach the private as such folio is only
|
|
|
|
* attached to one dummy eb, no sharing.
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
*/
|
|
|
|
if (!mapped) {
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_detach_subpage(fs_info, folio);
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
btrfs_folio_dec_eb_refs(fs_info, folio);
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
|
|
|
|
/*
|
2023-11-17 03:54:14 +00:00
|
|
|
* We can only detach the folio private if there are no other ebs in the
|
btrfs: subpage: fix a rare race between metadata endio and eb freeing
[BUG]
There is a very rare ASSERT() triggering during full fstests run for
subpage rw support.
No other reproducer so far.
The ASSERT() gets triggered for metadata read in
btrfs_page_set_uptodate() inside end_page_read().
[CAUSE]
There is still a small race window for metadata only, the race could
happen like this:
T1 | T2
------------------------------------+-----------------------------
end_bio_extent_readpage() |
|- btrfs_validate_metadata_buffer() |
| |- free_extent_buffer() |
| Still have 2 refs |
|- end_page_read() |
|- if (unlikely(PagePrivate()) |
| The page still has Private |
| | free_extent_buffer()
| | | Only one ref 1, will be
| | | released
| | |- detach_extent_buffer_page()
| | |- btrfs_detach_subpage()
|- btrfs_set_page_uptodate() |
The page no longer has Private|
>>> ASSERT() triggered <<< |
This race window is super small, thus pretty hard to hit, even with so
many runs of fstests.
But the race window is still there, we have to go another way to solve
it other than relying on random PagePrivate() check.
Data path is not affected, as it will lock the page before reading,
while unlocking the page after the last read has finished, thus no race
window.
[FIX]
This patch will fix the bug by repurposing btrfs_subpage::readers.
Now btrfs_subpage::readers will be a member shared by both metadata and
data.
For metadata path, we don't do the page unlock as metadata only relies
on extent locking.
At the same time, teach page_range_has_eb() to take
btrfs_subpage::readers into consideration.
So that even if the last eb of a page gets freed, page::private won't be
detached as long as there still are pending end_page_read() calls.
By this we eliminate the race window, this will slight increase the
metadata memory usage, as the page may not be released as frequently as
usual. But it should not be a big deal.
The code got introduced in ("btrfs: submit read time repair only for
each corrupted sector"), but the fix is in a separate patch to keep the
problem description and the crash is rare so it should not hurt
bisectability.
Signed-off-by: Qu Wegruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-06-07 09:02:58 +00:00
|
|
|
* page range and no unfinished IO.
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
*/
|
2023-12-06 23:09:28 +00:00
|
|
|
if (!folio_range_has_eb(fs_info, folio))
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_detach_subpage(fs_info, folio);
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
|
for-6.8-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
2024-01-10 17:27:40 +00:00
|
|
|
spin_unlock(&folio->mapping->i_private_lock);
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Release all pages attached to the extent buffer */
|
2024-05-30 17:14:12 +00:00
|
|
|
static void btrfs_release_extent_buffer_pages(const struct extent_buffer *eb)
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
{
|
|
|
|
ASSERT(!extent_buffer_under_io(eb));
|
|
|
|
|
2023-12-14 22:39:38 +00:00
|
|
|
for (int i = 0; i < INLINE_EXTENT_BUFFER_PAGES; i++) {
|
2023-12-06 23:09:28 +00:00
|
|
|
struct folio *folio = eb->folios[i];
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
if (!folio)
|
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-01-26 08:33:50 +00:00
|
|
|
continue;
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
detach_extent_buffer_folio(eb, folio);
|
2015-02-09 09:31:45 +00:00
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
/* One for when we allocated the folio. */
|
|
|
|
folio_put(folio);
|
2018-06-27 13:38:22 +00:00
|
|
|
}
|
2013-08-07 18:54:37 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Helper for releasing the extent buffer.
|
|
|
|
*/
|
|
|
|
static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
|
|
|
|
{
|
2018-07-19 15:24:32 +00:00
|
|
|
btrfs_release_extent_buffer_pages(eb);
|
2022-09-09 21:53:19 +00:00
|
|
|
btrfs_leak_debug_del_eb(eb);
|
2013-08-07 18:54:37 +00:00
|
|
|
__free_extent_buffer(eb);
|
|
|
|
}
|
|
|
|
|
2013-12-16 18:24:27 +00:00
|
|
|
static struct extent_buffer *
|
|
|
|
__alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
|
2014-06-15 00:55:29 +00:00
|
|
|
unsigned long len)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb = NULL;
|
|
|
|
|
2015-08-19 12:17:40 +00:00
|
|
|
eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
|
2008-01-24 21:13:08 +00:00
|
|
|
eb->start = start;
|
|
|
|
eb->len = len;
|
2013-12-16 18:24:27 +00:00
|
|
|
eb->fs_info = fs_info;
|
btrfs: switch extent buffer tree lock to rw_semaphore
Historically we've implemented our own locking because we wanted to be
able to selectively spin or sleep based on what we were doing in the
tree. For instance, if all of our nodes were in cache then there's
rarely a reason to need to sleep waiting for node locks, as they'll
likely become available soon. At the time this code was written the
rw_semaphore didn't do adaptive spinning, and thus was orders of
magnitude slower than our home grown locking.
However now the opposite is the case. There are a few problems with how
we implement blocking locks, namely that we use a normal waitqueue and
simply wake everybody up in reverse sleep order. This leads to some
suboptimal performance behavior, and a lot of context switches in highly
contended cases. The rw_semaphores actually do this properly, and also
have adaptive spinning that works relatively well.
The locking code is also a bit of a bear to understand, and we lose the
benefit of lockdep for the most part because the blocking states of the
lock are simply ad-hoc and not mapped into lockdep.
So rework the locking code to drop all of this custom locking stuff, and
simply use a rw_semaphore for everything. This makes the locking much
simpler for everything, as we can now drop a lot of cruft and blocking
transitions. The performance numbers vary depending on the workload,
because generally speaking there doesn't tend to be a lot of contention
on the btree. However, on my test system which is an 80 core single
socket system with 256GiB of RAM and a 2TiB NVMe drive I get the
following results (with all debug options off):
dbench 200 baseline
Throughput 216.056 MB/sec 200 clients 200 procs max_latency=1471.197 ms
dbench 200 with patch
Throughput 737.188 MB/sec 200 clients 200 procs max_latency=714.346 ms
Previously we also used fs_mark to test this sort of contention, and
those results are far less impressive, mostly because there's not enough
tasks to really stress the locking
fs_mark -d /d[0-15] -S 0 -L 20 -n 100000 -s 0 -t 16
baseline
Average Files/sec: 160166.7
p50 Files/sec: 165832
p90 Files/sec: 123886
p99 Files/sec: 123495
real 3m26.527s
user 2m19.223s
sys 48m21.856s
patched
Average Files/sec: 164135.7
p50 Files/sec: 171095
p90 Files/sec: 122889
p99 Files/sec: 113819
real 3m29.660s
user 2m19.990s
sys 44m12.259s
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-20 15:46:09 +00:00
|
|
|
init_rwsem(&eb->lock);
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 14:25:08 +00:00
|
|
|
|
2022-09-09 21:53:19 +00:00
|
|
|
btrfs_leak_debug_add_eb(eb);
|
2013-04-22 16:12:31 +00:00
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
spin_lock_init(&eb->refs_lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
atomic_set(&eb->refs, 1);
|
2010-08-06 17:21:20 +00:00
|
|
|
|
2020-12-02 06:48:01 +00:00
|
|
|
ASSERT(len <= BTRFS_MAX_METADATA_BLOCKSIZE);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
return eb;
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
|
2012-05-16 15:00:02 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *new;
|
2023-12-06 23:09:28 +00:00
|
|
|
int num_folios = num_extent_folios(src);
|
2022-03-30 20:11:22 +00:00
|
|
|
int ret;
|
2012-05-16 15:00:02 +00:00
|
|
|
|
2014-06-15 01:20:26 +00:00
|
|
|
new = __alloc_extent_buffer(src->fs_info, src->start, src->len);
|
2012-05-16 15:00:02 +00:00
|
|
|
if (new == NULL)
|
|
|
|
return NULL;
|
|
|
|
|
2021-01-26 08:33:46 +00:00
|
|
|
/*
|
|
|
|
* Set UNMAPPED before calling btrfs_release_extent_buffer(), as
|
|
|
|
* btrfs_release_extent_buffer() have different behavior for
|
|
|
|
* UNMAPPED subpage extent buffer.
|
|
|
|
*/
|
|
|
|
set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
|
|
|
|
|
2024-06-27 03:52:16 +00:00
|
|
|
ret = alloc_eb_folio_array(new, false);
|
2022-03-30 20:11:22 +00:00
|
|
|
if (ret) {
|
|
|
|
btrfs_release_extent_buffer(new);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
for (int i = 0; i < num_folios; i++) {
|
|
|
|
struct folio *folio = new->folios[i];
|
2021-01-26 08:33:48 +00:00
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
ret = attach_extent_buffer_folio(new, folio, NULL);
|
2021-01-26 08:33:48 +00:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_release_extent_buffer(new);
|
|
|
|
return NULL;
|
|
|
|
}
|
2023-12-06 23:09:28 +00:00
|
|
|
WARN_ON(folio_test_dirty(folio));
|
2012-05-16 15:00:02 +00:00
|
|
|
}
|
2023-07-15 11:08:32 +00:00
|
|
|
copy_extent_buffer_full(new, src);
|
2021-01-26 08:33:55 +00:00
|
|
|
set_extent_buffer_uptodate(new);
|
2012-05-16 15:00:02 +00:00
|
|
|
|
|
|
|
return new;
|
|
|
|
}
|
|
|
|
|
2015-09-30 03:50:31 +00:00
|
|
|
struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 start, unsigned long len)
|
2012-05-16 15:00:02 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb;
|
2023-12-06 23:09:28 +00:00
|
|
|
int num_folios = 0;
|
2022-03-30 20:11:22 +00:00
|
|
|
int ret;
|
2012-05-16 15:00:02 +00:00
|
|
|
|
2014-06-15 01:20:26 +00:00
|
|
|
eb = __alloc_extent_buffer(fs_info, start, len);
|
2012-05-16 15:00:02 +00:00
|
|
|
if (!eb)
|
|
|
|
return NULL;
|
|
|
|
|
2024-06-27 03:52:16 +00:00
|
|
|
ret = alloc_eb_folio_array(eb, false);
|
2022-03-30 20:11:22 +00:00
|
|
|
if (ret)
|
|
|
|
goto err;
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
num_folios = num_extent_folios(eb);
|
|
|
|
for (int i = 0; i < num_folios; i++) {
|
|
|
|
ret = attach_extent_buffer_folio(eb, eb->folios[i], NULL);
|
2021-01-26 08:33:51 +00:00
|
|
|
if (ret < 0)
|
|
|
|
goto err;
|
2012-05-16 15:00:02 +00:00
|
|
|
}
|
2022-03-30 20:11:22 +00:00
|
|
|
|
2012-05-16 15:00:02 +00:00
|
|
|
set_extent_buffer_uptodate(eb);
|
|
|
|
btrfs_set_header_nritems(eb, 0);
|
2018-06-27 13:38:24 +00:00
|
|
|
set_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
|
2012-05-16 15:00:02 +00:00
|
|
|
|
|
|
|
return eb;
|
|
|
|
err:
|
2023-12-06 23:09:28 +00:00
|
|
|
for (int i = 0; i < num_folios; i++) {
|
2023-12-06 23:09:27 +00:00
|
|
|
if (eb->folios[i]) {
|
2023-12-06 23:09:28 +00:00
|
|
|
detach_extent_buffer_folio(eb, eb->folios[i]);
|
2024-07-02 14:31:14 +00:00
|
|
|
folio_put(eb->folios[i]);
|
2022-03-30 20:11:22 +00:00
|
|
|
}
|
2021-01-26 08:33:51 +00:00
|
|
|
}
|
2012-05-16 15:00:02 +00:00
|
|
|
__free_extent_buffer(eb);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2015-09-30 03:50:31 +00:00
|
|
|
struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
|
2016-06-15 13:22:56 +00:00
|
|
|
u64 start)
|
2015-09-30 03:50:31 +00:00
|
|
|
{
|
2016-06-15 13:22:56 +00:00
|
|
|
return __alloc_dummy_extent_buffer(fs_info, start, fs_info->nodesize);
|
2015-09-30 03:50:31 +00:00
|
|
|
}
|
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
static void check_buffer_tree_ref(struct extent_buffer *eb)
|
|
|
|
{
|
2013-01-29 22:49:37 +00:00
|
|
|
int refs;
|
btrfs: fix fatal extent_buffer readahead vs releasepage race
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-17 18:35:19 +00:00
|
|
|
/*
|
|
|
|
* The TREE_REF bit is first set when the extent_buffer is added
|
|
|
|
* to the radix tree. It is also reset, if unset, when a new reference
|
|
|
|
* is created by find_extent_buffer.
|
2012-03-13 13:38:00 +00:00
|
|
|
*
|
btrfs: fix fatal extent_buffer readahead vs releasepage race
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-17 18:35:19 +00:00
|
|
|
* It is only cleared in two cases: freeing the last non-tree
|
|
|
|
* reference to the extent_buffer when its STALE bit is set or
|
2022-05-01 03:15:16 +00:00
|
|
|
* calling release_folio when the tree reference is the only reference.
|
2012-03-13 13:38:00 +00:00
|
|
|
*
|
btrfs: fix fatal extent_buffer readahead vs releasepage race
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-17 18:35:19 +00:00
|
|
|
* In both cases, care is taken to ensure that the extent_buffer's
|
2022-05-01 03:15:16 +00:00
|
|
|
* pages are not under io. However, release_folio can be concurrently
|
btrfs: fix fatal extent_buffer readahead vs releasepage race
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-17 18:35:19 +00:00
|
|
|
* called with creating new references, which is prone to race
|
|
|
|
* conditions between the calls to check_buffer_tree_ref in those
|
|
|
|
* codepaths and clearing TREE_REF in try_release_extent_buffer.
|
2012-03-13 13:38:00 +00:00
|
|
|
*
|
btrfs: fix fatal extent_buffer readahead vs releasepage race
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-17 18:35:19 +00:00
|
|
|
* The actual lifetime of the extent_buffer in the radix tree is
|
|
|
|
* adequately protected by the refcount, but the TREE_REF bit and
|
|
|
|
* its corresponding reference are not. To protect against this
|
|
|
|
* class of races, we call check_buffer_tree_ref from the codepaths
|
2023-05-03 15:24:36 +00:00
|
|
|
* which trigger io. Note that once io is initiated, TREE_REF can no
|
|
|
|
* longer be cleared, so that is the moment at which any such race is
|
|
|
|
* best fixed.
|
2012-03-13 13:38:00 +00:00
|
|
|
*/
|
2013-01-29 22:49:37 +00:00
|
|
|
refs = atomic_read(&eb->refs);
|
|
|
|
if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
|
|
|
|
return;
|
|
|
|
|
2012-07-20 20:11:08 +00:00
|
|
|
spin_lock(&eb->refs_lock);
|
|
|
|
if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
|
2012-03-13 13:38:00 +00:00
|
|
|
atomic_inc(&eb->refs);
|
2012-07-20 20:11:08 +00:00
|
|
|
spin_unlock(&eb->refs_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
static void mark_extent_buffer_accessed(struct extent_buffer *eb)
|
2012-03-15 22:24:42 +00:00
|
|
|
{
|
2023-12-06 23:09:28 +00:00
|
|
|
int num_folios= num_extent_folios(eb);
|
2012-03-15 22:24:42 +00:00
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
check_buffer_tree_ref(eb);
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
for (int i = 0; i < num_folios; i++)
|
|
|
|
folio_mark_accessed(eb->folios[i]);
|
2012-03-15 22:24:42 +00:00
|
|
|
}
|
|
|
|
|
2013-12-16 18:24:27 +00:00
|
|
|
struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 start)
|
2013-10-07 15:45:25 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
|
2021-04-06 00:36:00 +00:00
|
|
|
eb = find_extent_buffer_nolock(fs_info, start);
|
|
|
|
if (!eb)
|
|
|
|
return NULL;
|
|
|
|
/*
|
|
|
|
* Lock our eb's refs_lock to avoid races with free_extent_buffer().
|
|
|
|
* When we get our eb it might be flagged with EXTENT_BUFFER_STALE and
|
|
|
|
* another task running free_extent_buffer() might have seen that flag
|
|
|
|
* set, eb->refs == 2, that the buffer isn't under IO (dirty and
|
|
|
|
* writeback flags not set) and it's still in the tree (flag
|
|
|
|
* EXTENT_BUFFER_TREE_REF set), therefore being in the process of
|
|
|
|
* decrementing the extent buffer's reference count twice. So here we
|
|
|
|
* could race and increment the eb's reference count, clear its stale
|
|
|
|
* flag, mark it as dirty and drop our reference before the other task
|
|
|
|
* finishes executing free_extent_buffer, which would later result in
|
|
|
|
* an attempt to free an extent buffer that is dirty.
|
|
|
|
*/
|
|
|
|
if (test_bit(EXTENT_BUFFER_STALE, &eb->bflags)) {
|
|
|
|
spin_lock(&eb->refs_lock);
|
|
|
|
spin_unlock(&eb->refs_lock);
|
2013-10-07 15:45:25 +00:00
|
|
|
}
|
2023-12-06 23:09:28 +00:00
|
|
|
mark_extent_buffer_accessed(eb);
|
2021-04-06 00:36:00 +00:00
|
|
|
return eb;
|
2013-10-07 15:45:25 +00:00
|
|
|
}
|
|
|
|
|
2014-05-07 21:06:09 +00:00
|
|
|
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
|
|
|
struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
|
2016-06-15 13:22:56 +00:00
|
|
|
u64 start)
|
2014-05-07 21:06:09 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb, *exists = NULL;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
eb = find_extent_buffer(fs_info, start);
|
|
|
|
if (eb)
|
|
|
|
return eb;
|
2016-06-15 13:22:56 +00:00
|
|
|
eb = alloc_dummy_extent_buffer(fs_info, start);
|
2014-05-07 21:06:09 +00:00
|
|
|
if (!eb)
|
2019-12-03 11:24:58 +00:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2014-05-07 21:06:09 +00:00
|
|
|
eb->fs_info = fs_info;
|
2022-07-15 11:59:31 +00:00
|
|
|
again:
|
|
|
|
ret = radix_tree_preload(GFP_NOFS);
|
|
|
|
if (ret) {
|
|
|
|
exists = ERR_PTR(ret);
|
|
|
|
goto free_eb;
|
|
|
|
}
|
|
|
|
spin_lock(&fs_info->buffer_lock);
|
|
|
|
ret = radix_tree_insert(&fs_info->buffer_radix,
|
|
|
|
start >> fs_info->sectorsize_bits, eb);
|
|
|
|
spin_unlock(&fs_info->buffer_lock);
|
|
|
|
radix_tree_preload_end();
|
|
|
|
if (ret == -EEXIST) {
|
|
|
|
exists = find_extent_buffer(fs_info, start);
|
|
|
|
if (exists)
|
2014-05-07 21:06:09 +00:00
|
|
|
goto free_eb;
|
2022-07-15 11:59:31 +00:00
|
|
|
else
|
|
|
|
goto again;
|
|
|
|
}
|
2014-05-07 21:06:09 +00:00
|
|
|
check_buffer_tree_ref(eb);
|
|
|
|
set_bit(EXTENT_BUFFER_IN_TREE, &eb->bflags);
|
|
|
|
|
|
|
|
return eb;
|
|
|
|
free_eb:
|
|
|
|
btrfs_release_extent_buffer(eb);
|
|
|
|
return exists;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2021-01-26 08:33:49 +00:00
|
|
|
static struct extent_buffer *grab_extent_buffer(
|
|
|
|
struct btrfs_fs_info *fs_info, struct page *page)
|
2021-01-06 01:01:45 +00:00
|
|
|
{
|
2023-11-17 03:54:14 +00:00
|
|
|
struct folio *folio = page_folio(page);
|
2021-01-06 01:01:45 +00:00
|
|
|
struct extent_buffer *exists;
|
|
|
|
|
btrfs: protect folio::private when attaching extent buffer folios
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-06 01:31:51 +00:00
|
|
|
lockdep_assert_held(&page->mapping->i_private_lock);
|
|
|
|
|
2021-01-26 08:33:49 +00:00
|
|
|
/*
|
|
|
|
* For subpage case, we completely rely on radix tree to ensure we
|
|
|
|
* don't try to insert two ebs for the same bytenr. So here we always
|
|
|
|
* return NULL and just continue.
|
|
|
|
*/
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
if (fs_info->nodesize < PAGE_SIZE)
|
2021-01-26 08:33:49 +00:00
|
|
|
return NULL;
|
|
|
|
|
2021-01-06 01:01:45 +00:00
|
|
|
/* Page not yet attached to an extent buffer */
|
2023-11-17 03:54:14 +00:00
|
|
|
if (!folio_test_private(folio))
|
2021-01-06 01:01:45 +00:00
|
|
|
return NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We could have already allocated an eb for this page and attached one
|
|
|
|
* so lets see if we can get a ref on the existing eb, and if we can we
|
|
|
|
* know it's good and we can just return that one, else we know we can
|
2023-11-17 03:54:14 +00:00
|
|
|
* just overwrite folio private.
|
2021-01-06 01:01:45 +00:00
|
|
|
*/
|
2023-11-17 03:54:14 +00:00
|
|
|
exists = folio_get_private(folio);
|
2021-01-06 01:01:45 +00:00
|
|
|
if (atomic_inc_not_zero(&exists->refs))
|
|
|
|
return exists;
|
|
|
|
|
|
|
|
WARN_ON(PageDirty(page));
|
2023-11-17 03:54:14 +00:00
|
|
|
folio_detach_private(folio);
|
2021-01-06 01:01:45 +00:00
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
static int check_eb_alignment(struct btrfs_fs_info *fs_info, u64 start)
|
|
|
|
{
|
|
|
|
if (!IS_ALIGNED(start, fs_info->sectorsize)) {
|
|
|
|
btrfs_err(fs_info, "bad tree block start %llu", start);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (fs_info->nodesize < PAGE_SIZE &&
|
|
|
|
offset_in_page(start) + fs_info->nodesize > PAGE_SIZE) {
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"tree block crosses page boundary, start %llu nodesize %u",
|
|
|
|
start, fs_info->nodesize);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
if (fs_info->nodesize >= PAGE_SIZE &&
|
2022-05-26 14:35:40 +00:00
|
|
|
!PAGE_ALIGNED(start)) {
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
btrfs_err(fs_info,
|
|
|
|
"tree block is not page aligned, start %llu nodesize %u",
|
|
|
|
start, fs_info->nodesize);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
2023-08-24 06:33:36 +00:00
|
|
|
if (!IS_ALIGNED(start, fs_info->nodesize) &&
|
|
|
|
!test_and_set_bit(BTRFS_FS_UNALIGNED_TREE_BLOCK, &fs_info->flags)) {
|
|
|
|
btrfs_warn(fs_info,
|
|
|
|
"tree block not nodesize aligned, start %llu nodesize %u, can be resolved by a full metadata balance",
|
|
|
|
start, fs_info->nodesize);
|
|
|
|
}
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-11-29 22:32:08 +00:00
|
|
|
|
|
|
|
/*
|
2023-12-06 23:09:27 +00:00
|
|
|
* Return 0 if eb->folios[i] is attached to btree inode successfully.
|
|
|
|
* Return >0 if there is already another extent buffer for the range,
|
2023-11-29 22:32:08 +00:00
|
|
|
* and @found_eb_ret would be updated.
|
2023-12-06 23:09:28 +00:00
|
|
|
* Return -EAGAIN if the filemap has an existing folio but with different size
|
|
|
|
* than @eb.
|
|
|
|
* The caller needs to free the existing folios and retry using the same order.
|
2023-11-29 22:32:08 +00:00
|
|
|
*/
|
2023-12-06 23:09:28 +00:00
|
|
|
static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
|
btrfs: protect folio::private when attaching extent buffer folios
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-06 01:31:51 +00:00
|
|
|
struct btrfs_subpage *prealloc,
|
2023-12-06 23:09:28 +00:00
|
|
|
struct extent_buffer **found_eb_ret)
|
2023-11-29 22:32:08 +00:00
|
|
|
{
|
|
|
|
|
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
|
|
|
struct address_space *mapping = fs_info->btree_inode->i_mapping;
|
|
|
|
const unsigned long index = eb->start >> PAGE_SHIFT;
|
btrfs: protect folio::private when attaching extent buffer folios
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-06 01:31:51 +00:00
|
|
|
struct folio *existing_folio = NULL;
|
2023-11-29 22:32:08 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
ASSERT(found_eb_ret);
|
|
|
|
|
2023-12-06 23:09:27 +00:00
|
|
|
/* Caller should ensure the folio exists. */
|
|
|
|
ASSERT(eb->folios[i]);
|
2023-11-29 22:32:08 +00:00
|
|
|
|
|
|
|
retry:
|
2023-12-06 23:09:27 +00:00
|
|
|
ret = filemap_add_folio(mapping, eb->folios[i], index + i,
|
2023-11-29 22:32:08 +00:00
|
|
|
GFP_NOFS | __GFP_NOFAIL);
|
|
|
|
if (!ret)
|
btrfs: protect folio::private when attaching extent buffer folios
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-06 01:31:51 +00:00
|
|
|
goto finish;
|
2023-11-29 22:32:08 +00:00
|
|
|
|
|
|
|
existing_folio = filemap_lock_folio(mapping, index + i);
|
|
|
|
/* The page cache only exists for a very short time, just retry. */
|
btrfs: protect folio::private when attaching extent buffer folios
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-06 01:31:51 +00:00
|
|
|
if (IS_ERR(existing_folio)) {
|
|
|
|
existing_folio = NULL;
|
2023-11-29 22:32:08 +00:00
|
|
|
goto retry;
|
btrfs: protect folio::private when attaching extent buffer folios
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-06 01:31:51 +00:00
|
|
|
}
|
2023-11-29 22:32:08 +00:00
|
|
|
|
|
|
|
/* For now, we should only have single-page folios for btree inode. */
|
|
|
|
ASSERT(folio_nr_pages(existing_folio) == 1);
|
|
|
|
|
2024-01-05 05:35:55 +00:00
|
|
|
if (folio_size(existing_folio) != eb->folio_size) {
|
2023-12-06 23:09:28 +00:00
|
|
|
folio_unlock(existing_folio);
|
|
|
|
folio_put(existing_folio);
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
|
|
|
|
btrfs: protect folio::private when attaching extent buffer folios
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-06 01:31:51 +00:00
|
|
|
finish:
|
|
|
|
spin_lock(&mapping->i_private_lock);
|
|
|
|
if (existing_folio && fs_info->nodesize < PAGE_SIZE) {
|
|
|
|
/* We're going to reuse the existing page, can drop our folio now. */
|
2023-12-06 23:09:27 +00:00
|
|
|
__free_page(folio_page(eb->folios[i], 0));
|
|
|
|
eb->folios[i] = existing_folio;
|
btrfs: protect folio::private when attaching extent buffer folios
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-06 01:31:51 +00:00
|
|
|
} else if (existing_folio) {
|
2023-11-29 22:32:08 +00:00
|
|
|
struct extent_buffer *existing_eb;
|
|
|
|
|
|
|
|
existing_eb = grab_extent_buffer(fs_info,
|
|
|
|
folio_page(existing_folio, 0));
|
|
|
|
if (existing_eb) {
|
|
|
|
/* The extent buffer still exists, we can use it directly. */
|
|
|
|
*found_eb_ret = existing_eb;
|
btrfs: protect folio::private when attaching extent buffer folios
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-06 01:31:51 +00:00
|
|
|
spin_unlock(&mapping->i_private_lock);
|
2023-11-29 22:32:08 +00:00
|
|
|
folio_unlock(existing_folio);
|
|
|
|
folio_put(existing_folio);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
/* The extent buffer no longer exists, we can reuse the folio. */
|
2023-12-06 23:09:27 +00:00
|
|
|
__free_page(folio_page(eb->folios[i], 0));
|
|
|
|
eb->folios[i] = existing_folio;
|
2023-11-29 22:32:08 +00:00
|
|
|
}
|
btrfs: protect folio::private when attaching extent buffer folios
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-06 01:31:51 +00:00
|
|
|
eb->folio_size = folio_size(eb->folios[i]);
|
|
|
|
eb->folio_shift = folio_shift(eb->folios[i]);
|
|
|
|
/* Should not fail, as we have preallocated the memory. */
|
|
|
|
ret = attach_extent_buffer_folio(eb, eb->folios[i], prealloc);
|
|
|
|
ASSERT(!ret);
|
|
|
|
/*
|
|
|
|
* To inform we have an extra eb under allocation, so that
|
|
|
|
* detach_extent_buffer_page() won't release the folio private when the
|
|
|
|
* eb hasn't been inserted into radix tree yet.
|
|
|
|
*
|
|
|
|
* The ref will be decreased when the eb releases the page, in
|
|
|
|
* detach_extent_buffer_page(). Thus needs no special handling in the
|
|
|
|
* error path.
|
|
|
|
*/
|
|
|
|
btrfs_folio_inc_eb_refs(fs_info, eb->folios[i]);
|
|
|
|
spin_unlock(&mapping->i_private_lock);
|
2023-11-29 22:32:08 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-12-16 18:24:27 +00:00
|
|
|
struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
|
2020-11-05 15:45:20 +00:00
|
|
|
u64 start, u64 owner_root, int level)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2016-06-15 13:22:56 +00:00
|
|
|
unsigned long len = fs_info->nodesize;
|
2023-12-06 23:09:28 +00:00
|
|
|
int num_folios;
|
2023-11-29 22:32:08 +00:00
|
|
|
int attached = 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
struct extent_buffer *eb;
|
2023-11-29 22:32:08 +00:00
|
|
|
struct extent_buffer *existing_eb = NULL;
|
2023-07-09 07:08:18 +00:00
|
|
|
struct btrfs_subpage *prealloc = NULL;
|
btrfs: fix lockdep splat with reloc root extent buffers
We have been hitting the following lockdep splat with btrfs/187 recently
WARNING: possible circular locking dependency detected
5.19.0-rc8+ #775 Not tainted
------------------------------------------------------
btrfs/752500 is trying to acquire lock:
ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
but task is already holding lock:
ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_init_new_buffer+0x7d/0x2c0
btrfs_alloc_tree_block+0x120/0x3b0
__btrfs_cow_block+0x136/0x600
btrfs_cow_block+0x10b/0x230
btrfs_search_slot+0x53b/0xb70
btrfs_lookup_inode+0x2a/0xa0
__btrfs_update_delayed_inode+0x5f/0x280
btrfs_async_run_delayed_root+0x24c/0x290
btrfs_work_helper+0xf2/0x3e0
process_one_work+0x271/0x590
worker_thread+0x52/0x3b0
kthread+0xf0/0x120
ret_from_fork+0x1f/0x30
-> #1 (btrfs-tree-01){++++}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_search_slot+0x3c3/0xb70
do_relocation+0x10c/0x6b0
relocate_tree_blocks+0x317/0x6d0
relocate_block_group+0x1f1/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Chain exists of:
btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-tree-01/1);
lock(btrfs-tree-01);
lock(btrfs-tree-01/1);
lock(btrfs-treloc-02#2);
*** DEADLOCK ***
7 locks held by btrfs/752500:
#0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
#1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
#2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
#3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
#4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
stack backtrace:
CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x56/0x73
check_noncircular+0xd6/0x100
? lock_is_held_type+0xe2/0x140
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
? __btrfs_tree_lock+0x24/0x110
down_write_nested+0x41/0x80
? __btrfs_tree_lock+0x24/0x110
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
? lock_release+0x137/0x2d0
? _raw_spin_unlock+0x29/0x50
? release_extent_buffer+0x128/0x180
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
? lock_is_held_type+0xe2/0x140
? lock_is_held_type+0xe2/0x140
? __x64_sys_ioctl+0x88/0xc0
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
This isn't necessarily new, it's just tricky to hit in practice. There
are two competing things going on here. With relocation we create a
snapshot of every fs tree with a reloc tree. Any extent buffers that
get initialized here are initialized with the reloc root lockdep key.
However since it is a snapshot, any blocks that are currently in cache
that originally belonged to the fs tree will have the normal tree
lockdep key set. This creates the lock dependency of
reloc tree -> normal tree
for the extent buffer locking during the first phase of the relocation
as we walk down the reloc root to relocate blocks.
However this is problematic because the final phase of the relocation is
merging the reloc root into the original fs root. This involves
searching down to any keys that exist in the original fs root and then
swapping the relocated block and the original fs root block. We have to
search down to the fs root first, and then go search the reloc root for
the block we need to replace. This creates the dependency of
normal tree -> reloc tree
which is why lockdep complains.
Additionally even if we were to fix this particular mismatch with a
different nesting for the merge case, we're still slotting in a block
that has a owner of the reloc root objectid into a normal tree, so that
block will have its lockdep key set to the tree reloc root, and create a
lockdep splat later on when we wander into that block from the fs root.
Unfortunately the only solution here is to make sure we do not set the
lockdep key to the reloc tree lockdep key normally, and then reset any
blocks we wander into from the reloc root when we're doing the merged.
This solves the problem of having mixed tree reloc keys intermixed with
normal tree keys, and then allows us to make sure in the merge case we
maintain the lock order of
normal tree -> reloc tree
We handle this by setting a bit on the reloc root when we do the search
for the block we want to relocate, and any block we search into or COW
at that point gets set to the reloc tree key. This works correctly
because we only ever COW down to the parent node, so we aren't resetting
the key for the block we're linking into the fs root.
With this patch we no longer have the lockdep splat in btrfs/187.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-26 20:24:04 +00:00
|
|
|
u64 lockdep_owner = owner_root;
|
2023-11-16 05:19:06 +00:00
|
|
|
bool page_contig = true;
|
2008-01-24 21:13:08 +00:00
|
|
|
int uptodate = 1;
|
2010-10-27 00:57:29 +00:00
|
|
|
int ret;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
if (check_eb_alignment(fs_info, start))
|
2016-06-06 19:01:23 +00:00
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
|
2021-02-25 01:18:14 +00:00
|
|
|
#if BITS_PER_LONG == 32
|
|
|
|
if (start >= MAX_LFS_FILESIZE) {
|
|
|
|
btrfs_err_rl(fs_info,
|
|
|
|
"extent buffer %llu is beyond 32bit page cache limit", start);
|
|
|
|
btrfs_err_32bit_limit(fs_info);
|
|
|
|
return ERR_PTR(-EOVERFLOW);
|
|
|
|
}
|
|
|
|
if (start >= BTRFS_32BIT_EARLY_WARN_THRESHOLD)
|
|
|
|
btrfs_warn_32bit_limit(fs_info);
|
|
|
|
#endif
|
|
|
|
|
2013-12-16 18:24:27 +00:00
|
|
|
eb = find_extent_buffer(fs_info, start);
|
2013-10-07 15:45:25 +00:00
|
|
|
if (eb)
|
2008-07-22 15:18:07 +00:00
|
|
|
return eb;
|
|
|
|
|
2014-06-15 00:55:29 +00:00
|
|
|
eb = __alloc_extent_buffer(fs_info, start, len);
|
2008-04-01 15:21:40 +00:00
|
|
|
if (!eb)
|
2016-06-06 19:01:23 +00:00
|
|
|
return ERR_PTR(-ENOMEM);
|
btrfs: fix lockdep splat with reloc root extent buffers
We have been hitting the following lockdep splat with btrfs/187 recently
WARNING: possible circular locking dependency detected
5.19.0-rc8+ #775 Not tainted
------------------------------------------------------
btrfs/752500 is trying to acquire lock:
ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
but task is already holding lock:
ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_init_new_buffer+0x7d/0x2c0
btrfs_alloc_tree_block+0x120/0x3b0
__btrfs_cow_block+0x136/0x600
btrfs_cow_block+0x10b/0x230
btrfs_search_slot+0x53b/0xb70
btrfs_lookup_inode+0x2a/0xa0
__btrfs_update_delayed_inode+0x5f/0x280
btrfs_async_run_delayed_root+0x24c/0x290
btrfs_work_helper+0xf2/0x3e0
process_one_work+0x271/0x590
worker_thread+0x52/0x3b0
kthread+0xf0/0x120
ret_from_fork+0x1f/0x30
-> #1 (btrfs-tree-01){++++}-{3:3}:
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_search_slot+0x3c3/0xb70
do_relocation+0x10c/0x6b0
relocate_tree_blocks+0x317/0x6d0
relocate_block_group+0x1f1/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
down_write_nested+0x41/0x80
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Chain exists of:
btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-tree-01/1);
lock(btrfs-tree-01);
lock(btrfs-tree-01/1);
lock(btrfs-treloc-02#2);
*** DEADLOCK ***
7 locks held by btrfs/752500:
#0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
#1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
#2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
#3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
#4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
#6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
stack backtrace:
CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack_lvl+0x56/0x73
check_noncircular+0xd6/0x100
? lock_is_held_type+0xe2/0x140
__lock_acquire+0x1122/0x1e10
lock_acquire+0xc2/0x2d0
? __btrfs_tree_lock+0x24/0x110
down_write_nested+0x41/0x80
? __btrfs_tree_lock+0x24/0x110
__btrfs_tree_lock+0x24/0x110
btrfs_lock_root_node+0x31/0x50
btrfs_search_slot+0x1cb/0xb70
? lock_release+0x137/0x2d0
? _raw_spin_unlock+0x29/0x50
? release_extent_buffer+0x128/0x180
replace_path+0x541/0x9f0
merge_reloc_root+0x1d6/0x610
merge_reloc_roots+0xe2/0x260
relocate_block_group+0x2c8/0x560
btrfs_relocate_block_group+0x23e/0x400
btrfs_relocate_chunk+0x4c/0x140
btrfs_balance+0x755/0xe40
btrfs_ioctl+0x1ea2/0x2c90
? lock_is_held_type+0xe2/0x140
? lock_is_held_type+0xe2/0x140
? __x64_sys_ioctl+0x88/0xc0
__x64_sys_ioctl+0x88/0xc0
do_syscall_64+0x38/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
This isn't necessarily new, it's just tricky to hit in practice. There
are two competing things going on here. With relocation we create a
snapshot of every fs tree with a reloc tree. Any extent buffers that
get initialized here are initialized with the reloc root lockdep key.
However since it is a snapshot, any blocks that are currently in cache
that originally belonged to the fs tree will have the normal tree
lockdep key set. This creates the lock dependency of
reloc tree -> normal tree
for the extent buffer locking during the first phase of the relocation
as we walk down the reloc root to relocate blocks.
However this is problematic because the final phase of the relocation is
merging the reloc root into the original fs root. This involves
searching down to any keys that exist in the original fs root and then
swapping the relocated block and the original fs root block. We have to
search down to the fs root first, and then go search the reloc root for
the block we need to replace. This creates the dependency of
normal tree -> reloc tree
which is why lockdep complains.
Additionally even if we were to fix this particular mismatch with a
different nesting for the merge case, we're still slotting in a block
that has a owner of the reloc root objectid into a normal tree, so that
block will have its lockdep key set to the tree reloc root, and create a
lockdep splat later on when we wander into that block from the fs root.
Unfortunately the only solution here is to make sure we do not set the
lockdep key to the reloc tree lockdep key normally, and then reset any
blocks we wander into from the reloc root when we're doing the merged.
This solves the problem of having mixed tree reloc keys intermixed with
normal tree keys, and then allows us to make sure in the merge case we
maintain the lock order of
normal tree -> reloc tree
We handle this by setting a bit on the reloc root when we do the search
for the block we want to relocate, and any block we search into or COW
at that point gets set to the reloc tree key. This works correctly
because we only ever COW down to the parent node, so we aren't resetting
the key for the block we're linking into the fs root.
With this patch we no longer have the lockdep splat in btrfs/187.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-07-26 20:24:04 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The reloc trees are just snapshots, so we need them to appear to be
|
|
|
|
* just like any other fs tree WRT lockdep.
|
|
|
|
*/
|
|
|
|
if (lockdep_owner == BTRFS_TREE_RELOC_OBJECTID)
|
|
|
|
lockdep_owner = BTRFS_FS_TREE_OBJECTID;
|
|
|
|
|
|
|
|
btrfs_set_buffer_lockdep_class(lockdep_owner, eb, level);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-07-09 07:08:18 +00:00
|
|
|
/*
|
2023-11-17 03:54:14 +00:00
|
|
|
* Preallocate folio private for subpage case, so that we won't
|
2023-11-17 21:58:23 +00:00
|
|
|
* allocate memory with i_private_lock nor page lock hold.
|
2023-07-09 07:08:18 +00:00
|
|
|
*
|
|
|
|
* The memory will be freed by attach_extent_buffer_page() or freed
|
|
|
|
* manually if we exit earlier.
|
|
|
|
*/
|
|
|
|
if (fs_info->nodesize < PAGE_SIZE) {
|
|
|
|
prealloc = btrfs_alloc_subpage(fs_info, BTRFS_SUBPAGE_METADATA);
|
|
|
|
if (IS_ERR(prealloc)) {
|
2023-11-29 22:32:08 +00:00
|
|
|
ret = PTR_ERR(prealloc);
|
|
|
|
goto out;
|
2023-07-09 07:08:18 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
reallocate:
|
2023-11-29 22:32:08 +00:00
|
|
|
/* Allocate all pages first. */
|
2024-06-27 03:52:16 +00:00
|
|
|
ret = alloc_eb_folio_array(eb, true);
|
2023-11-29 22:32:08 +00:00
|
|
|
if (ret < 0) {
|
|
|
|
btrfs_free_subpage(prealloc);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
num_folios = num_extent_folios(eb);
|
2023-11-29 22:32:08 +00:00
|
|
|
/* Attach all pages to the filemap. */
|
2023-12-06 23:09:28 +00:00
|
|
|
for (int i = 0; i < num_folios; i++) {
|
|
|
|
struct folio *folio;
|
2023-11-29 22:32:08 +00:00
|
|
|
|
btrfs: protect folio::private when attaching extent buffer folios
[BUG]
Since v6.8 there are rare kernel crashes reported by various people,
the common factor is bad page status error messages like this:
BUG: Bad page state in process kswapd0 pfn:d6e840
page: refcount:0 mapcount:0 mapping:000000007512f4f2 index:0x2796c2c7c
pfn:0xd6e840
aops:btree_aops ino:1
flags: 0x17ffffe0000008(uptodate|node=0|zone=2|lastcpupid=0x3fffff)
page_type: 0xffffffff()
raw: 0017ffffe0000008 dead000000000100 dead000000000122 ffff88826d0be4c0
raw: 00000002796c2c7c 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: non-NULL mapping
[CAUSE]
Commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") changes the sequence when allocating a new
extent buffer.
Previously we always called grab_extent_buffer() under
mapping->i_private_lock, to ensure the safety on modification on
folio::private (which is a pointer to extent buffer for regular
sectorsize).
This can lead to the following race:
Thread A is trying to allocate an extent buffer at bytenr X, with 4
4K pages, meanwhile thread B is trying to release the page at X + 4K
(the second page of the extent buffer at X).
Thread A | Thread B
-----------------------------------+-------------------------------------
| btree_release_folio()
| | This is for the page at X + 4K,
| | Not page X.
| |
alloc_extent_buffer() | |- release_extent_buffer()
|- filemap_add_folio() for the | | |- atomic_dec_and_test(eb->refs)
| page at bytenr X (the first | | |
| page). | | |
| Which returned -EEXIST. | | |
| | | |
|- filemap_lock_folio() | | |
| Returned the first page locked. | | |
| | | |
|- grab_extent_buffer() | | |
| |- atomic_inc_not_zero() | | |
| | Returned false | | |
| |- folio_detach_private() | | |- folio_detach_private() for X
| |- folio_test_private() | | |- folio_test_private()
| Returned true | | | Returned true
|- folio_put() | |- folio_put()
Now there are two puts on the same folio at folio X, leading to refcount
underflow of the folio X, and eventually causing the BUG_ON() on the
page->mapping.
The condition is not that easy to hit:
- The release must be triggered for the middle page of an eb
If the release is on the same first page of an eb, page lock would kick
in and prevent the race.
- folio_detach_private() has a very small race window
It's only between folio_test_private() and folio_clear_private().
That's exactly when mapping->i_private_lock is used to prevent such race,
and commit 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to
allocate-then-attach method") screwed that up.
At that time, I thought the page lock would kick in as
filemap_release_folio() also requires the page to be locked, but forgot
the filemap_release_folio() only locks one page, not all pages of an
extent buffer.
[FIX]
Move all the code requiring i_private_lock into
attach_eb_folio_to_filemap(), so that everything is done with proper
lock protection.
Furthermore to prevent future problems, add an extra
lockdep_assert_locked() to ensure we're holding the proper lock.
To reproducer that is able to hit the race (takes a few minutes with
instrumented code inserting delays to alloc_extent_buffer()):
#!/bin/sh
drop_caches () {
while(true); do
echo 3 > /proc/sys/vm/drop_caches
echo 1 > /proc/sys/vm/compact_memory
done
}
run_tar () {
while(true); do
for x in `seq 1 80` ; do
tar cf /dev/zero /mnt > /dev/null &
done
wait
done
}
mkfs.btrfs -f -d single -m single /dev/vda
mount -o noatime /dev/vda /mnt
# create 200,000 files, 1K each
./simoop -n 200000 -E -f 1k /mnt
drop_caches &
(run_tar)
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/linux-btrfs/CAHk-=wgt362nGfScVOOii8cgKn2LVVHeOvOA7OBwg1OwbuJQcw@mail.gmail.com/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Link: https://lore.kernel.org/lkml/CABXGCsPktcHQOvKTbPaTwegMExije=Gpgci5NW=hqORo-s7diA@mail.gmail.com/
Reported-by: Toralf Förster <toralf.foerster@gmx.de>
Link: https://lore.kernel.org/linux-btrfs/e8b3311c-9a75-4903-907f-fc0f7a3fe423@gmx.de/
Reported-by: syzbot+f80b066392366b4af85e@syzkaller.appspotmail.com
Fixes: 09e6cef19c9f ("btrfs: refactor alloc_extent_buffer() to allocate-then-attach method")
CC: stable@vger.kernel.org # 6.8+
CC: Chris Mason <clm@fb.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-06-06 01:31:51 +00:00
|
|
|
ret = attach_eb_folio_to_filemap(eb, i, prealloc, &existing_eb);
|
2023-11-29 22:32:08 +00:00
|
|
|
if (ret > 0) {
|
|
|
|
ASSERT(existing_eb);
|
|
|
|
goto out;
|
2016-06-06 19:01:23 +00:00
|
|
|
}
|
2012-03-07 21:20:05 +00:00
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
/*
|
|
|
|
* TODO: Special handling for a corner case where the order of
|
|
|
|
* folios mismatch between the new eb and filemap.
|
|
|
|
*
|
|
|
|
* This happens when:
|
|
|
|
*
|
|
|
|
* - the new eb is using higher order folio
|
|
|
|
*
|
|
|
|
* - the filemap is still using 0-order folios for the range
|
|
|
|
* This can happen at the previous eb allocation, and we don't
|
|
|
|
* have higher order folio for the call.
|
|
|
|
*
|
|
|
|
* - the existing eb has already been freed
|
|
|
|
*
|
|
|
|
* In this case, we have to free the existing folios first, and
|
|
|
|
* re-allocate using the same order.
|
|
|
|
* Thankfully this is not going to happen yet, as we're still
|
|
|
|
* using 0-order folios.
|
|
|
|
*/
|
|
|
|
if (unlikely(ret == -EAGAIN)) {
|
|
|
|
ASSERT(0);
|
|
|
|
goto reallocate;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2023-11-29 22:32:08 +00:00
|
|
|
attached++;
|
2012-03-07 21:20:05 +00:00
|
|
|
|
2023-11-29 22:32:08 +00:00
|
|
|
/*
|
2023-12-06 23:09:28 +00:00
|
|
|
* Only after attach_eb_folio_to_filemap(), eb->folios[] is
|
2023-11-29 22:32:08 +00:00
|
|
|
* reliable, as we may choose to reuse the existing page cache
|
|
|
|
* and free the allocated page.
|
|
|
|
*/
|
2023-12-06 23:09:28 +00:00
|
|
|
folio = eb->folios[i];
|
2023-12-12 02:28:37 +00:00
|
|
|
WARN_ON(btrfs_folio_test_dirty(fs_info, folio, eb->start, eb->len));
|
2023-11-16 05:19:06 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if the current page is physically contiguous with previous eb
|
|
|
|
* page.
|
2023-12-06 23:09:28 +00:00
|
|
|
* At this stage, either we allocated a large folio, thus @i
|
|
|
|
* would only be 0, or we fall back to per-page allocation.
|
2023-11-16 05:19:06 +00:00
|
|
|
*/
|
2023-12-06 23:09:28 +00:00
|
|
|
if (i && folio_page(eb->folios[i - 1], 0) + 1 != folio_page(folio, 0))
|
2023-11-16 05:19:06 +00:00
|
|
|
page_contig = false;
|
|
|
|
|
2023-12-12 02:28:37 +00:00
|
|
|
if (!btrfs_folio_test_uptodate(fs_info, folio, eb->start, eb->len))
|
2008-01-24 21:13:08 +00:00
|
|
|
uptodate = 0;
|
2011-02-10 17:35:00 +00:00
|
|
|
|
|
|
|
/*
|
2018-07-04 07:24:52 +00:00
|
|
|
* We can't unlock the pages just yet since the extent buffer
|
|
|
|
* hasn't been properly inserted in the radix tree, this
|
2022-05-01 03:15:16 +00:00
|
|
|
* opens a race with btree_release_folio which can free a page
|
2018-07-04 07:24:52 +00:00
|
|
|
* while we are still filling in all pages for the buffer and
|
|
|
|
* we could crash.
|
2011-02-10 17:35:00 +00:00
|
|
|
*/
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
if (uptodate)
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 14:25:08 +00:00
|
|
|
set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
|
2023-11-16 05:19:06 +00:00
|
|
|
/* All pages are physically contiguous, can skip cross page handling. */
|
|
|
|
if (page_contig)
|
2023-12-06 23:09:27 +00:00
|
|
|
eb->addr = folio_address(eb->folios[0]) + offset_in_page(eb->start);
|
2022-07-15 11:59:31 +00:00
|
|
|
again:
|
|
|
|
ret = radix_tree_preload(GFP_NOFS);
|
2023-11-29 22:32:08 +00:00
|
|
|
if (ret)
|
|
|
|
goto out;
|
2022-07-15 11:59:31 +00:00
|
|
|
|
|
|
|
spin_lock(&fs_info->buffer_lock);
|
|
|
|
ret = radix_tree_insert(&fs_info->buffer_radix,
|
|
|
|
start >> fs_info->sectorsize_bits, eb);
|
|
|
|
spin_unlock(&fs_info->buffer_lock);
|
|
|
|
radix_tree_preload_end();
|
|
|
|
if (ret == -EEXIST) {
|
2023-11-29 22:32:08 +00:00
|
|
|
ret = 0;
|
|
|
|
existing_eb = find_extent_buffer(fs_info, start);
|
|
|
|
if (existing_eb)
|
|
|
|
goto out;
|
2022-07-15 11:59:31 +00:00
|
|
|
else
|
|
|
|
goto again;
|
|
|
|
}
|
2008-07-22 15:18:07 +00:00
|
|
|
/* add one reference for the tree */
|
2012-03-13 13:38:00 +00:00
|
|
|
check_buffer_tree_ref(eb);
|
2013-12-13 15:41:51 +00:00
|
|
|
set_bit(EXTENT_BUFFER_IN_TREE, &eb->bflags);
|
2011-02-10 17:35:00 +00:00
|
|
|
|
|
|
|
/*
|
2018-07-04 07:24:52 +00:00
|
|
|
* Now it's safe to unlock the pages because any calls to
|
2022-05-01 03:15:16 +00:00
|
|
|
* btree_release_folio will correctly detect that a page belongs to a
|
2018-07-04 07:24:52 +00:00
|
|
|
* live buffer and won't free them prematurely.
|
2011-02-10 17:35:00 +00:00
|
|
|
*/
|
2023-12-06 23:09:28 +00:00
|
|
|
for (int i = 0; i < num_folios; i++)
|
2023-12-06 23:09:27 +00:00
|
|
|
unlock_page(folio_page(eb->folios[i], 0));
|
2008-01-24 21:13:08 +00:00
|
|
|
return eb;
|
|
|
|
|
2023-11-29 22:32:08 +00:00
|
|
|
out:
|
2015-02-24 10:47:05 +00:00
|
|
|
WARN_ON(!atomic_dec_and_test(&eb->refs));
|
2023-12-14 22:39:38 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Any attached folios need to be detached before we unlock them. This
|
|
|
|
* is because when we're inserting our new folios into the mapping, and
|
|
|
|
* then attaching our eb to that folio. If we fail to insert our folio
|
|
|
|
* we'll lookup the folio for that index, and grab that EB. We do not
|
|
|
|
* want that to grab this eb, as we're getting ready to free it. So we
|
|
|
|
* have to detach it first and then unlock it.
|
|
|
|
*
|
|
|
|
* We have to drop our reference and NULL it out here because in the
|
|
|
|
* subpage case detaching does a btrfs_folio_dec_eb_refs() for our eb.
|
|
|
|
* Below when we call btrfs_release_extent_buffer() we will call
|
|
|
|
* detach_extent_buffer_folio() on our remaining pages in the !subpage
|
|
|
|
* case. If we left eb->folios[i] populated in the subpage case we'd
|
|
|
|
* double put our reference and be super sad.
|
|
|
|
*/
|
2023-11-29 22:32:08 +00:00
|
|
|
for (int i = 0; i < attached; i++) {
|
2023-12-06 23:09:27 +00:00
|
|
|
ASSERT(eb->folios[i]);
|
2023-12-06 23:09:28 +00:00
|
|
|
detach_extent_buffer_folio(eb, eb->folios[i]);
|
2023-12-06 23:09:27 +00:00
|
|
|
unlock_page(folio_page(eb->folios[i], 0));
|
2023-12-14 22:39:38 +00:00
|
|
|
folio_put(eb->folios[i]);
|
|
|
|
eb->folios[i] = NULL;
|
2010-08-06 17:21:20 +00:00
|
|
|
}
|
2023-11-29 22:32:08 +00:00
|
|
|
/*
|
|
|
|
* Now all pages of that extent buffer is unmapped, set UNMAPPED flag,
|
|
|
|
* so it can be cleaned up without utlizing page->mapping.
|
|
|
|
*/
|
|
|
|
set_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
|
2011-02-10 17:35:00 +00:00
|
|
|
|
2010-10-27 00:57:29 +00:00
|
|
|
btrfs_release_extent_buffer(eb);
|
2023-11-29 22:32:08 +00:00
|
|
|
if (ret < 0)
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
ASSERT(existing_eb);
|
|
|
|
return existing_eb;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head)
|
|
|
|
{
|
|
|
|
struct extent_buffer *eb =
|
|
|
|
container_of(head, struct extent_buffer, rcu_head);
|
|
|
|
|
|
|
|
__free_extent_buffer(eb);
|
|
|
|
}
|
|
|
|
|
2013-04-26 14:56:29 +00:00
|
|
|
static int release_extent_buffer(struct extent_buffer *eb)
|
2020-02-23 23:16:42 +00:00
|
|
|
__releases(&eb->refs_lock)
|
2012-03-09 21:01:49 +00:00
|
|
|
{
|
2018-06-27 13:38:23 +00:00
|
|
|
lockdep_assert_held(&eb->refs_lock);
|
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
WARN_ON(atomic_read(&eb->refs) == 0);
|
|
|
|
if (atomic_dec_and_test(&eb->refs)) {
|
2013-12-13 15:41:51 +00:00
|
|
|
if (test_and_clear_bit(EXTENT_BUFFER_IN_TREE, &eb->bflags)) {
|
2013-12-16 18:24:27 +00:00
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
2012-03-09 21:01:49 +00:00
|
|
|
|
2012-05-16 15:00:02 +00:00
|
|
|
spin_unlock(&eb->refs_lock);
|
2012-03-09 21:01:49 +00:00
|
|
|
|
2022-07-15 11:59:31 +00:00
|
|
|
spin_lock(&fs_info->buffer_lock);
|
|
|
|
radix_tree_delete(&fs_info->buffer_radix,
|
|
|
|
eb->start >> fs_info->sectorsize_bits);
|
|
|
|
spin_unlock(&fs_info->buffer_lock);
|
2013-12-13 15:41:51 +00:00
|
|
|
} else {
|
|
|
|
spin_unlock(&eb->refs_lock);
|
2012-05-16 15:00:02 +00:00
|
|
|
}
|
2012-03-09 21:01:49 +00:00
|
|
|
|
2022-09-09 21:53:19 +00:00
|
|
|
btrfs_leak_debug_del_eb(eb);
|
2012-03-09 21:01:49 +00:00
|
|
|
/* Should be safe to release our pages at this point */
|
2018-07-19 15:24:32 +00:00
|
|
|
btrfs_release_extent_buffer_pages(eb);
|
2015-03-16 21:38:02 +00:00
|
|
|
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
2018-06-27 13:38:24 +00:00
|
|
|
if (unlikely(test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags))) {
|
2015-03-16 21:38:02 +00:00
|
|
|
__free_extent_buffer(eb);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
#endif
|
2012-03-09 21:01:49 +00:00
|
|
|
call_rcu(&eb->rcu_head, btrfs_release_extent_buffer_rcu);
|
2012-07-20 20:05:36 +00:00
|
|
|
return 1;
|
2012-03-09 21:01:49 +00:00
|
|
|
}
|
|
|
|
spin_unlock(&eb->refs_lock);
|
2012-07-20 20:05:36 +00:00
|
|
|
|
|
|
|
return 0;
|
2012-03-09 21:01:49 +00:00
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
void free_extent_buffer(struct extent_buffer *eb)
|
|
|
|
{
|
2013-01-29 22:49:37 +00:00
|
|
|
int refs;
|
2008-01-24 21:13:08 +00:00
|
|
|
if (!eb)
|
|
|
|
return;
|
|
|
|
|
2022-08-09 16:36:33 +00:00
|
|
|
refs = atomic_read(&eb->refs);
|
2013-01-29 22:49:37 +00:00
|
|
|
while (1) {
|
2018-10-15 14:04:01 +00:00
|
|
|
if ((!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs <= 3)
|
|
|
|
|| (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) &&
|
|
|
|
refs == 1))
|
2013-01-29 22:49:37 +00:00
|
|
|
break;
|
2022-08-09 16:36:33 +00:00
|
|
|
if (atomic_try_cmpxchg(&eb->refs, &refs, refs - 1))
|
2013-01-29 22:49:37 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
spin_lock(&eb->refs_lock);
|
|
|
|
if (atomic_read(&eb->refs) == 2 &&
|
|
|
|
test_bit(EXTENT_BUFFER_STALE, &eb->bflags) &&
|
2012-03-13 13:38:00 +00:00
|
|
|
!extent_buffer_under_io(eb) &&
|
2012-03-09 21:01:49 +00:00
|
|
|
test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
|
|
|
|
atomic_dec(&eb->refs);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* I know this is terrible, but it's temporary until we stop tracking
|
|
|
|
* the uptodate bits and such for the extent buffers.
|
|
|
|
*/
|
2013-04-26 14:56:29 +00:00
|
|
|
release_extent_buffer(eb);
|
2012-03-09 21:01:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void free_extent_buffer_stale(struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
if (!eb)
|
2008-01-24 21:13:08 +00:00
|
|
|
return;
|
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
spin_lock(&eb->refs_lock);
|
|
|
|
set_bit(EXTENT_BUFFER_STALE, &eb->bflags);
|
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
if (atomic_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) &&
|
2012-03-09 21:01:49 +00:00
|
|
|
test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
|
|
|
|
atomic_dec(&eb->refs);
|
2013-04-26 14:56:29 +00:00
|
|
|
release_extent_buffer(eb);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
static void btree_clear_folio_dirty(struct folio *folio)
|
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-25 07:14:43 +00:00
|
|
|
{
|
2023-12-06 23:09:28 +00:00
|
|
|
ASSERT(folio_test_dirty(folio));
|
|
|
|
ASSERT(folio_test_locked(folio));
|
|
|
|
folio_clear_dirty_for_io(folio);
|
|
|
|
xa_lock_irq(&folio->mapping->i_pages);
|
|
|
|
if (!folio_test_dirty(folio))
|
|
|
|
__xa_clear_mark(&folio->mapping->i_pages,
|
|
|
|
folio_index(folio), PAGECACHE_TAG_DIRTY);
|
|
|
|
xa_unlock_irq(&folio->mapping->i_pages);
|
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-25 07:14:43 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void clear_subpage_extent_buffer_dirty(const struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
2023-12-06 23:09:28 +00:00
|
|
|
struct folio *folio = eb->folios[0];
|
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-25 07:14:43 +00:00
|
|
|
bool last;
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
/* btree_clear_folio_dirty() needs page locked. */
|
|
|
|
folio_lock(folio);
|
2023-12-12 02:28:37 +00:00
|
|
|
last = btrfs_subpage_clear_and_test_dirty(fs_info, folio, eb->start, eb->len);
|
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-25 07:14:43 +00:00
|
|
|
if (last)
|
2023-12-06 23:09:28 +00:00
|
|
|
btree_clear_folio_dirty(folio);
|
|
|
|
folio_unlock(folio);
|
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-25 07:14:43 +00:00
|
|
|
WARN_ON(atomic_read(&eb->refs) == 0);
|
|
|
|
}
|
|
|
|
|
2023-01-26 21:00:59 +00:00
|
|
|
void btrfs_clear_buffer_dirty(struct btrfs_trans_handle *trans,
|
|
|
|
struct extent_buffer *eb)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2023-01-26 21:00:59 +00:00
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
2023-12-06 23:09:28 +00:00
|
|
|
int num_folios;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-01-26 21:00:59 +00:00
|
|
|
btrfs_assert_tree_write_locked(eb);
|
|
|
|
|
|
|
|
if (trans && btrfs_header_generation(eb) != trans->transid)
|
|
|
|
return;
|
|
|
|
|
2023-11-23 15:47:16 +00:00
|
|
|
/*
|
|
|
|
* Instead of clearing the dirty flag off of the buffer, mark it as
|
|
|
|
* EXTENT_BUFFER_ZONED_ZEROOUT. This allows us to preserve
|
|
|
|
* write-ordering in zoned mode, without the need to later re-dirty
|
|
|
|
* the extent_buffer.
|
|
|
|
*
|
|
|
|
* The actual zeroout of the buffer will happen later in
|
|
|
|
* btree_csum_one_bio.
|
|
|
|
*/
|
btrfs: zoned: do not flag ZEROOUT on non-dirty extent buffer
Btrfs clears the content of an extent buffer marked as
EXTENT_BUFFER_ZONED_ZEROOUT before the bio submission. This mechanism is
introduced to prevent a write hole of an extent buffer, which is once
allocated, marked dirty, but turns out unnecessary and cleaned up within
one transaction operation.
Currently, btrfs_clear_buffer_dirty() marks the extent buffer as
EXTENT_BUFFER_ZONED_ZEROOUT, and skips the entry function. If this call
happens while the buffer is under IO (with the WRITEBACK flag set,
without the DIRTY flag), we can add the ZEROOUT flag and clear the
buffer's content just before a bio submission. As a result:
1) it can lead to adding faulty delayed reference item which leads to a
FS corrupted (EUCLEAN) error, and
2) it writes out cleared tree node on disk
The former issue is previously discussed in [1]. The corruption happens
when it runs a delayed reference update. So, on-disk data is safe.
[1] https://lore.kernel.org/linux-btrfs/3f4f2a0ff1a6c818050434288925bdcf3cd719e5.1709124777.git.naohiro.aota@wdc.com/
The latter one can reach on-disk data. But, as that node is already
processed by btrfs_clear_buffer_dirty(), that will be invalidated in the
next transaction commit anyway. So, the chance of hitting the corruption
is relatively small.
Anyway, we should skip flagging ZEROOUT on a non-DIRTY extent buffer, to
keep the content under IO intact.
Fixes: aa6313e6ff2b ("btrfs: zoned: don't clear dirty flag of extent buffer")
CC: stable@vger.kernel.org # 6.8
Link: https://lore.kernel.org/linux-btrfs/oadvdekkturysgfgi4qzuemd57zudeasynswurjxw3ocdfsef6@sjyufeugh63f/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 05:39:20 +00:00
|
|
|
if (btrfs_is_zoned(fs_info) && test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
|
2023-11-23 15:47:16 +00:00
|
|
|
set_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2023-01-26 21:00:59 +00:00
|
|
|
if (!test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags))
|
|
|
|
return;
|
|
|
|
|
|
|
|
percpu_counter_add_batch(&fs_info->dirty_metadata_bytes, -eb->len,
|
|
|
|
fs_info->dirty_metadata_batch);
|
|
|
|
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
if (eb->fs_info->nodesize < PAGE_SIZE)
|
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-25 07:14:43 +00:00
|
|
|
return clear_subpage_extent_buffer_dirty(eb);
|
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
num_folios = num_extent_folios(eb);
|
|
|
|
for (int i = 0; i < num_folios; i++) {
|
|
|
|
struct folio *folio = eb->folios[i];
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
if (!folio_test_dirty(folio))
|
2008-11-19 17:44:22 +00:00
|
|
|
continue;
|
2023-12-06 23:09:28 +00:00
|
|
|
folio_lock(folio);
|
|
|
|
btree_clear_folio_dirty(folio);
|
|
|
|
folio_unlock(folio);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2012-03-13 13:38:00 +00:00
|
|
|
WARN_ON(atomic_read(&eb->refs) == 0);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2023-05-08 14:58:38 +00:00
|
|
|
void set_extent_buffer_dirty(struct extent_buffer *eb)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2023-12-06 23:09:28 +00:00
|
|
|
int num_folios;
|
2018-09-13 17:44:42 +00:00
|
|
|
bool was_dirty;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
check_buffer_tree_ref(eb);
|
|
|
|
|
2009-03-13 15:00:37 +00:00
|
|
|
was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
|
2012-03-13 13:38:00 +00:00
|
|
|
|
2023-12-06 23:09:28 +00:00
|
|
|
num_folios = num_extent_folios(eb);
|
2012-03-09 21:01:49 +00:00
|
|
|
WARN_ON(atomic_read(&eb->refs) == 0);
|
2012-03-13 13:38:00 +00:00
|
|
|
WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags));
|
2024-03-26 05:39:21 +00:00
|
|
|
WARN_ON(test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags));
|
2012-03-13 13:38:00 +00:00
|
|
|
|
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-25 07:14:43 +00:00
|
|
|
if (!was_dirty) {
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
bool subpage = eb->fs_info->nodesize < PAGE_SIZE;
|
2018-09-13 17:46:08 +00:00
|
|
|
|
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-25 07:14:43 +00:00
|
|
|
/*
|
|
|
|
* For subpage case, we can have other extent buffers in the
|
|
|
|
* same page, and in clear_subpage_extent_buffer_dirty() we
|
|
|
|
* have to clear page dirty without subpage lock held.
|
|
|
|
* This can cause race where our page gets dirty cleared after
|
|
|
|
* we just set it.
|
|
|
|
*
|
|
|
|
* Thankfully, clear_subpage_extent_buffer_dirty() has locked
|
|
|
|
* its page for other reasons, we can use page lock to prevent
|
|
|
|
* the above race.
|
|
|
|
*/
|
|
|
|
if (subpage)
|
2023-12-06 23:09:27 +00:00
|
|
|
lock_page(folio_page(eb->folios[0], 0));
|
2023-12-06 23:09:28 +00:00
|
|
|
for (int i = 0; i < num_folios; i++)
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_folio_set_dirty(eb->fs_info, eb->folios[i],
|
|
|
|
eb->start, eb->len);
|
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-25 07:14:43 +00:00
|
|
|
if (subpage)
|
2023-12-06 23:09:27 +00:00
|
|
|
unlock_page(folio_page(eb->folios[0], 0));
|
2023-05-08 14:58:38 +00:00
|
|
|
percpu_counter_add_batch(&eb->fs_info->dirty_metadata_bytes,
|
|
|
|
eb->len,
|
|
|
|
eb->fs_info->dirty_metadata_batch);
|
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-03-25 07:14:43 +00:00
|
|
|
}
|
2018-09-13 17:46:08 +00:00
|
|
|
#ifdef CONFIG_BTRFS_DEBUG
|
2023-12-06 23:09:28 +00:00
|
|
|
for (int i = 0; i < num_folios; i++)
|
|
|
|
ASSERT(folio_test_dirty(eb->folios[i]));
|
2018-09-13 17:46:08 +00:00
|
|
|
#endif
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2015-12-03 12:08:59 +00:00
|
|
|
void clear_extent_buffer_uptodate(struct extent_buffer *eb)
|
2008-05-12 17:39:03 +00:00
|
|
|
{
|
2021-01-26 08:33:54 +00:00
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
2023-12-06 23:09:28 +00:00
|
|
|
int num_folios = num_extent_folios(eb);
|
2008-05-12 17:39:03 +00:00
|
|
|
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 14:25:08 +00:00
|
|
|
clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
|
2023-12-06 23:09:28 +00:00
|
|
|
for (int i = 0; i < num_folios; i++) {
|
|
|
|
struct folio *folio = eb->folios[i];
|
|
|
|
|
|
|
|
if (!folio)
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* This is special handling for metadata subpage, as regular
|
|
|
|
* btrfs_is_subpage() can not handle cloned/dummy metadata.
|
|
|
|
*/
|
|
|
|
if (fs_info->nodesize >= PAGE_SIZE)
|
2023-12-06 23:09:28 +00:00
|
|
|
folio_clear_uptodate(folio);
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
else
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_subpage_clear_uptodate(fs_info, folio,
|
2023-12-06 23:09:28 +00:00
|
|
|
eb->start, eb->len);
|
2008-05-12 17:39:03 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-12-03 12:08:59 +00:00
|
|
|
void set_extent_buffer_uptodate(struct extent_buffer *eb)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2021-01-26 08:33:54 +00:00
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
2023-12-06 23:09:28 +00:00
|
|
|
int num_folios = num_extent_folios(eb);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
|
2023-12-06 23:09:28 +00:00
|
|
|
for (int i = 0; i < num_folios; i++) {
|
|
|
|
struct folio *folio = eb->folios[i];
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This is special handling for metadata subpage, as regular
|
|
|
|
* btrfs_is_subpage() can not handle cloned/dummy metadata.
|
|
|
|
*/
|
|
|
|
if (fs_info->nodesize >= PAGE_SIZE)
|
2023-12-06 23:09:28 +00:00
|
|
|
folio_mark_uptodate(folio);
|
btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine
The reason why we only support 64K page size for subpage is, for 64K
page size we can ensure no matter what the nodesize is, we can fit it
into one page.
When other page size come, especially like 16K, the limitation is a bit
limiting.
To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
non-subpage routine. By this, we can allow 4K sectorsize on 16K page
size.
Although this introduces another smaller limitation, the metadata can
not cross page boundary, which is already met by most recent mkfs.
Another small improvement is, we can avoid the overhead for metadata if
nodesize >= PAGE_SIZE.
For 4K sector size and 64K page size/node size, or 4K sector size and
16K page size/node size, we don't need to allocate extra memory for the
metadata pages.
Please note that, this patch will not yet enable other page size support
yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-01-13 05:22:09 +00:00
|
|
|
else
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_subpage_set_uptodate(fs_info, folio,
|
2023-12-06 23:09:28 +00:00
|
|
|
eb->start, eb->len);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2024-03-18 13:56:53 +00:00
|
|
|
static void clear_extent_buffer_reading(struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
|
|
|
|
smp_mb__after_atomic();
|
|
|
|
wake_up_bit(&eb->bflags, EXTENT_BUFFER_READING);
|
|
|
|
}
|
|
|
|
|
2023-12-12 02:28:38 +00:00
|
|
|
static void end_bbio_meta_read(struct btrfs_bio *bbio)
|
2023-05-03 15:24:28 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb = bbio->private;
|
2023-05-03 15:24:40 +00:00
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
2023-05-03 15:24:28 +00:00
|
|
|
bool uptodate = !bbio->bio.bi_status;
|
2023-12-12 02:28:38 +00:00
|
|
|
struct folio_iter fi;
|
2023-05-03 15:24:28 +00:00
|
|
|
u32 bio_offset = 0;
|
|
|
|
|
2024-03-18 13:56:54 +00:00
|
|
|
/*
|
|
|
|
* If the extent buffer is marked UPTODATE before the read operation
|
|
|
|
* completes, other calls to read_extent_buffer_pages() will return
|
|
|
|
* early without waiting for the read to finish, causing data races.
|
|
|
|
*/
|
|
|
|
WARN_ON(test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags));
|
|
|
|
|
2023-05-03 15:24:28 +00:00
|
|
|
eb->read_mirror = bbio->mirror_num;
|
|
|
|
|
|
|
|
if (uptodate &&
|
|
|
|
btrfs_validate_extent_buffer(eb, &bbio->parent_check) < 0)
|
|
|
|
uptodate = false;
|
|
|
|
|
|
|
|
if (uptodate) {
|
|
|
|
set_extent_buffer_uptodate(eb);
|
|
|
|
} else {
|
|
|
|
clear_extent_buffer_uptodate(eb);
|
|
|
|
set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
|
|
|
|
}
|
|
|
|
|
2023-12-12 02:28:38 +00:00
|
|
|
bio_for_each_folio_all(fi, &bbio->bio) {
|
|
|
|
struct folio *folio = fi.folio;
|
2023-05-03 15:24:40 +00:00
|
|
|
u64 start = eb->start + bio_offset;
|
2023-12-12 02:28:38 +00:00
|
|
|
u32 len = fi.length;
|
2023-05-03 15:24:28 +00:00
|
|
|
|
2023-05-03 15:24:40 +00:00
|
|
|
if (uptodate)
|
2023-12-12 02:28:38 +00:00
|
|
|
btrfs_folio_set_uptodate(fs_info, folio, start, len);
|
2023-05-03 15:24:40 +00:00
|
|
|
else
|
2023-12-12 02:28:38 +00:00
|
|
|
btrfs_folio_clear_uptodate(fs_info, folio, start, len);
|
2023-05-03 15:24:40 +00:00
|
|
|
|
|
|
|
bio_offset += len;
|
2023-05-03 15:24:29 +00:00
|
|
|
}
|
2023-05-03 15:24:40 +00:00
|
|
|
|
2024-03-18 13:56:53 +00:00
|
|
|
clear_extent_buffer_reading(eb);
|
2023-05-03 15:24:28 +00:00
|
|
|
free_extent_buffer(eb);
|
|
|
|
|
|
|
|
bio_put(&bbio->bio);
|
|
|
|
}
|
|
|
|
|
2023-05-03 15:24:40 +00:00
|
|
|
int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num,
|
2024-05-30 17:14:12 +00:00
|
|
|
const struct btrfs_tree_parent_check *check)
|
2023-05-03 15:24:26 +00:00
|
|
|
{
|
|
|
|
struct btrfs_bio *bbio;
|
2023-12-06 23:09:28 +00:00
|
|
|
bool ret;
|
2023-05-03 15:24:26 +00:00
|
|
|
|
2023-05-03 15:24:40 +00:00
|
|
|
if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We could have had EXTENT_BUFFER_UPTODATE cleared by the write
|
|
|
|
* operation, which could potentially still be in flight. In this case
|
|
|
|
* we simply want to return an error.
|
|
|
|
*/
|
|
|
|
if (unlikely(test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)))
|
|
|
|
return -EIO;
|
|
|
|
|
|
|
|
/* Someone else is already reading the buffer, just wait for it. */
|
|
|
|
if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
|
|
|
|
goto done;
|
|
|
|
|
btrfs: fix race in read_extent_buffer_pages()
There are reports from tree-checker that detects corrupted nodes,
without any obvious pattern so possibly an overwrite in memory.
After some debugging it turns out there's a race when reading an extent
buffer the uptodate status can be missed.
To prevent concurrent reads for the same extent buffer,
read_extent_buffer_pages() performs these checks:
/* (1) */
if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
return 0;
/* (2) */
if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
goto done;
At this point, it seems safe to start the actual read operation. Once
that completes, end_bbio_meta_read() does
/* (3) */
set_extent_buffer_uptodate(eb);
/* (4) */
clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
Normally, this is enough to ensure only one read happens, and all other
callers wait for it to finish before returning. Unfortunately, there is
a racey interleaving:
Thread A | Thread B | Thread C
---------+----------+---------
(1) | |
| (1) |
(2) | |
(3) | |
(4) | |
| (2) |
| | (1)
When this happens, thread B kicks of an unnecessary read. Worse, thread
C will see UPTODATE set and return immediately, while the read from
thread B is still in progress. This race could result in tree-checker
errors like this as the extent buffer is concurrently modified:
BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
Fix it by testing UPTODATE again after setting the READING bit, and if
it's been set, skip the unnecessary read.
Fixes: d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer reading")
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-16 01:14:29 +00:00
|
|
|
/*
|
|
|
|
* Between the initial test_bit(EXTENT_BUFFER_UPTODATE) and the above
|
|
|
|
* test_and_set_bit(EXTENT_BUFFER_READING), someone else could have
|
|
|
|
* started and finished reading the same eb. In this case, UPTODATE
|
|
|
|
* will now be set, and we shouldn't read it in again.
|
|
|
|
*/
|
|
|
|
if (unlikely(test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))) {
|
2024-03-18 13:56:53 +00:00
|
|
|
clear_extent_buffer_reading(eb);
|
btrfs: fix race in read_extent_buffer_pages()
There are reports from tree-checker that detects corrupted nodes,
without any obvious pattern so possibly an overwrite in memory.
After some debugging it turns out there's a race when reading an extent
buffer the uptodate status can be missed.
To prevent concurrent reads for the same extent buffer,
read_extent_buffer_pages() performs these checks:
/* (1) */
if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
return 0;
/* (2) */
if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
goto done;
At this point, it seems safe to start the actual read operation. Once
that completes, end_bbio_meta_read() does
/* (3) */
set_extent_buffer_uptodate(eb);
/* (4) */
clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
Normally, this is enough to ensure only one read happens, and all other
callers wait for it to finish before returning. Unfortunately, there is
a racey interleaving:
Thread A | Thread B | Thread C
---------+----------+---------
(1) | |
| (1) |
(2) | |
(3) | |
(4) | |
| (2) |
| | (1)
When this happens, thread B kicks of an unnecessary read. Worse, thread
C will see UPTODATE set and return immediately, while the read from
thread B is still in progress. This race could result in tree-checker
errors like this as the extent buffer is concurrently modified:
BTRFS critical (device dm-0): corrupted node, root=256
block=8550954455682405139 owner mismatch, have 11858205567642294356
expect [256, 18446744073709551360]
Fix it by testing UPTODATE again after setting the READING bit, and if
it's been set, skip the unnecessary read.
Fixes: d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer reading")
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-16 01:14:29 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2023-05-03 15:24:26 +00:00
|
|
|
clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
|
|
|
|
eb->read_mirror = 0;
|
|
|
|
check_buffer_tree_ref(eb);
|
2023-05-03 15:24:36 +00:00
|
|
|
atomic_inc(&eb->refs);
|
2023-05-03 15:24:26 +00:00
|
|
|
|
|
|
|
bbio = btrfs_bio_alloc(INLINE_EXTENT_BUFFER_PAGES,
|
|
|
|
REQ_OP_READ | REQ_META, eb->fs_info,
|
2023-12-12 02:28:38 +00:00
|
|
|
end_bbio_meta_read, eb);
|
2023-05-03 15:24:26 +00:00
|
|
|
bbio->bio.bi_iter.bi_sector = eb->start >> SECTOR_SHIFT;
|
|
|
|
bbio->inode = BTRFS_I(eb->fs_info->btree_inode);
|
|
|
|
bbio->file_offset = eb->start;
|
|
|
|
memcpy(&bbio->parent_check, check, sizeof(*check));
|
|
|
|
if (eb->fs_info->nodesize < PAGE_SIZE) {
|
2023-12-06 23:09:28 +00:00
|
|
|
ret = bio_add_folio(&bbio->bio, eb->folios[0], eb->len,
|
|
|
|
eb->start - folio_pos(eb->folios[0]));
|
|
|
|
ASSERT(ret);
|
2023-05-03 15:24:26 +00:00
|
|
|
} else {
|
2023-12-06 23:09:28 +00:00
|
|
|
int num_folios = num_extent_folios(eb);
|
|
|
|
|
|
|
|
for (int i = 0; i < num_folios; i++) {
|
|
|
|
struct folio *folio = eb->folios[i];
|
|
|
|
|
2024-01-05 05:35:55 +00:00
|
|
|
ret = bio_add_folio(&bbio->bio, folio, eb->folio_size, 0);
|
2023-12-06 23:09:28 +00:00
|
|
|
ASSERT(ret);
|
|
|
|
}
|
2023-05-03 15:24:26 +00:00
|
|
|
}
|
2024-08-27 01:40:11 +00:00
|
|
|
btrfs_submit_bbio(bbio, mirror_num);
|
2023-05-03 15:24:26 +00:00
|
|
|
|
2023-05-03 15:24:40 +00:00
|
|
|
done:
|
|
|
|
if (wait == WAIT_COMPLETE) {
|
|
|
|
wait_on_bit_io(&eb->bflags, EXTENT_BUFFER_READING, TASK_UNINTERRUPTIBLE);
|
|
|
|
if (!test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
|
2023-02-27 15:17:01 +00:00
|
|
|
return -EIO;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2009-01-06 02:25:51 +00:00
|
|
|
|
2023-02-27 15:17:01 +00:00
|
|
|
return 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
static bool report_eb_range(const struct extent_buffer *eb, unsigned long start,
|
|
|
|
unsigned long len)
|
|
|
|
{
|
|
|
|
btrfs_warn(eb->fs_info,
|
2024-01-05 05:35:55 +00:00
|
|
|
"access to eb bytenr %llu len %u out of range start %lu len %lu",
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
eb->start, eb->len, start, len);
|
|
|
|
WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if the [start, start + len) range is valid before reading/writing
|
|
|
|
* the eb.
|
|
|
|
* NOTE: @start and @len are offset inside the eb, not logical address.
|
|
|
|
*
|
|
|
|
* Caller should not touch the dst/src memory if this function returns error.
|
|
|
|
*/
|
|
|
|
static inline int check_eb_range(const struct extent_buffer *eb,
|
|
|
|
unsigned long start, unsigned long len)
|
|
|
|
{
|
|
|
|
unsigned long offset;
|
|
|
|
|
|
|
|
/* start, start + len should not go beyond eb->len nor overflow */
|
|
|
|
if (unlikely(check_add_overflow(start, len, &offset) || offset > eb->len))
|
|
|
|
return report_eb_range(eb, start, len);
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2017-06-29 03:56:53 +00:00
|
|
|
void read_extent_buffer(const struct extent_buffer *eb, void *dstv,
|
|
|
|
unsigned long start, unsigned long len)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2024-01-05 05:35:55 +00:00
|
|
|
const int unit_size = eb->folio_size;
|
2008-01-24 21:13:08 +00:00
|
|
|
size_t cur;
|
|
|
|
size_t offset;
|
|
|
|
char *dst = (char *)dstv;
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
unsigned long i = get_eb_folio_index(eb, start);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-09-19 02:14:42 +00:00
|
|
|
if (check_eb_range(eb, start, len)) {
|
|
|
|
/*
|
|
|
|
* Invalid range hit, reset the memory, so callers won't get
|
2023-12-05 18:26:39 +00:00
|
|
|
* some random garbage for their uninitialized memory.
|
2023-09-19 02:14:42 +00:00
|
|
|
*/
|
|
|
|
memset(dstv, 0, len);
|
Btrfs: fix out of bounds array access while reading extent buffer
There is a corner case that slips through the checkers in functions
reading extent buffer, ie.
if (start < eb->len) and (start + len > eb->len),
then
a) map_private_extent_buffer() returns immediately because
it's thinking the range spans across two pages,
b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len)
and WARN_ON(start + len > eb->start + eb->len), both are OK in this
corner case, but it'd actually try to access the eb->pages out of
bounds because of (start + len > eb->len).
The case is found by switching extent inline ref type from shared data
ref to non-shared data ref, which is a kind of metadata corruption.
It'd use the wrong helper to access the eb,
eg. btrfs_extent_data_ref_root(eb, ref) is used but the %ref passing
here is "struct btrfs_shared_data_ref". And if the extent item
happens to be the first item in the eb, then offset/length will get
over eb->len which ends up an invalid memory access.
This is adding proper checks in order to avoid invalid memory access,
ie. 'general protection fault', before it's too late.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-09 17:10:16 +00:00
|
|
|
return;
|
2023-09-19 02:14:42 +00:00
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-11-16 05:19:06 +00:00
|
|
|
if (eb->addr) {
|
|
|
|
memcpy(dstv, eb->addr + start, len);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
offset = get_eb_offset_in_folio(eb, start);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (len > 0) {
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
char *kaddr;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
cur = min(len, unit_size - offset);
|
|
|
|
kaddr = folio_address(eb->folios[i]);
|
2008-01-24 21:13:08 +00:00
|
|
|
memcpy(dst, kaddr + offset, cur);
|
|
|
|
|
|
|
|
dst += cur;
|
|
|
|
len -= cur;
|
|
|
|
offset = 0;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-08-10 15:42:27 +00:00
|
|
|
int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
|
|
|
|
void __user *dstv,
|
|
|
|
unsigned long start, unsigned long len)
|
2014-01-30 15:24:01 +00:00
|
|
|
{
|
2024-01-05 05:35:55 +00:00
|
|
|
const int unit_size = eb->folio_size;
|
2014-01-30 15:24:01 +00:00
|
|
|
size_t cur;
|
|
|
|
size_t offset;
|
|
|
|
char __user *dst = (char __user *)dstv;
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
unsigned long i = get_eb_folio_index(eb, start);
|
2014-01-30 15:24:01 +00:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
WARN_ON(start > eb->len);
|
|
|
|
WARN_ON(start + len > eb->start + eb->len);
|
|
|
|
|
2023-11-16 05:19:06 +00:00
|
|
|
if (eb->addr) {
|
|
|
|
if (copy_to_user_nofault(dstv, eb->addr + start, len))
|
|
|
|
ret = -EFAULT;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
offset = get_eb_offset_in_folio(eb, start);
|
2014-01-30 15:24:01 +00:00
|
|
|
|
|
|
|
while (len > 0) {
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
char *kaddr;
|
2014-01-30 15:24:01 +00:00
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
cur = min(len, unit_size - offset);
|
|
|
|
kaddr = folio_address(eb->folios[i]);
|
2020-08-10 15:42:27 +00:00
|
|
|
if (copy_to_user_nofault(dst, kaddr + offset, cur)) {
|
2014-01-30 15:24:01 +00:00
|
|
|
ret = -EFAULT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
dst += cur;
|
|
|
|
len -= cur;
|
|
|
|
offset = 0;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-06-29 03:56:53 +00:00
|
|
|
int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv,
|
|
|
|
unsigned long start, unsigned long len)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2024-01-05 05:35:55 +00:00
|
|
|
const int unit_size = eb->folio_size;
|
2008-01-24 21:13:08 +00:00
|
|
|
size_t cur;
|
|
|
|
size_t offset;
|
|
|
|
char *kaddr;
|
|
|
|
char *ptr = (char *)ptrv;
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
unsigned long i = get_eb_folio_index(eb, start);
|
2008-01-24 21:13:08 +00:00
|
|
|
int ret = 0;
|
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
if (check_eb_range(eb, start, len))
|
|
|
|
return -EINVAL;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-11-16 05:19:06 +00:00
|
|
|
if (eb->addr)
|
|
|
|
return memcmp(ptrv, eb->addr + start, len);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
offset = get_eb_offset_in_folio(eb, start);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (len > 0) {
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
cur = min(len, unit_size - offset);
|
|
|
|
kaddr = folio_address(eb->folios[i]);
|
2008-01-24 21:13:08 +00:00
|
|
|
ret = memcmp(ptr, kaddr + offset, cur);
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
|
|
|
|
ptr += cur;
|
|
|
|
len -= cur;
|
|
|
|
offset = 0;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2021-03-25 07:14:42 +00:00
|
|
|
/*
|
|
|
|
* Check that the extent buffer is uptodate.
|
|
|
|
*
|
|
|
|
* For regular sector size == PAGE_SIZE case, check if @page is uptodate.
|
|
|
|
* For subpage case, check if the range covered by the eb has EXTENT_UPTODATE.
|
|
|
|
*/
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
static void assert_eb_folio_uptodate(const struct extent_buffer *eb, int i)
|
2021-03-25 07:14:42 +00:00
|
|
|
{
|
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
struct folio *folio = eb->folios[i];
|
|
|
|
|
|
|
|
ASSERT(folio);
|
2021-03-25 07:14:42 +00:00
|
|
|
|
btrfs: do not WARN_ON() if we have PageError set
Whenever we do any extent buffer operations we call
assert_eb_page_uptodate() to complain loudly if we're operating on an
non-uptodate page. Our overnight tests caught this warning earlier this
week
WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50
CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Workqueue: btrfs-cache btrfs_work_helper
RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246
RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000
RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0
RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000
R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1
R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000
FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0
Call Trace:
extent_buffer_test_bit+0x3f/0x70
free_space_test_bit+0xa6/0xc0
load_free_space_tree+0x1f6/0x470
caching_thread+0x454/0x630
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? lock_release+0x1f0/0x2d0
btrfs_work_helper+0xf2/0x3e0
? lock_release+0x1f0/0x2d0
? finish_task_switch.isra.0+0xf9/0x3a0
process_one_work+0x26d/0x580
? process_one_work+0x580/0x580
worker_thread+0x55/0x3b0
? process_one_work+0x580/0x580
kthread+0xf0/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
This was partially fixed by c2e39305299f01 ("btrfs: clear extent buffer
uptodate when we fail to write it"), however all that fix did was keep
us from finding extent buffers after a failed writeout. It didn't keep
us from continuing to use a buffer that we already had found.
In this case we're searching the commit root to cache the block group,
so we can start committing the transaction and switch the commit root
and then start writing. After the switch we can look up an extent
buffer that hasn't been written yet and start processing that block
group. Then we fail to write that block out and clear Uptodate on the
page, and then we start spewing these errors.
Normally we're protected by the tree lock to a certain degree here. If
we read a block we have that block read locked, and we block the writer
from locking the block before we submit it for the write. However this
isn't necessarily fool proof because the read could happen before we do
the submit_bio and after we locked and unlocked the extent buffer.
Also in this particular case we have path->skip_locking set, so that
won't save us here. We'll simply get a block that was valid when we
read it, but became invalid while we were using it.
What we really want is to catch the case where we've "read" a block but
it's not marked Uptodate. On read we ClearPageError(), so if we're
!Uptodate and !Error we know we didn't do the right thing for reading
the page.
Fix this by checking !Uptodate && !Error, this way we will not complain
if our buffer gets invalidated while we're using it, and we'll maintain
the spirit of the check which is to make sure we have a fully in-cache
block while we're messing with it.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-18 15:17:39 +00:00
|
|
|
/*
|
|
|
|
* If we are using the commit root we could potentially clear a page
|
|
|
|
* Uptodate while we're using the extent buffer that we've previously
|
|
|
|
* looked up. We don't want to complain in this case, as the page was
|
|
|
|
* valid before, we just didn't write it out. Instead we want to catch
|
|
|
|
* the case where we didn't actually read the block properly, which
|
2023-05-03 15:24:37 +00:00
|
|
|
* would have !PageUptodate and !EXTENT_BUFFER_WRITE_ERR.
|
btrfs: do not WARN_ON() if we have PageError set
Whenever we do any extent buffer operations we call
assert_eb_page_uptodate() to complain loudly if we're operating on an
non-uptodate page. Our overnight tests caught this warning earlier this
week
WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50
CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
Workqueue: btrfs-cache btrfs_work_helper
RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246
RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000
RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0
RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000
R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1
R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000
FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0
Call Trace:
extent_buffer_test_bit+0x3f/0x70
free_space_test_bit+0xa6/0xc0
load_free_space_tree+0x1f6/0x470
caching_thread+0x454/0x630
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? rcu_read_lock_sched_held+0x12/0x60
? lock_release+0x1f0/0x2d0
btrfs_work_helper+0xf2/0x3e0
? lock_release+0x1f0/0x2d0
? finish_task_switch.isra.0+0xf9/0x3a0
process_one_work+0x26d/0x580
? process_one_work+0x580/0x580
worker_thread+0x55/0x3b0
? process_one_work+0x580/0x580
kthread+0xf0/0x120
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
This was partially fixed by c2e39305299f01 ("btrfs: clear extent buffer
uptodate when we fail to write it"), however all that fix did was keep
us from finding extent buffers after a failed writeout. It didn't keep
us from continuing to use a buffer that we already had found.
In this case we're searching the commit root to cache the block group,
so we can start committing the transaction and switch the commit root
and then start writing. After the switch we can look up an extent
buffer that hasn't been written yet and start processing that block
group. Then we fail to write that block out and clear Uptodate on the
page, and then we start spewing these errors.
Normally we're protected by the tree lock to a certain degree here. If
we read a block we have that block read locked, and we block the writer
from locking the block before we submit it for the write. However this
isn't necessarily fool proof because the read could happen before we do
the submit_bio and after we locked and unlocked the extent buffer.
Also in this particular case we have path->skip_locking set, so that
won't save us here. We'll simply get a block that was valid when we
read it, but became invalid while we were using it.
What we really want is to catch the case where we've "read" a block but
it's not marked Uptodate. On read we ClearPageError(), so if we're
!Uptodate and !Error we know we didn't do the right thing for reading
the page.
Fix this by checking !Uptodate && !Error, this way we will not complain
if our buffer gets invalidated while we're using it, and we'll maintain
the spirit of the check which is to make sure we have a fully in-cache
block while we're messing with it.
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-18 15:17:39 +00:00
|
|
|
*/
|
2023-05-03 15:24:37 +00:00
|
|
|
if (test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
|
|
|
|
return;
|
2021-03-25 07:14:42 +00:00
|
|
|
|
2023-05-03 15:24:37 +00:00
|
|
|
if (fs_info->nodesize < PAGE_SIZE) {
|
2024-05-20 17:40:26 +00:00
|
|
|
folio = eb->folios[0];
|
2023-12-12 02:28:37 +00:00
|
|
|
ASSERT(i == 0);
|
|
|
|
if (WARN_ON(!btrfs_subpage_test_uptodate(fs_info, folio,
|
2023-05-26 12:30:53 +00:00
|
|
|
eb->start, eb->len)))
|
2023-12-12 02:28:37 +00:00
|
|
|
btrfs_subpage_dump_bitmap(fs_info, folio, eb->start, eb->len);
|
2021-03-25 07:14:42 +00:00
|
|
|
} else {
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
WARN_ON(!folio_test_uptodate(folio));
|
2021-03-25 07:14:42 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: refactor main loop in memcpy_extent_buffer()
[BACKGROUND]
Currently memcpy_extent_buffer() does a loop where it would stop at
any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
regular memcpy(), this function can handle overlapping ranges.
So here we extract write_extent_buffer() into a new internal helper,
__write_extent_buffer(), and add a new parameter @use_memmove, to
indicate whether we should use memmove() or regular memcpy().
Now we can go __write_extent_buffer() to handle writing into the dst
range, with proper overlapping detection.
This has a tiny change to the chance of calling memmove().
As the split only happens at the source range page boundaries, the
memcpy/memmove() range would be slightly larger than the old code,
thus slightly increase the chance we call memmove() other than memcopy().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:33 +00:00
|
|
|
static void __write_extent_buffer(const struct extent_buffer *eb,
|
|
|
|
const void *srcv, unsigned long start,
|
|
|
|
unsigned long len, bool use_memmove)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2024-01-05 05:35:55 +00:00
|
|
|
const int unit_size = eb->folio_size;
|
2008-01-24 21:13:08 +00:00
|
|
|
size_t cur;
|
|
|
|
size_t offset;
|
|
|
|
char *kaddr;
|
2024-05-20 18:49:18 +00:00
|
|
|
const char *src = (const char *)srcv;
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
unsigned long i = get_eb_folio_index(eb, start);
|
btrfs: refactor main loop in memcpy_extent_buffer()
[BACKGROUND]
Currently memcpy_extent_buffer() does a loop where it would stop at
any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
regular memcpy(), this function can handle overlapping ranges.
So here we extract write_extent_buffer() into a new internal helper,
__write_extent_buffer(), and add a new parameter @use_memmove, to
indicate whether we should use memmove() or regular memcpy().
Now we can go __write_extent_buffer() to handle writing into the dst
range, with proper overlapping detection.
This has a tiny change to the chance of calling memmove().
As the split only happens at the source range page boundaries, the
memcpy/memmove() range would be slightly larger than the old code,
thus slightly increase the chance we call memmove() other than memcopy().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:33 +00:00
|
|
|
/* For unmapped (dummy) ebs, no need to check their uptodate status. */
|
|
|
|
const bool check_uptodate = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
if (check_eb_range(eb, start, len))
|
|
|
|
return;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-11-16 05:19:06 +00:00
|
|
|
if (eb->addr) {
|
|
|
|
if (use_memmove)
|
|
|
|
memmove(eb->addr + start, srcv, len);
|
|
|
|
else
|
|
|
|
memcpy(eb->addr + start, srcv, len);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
offset = get_eb_offset_in_folio(eb, start);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (len > 0) {
|
btrfs: refactor main loop in memcpy_extent_buffer()
[BACKGROUND]
Currently memcpy_extent_buffer() does a loop where it would stop at
any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
regular memcpy(), this function can handle overlapping ranges.
So here we extract write_extent_buffer() into a new internal helper,
__write_extent_buffer(), and add a new parameter @use_memmove, to
indicate whether we should use memmove() or regular memcpy().
Now we can go __write_extent_buffer() to handle writing into the dst
range, with proper overlapping detection.
This has a tiny change to the chance of calling memmove().
As the split only happens at the source range page boundaries, the
memcpy/memmove() range would be slightly larger than the old code,
thus slightly increase the chance we call memmove() other than memcopy().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:33 +00:00
|
|
|
if (check_uptodate)
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
assert_eb_folio_uptodate(eb, i);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
cur = min(len, unit_size - offset);
|
|
|
|
kaddr = folio_address(eb->folios[i]);
|
btrfs: refactor main loop in memcpy_extent_buffer()
[BACKGROUND]
Currently memcpy_extent_buffer() does a loop where it would stop at
any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
regular memcpy(), this function can handle overlapping ranges.
So here we extract write_extent_buffer() into a new internal helper,
__write_extent_buffer(), and add a new parameter @use_memmove, to
indicate whether we should use memmove() or regular memcpy().
Now we can go __write_extent_buffer() to handle writing into the dst
range, with proper overlapping detection.
This has a tiny change to the chance of calling memmove().
As the split only happens at the source range page boundaries, the
memcpy/memmove() range would be slightly larger than the old code,
thus slightly increase the chance we call memmove() other than memcopy().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:33 +00:00
|
|
|
if (use_memmove)
|
|
|
|
memmove(kaddr + offset, src, cur);
|
|
|
|
else
|
|
|
|
memcpy(kaddr + offset, src, cur);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
src += cur;
|
|
|
|
len -= cur;
|
|
|
|
offset = 0;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
btrfs: refactor main loop in memcpy_extent_buffer()
[BACKGROUND]
Currently memcpy_extent_buffer() does a loop where it would stop at
any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
regular memcpy(), this function can handle overlapping ranges.
So here we extract write_extent_buffer() into a new internal helper,
__write_extent_buffer(), and add a new parameter @use_memmove, to
indicate whether we should use memmove() or regular memcpy().
Now we can go __write_extent_buffer() to handle writing into the dst
range, with proper overlapping detection.
This has a tiny change to the chance of calling memmove().
As the split only happens at the source range page boundaries, the
memcpy/memmove() range would be slightly larger than the old code,
thus slightly increase the chance we call memmove() other than memcopy().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:33 +00:00
|
|
|
void write_extent_buffer(const struct extent_buffer *eb, const void *srcv,
|
|
|
|
unsigned long start, unsigned long len)
|
|
|
|
{
|
|
|
|
return __write_extent_buffer(eb, srcv, start, len, false);
|
|
|
|
}
|
|
|
|
|
2023-07-15 11:08:29 +00:00
|
|
|
static void memset_extent_buffer(const struct extent_buffer *eb, int c,
|
|
|
|
unsigned long start, unsigned long len)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2024-01-05 05:35:55 +00:00
|
|
|
const int unit_size = eb->folio_size;
|
2023-07-15 11:08:29 +00:00
|
|
|
unsigned long cur = start;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-11-16 05:19:06 +00:00
|
|
|
if (eb->addr) {
|
|
|
|
memset(eb->addr + start, c, len);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2023-07-15 11:08:29 +00:00
|
|
|
while (cur < start + len) {
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
unsigned long index = get_eb_folio_index(eb, cur);
|
|
|
|
unsigned int offset = get_eb_offset_in_folio(eb, cur);
|
|
|
|
unsigned int cur_len = min(start + len - cur, unit_size - offset);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
assert_eb_folio_uptodate(eb, index);
|
|
|
|
memset(folio_address(eb->folios[index]) + offset, c, cur_len);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-07-15 11:08:29 +00:00
|
|
|
cur += cur_len;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2023-07-15 11:08:29 +00:00
|
|
|
void memzero_extent_buffer(const struct extent_buffer *eb, unsigned long start,
|
|
|
|
unsigned long len)
|
|
|
|
{
|
|
|
|
if (check_eb_range(eb, start, len))
|
|
|
|
return;
|
|
|
|
return memset_extent_buffer(eb, 0, start, len);
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void copy_extent_buffer_full(const struct extent_buffer *dst,
|
|
|
|
const struct extent_buffer *src)
|
2016-11-08 17:30:31 +00:00
|
|
|
{
|
2024-01-05 05:35:55 +00:00
|
|
|
const int unit_size = src->folio_size;
|
2023-07-15 11:08:31 +00:00
|
|
|
unsigned long cur = 0;
|
2016-11-08 17:30:31 +00:00
|
|
|
|
|
|
|
ASSERT(dst->len == src->len);
|
|
|
|
|
2023-07-15 11:08:31 +00:00
|
|
|
while (cur < src->len) {
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
unsigned long index = get_eb_folio_index(src, cur);
|
|
|
|
unsigned long offset = get_eb_offset_in_folio(src, cur);
|
|
|
|
unsigned long cur_len = min(src->len, unit_size - offset);
|
2023-12-06 23:09:27 +00:00
|
|
|
void *addr = folio_address(src->folios[index]) + offset;
|
2023-07-15 11:08:31 +00:00
|
|
|
|
|
|
|
write_extent_buffer(dst, addr, cur, cur_len);
|
2020-12-02 06:48:04 +00:00
|
|
|
|
2023-07-15 11:08:31 +00:00
|
|
|
cur += cur_len;
|
2020-12-02 06:48:04 +00:00
|
|
|
}
|
2016-11-08 17:30:31 +00:00
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void copy_extent_buffer(const struct extent_buffer *dst,
|
|
|
|
const struct extent_buffer *src,
|
2008-01-24 21:13:08 +00:00
|
|
|
unsigned long dst_offset, unsigned long src_offset,
|
|
|
|
unsigned long len)
|
|
|
|
{
|
2024-01-05 05:35:55 +00:00
|
|
|
const int unit_size = dst->folio_size;
|
2008-01-24 21:13:08 +00:00
|
|
|
u64 dst_len = dst->len;
|
|
|
|
size_t cur;
|
|
|
|
size_t offset;
|
|
|
|
char *kaddr;
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
unsigned long i = get_eb_folio_index(dst, dst_offset);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
if (check_eb_range(dst, dst_offset, len) ||
|
|
|
|
check_eb_range(src, src_offset, len))
|
|
|
|
return;
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
WARN_ON(src->len != dst_len);
|
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
offset = get_eb_offset_in_folio(dst, dst_offset);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (len > 0) {
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
assert_eb_folio_uptodate(dst, i);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
cur = min(len, (unsigned long)(unit_size - offset));
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
kaddr = folio_address(dst->folios[i]);
|
2008-01-24 21:13:08 +00:00
|
|
|
read_extent_buffer(src, kaddr + offset, src_offset, cur);
|
|
|
|
|
|
|
|
src_offset += cur;
|
|
|
|
len -= cur;
|
|
|
|
offset = 0;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-09-30 03:50:30 +00:00
|
|
|
/*
|
2023-12-12 05:24:09 +00:00
|
|
|
* Calculate the folio and offset of the byte containing the given bit number.
|
2023-09-07 23:09:25 +00:00
|
|
|
*
|
|
|
|
* @eb: the extent buffer
|
|
|
|
* @start: offset of the bitmap item in the extent buffer
|
|
|
|
* @nr: bit number
|
2023-12-12 05:24:09 +00:00
|
|
|
* @folio_index: return index of the folio in the extent buffer that contains
|
2023-09-07 23:09:25 +00:00
|
|
|
* the given bit number
|
2023-12-12 05:24:09 +00:00
|
|
|
* @folio_offset: return offset into the folio given by folio_index
|
2015-09-30 03:50:30 +00:00
|
|
|
*
|
|
|
|
* This helper hides the ugliness of finding the byte in an extent buffer which
|
|
|
|
* contains a given bit.
|
|
|
|
*/
|
2020-04-29 01:04:10 +00:00
|
|
|
static inline void eb_bitmap_offset(const struct extent_buffer *eb,
|
2015-09-30 03:50:30 +00:00
|
|
|
unsigned long start, unsigned long nr,
|
2023-12-12 05:24:09 +00:00
|
|
|
unsigned long *folio_index,
|
|
|
|
size_t *folio_offset)
|
2015-09-30 03:50:30 +00:00
|
|
|
{
|
|
|
|
size_t byte_offset = BIT_BYTE(nr);
|
|
|
|
size_t offset;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The byte we want is the offset of the extent buffer + the offset of
|
|
|
|
* the bitmap item in the extent buffer + the offset of the byte in the
|
|
|
|
* bitmap item.
|
|
|
|
*/
|
2024-01-05 05:35:55 +00:00
|
|
|
offset = start + offset_in_eb_folio(eb, eb->start) + byte_offset;
|
2015-09-30 03:50:30 +00:00
|
|
|
|
2024-01-05 05:35:55 +00:00
|
|
|
*folio_index = offset >> eb->folio_shift;
|
|
|
|
*folio_offset = offset_in_eb_folio(eb, offset);
|
2015-09-30 03:50:30 +00:00
|
|
|
}
|
|
|
|
|
2022-10-27 12:21:42 +00:00
|
|
|
/*
|
|
|
|
* Determine whether a bit in a bitmap item is set.
|
|
|
|
*
|
|
|
|
* @eb: the extent buffer
|
|
|
|
* @start: offset of the bitmap item in the extent buffer
|
|
|
|
* @nr: bit number to test
|
2015-09-30 03:50:30 +00:00
|
|
|
*/
|
2020-04-29 01:04:10 +00:00
|
|
|
int extent_buffer_test_bit(const struct extent_buffer *eb, unsigned long start,
|
2015-09-30 03:50:30 +00:00
|
|
|
unsigned long nr)
|
|
|
|
{
|
|
|
|
unsigned long i;
|
|
|
|
size_t offset;
|
2023-12-12 05:24:09 +00:00
|
|
|
u8 *kaddr;
|
2015-09-30 03:50:30 +00:00
|
|
|
|
|
|
|
eb_bitmap_offset(eb, start, nr, &i, &offset);
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
assert_eb_folio_uptodate(eb, i);
|
2023-12-12 05:24:09 +00:00
|
|
|
kaddr = folio_address(eb->folios[i]);
|
2015-09-30 03:50:30 +00:00
|
|
|
return 1U & (kaddr[offset] >> (nr & (BITS_PER_BYTE - 1)));
|
|
|
|
}
|
|
|
|
|
2023-07-15 11:08:29 +00:00
|
|
|
static u8 *extent_buffer_get_byte(const struct extent_buffer *eb, unsigned long bytenr)
|
|
|
|
{
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
unsigned long index = get_eb_folio_index(eb, bytenr);
|
2023-07-15 11:08:29 +00:00
|
|
|
|
|
|
|
if (check_eb_range(eb, bytenr, 1))
|
|
|
|
return NULL;
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
return folio_address(eb->folios[index]) + get_eb_offset_in_folio(eb, bytenr);
|
2023-07-15 11:08:29 +00:00
|
|
|
}
|
|
|
|
|
2022-10-27 12:21:42 +00:00
|
|
|
/*
|
|
|
|
* Set an area of a bitmap to 1.
|
|
|
|
*
|
|
|
|
* @eb: the extent buffer
|
|
|
|
* @start: offset of the bitmap item in the extent buffer
|
|
|
|
* @pos: bit number of the first bit
|
|
|
|
* @len: number of bits to set
|
2015-09-30 03:50:30 +00:00
|
|
|
*/
|
2020-04-29 01:04:10 +00:00
|
|
|
void extent_buffer_bitmap_set(const struct extent_buffer *eb, unsigned long start,
|
2015-09-30 03:50:30 +00:00
|
|
|
unsigned long pos, unsigned long len)
|
|
|
|
{
|
2023-07-15 11:08:29 +00:00
|
|
|
unsigned int first_byte = start + BIT_BYTE(pos);
|
|
|
|
unsigned int last_byte = start + BIT_BYTE(pos + len - 1);
|
|
|
|
const bool same_byte = (first_byte == last_byte);
|
|
|
|
u8 mask = BITMAP_FIRST_BYTE_MASK(pos);
|
2016-09-23 00:24:20 +00:00
|
|
|
u8 *kaddr;
|
2015-09-30 03:50:30 +00:00
|
|
|
|
2023-07-15 11:08:29 +00:00
|
|
|
if (same_byte)
|
|
|
|
mask &= BITMAP_LAST_BYTE_MASK(pos + len);
|
2015-09-30 03:50:30 +00:00
|
|
|
|
2023-07-15 11:08:29 +00:00
|
|
|
/* Handle the first byte. */
|
|
|
|
kaddr = extent_buffer_get_byte(eb, first_byte);
|
|
|
|
*kaddr |= mask;
|
|
|
|
if (same_byte)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Handle the byte aligned part. */
|
|
|
|
ASSERT(first_byte + 1 <= last_byte);
|
|
|
|
memset_extent_buffer(eb, 0xff, first_byte + 1, last_byte - first_byte - 1);
|
|
|
|
|
|
|
|
/* Handle the last byte. */
|
|
|
|
kaddr = extent_buffer_get_byte(eb, last_byte);
|
|
|
|
*kaddr |= BITMAP_LAST_BYTE_MASK(pos + len);
|
2015-09-30 03:50:30 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
|
2022-10-27 12:21:42 +00:00
|
|
|
/*
|
|
|
|
* Clear an area of a bitmap.
|
|
|
|
*
|
|
|
|
* @eb: the extent buffer
|
|
|
|
* @start: offset of the bitmap item in the extent buffer
|
|
|
|
* @pos: bit number of the first bit
|
|
|
|
* @len: number of bits to clear
|
2015-09-30 03:50:30 +00:00
|
|
|
*/
|
2020-04-29 01:04:10 +00:00
|
|
|
void extent_buffer_bitmap_clear(const struct extent_buffer *eb,
|
|
|
|
unsigned long start, unsigned long pos,
|
|
|
|
unsigned long len)
|
2015-09-30 03:50:30 +00:00
|
|
|
{
|
2023-07-15 11:08:29 +00:00
|
|
|
unsigned int first_byte = start + BIT_BYTE(pos);
|
|
|
|
unsigned int last_byte = start + BIT_BYTE(pos + len - 1);
|
|
|
|
const bool same_byte = (first_byte == last_byte);
|
|
|
|
u8 mask = BITMAP_FIRST_BYTE_MASK(pos);
|
2016-09-23 00:24:20 +00:00
|
|
|
u8 *kaddr;
|
2015-09-30 03:50:30 +00:00
|
|
|
|
2023-07-15 11:08:29 +00:00
|
|
|
if (same_byte)
|
|
|
|
mask &= BITMAP_LAST_BYTE_MASK(pos + len);
|
2015-09-30 03:50:30 +00:00
|
|
|
|
2023-07-15 11:08:29 +00:00
|
|
|
/* Handle the first byte. */
|
|
|
|
kaddr = extent_buffer_get_byte(eb, first_byte);
|
|
|
|
*kaddr &= ~mask;
|
|
|
|
if (same_byte)
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* Handle the byte aligned part. */
|
|
|
|
ASSERT(first_byte + 1 <= last_byte);
|
|
|
|
memset_extent_buffer(eb, 0, first_byte + 1, last_byte - first_byte - 1);
|
|
|
|
|
|
|
|
/* Handle the last byte. */
|
|
|
|
kaddr = extent_buffer_get_byte(eb, last_byte);
|
|
|
|
*kaddr &= ~BITMAP_LAST_BYTE_MASK(pos + len);
|
2015-09-30 03:50:30 +00:00
|
|
|
}
|
|
|
|
|
2011-04-11 21:52:52 +00:00
|
|
|
static inline bool areas_overlap(unsigned long src, unsigned long dst, unsigned long len)
|
|
|
|
{
|
|
|
|
unsigned long distance = (src > dst) ? src - dst : dst - src;
|
|
|
|
return distance < len;
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void memcpy_extent_buffer(const struct extent_buffer *dst,
|
|
|
|
unsigned long dst_offset, unsigned long src_offset,
|
|
|
|
unsigned long len)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2024-01-05 05:35:55 +00:00
|
|
|
const int unit_size = dst->folio_size;
|
btrfs: refactor main loop in memcpy_extent_buffer()
[BACKGROUND]
Currently memcpy_extent_buffer() does a loop where it would stop at
any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
regular memcpy(), this function can handle overlapping ranges.
So here we extract write_extent_buffer() into a new internal helper,
__write_extent_buffer(), and add a new parameter @use_memmove, to
indicate whether we should use memmove() or regular memcpy().
Now we can go __write_extent_buffer() to handle writing into the dst
range, with proper overlapping detection.
This has a tiny change to the chance of calling memmove().
As the split only happens at the source range page boundaries, the
memcpy/memmove() range would be slightly larger than the old code,
thus slightly increase the chance we call memmove() other than memcopy().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:33 +00:00
|
|
|
unsigned long cur_off = 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
if (check_eb_range(dst, dst_offset, len) ||
|
|
|
|
check_eb_range(dst, src_offset, len))
|
|
|
|
return;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2023-11-16 05:19:06 +00:00
|
|
|
if (dst->addr) {
|
|
|
|
const bool use_memmove = areas_overlap(src_offset, dst_offset, len);
|
|
|
|
|
|
|
|
if (use_memmove)
|
|
|
|
memmove(dst->addr + dst_offset, dst->addr + src_offset, len);
|
|
|
|
else
|
|
|
|
memcpy(dst->addr + dst_offset, dst->addr + src_offset, len);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
btrfs: refactor main loop in memcpy_extent_buffer()
[BACKGROUND]
Currently memcpy_extent_buffer() does a loop where it would stop at
any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
regular memcpy(), this function can handle overlapping ranges.
So here we extract write_extent_buffer() into a new internal helper,
__write_extent_buffer(), and add a new parameter @use_memmove, to
indicate whether we should use memmove() or regular memcpy().
Now we can go __write_extent_buffer() to handle writing into the dst
range, with proper overlapping detection.
This has a tiny change to the chance of calling memmove().
As the split only happens at the source range page boundaries, the
memcpy/memmove() range would be slightly larger than the old code,
thus slightly increase the chance we call memmove() other than memcopy().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:33 +00:00
|
|
|
while (cur_off < len) {
|
|
|
|
unsigned long cur_src = cur_off + src_offset;
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
unsigned long folio_index = get_eb_folio_index(dst, cur_src);
|
|
|
|
unsigned long folio_off = get_eb_offset_in_folio(dst, cur_src);
|
btrfs: refactor main loop in memcpy_extent_buffer()
[BACKGROUND]
Currently memcpy_extent_buffer() does a loop where it would stop at
any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
regular memcpy(), this function can handle overlapping ranges.
So here we extract write_extent_buffer() into a new internal helper,
__write_extent_buffer(), and add a new parameter @use_memmove, to
indicate whether we should use memmove() or regular memcpy().
Now we can go __write_extent_buffer() to handle writing into the dst
range, with proper overlapping detection.
This has a tiny change to the chance of calling memmove().
As the split only happens at the source range page boundaries, the
memcpy/memmove() range would be slightly larger than the old code,
thus slightly increase the chance we call memmove() other than memcopy().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:33 +00:00
|
|
|
unsigned long cur_len = min(src_offset + len - cur_src,
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
unit_size - folio_off);
|
|
|
|
void *src_addr = folio_address(dst->folios[folio_index]) + folio_off;
|
btrfs: refactor main loop in memcpy_extent_buffer()
[BACKGROUND]
Currently memcpy_extent_buffer() does a loop where it would stop at
any page boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
There is a hidden pitfall of the naming memcpy_extent_buffer(), unlike
regular memcpy(), this function can handle overlapping ranges.
So here we extract write_extent_buffer() into a new internal helper,
__write_extent_buffer(), and add a new parameter @use_memmove, to
indicate whether we should use memmove() or regular memcpy().
Now we can go __write_extent_buffer() to handle writing into the dst
range, with proper overlapping detection.
This has a tiny change to the chance of calling memmove().
As the split only happens at the source range page boundaries, the
memcpy/memmove() range would be slightly larger than the old code,
thus slightly increase the chance we call memmove() other than memcopy().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:33 +00:00
|
|
|
const bool use_memmove = areas_overlap(src_offset + cur_off,
|
|
|
|
dst_offset + cur_off, cur_len);
|
|
|
|
|
|
|
|
__write_extent_buffer(dst, src_addr, dst_offset + cur_off, cur_len,
|
|
|
|
use_memmove);
|
|
|
|
cur_off += cur_len;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void memmove_extent_buffer(const struct extent_buffer *dst,
|
|
|
|
unsigned long dst_offset, unsigned long src_offset,
|
|
|
|
unsigned long len)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
unsigned long dst_end = dst_offset + len - 1;
|
|
|
|
unsigned long src_end = src_offset + len - 1;
|
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
if (check_eb_range(dst, dst_offset, len) ||
|
|
|
|
check_eb_range(dst, src_offset, len))
|
|
|
|
return;
|
btrfs: refactor main loop in memmove_extent_buffer()
[BACKGROUND]
Currently memove_extent_buffer() does a loop where it strop at any page
boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
Instead of sticking with copy_pages(), here we utilize the new
__write_extent_buffer() helper to handle the writes.
Unlike the refactoring in memcpy_extent_buffer(), we can not just rely
on the write_extent_buffer() and only handle page boundaries inside src
range.
The function write_extent_buffer() itself is still doing forward
writing, thus it cannot handle the following case: (already in the
extent buffer memory operation tests, cross page overlapping run 2)
Src Page boundary
|///////|
|///|////|
Dst
In the above case, if we just follow page boundary in the src range, we
have no need to do any split, just one __write_extent_buffer() with
use_memmove = true.
But __write_extent_buffer() would split the dst range into two,
so it first copies the beginning part of the src range into the first half
of the dst range.
After this operation, the beginning of the dst range is already updated,
causing corruption.
So we have to follow the old behavior of handling both page boundaries.
And since we're the last caller of copy_pages(), we can remove it
completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:34 +00:00
|
|
|
|
2010-08-06 17:21:20 +00:00
|
|
|
if (dst_offset < src_offset) {
|
2008-01-24 21:13:08 +00:00
|
|
|
memcpy_extent_buffer(dst, dst_offset, src_offset, len);
|
|
|
|
return;
|
|
|
|
}
|
btrfs: refactor main loop in memmove_extent_buffer()
[BACKGROUND]
Currently memove_extent_buffer() does a loop where it strop at any page
boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
Instead of sticking with copy_pages(), here we utilize the new
__write_extent_buffer() helper to handle the writes.
Unlike the refactoring in memcpy_extent_buffer(), we can not just rely
on the write_extent_buffer() and only handle page boundaries inside src
range.
The function write_extent_buffer() itself is still doing forward
writing, thus it cannot handle the following case: (already in the
extent buffer memory operation tests, cross page overlapping run 2)
Src Page boundary
|///////|
|///|////|
Dst
In the above case, if we just follow page boundary in the src range, we
have no need to do any split, just one __write_extent_buffer() with
use_memmove = true.
But __write_extent_buffer() would split the dst range into two,
so it first copies the beginning part of the src range into the first half
of the dst range.
After this operation, the beginning of the dst range is already updated,
causing corruption.
So we have to follow the old behavior of handling both page boundaries.
And since we're the last caller of copy_pages(), we can remove it
completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:34 +00:00
|
|
|
|
2023-11-16 05:19:06 +00:00
|
|
|
if (dst->addr) {
|
|
|
|
memmove(dst->addr + dst_offset, dst->addr + src_offset, len);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (len > 0) {
|
btrfs: refactor main loop in memmove_extent_buffer()
[BACKGROUND]
Currently memove_extent_buffer() does a loop where it strop at any page
boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
Instead of sticking with copy_pages(), here we utilize the new
__write_extent_buffer() helper to handle the writes.
Unlike the refactoring in memcpy_extent_buffer(), we can not just rely
on the write_extent_buffer() and only handle page boundaries inside src
range.
The function write_extent_buffer() itself is still doing forward
writing, thus it cannot handle the following case: (already in the
extent buffer memory operation tests, cross page overlapping run 2)
Src Page boundary
|///////|
|///|////|
Dst
In the above case, if we just follow page boundary in the src range, we
have no need to do any split, just one __write_extent_buffer() with
use_memmove = true.
But __write_extent_buffer() would split the dst range into two,
so it first copies the beginning part of the src range into the first half
of the dst range.
After this operation, the beginning of the dst range is already updated,
causing corruption.
So we have to follow the old behavior of handling both page boundaries.
And since we're the last caller of copy_pages(), we can remove it
completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:34 +00:00
|
|
|
unsigned long src_i;
|
|
|
|
size_t cur;
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
size_t dst_off_in_folio;
|
|
|
|
size_t src_off_in_folio;
|
btrfs: refactor main loop in memmove_extent_buffer()
[BACKGROUND]
Currently memove_extent_buffer() does a loop where it strop at any page
boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
Instead of sticking with copy_pages(), here we utilize the new
__write_extent_buffer() helper to handle the writes.
Unlike the refactoring in memcpy_extent_buffer(), we can not just rely
on the write_extent_buffer() and only handle page boundaries inside src
range.
The function write_extent_buffer() itself is still doing forward
writing, thus it cannot handle the following case: (already in the
extent buffer memory operation tests, cross page overlapping run 2)
Src Page boundary
|///////|
|///|////|
Dst
In the above case, if we just follow page boundary in the src range, we
have no need to do any split, just one __write_extent_buffer() with
use_memmove = true.
But __write_extent_buffer() would split the dst range into two,
so it first copies the beginning part of the src range into the first half
of the dst range.
After this operation, the beginning of the dst range is already updated,
causing corruption.
So we have to follow the old behavior of handling both page boundaries.
And since we're the last caller of copy_pages(), we can remove it
completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:34 +00:00
|
|
|
void *src_addr;
|
|
|
|
bool use_memmove;
|
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
src_i = get_eb_folio_index(dst, src_end);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
dst_off_in_folio = get_eb_offset_in_folio(dst, dst_end);
|
|
|
|
src_off_in_folio = get_eb_offset_in_folio(dst, src_end);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
cur = min_t(unsigned long, len, src_off_in_folio + 1);
|
|
|
|
cur = min(cur, dst_off_in_folio + 1);
|
btrfs: refactor main loop in memmove_extent_buffer()
[BACKGROUND]
Currently memove_extent_buffer() does a loop where it strop at any page
boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
Instead of sticking with copy_pages(), here we utilize the new
__write_extent_buffer() helper to handle the writes.
Unlike the refactoring in memcpy_extent_buffer(), we can not just rely
on the write_extent_buffer() and only handle page boundaries inside src
range.
The function write_extent_buffer() itself is still doing forward
writing, thus it cannot handle the following case: (already in the
extent buffer memory operation tests, cross page overlapping run 2)
Src Page boundary
|///////|
|///|////|
Dst
In the above case, if we just follow page boundary in the src range, we
have no need to do any split, just one __write_extent_buffer() with
use_memmove = true.
But __write_extent_buffer() would split the dst range into two,
so it first copies the beginning part of the src range into the first half
of the dst range.
After this operation, the beginning of the dst range is already updated,
causing corruption.
So we have to follow the old behavior of handling both page boundaries.
And since we're the last caller of copy_pages(), we can remove it
completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:34 +00:00
|
|
|
|
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
These two functions are still using the old page based code, which is
not going to handle larger folios at all.
The migration itself is going to involve the following changes:
- PAGE_SIZE -> folio_size()
- PAGE_SHIFT -> folio_shift()
- get_eb_page_index() -> get_eb_folio_index()
- get_eb_offset_in_page() -> get_eb_offset_in_folio()
And since we're going to support larger folios, although above straight
conversion is good enough, this patch would add extra comments in the
involved functions to explain why the same single line code can now
cover 3 cases:
- folio_size == PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The common, non-subpage case with per-page folio.
- folio_size > PAGE_SIZE, sectorsize == PAGE_SIZE, nodesize >= PAGE_SIZE
The incoming larger folio, non-subpage case.
- folio_size == PAGE_SIZE, sectorsize < PAGE_SIZE, nodesize < PAGE_SIZE
The existing subpage case, we won't larger folio anyway.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-12 02:28:36 +00:00
|
|
|
src_addr = folio_address(dst->folios[src_i]) + src_off_in_folio -
|
2023-12-06 23:09:27 +00:00
|
|
|
cur + 1;
|
btrfs: refactor main loop in memmove_extent_buffer()
[BACKGROUND]
Currently memove_extent_buffer() does a loop where it strop at any page
boundary inside [dst_offset, dst_offset + len) or [src_offset,
src_offset + len).
This is mostly allowing us to do copy_pages(), but if we're going to use
folios we will need to handle multi-page (the old behavior) or single
folio (the new optimization).
The current code would be a burden for future changes.
[ENHANCEMENT]
Instead of sticking with copy_pages(), here we utilize the new
__write_extent_buffer() helper to handle the writes.
Unlike the refactoring in memcpy_extent_buffer(), we can not just rely
on the write_extent_buffer() and only handle page boundaries inside src
range.
The function write_extent_buffer() itself is still doing forward
writing, thus it cannot handle the following case: (already in the
extent buffer memory operation tests, cross page overlapping run 2)
Src Page boundary
|///////|
|///|////|
Dst
In the above case, if we just follow page boundary in the src range, we
have no need to do any split, just one __write_extent_buffer() with
use_memmove = true.
But __write_extent_buffer() would split the dst range into two,
so it first copies the beginning part of the src range into the first half
of the dst range.
After this operation, the beginning of the dst range is already updated,
causing corruption.
So we have to follow the old behavior of handling both page boundaries.
And since we're the last caller of copy_pages(), we can remove it
completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-07-15 11:08:34 +00:00
|
|
|
use_memmove = areas_overlap(src_end - cur + 1, dst_end - cur + 1,
|
|
|
|
cur);
|
|
|
|
|
|
|
|
__write_extent_buffer(dst, src_addr, dst_end - cur + 1, cur,
|
|
|
|
use_memmove);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
dst_end -= cur;
|
|
|
|
src_end -= cur;
|
|
|
|
len -= cur;
|
|
|
|
}
|
|
|
|
}
|
2008-07-22 15:18:07 +00:00
|
|
|
|
2022-07-15 11:59:31 +00:00
|
|
|
#define GANG_LOOKUP_SIZE 16
|
2021-01-26 08:33:56 +00:00
|
|
|
static struct extent_buffer *get_next_extent_buffer(
|
2024-08-28 18:28:56 +00:00
|
|
|
const struct btrfs_fs_info *fs_info, struct folio *folio, u64 bytenr)
|
2021-01-26 08:33:56 +00:00
|
|
|
{
|
2022-07-15 11:59:31 +00:00
|
|
|
struct extent_buffer *gang[GANG_LOOKUP_SIZE];
|
|
|
|
struct extent_buffer *found = NULL;
|
2024-08-28 18:28:56 +00:00
|
|
|
u64 folio_start = folio_pos(folio);
|
|
|
|
u64 cur = folio_start;
|
2021-01-26 08:33:56 +00:00
|
|
|
|
2024-08-28 18:28:56 +00:00
|
|
|
ASSERT(in_range(bytenr, folio_start, PAGE_SIZE));
|
2021-01-26 08:33:56 +00:00
|
|
|
lockdep_assert_held(&fs_info->buffer_lock);
|
|
|
|
|
2024-08-28 18:28:56 +00:00
|
|
|
while (cur < folio_start + PAGE_SIZE) {
|
2022-07-15 11:59:31 +00:00
|
|
|
int ret;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
ret = radix_tree_gang_lookup(&fs_info->buffer_radix,
|
|
|
|
(void **)gang, cur >> fs_info->sectorsize_bits,
|
|
|
|
min_t(unsigned int, GANG_LOOKUP_SIZE,
|
|
|
|
PAGE_SIZE / fs_info->nodesize));
|
|
|
|
if (ret == 0)
|
|
|
|
goto out;
|
|
|
|
for (i = 0; i < ret; i++) {
|
|
|
|
/* Already beyond page end */
|
2024-08-28 18:28:56 +00:00
|
|
|
if (gang[i]->start >= folio_start + PAGE_SIZE)
|
2022-07-15 11:59:31 +00:00
|
|
|
goto out;
|
|
|
|
/* Found one */
|
|
|
|
if (gang[i]->start >= bytenr) {
|
|
|
|
found = gang[i];
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
cur = gang[ret - 1]->start + gang[ret - 1]->len;
|
2021-01-26 08:33:56 +00:00
|
|
|
}
|
2022-07-15 11:59:31 +00:00
|
|
|
out:
|
|
|
|
return found;
|
2021-01-26 08:33:56 +00:00
|
|
|
}
|
|
|
|
|
2024-08-28 18:28:57 +00:00
|
|
|
static int try_release_subpage_extent_buffer(struct folio *folio)
|
2021-01-26 08:33:56 +00:00
|
|
|
{
|
2024-08-28 18:28:57 +00:00
|
|
|
struct btrfs_fs_info *fs_info = folio_to_fs_info(folio);
|
|
|
|
u64 cur = folio_pos(folio);
|
|
|
|
const u64 end = cur + PAGE_SIZE;
|
2021-01-26 08:33:56 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
while (cur < end) {
|
|
|
|
struct extent_buffer *eb = NULL;
|
|
|
|
|
|
|
|
/*
|
2023-11-17 03:54:14 +00:00
|
|
|
* Unlike try_release_extent_buffer() which uses folio private
|
2021-01-26 08:33:56 +00:00
|
|
|
* to grab buffer, for subpage case we rely on radix tree, thus
|
|
|
|
* we need to ensure radix tree consistency.
|
|
|
|
*
|
|
|
|
* We also want an atomic snapshot of the radix tree, thus go
|
|
|
|
* with spinlock rather than RCU.
|
|
|
|
*/
|
|
|
|
spin_lock(&fs_info->buffer_lock);
|
2024-08-28 18:28:57 +00:00
|
|
|
eb = get_next_extent_buffer(fs_info, folio, cur);
|
2021-01-26 08:33:56 +00:00
|
|
|
if (!eb) {
|
|
|
|
/* No more eb in the page range after or at cur */
|
|
|
|
spin_unlock(&fs_info->buffer_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
cur = eb->start + eb->len;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The same as try_release_extent_buffer(), to ensure the eb
|
|
|
|
* won't disappear out from under us.
|
|
|
|
*/
|
|
|
|
spin_lock(&eb->refs_lock);
|
|
|
|
if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
|
|
|
|
spin_unlock(&eb->refs_lock);
|
|
|
|
spin_unlock(&fs_info->buffer_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
spin_unlock(&fs_info->buffer_lock);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If tree ref isn't set then we know the ref on this eb is a
|
|
|
|
* real ref, so just return, this eb will likely be freed soon
|
|
|
|
* anyway.
|
|
|
|
*/
|
|
|
|
if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
|
|
|
|
spin_unlock(&eb->refs_lock);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Here we don't care about the return value, we will always
|
2023-11-17 03:54:14 +00:00
|
|
|
* check the folio private at the end. And
|
2021-01-26 08:33:56 +00:00
|
|
|
* release_extent_buffer() will release the refs_lock.
|
|
|
|
*/
|
|
|
|
release_extent_buffer(eb);
|
|
|
|
}
|
|
|
|
/*
|
2023-11-17 03:54:14 +00:00
|
|
|
* Finally to check if we have cleared folio private, as if we have
|
|
|
|
* released all ebs in the page, the folio private should be cleared now.
|
2021-01-26 08:33:56 +00:00
|
|
|
*/
|
2024-08-28 18:28:57 +00:00
|
|
|
spin_lock(&folio->mapping->i_private_lock);
|
|
|
|
if (!folio_test_private(folio))
|
2021-01-26 08:33:56 +00:00
|
|
|
ret = 1;
|
|
|
|
else
|
|
|
|
ret = 0;
|
2024-08-28 18:28:57 +00:00
|
|
|
spin_unlock(&folio->mapping->i_private_lock);
|
2021-01-26 08:33:56 +00:00
|
|
|
return ret;
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2024-08-28 18:28:58 +00:00
|
|
|
int try_release_extent_buffer(struct folio *folio)
|
2010-10-27 00:57:29 +00:00
|
|
|
{
|
2008-07-22 15:18:07 +00:00
|
|
|
struct extent_buffer *eb;
|
|
|
|
|
2024-08-28 18:28:58 +00:00
|
|
|
if (folio_to_fs_info(folio)->nodesize < PAGE_SIZE)
|
|
|
|
return try_release_subpage_extent_buffer(folio);
|
2021-01-26 08:33:56 +00:00
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
/*
|
2023-11-17 03:54:14 +00:00
|
|
|
* We need to make sure nobody is changing folio private, as we rely on
|
|
|
|
* folio private as the pointer to extent buffer.
|
2012-03-09 21:01:49 +00:00
|
|
|
*/
|
2024-08-28 18:28:58 +00:00
|
|
|
spin_lock(&folio->mapping->i_private_lock);
|
2023-11-17 03:54:14 +00:00
|
|
|
if (!folio_test_private(folio)) {
|
2024-08-28 18:28:58 +00:00
|
|
|
spin_unlock(&folio->mapping->i_private_lock);
|
2012-03-07 21:20:05 +00:00
|
|
|
return 1;
|
2010-11-22 03:27:44 +00:00
|
|
|
}
|
2008-07-22 15:18:07 +00:00
|
|
|
|
2023-11-17 03:54:14 +00:00
|
|
|
eb = folio_get_private(folio);
|
2012-03-09 21:01:49 +00:00
|
|
|
BUG_ON(!eb);
|
2010-10-27 00:57:29 +00:00
|
|
|
|
|
|
|
/*
|
2012-03-09 21:01:49 +00:00
|
|
|
* This is a little awful but should be ok, we need to make sure that
|
|
|
|
* the eb doesn't disappear out from under us while we're looking at
|
|
|
|
* this page.
|
2010-10-27 00:57:29 +00:00
|
|
|
*/
|
2012-03-09 21:01:49 +00:00
|
|
|
spin_lock(&eb->refs_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
|
2012-03-09 21:01:49 +00:00
|
|
|
spin_unlock(&eb->refs_lock);
|
2024-08-28 18:28:58 +00:00
|
|
|
spin_unlock(&folio->mapping->i_private_lock);
|
2012-03-09 21:01:49 +00:00
|
|
|
return 0;
|
2009-03-13 15:00:37 +00:00
|
|
|
}
|
2024-08-28 18:28:58 +00:00
|
|
|
spin_unlock(&folio->mapping->i_private_lock);
|
2010-10-27 00:57:29 +00:00
|
|
|
|
2010-10-27 00:57:29 +00:00
|
|
|
/*
|
2012-03-09 21:01:49 +00:00
|
|
|
* If tree ref isn't set then we know the ref on this eb is a real ref,
|
|
|
|
* so just return, this page will likely be freed soon anyway.
|
2010-10-27 00:57:29 +00:00
|
|
|
*/
|
2012-03-09 21:01:49 +00:00
|
|
|
if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
|
|
|
|
spin_unlock(&eb->refs_lock);
|
|
|
|
return 0;
|
2009-03-13 15:00:37 +00:00
|
|
|
}
|
2010-10-27 00:57:29 +00:00
|
|
|
|
2013-04-26 14:56:29 +00:00
|
|
|
return release_extent_buffer(eb);
|
2008-07-22 15:18:07 +00:00
|
|
|
}
|
2020-11-05 15:45:09 +00:00
|
|
|
|
|
|
|
/*
|
2023-09-07 23:09:25 +00:00
|
|
|
* Attempt to readahead a child block.
|
|
|
|
*
|
2020-11-05 15:45:09 +00:00
|
|
|
* @fs_info: the fs_info
|
|
|
|
* @bytenr: bytenr to read
|
2020-11-05 15:45:20 +00:00
|
|
|
* @owner_root: objectid of the root that owns this eb
|
2020-11-05 15:45:09 +00:00
|
|
|
* @gen: generation for the uptodate check, can be 0
|
2020-11-05 15:45:20 +00:00
|
|
|
* @level: level for the eb
|
2020-11-05 15:45:09 +00:00
|
|
|
*
|
|
|
|
* Attempt to readahead a tree block at @bytenr. If @gen is 0 then we do a
|
|
|
|
* normal uptodate check of the eb, without checking the generation. If we have
|
|
|
|
* to read the block we will not block on anything.
|
|
|
|
*/
|
|
|
|
void btrfs_readahead_tree_block(struct btrfs_fs_info *fs_info,
|
2020-11-05 15:45:20 +00:00
|
|
|
u64 bytenr, u64 owner_root, u64 gen, int level)
|
2020-11-05 15:45:09 +00:00
|
|
|
{
|
btrfs: move tree block parentness check into validate_extent_buffer()
[BACKGROUND]
Although both btrfs metadata and data has their read time verification
done at endio time (btrfs_validate_metadata_buffer() and
btrfs_verify_data_csum()), metadata has extra verification, mostly
parentness check including first key/transid/owner_root/level, done at
read_tree_block() and btrfs_read_extent_buffer().
On the other hand, all the data verification is done at endio context.
[ENHANCEMENT]
This patch will make a new union in btrfs_bio, taking the space of the
old data checksums, thus it will not increase the memory usage.
With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
pass the check parameter into read_extent_buffer_pages(), and before
submitting the bio, we can copy the check structure into btrfs_bio.
And finally at endio time, we can grab btrfs_bio::parent_check and pass
it to validate_extent_buffer(), to move the remaining checks into it.
This brings the following benefits:
- Much simpler btrfs_read_extent_buffer()
Now it only needs to iterate through all mirrors.
- Simpler read-time transid check
Previously we go verify_parent_transid() after reading out the extent
buffer.
Now the transid check is done inside the endio function, no other
code can modify the content.
Thus no need to use the extent lock anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-14 05:32:51 +00:00
|
|
|
struct btrfs_tree_parent_check check = {
|
|
|
|
.has_first_key = 0,
|
|
|
|
.level = level,
|
|
|
|
.transid = gen
|
|
|
|
};
|
2020-11-05 15:45:09 +00:00
|
|
|
struct extent_buffer *eb;
|
|
|
|
int ret;
|
|
|
|
|
2020-11-05 15:45:20 +00:00
|
|
|
eb = btrfs_find_create_tree_block(fs_info, bytenr, owner_root, level);
|
2020-11-05 15:45:09 +00:00
|
|
|
if (IS_ERR(eb))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (btrfs_buffer_uptodate(eb, gen, 1)) {
|
|
|
|
free_extent_buffer(eb);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
btrfs: move tree block parentness check into validate_extent_buffer()
[BACKGROUND]
Although both btrfs metadata and data has their read time verification
done at endio time (btrfs_validate_metadata_buffer() and
btrfs_verify_data_csum()), metadata has extra verification, mostly
parentness check including first key/transid/owner_root/level, done at
read_tree_block() and btrfs_read_extent_buffer().
On the other hand, all the data verification is done at endio context.
[ENHANCEMENT]
This patch will make a new union in btrfs_bio, taking the space of the
old data checksums, thus it will not increase the memory usage.
With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
pass the check parameter into read_extent_buffer_pages(), and before
submitting the bio, we can copy the check structure into btrfs_bio.
And finally at endio time, we can grab btrfs_bio::parent_check and pass
it to validate_extent_buffer(), to move the remaining checks into it.
This brings the following benefits:
- Much simpler btrfs_read_extent_buffer()
Now it only needs to iterate through all mirrors.
- Simpler read-time transid check
Previously we go verify_parent_transid() after reading out the extent
buffer.
Now the transid check is done inside the endio function, no other
code can modify the content.
Thus no need to use the extent lock anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-14 05:32:51 +00:00
|
|
|
ret = read_extent_buffer_pages(eb, WAIT_NONE, 0, &check);
|
2020-11-05 15:45:09 +00:00
|
|
|
if (ret < 0)
|
|
|
|
free_extent_buffer_stale(eb);
|
|
|
|
else
|
|
|
|
free_extent_buffer(eb);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2023-09-07 23:09:25 +00:00
|
|
|
* Readahead a node's child block.
|
|
|
|
*
|
2020-11-05 15:45:09 +00:00
|
|
|
* @node: parent node we're reading from
|
|
|
|
* @slot: slot in the parent node for the child we want to read
|
|
|
|
*
|
|
|
|
* A helper for btrfs_readahead_tree_block, we simply read the bytenr pointed at
|
|
|
|
* the slot in the node provided.
|
|
|
|
*/
|
|
|
|
void btrfs_readahead_node_child(struct extent_buffer *node, int slot)
|
|
|
|
{
|
|
|
|
btrfs_readahead_tree_block(node->fs_info,
|
|
|
|
btrfs_node_blockptr(node, slot),
|
2020-11-05 15:45:20 +00:00
|
|
|
btrfs_header_owner(node),
|
|
|
|
btrfs_node_ptr_generation(node, slot),
|
|
|
|
btrfs_header_level(node) - 1);
|
2020-11-05 15:45:09 +00:00
|
|
|
}
|