License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.
By default all files without license information are under the default
license of the kernel, which is GPL version 2.
Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.
This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.
How this work was done:
Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,
Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.
The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.
The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.
Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if <5
lines).
All documentation files were explicitly excluded.
The following heuristics were used to determine which SPDX license
identifiers to apply.
- when both scanners couldn't find any license traces, file was
considered to have no license information in it, and the top level
COPYING file license applied.
For non */uapi/* files that summary was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 11139
and resulted in the first patch in this series.
If that file was a */uapi/* path one, it was "GPL-2.0 WITH
Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was:
SPDX license identifier # files
---------------------------------------------------|-------
GPL-2.0 WITH Linux-syscall-note 930
and resulted in the second patch in this series.
- if a file had some form of licensing information in it, and was one
of the */uapi/* ones, it was denoted with the Linux-syscall-note if
any GPL family license was found in the file or had no licensing in
it (per prior point). Results summary:
SPDX license identifier # files
---------------------------------------------------|------
GPL-2.0 WITH Linux-syscall-note 270
GPL-2.0+ WITH Linux-syscall-note 169
((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21
((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17
LGPL-2.1+ WITH Linux-syscall-note 15
GPL-1.0+ WITH Linux-syscall-note 14
((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5
LGPL-2.0+ WITH Linux-syscall-note 4
LGPL-2.1 WITH Linux-syscall-note 3
((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3
((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1
and that resulted in the third patch in this series.
- when the two scanners agreed on the detected license(s), that became
the concluded license(s).
- when there was disagreement between the two scanners (one detected a
license but the other didn't, or they both detected different
licenses) a manual inspection of the file occurred.
- In most cases a manual inspection of the information in the file
resulted in a clear resolution of the license that should apply (and
which scanner probably needed to revisit its heuristics).
- When it was not immediately clear, the license identifier was
confirmed with lawyers working with the Linux Foundation.
- If there was any question as to the appropriate license identifier,
the file was flagged for further research and to be revisited later
in time.
In total, over 70 hours of logged manual review was done on the
spreadsheet to determine the SPDX license identifiers to apply to the
source files by Kate, Philippe, Thomas and, in some cases, confirmation
by lawyers working with the Linux Foundation.
Kate also obtained a third independent scan of the 4.13 code base from
FOSSology, and compared selected files where the other two scanners
disagreed against that SPDX file, to see if there was new insights. The
Windriver scanner is based on an older version of FOSSology in part, so
they are related.
Thomas did random spot checks in about 500 files from the spreadsheets
for the uapi headers and agreed with SPDX license identifier in the
files he inspected. For the non-uapi files Thomas did random spot checks
in about 15000 files.
In initial set of patches against 4.14-rc6, 3 files were found to have
copy/paste license identifier errors, and have been fixed to reflect the
correct identifier.
Additionally Philippe spent 10 hours this week doing a detailed manual
inspection and review of the 12,461 patched files from the initial patch
version early this week with:
- a full scancode scan run, collecting the matched texts, detected
license ids and scores
- reviewing anything where there was a license detected (about 500+
files) to ensure that the applied SPDX license was correct
- reviewing anything where there was no detection but the patch license
was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
SPDX license was correct
This produced a worksheet with 20 files needing minor correction. This
worksheet was then exported into 3 different .csv files for the
different types of files to be modified.
These .csv files were then reviewed by Greg. Thomas wrote a script to
parse the csv files and add the proper SPDX tag to the file, in the
format that the file expected. This script was further refined by Greg
based on the output to detect more types of files automatically and to
distinguish between header and source .c files (which need different
comment types.) Finally Greg ran the script using the .csv files to
generate the patches.
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:07:57 +00:00
|
|
|
// SPDX-License-Identifier: GPL-2.0
|
2018-04-03 17:23:33 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
#include <linux/bitops.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/bio.h>
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/page-flags.h>
|
|
|
|
#include <linux/spinlock.h>
|
|
|
|
#include <linux/blkdev.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/writeback.h>
|
|
|
|
#include <linux/pagevec.h>
|
2011-05-20 19:50:29 +00:00
|
|
|
#include <linux/prefetch.h>
|
2011-05-26 16:01:56 +00:00
|
|
|
#include <linux/cleancache.h>
|
2008-01-24 21:13:08 +00:00
|
|
|
#include "extent_io.h"
|
2019-09-23 14:05:19 +00:00
|
|
|
#include "extent-io-tree.h"
|
2008-01-24 21:13:08 +00:00
|
|
|
#include "extent_map.h"
|
2008-08-20 12:51:49 +00:00
|
|
|
#include "ctree.h"
|
|
|
|
#include "btrfs_inode.h"
|
2011-07-22 13:41:52 +00:00
|
|
|
#include "volumes.h"
|
2011-11-09 12:44:05 +00:00
|
|
|
#include "check-integrity.h"
|
2012-03-13 13:38:00 +00:00
|
|
|
#include "locking.h"
|
2012-06-04 18:03:51 +00:00
|
|
|
#include "rcu-string.h"
|
2013-09-22 04:54:23 +00:00
|
|
|
#include "backref.h"
|
2017-06-23 02:09:57 +00:00
|
|
|
#include "disk-io.h"
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
static struct kmem_cache *extent_state_cache;
|
|
|
|
static struct kmem_cache *extent_buffer_cache;
|
2018-05-20 22:25:56 +00:00
|
|
|
static struct bio_set btrfs_bioset;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2014-07-06 19:09:59 +00:00
|
|
|
static inline bool extent_state_in_tree(const struct extent_state *state)
|
|
|
|
{
|
|
|
|
return !RB_EMPTY_NODE(&state->rb_node);
|
|
|
|
}
|
|
|
|
|
2013-04-22 16:12:31 +00:00
|
|
|
#ifdef CONFIG_BTRFS_DEBUG
|
2008-01-24 21:13:08 +00:00
|
|
|
static LIST_HEAD(states);
|
2009-01-06 02:25:51 +00:00
|
|
|
static DEFINE_SPINLOCK(leak_lock);
|
2013-04-22 16:12:31 +00:00
|
|
|
|
2020-02-14 21:11:40 +00:00
|
|
|
static inline void btrfs_leak_debug_add(spinlock_t *lock,
|
|
|
|
struct list_head *new,
|
|
|
|
struct list_head *head)
|
2013-04-22 16:12:31 +00:00
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
2020-02-14 21:11:40 +00:00
|
|
|
spin_lock_irqsave(lock, flags);
|
2013-04-22 16:12:31 +00:00
|
|
|
list_add(new, head);
|
2020-02-14 21:11:40 +00:00
|
|
|
spin_unlock_irqrestore(lock, flags);
|
2013-04-22 16:12:31 +00:00
|
|
|
}
|
|
|
|
|
2020-02-14 21:11:40 +00:00
|
|
|
static inline void btrfs_leak_debug_del(spinlock_t *lock,
|
|
|
|
struct list_head *entry)
|
2013-04-22 16:12:31 +00:00
|
|
|
{
|
|
|
|
unsigned long flags;
|
|
|
|
|
2020-02-14 21:11:40 +00:00
|
|
|
spin_lock_irqsave(lock, flags);
|
2013-04-22 16:12:31 +00:00
|
|
|
list_del(entry);
|
2020-02-14 21:11:40 +00:00
|
|
|
spin_unlock_irqrestore(lock, flags);
|
2013-04-22 16:12:31 +00:00
|
|
|
}
|
|
|
|
|
2020-02-14 21:11:40 +00:00
|
|
|
void btrfs_extent_buffer_leak_debug_check(struct btrfs_fs_info *fs_info)
|
2013-04-22 16:12:31 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb;
|
2020-02-14 21:11:40 +00:00
|
|
|
unsigned long flags;
|
2013-04-22 16:12:31 +00:00
|
|
|
|
2020-02-14 21:11:42 +00:00
|
|
|
/*
|
|
|
|
* If we didn't get into open_ctree our allocated_ebs will not be
|
|
|
|
* initialized, so just skip this.
|
|
|
|
*/
|
|
|
|
if (!fs_info->allocated_ebs.next)
|
|
|
|
return;
|
|
|
|
|
2020-02-14 21:11:40 +00:00
|
|
|
spin_lock_irqsave(&fs_info->eb_leak_lock, flags);
|
|
|
|
while (!list_empty(&fs_info->allocated_ebs)) {
|
|
|
|
eb = list_first_entry(&fs_info->allocated_ebs,
|
|
|
|
struct extent_buffer, leak_list);
|
2020-02-14 21:11:42 +00:00
|
|
|
pr_err(
|
|
|
|
"BTRFS: buffer leak start %llu len %lu refs %d bflags %lu owner %llu\n",
|
|
|
|
eb->start, eb->len, atomic_read(&eb->refs), eb->bflags,
|
|
|
|
btrfs_header_owner(eb));
|
2019-09-23 14:05:17 +00:00
|
|
|
list_del(&eb->leak_list);
|
|
|
|
kmem_cache_free(extent_buffer_cache, eb);
|
|
|
|
}
|
2020-02-14 21:11:40 +00:00
|
|
|
spin_unlock_irqrestore(&fs_info->eb_leak_lock, flags);
|
2019-09-23 14:05:17 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_extent_state_leak_debug_check(void)
|
|
|
|
{
|
|
|
|
struct extent_state *state;
|
|
|
|
|
2013-04-22 16:12:31 +00:00
|
|
|
while (!list_empty(&states)) {
|
|
|
|
state = list_entry(states.next, struct extent_state, leak_list);
|
2015-01-14 18:52:13 +00:00
|
|
|
pr_err("BTRFS: state leak: start %llu end %llu state %u in tree %d refs %d\n",
|
2014-07-06 19:09:59 +00:00
|
|
|
state->start, state->end, state->state,
|
|
|
|
extent_state_in_tree(state),
|
2017-03-03 08:55:19 +00:00
|
|
|
refcount_read(&state->refs));
|
2013-04-22 16:12:31 +00:00
|
|
|
list_del(&state->leak_list);
|
|
|
|
kmem_cache_free(extent_state_cache, state);
|
|
|
|
}
|
|
|
|
}
|
2013-04-30 15:22:23 +00:00
|
|
|
|
2013-12-13 15:02:44 +00:00
|
|
|
#define btrfs_debug_check_extent_io_range(tree, start, end) \
|
|
|
|
__btrfs_debug_check_extent_io_range(__func__, (tree), (start), (end))
|
2013-04-30 15:22:23 +00:00
|
|
|
static inline void __btrfs_debug_check_extent_io_range(const char *caller,
|
2013-12-13 15:02:44 +00:00
|
|
|
struct extent_io_tree *tree, u64 start, u64 end)
|
2013-04-30 15:22:23 +00:00
|
|
|
{
|
2018-11-01 12:09:49 +00:00
|
|
|
struct inode *inode = tree->private_data;
|
|
|
|
u64 isize;
|
|
|
|
|
|
|
|
if (!inode || !is_data_inode(inode))
|
|
|
|
return;
|
|
|
|
|
|
|
|
isize = i_size_read(inode);
|
|
|
|
if (end >= PAGE_SIZE && (end % 2) == 0 && end != isize - 1) {
|
|
|
|
btrfs_debug_rl(BTRFS_I(inode)->root->fs_info,
|
|
|
|
"%s: ino %llu isize %llu odd range [%llu,%llu]",
|
|
|
|
caller, btrfs_ino(BTRFS_I(inode)), isize, start, end);
|
|
|
|
}
|
2013-04-30 15:22:23 +00:00
|
|
|
}
|
2013-04-22 16:12:31 +00:00
|
|
|
#else
|
2020-02-14 21:11:40 +00:00
|
|
|
#define btrfs_leak_debug_add(lock, new, head) do {} while (0)
|
|
|
|
#define btrfs_leak_debug_del(lock, entry) do {} while (0)
|
2019-09-23 14:05:17 +00:00
|
|
|
#define btrfs_extent_state_leak_debug_check() do {} while (0)
|
2013-04-30 15:22:23 +00:00
|
|
|
#define btrfs_debug_check_extent_io_range(c, s, e) do {} while (0)
|
2008-09-08 15:18:08 +00:00
|
|
|
#endif
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
struct tree_entry {
|
|
|
|
u64 start;
|
|
|
|
u64 end;
|
|
|
|
struct rb_node rb_node;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct extent_page_data {
|
|
|
|
struct bio *bio;
|
2008-11-07 03:02:51 +00:00
|
|
|
/* tells writepage not to lock the state bits for this range
|
|
|
|
* it still does the unlocking
|
|
|
|
*/
|
2009-04-20 19:50:09 +00:00
|
|
|
unsigned int extent_locked:1;
|
|
|
|
|
2016-11-01 13:40:10 +00:00
|
|
|
/* tells the submit_bio code to use REQ_SYNC */
|
2009-04-20 19:50:09 +00:00
|
|
|
unsigned int sync_io:1;
|
2008-01-24 21:13:08 +00:00
|
|
|
};
|
|
|
|
|
2020-11-13 12:51:40 +00:00
|
|
|
static int add_extent_changeset(struct extent_state *state, u32 bits,
|
2015-10-12 06:53:37 +00:00
|
|
|
struct extent_changeset *changeset,
|
|
|
|
int set)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!changeset)
|
2018-03-01 16:56:34 +00:00
|
|
|
return 0;
|
2015-10-12 06:53:37 +00:00
|
|
|
if (set && (state->state & bits) == bits)
|
2018-03-01 16:56:34 +00:00
|
|
|
return 0;
|
2015-10-12 07:35:38 +00:00
|
|
|
if (!set && (state->state & bits) == 0)
|
2018-03-01 16:56:34 +00:00
|
|
|
return 0;
|
2015-10-12 06:53:37 +00:00
|
|
|
changeset->bytes_changed += state->end - state->start + 1;
|
2017-02-13 12:42:29 +00:00
|
|
|
ret = ulist_add(&changeset->range_changed, state->start, state->end,
|
2015-10-12 06:53:37 +00:00
|
|
|
GFP_ATOMIC);
|
2018-03-01 16:56:34 +00:00
|
|
|
return ret;
|
2015-10-12 06:53:37 +00:00
|
|
|
}
|
|
|
|
|
2020-09-14 09:37:08 +00:00
|
|
|
int __must_check submit_one_bio(struct bio *bio, int mirror_num,
|
|
|
|
unsigned long bio_flags)
|
2019-01-25 05:09:15 +00:00
|
|
|
{
|
|
|
|
blk_status_t ret = 0;
|
|
|
|
struct extent_io_tree *tree = bio->bi_private;
|
|
|
|
|
|
|
|
bio->bi_private = NULL;
|
|
|
|
|
2020-09-18 13:34:37 +00:00
|
|
|
if (is_data_inode(tree->private_data))
|
|
|
|
ret = btrfs_submit_data_bio(tree->private_data, bio, mirror_num,
|
|
|
|
bio_flags);
|
|
|
|
else
|
2020-09-18 13:34:38 +00:00
|
|
|
ret = btrfs_submit_metadata_bio(tree->private_data, bio,
|
|
|
|
mirror_num, bio_flags);
|
2019-01-25 05:09:15 +00:00
|
|
|
|
|
|
|
return blk_status_to_errno(ret);
|
|
|
|
}
|
|
|
|
|
2019-03-20 06:27:42 +00:00
|
|
|
/* Cleanup unsubmitted bios */
|
|
|
|
static void end_write_bio(struct extent_page_data *epd, int ret)
|
|
|
|
{
|
|
|
|
if (epd->bio) {
|
|
|
|
epd->bio->bi_status = errno_to_blk_status(ret);
|
|
|
|
bio_endio(epd->bio);
|
|
|
|
epd->bio = NULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-03-20 06:27:41 +00:00
|
|
|
/*
|
|
|
|
* Submit bio from extent page data via submit_one_bio
|
|
|
|
*
|
|
|
|
* Return 0 if everything is OK.
|
|
|
|
* Return <0 for error.
|
|
|
|
*/
|
|
|
|
static int __must_check flush_write_bio(struct extent_page_data *epd)
|
2019-01-25 05:09:15 +00:00
|
|
|
{
|
2019-03-20 06:27:41 +00:00
|
|
|
int ret = 0;
|
2019-01-25 05:09:15 +00:00
|
|
|
|
2019-03-20 06:27:41 +00:00
|
|
|
if (epd->bio) {
|
2019-01-25 05:09:15 +00:00
|
|
|
ret = submit_one_bio(epd->bio, 0, 0);
|
2019-03-20 06:27:41 +00:00
|
|
|
/*
|
|
|
|
* Clean up of epd->bio is handled by its endio function.
|
|
|
|
* And endio is either triggered by successful bio execution
|
|
|
|
* or the error handler of submit bio hook.
|
|
|
|
* So at this point, no matter what happened, we don't need
|
|
|
|
* to clean up epd->bio.
|
|
|
|
*/
|
2019-01-25 05:09:15 +00:00
|
|
|
epd->bio = NULL;
|
|
|
|
}
|
2019-03-20 06:27:41 +00:00
|
|
|
return ret;
|
2019-01-25 05:09:15 +00:00
|
|
|
}
|
2017-06-23 02:16:17 +00:00
|
|
|
|
2019-09-23 14:05:18 +00:00
|
|
|
int __init extent_state_cache_init(void)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2012-09-07 09:00:48 +00:00
|
|
|
extent_state_cache = kmem_cache_create("btrfs_extent_state",
|
2009-04-13 13:33:09 +00:00
|
|
|
sizeof(struct extent_state), 0,
|
2016-06-23 18:17:08 +00:00
|
|
|
SLAB_MEM_SPREAD, NULL);
|
2008-01-24 21:13:08 +00:00
|
|
|
if (!extent_state_cache)
|
|
|
|
return -ENOMEM;
|
2019-09-23 14:05:18 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2019-09-23 14:05:18 +00:00
|
|
|
int __init extent_io_init(void)
|
|
|
|
{
|
2012-09-07 09:00:48 +00:00
|
|
|
extent_buffer_cache = kmem_cache_create("btrfs_extent_buffer",
|
2009-04-13 13:33:09 +00:00
|
|
|
sizeof(struct extent_buffer), 0,
|
2016-06-23 18:17:08 +00:00
|
|
|
SLAB_MEM_SPREAD, NULL);
|
2008-01-24 21:13:08 +00:00
|
|
|
if (!extent_buffer_cache)
|
2019-09-23 14:05:18 +00:00
|
|
|
return -ENOMEM;
|
2013-05-17 22:30:14 +00:00
|
|
|
|
2018-05-20 22:25:56 +00:00
|
|
|
if (bioset_init(&btrfs_bioset, BIO_POOL_SIZE,
|
|
|
|
offsetof(struct btrfs_io_bio, bio),
|
|
|
|
BIOSET_NEED_BVECS))
|
2013-05-17 22:30:14 +00:00
|
|
|
goto free_buffer_cache;
|
2013-09-20 03:37:07 +00:00
|
|
|
|
2018-05-20 22:25:56 +00:00
|
|
|
if (bioset_integrity_create(&btrfs_bioset, BIO_POOL_SIZE))
|
2013-09-20 03:37:07 +00:00
|
|
|
goto free_bioset;
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
return 0;
|
|
|
|
|
2013-09-20 03:37:07 +00:00
|
|
|
free_bioset:
|
2018-05-20 22:25:56 +00:00
|
|
|
bioset_exit(&btrfs_bioset);
|
2013-09-20 03:37:07 +00:00
|
|
|
|
2013-05-17 22:30:14 +00:00
|
|
|
free_buffer_cache:
|
|
|
|
kmem_cache_destroy(extent_buffer_cache);
|
|
|
|
extent_buffer_cache = NULL;
|
2019-09-23 14:05:18 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
2013-05-17 22:30:14 +00:00
|
|
|
|
2019-09-23 14:05:18 +00:00
|
|
|
void __cold extent_state_cache_exit(void)
|
|
|
|
{
|
|
|
|
btrfs_extent_state_leak_debug_check();
|
2008-01-24 21:13:08 +00:00
|
|
|
kmem_cache_destroy(extent_state_cache);
|
|
|
|
}
|
|
|
|
|
2018-02-19 16:24:18 +00:00
|
|
|
void __cold extent_io_exit(void)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2012-09-26 01:33:07 +00:00
|
|
|
/*
|
|
|
|
* Make sure all delayed rcu free are flushed before we
|
|
|
|
* destroy caches.
|
|
|
|
*/
|
|
|
|
rcu_barrier();
|
2016-01-29 13:36:35 +00:00
|
|
|
kmem_cache_destroy(extent_buffer_cache);
|
2018-05-20 22:25:56 +00:00
|
|
|
bioset_exit(&btrfs_bioset);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2020-01-17 14:02:21 +00:00
|
|
|
/*
|
|
|
|
* For the file_extent_tree, we want to hold the inode lock when we lookup and
|
|
|
|
* update the disk_i_size, but lockdep will complain because our io_tree we hold
|
|
|
|
* the tree lock and get the inode lock when setting delalloc. These two things
|
|
|
|
* are unrelated, so make a class for the file_extent_tree so we don't get the
|
|
|
|
* two locking patterns mixed up.
|
|
|
|
*/
|
|
|
|
static struct lock_class_key file_extent_tree_class;
|
|
|
|
|
2019-03-01 02:47:58 +00:00
|
|
|
void extent_io_tree_init(struct btrfs_fs_info *fs_info,
|
2019-03-01 02:47:59 +00:00
|
|
|
struct extent_io_tree *tree, unsigned int owner,
|
|
|
|
void *private_data)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2019-03-01 02:47:58 +00:00
|
|
|
tree->fs_info = fs_info;
|
2010-02-23 19:43:04 +00:00
|
|
|
tree->state = RB_ROOT;
|
2008-01-24 21:13:08 +00:00
|
|
|
tree->dirty_bytes = 0;
|
2008-01-29 14:59:12 +00:00
|
|
|
spin_lock_init(&tree->lock);
|
2017-05-05 15:57:13 +00:00
|
|
|
tree->private_data = private_data;
|
2019-03-01 02:47:59 +00:00
|
|
|
tree->owner = owner;
|
2020-01-17 14:02:21 +00:00
|
|
|
if (owner == IO_TREE_INODE_FILE_EXTENT)
|
|
|
|
lockdep_set_class(&tree->lock, &file_extent_tree_class);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2019-03-25 12:31:24 +00:00
|
|
|
void extent_io_tree_release(struct extent_io_tree *tree)
|
|
|
|
{
|
|
|
|
spin_lock(&tree->lock);
|
|
|
|
/*
|
|
|
|
* Do a single barrier for the waitqueue_active check here, the state
|
|
|
|
* of the waitqueue should not change once extent_io_tree_release is
|
|
|
|
* called.
|
|
|
|
*/
|
|
|
|
smp_mb();
|
|
|
|
while (!RB_EMPTY_ROOT(&tree->state)) {
|
|
|
|
struct rb_node *node;
|
|
|
|
struct extent_state *state;
|
|
|
|
|
|
|
|
node = rb_first(&tree->state);
|
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
|
|
|
rb_erase(&state->rb_node, &tree->state);
|
|
|
|
RB_CLEAR_NODE(&state->rb_node);
|
|
|
|
/*
|
|
|
|
* btree io trees aren't supposed to have tasks waiting for
|
|
|
|
* changes in the flags of extent states ever.
|
|
|
|
*/
|
|
|
|
ASSERT(!waitqueue_active(&state->wq));
|
|
|
|
free_extent_state(state);
|
|
|
|
|
|
|
|
cond_resched_lock(&tree->lock);
|
|
|
|
}
|
|
|
|
spin_unlock(&tree->lock);
|
|
|
|
}
|
|
|
|
|
2008-12-02 14:54:17 +00:00
|
|
|
static struct extent_state *alloc_extent_state(gfp_t mask)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct extent_state *state;
|
|
|
|
|
2017-01-09 14:39:02 +00:00
|
|
|
/*
|
|
|
|
* The given mask might be not appropriate for the slab allocator,
|
|
|
|
* drop the unsupported bits
|
|
|
|
*/
|
|
|
|
mask &= ~(__GFP_DMA32|__GFP_HIGHMEM);
|
2008-01-24 21:13:08 +00:00
|
|
|
state = kmem_cache_alloc(extent_state_cache, mask);
|
2008-04-01 15:21:40 +00:00
|
|
|
if (!state)
|
2008-01-24 21:13:08 +00:00
|
|
|
return state;
|
|
|
|
state->state = 0;
|
2016-02-11 12:24:13 +00:00
|
|
|
state->failrec = NULL;
|
2014-07-06 19:09:59 +00:00
|
|
|
RB_CLEAR_NODE(&state->rb_node);
|
2020-02-14 21:11:40 +00:00
|
|
|
btrfs_leak_debug_add(&leak_lock, &state->leak_list, &states);
|
2017-03-03 08:55:19 +00:00
|
|
|
refcount_set(&state->refs, 1);
|
2008-01-24 21:13:08 +00:00
|
|
|
init_waitqueue_head(&state->wq);
|
2012-03-01 13:56:26 +00:00
|
|
|
trace_alloc_extent_state(state, mask, _RET_IP_);
|
2008-01-24 21:13:08 +00:00
|
|
|
return state;
|
|
|
|
}
|
|
|
|
|
2010-05-26 00:56:50 +00:00
|
|
|
void free_extent_state(struct extent_state *state)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
if (!state)
|
|
|
|
return;
|
2017-03-03 08:55:19 +00:00
|
|
|
if (refcount_dec_and_test(&state->refs)) {
|
2014-07-06 19:09:59 +00:00
|
|
|
WARN_ON(extent_state_in_tree(state));
|
2020-02-14 21:11:40 +00:00
|
|
|
btrfs_leak_debug_del(&leak_lock, &state->leak_list);
|
2012-03-01 13:56:26 +00:00
|
|
|
trace_free_extent_state(state, _RET_IP_);
|
2008-01-24 21:13:08 +00:00
|
|
|
kmem_cache_free(extent_state_cache, state);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-02-12 15:05:53 +00:00
|
|
|
static struct rb_node *tree_insert(struct rb_root *root,
|
|
|
|
struct rb_node *search_start,
|
|
|
|
u64 offset,
|
2013-11-26 15:41:47 +00:00
|
|
|
struct rb_node *node,
|
|
|
|
struct rb_node ***p_in,
|
|
|
|
struct rb_node **parent_in)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2014-02-12 15:05:53 +00:00
|
|
|
struct rb_node **p;
|
2009-01-06 02:25:51 +00:00
|
|
|
struct rb_node *parent = NULL;
|
2008-01-24 21:13:08 +00:00
|
|
|
struct tree_entry *entry;
|
|
|
|
|
2013-11-26 15:41:47 +00:00
|
|
|
if (p_in && parent_in) {
|
|
|
|
p = *p_in;
|
|
|
|
parent = *parent_in;
|
|
|
|
goto do_insert;
|
|
|
|
}
|
|
|
|
|
2014-02-12 15:05:53 +00:00
|
|
|
p = search_start ? &search_start : &root->rb_node;
|
2009-01-06 02:25:51 +00:00
|
|
|
while (*p) {
|
2008-01-24 21:13:08 +00:00
|
|
|
parent = *p;
|
|
|
|
entry = rb_entry(parent, struct tree_entry, rb_node);
|
|
|
|
|
|
|
|
if (offset < entry->start)
|
|
|
|
p = &(*p)->rb_left;
|
|
|
|
else if (offset > entry->end)
|
|
|
|
p = &(*p)->rb_right;
|
|
|
|
else
|
|
|
|
return parent;
|
|
|
|
}
|
|
|
|
|
2013-11-26 15:41:47 +00:00
|
|
|
do_insert:
|
2008-01-24 21:13:08 +00:00
|
|
|
rb_link_node(node, parent, p);
|
|
|
|
rb_insert_color(node, root);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2019-06-05 11:50:04 +00:00
|
|
|
/**
|
|
|
|
* __etree_search - searche @tree for an entry that contains @offset. Such
|
|
|
|
* entry would have entry->start <= offset && entry->end >= offset.
|
|
|
|
*
|
|
|
|
* @tree - the tree to search
|
|
|
|
* @offset - offset that should fall within an entry in @tree
|
|
|
|
* @next_ret - pointer to the first entry whose range ends after @offset
|
|
|
|
* @prev - pointer to the first entry whose range begins before @offset
|
|
|
|
* @p_ret - pointer where new node should be anchored (used when inserting an
|
|
|
|
* entry in the tree)
|
|
|
|
* @parent_ret - points to entry which would have been the parent of the entry,
|
|
|
|
* containing @offset
|
|
|
|
*
|
|
|
|
* This function returns a pointer to the entry that contains @offset byte
|
|
|
|
* address. If no such entry exists, then NULL is returned and the other
|
|
|
|
* pointer arguments to the function are filled, otherwise the found entry is
|
|
|
|
* returned and other pointers are left untouched.
|
|
|
|
*/
|
2008-02-01 19:51:59 +00:00
|
|
|
static struct rb_node *__etree_search(struct extent_io_tree *tree, u64 offset,
|
2013-11-26 15:41:47 +00:00
|
|
|
struct rb_node **next_ret,
|
2019-01-30 14:51:00 +00:00
|
|
|
struct rb_node **prev_ret,
|
2013-11-26 15:41:47 +00:00
|
|
|
struct rb_node ***p_ret,
|
|
|
|
struct rb_node **parent_ret)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2008-02-01 19:51:59 +00:00
|
|
|
struct rb_root *root = &tree->state;
|
2013-11-26 15:41:47 +00:00
|
|
|
struct rb_node **n = &root->rb_node;
|
2008-01-24 21:13:08 +00:00
|
|
|
struct rb_node *prev = NULL;
|
|
|
|
struct rb_node *orig_prev = NULL;
|
|
|
|
struct tree_entry *entry;
|
|
|
|
struct tree_entry *prev_entry = NULL;
|
|
|
|
|
2013-11-26 15:41:47 +00:00
|
|
|
while (*n) {
|
|
|
|
prev = *n;
|
|
|
|
entry = rb_entry(prev, struct tree_entry, rb_node);
|
2008-01-24 21:13:08 +00:00
|
|
|
prev_entry = entry;
|
|
|
|
|
|
|
|
if (offset < entry->start)
|
2013-11-26 15:41:47 +00:00
|
|
|
n = &(*n)->rb_left;
|
2008-01-24 21:13:08 +00:00
|
|
|
else if (offset > entry->end)
|
2013-11-26 15:41:47 +00:00
|
|
|
n = &(*n)->rb_right;
|
2009-01-06 02:25:51 +00:00
|
|
|
else
|
2013-11-26 15:41:47 +00:00
|
|
|
return *n;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2013-11-26 15:41:47 +00:00
|
|
|
if (p_ret)
|
|
|
|
*p_ret = n;
|
|
|
|
if (parent_ret)
|
|
|
|
*parent_ret = prev;
|
|
|
|
|
2019-01-30 14:51:00 +00:00
|
|
|
if (next_ret) {
|
2008-01-24 21:13:08 +00:00
|
|
|
orig_prev = prev;
|
2009-01-06 02:25:51 +00:00
|
|
|
while (prev && offset > prev_entry->end) {
|
2008-01-24 21:13:08 +00:00
|
|
|
prev = rb_next(prev);
|
|
|
|
prev_entry = rb_entry(prev, struct tree_entry, rb_node);
|
|
|
|
}
|
2019-01-30 14:51:00 +00:00
|
|
|
*next_ret = prev;
|
2008-01-24 21:13:08 +00:00
|
|
|
prev = orig_prev;
|
|
|
|
}
|
|
|
|
|
2019-01-30 14:51:00 +00:00
|
|
|
if (prev_ret) {
|
2008-01-24 21:13:08 +00:00
|
|
|
prev_entry = rb_entry(prev, struct tree_entry, rb_node);
|
2009-01-06 02:25:51 +00:00
|
|
|
while (prev && offset < prev_entry->start) {
|
2008-01-24 21:13:08 +00:00
|
|
|
prev = rb_prev(prev);
|
|
|
|
prev_entry = rb_entry(prev, struct tree_entry, rb_node);
|
|
|
|
}
|
2019-01-30 14:51:00 +00:00
|
|
|
*prev_ret = prev;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2013-11-26 15:41:47 +00:00
|
|
|
static inline struct rb_node *
|
|
|
|
tree_search_for_insert(struct extent_io_tree *tree,
|
|
|
|
u64 offset,
|
|
|
|
struct rb_node ***p_ret,
|
|
|
|
struct rb_node **parent_ret)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2019-01-30 14:51:00 +00:00
|
|
|
struct rb_node *next= NULL;
|
2008-01-24 21:13:08 +00:00
|
|
|
struct rb_node *ret;
|
2008-01-29 14:59:12 +00:00
|
|
|
|
2019-01-30 14:51:00 +00:00
|
|
|
ret = __etree_search(tree, offset, &next, NULL, p_ret, parent_ret);
|
2009-01-06 02:25:51 +00:00
|
|
|
if (!ret)
|
2019-01-30 14:51:00 +00:00
|
|
|
return next;
|
2008-01-24 21:13:08 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-11-26 15:41:47 +00:00
|
|
|
static inline struct rb_node *tree_search(struct extent_io_tree *tree,
|
|
|
|
u64 offset)
|
|
|
|
{
|
|
|
|
return tree_search_for_insert(tree, offset, NULL, NULL);
|
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* utility function to look for merge candidates inside a given range.
|
|
|
|
* Any extents with matching state are merged together into a single
|
|
|
|
* extent in the tree. Extents with EXTENT_IO in their state field
|
|
|
|
* are not merged because the end_io handlers need to be able to do
|
|
|
|
* operations on them without sleeping (or doing allocations/splits).
|
|
|
|
*
|
|
|
|
* This should be called with the tree lock held.
|
|
|
|
*/
|
2011-07-21 16:56:09 +00:00
|
|
|
static void merge_state(struct extent_io_tree *tree,
|
|
|
|
struct extent_state *state)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct extent_state *other;
|
|
|
|
struct rb_node *other_node;
|
|
|
|
|
2019-03-14 13:28:31 +00:00
|
|
|
if (state->state & (EXTENT_LOCKED | EXTENT_BOUNDARY))
|
2011-07-21 16:56:09 +00:00
|
|
|
return;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
other_node = rb_prev(&state->rb_node);
|
|
|
|
if (other_node) {
|
|
|
|
other = rb_entry(other_node, struct extent_state, rb_node);
|
|
|
|
if (other->end == state->start - 1 &&
|
|
|
|
other->state == state->state) {
|
2018-11-01 12:09:52 +00:00
|
|
|
if (tree->private_data &&
|
|
|
|
is_data_inode(tree->private_data))
|
|
|
|
btrfs_merge_delalloc_extent(tree->private_data,
|
|
|
|
state, other);
|
2008-01-24 21:13:08 +00:00
|
|
|
state->start = other->start;
|
|
|
|
rb_erase(&other->rb_node, &tree->state);
|
2014-07-06 19:09:59 +00:00
|
|
|
RB_CLEAR_NODE(&other->rb_node);
|
2008-01-24 21:13:08 +00:00
|
|
|
free_extent_state(other);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
other_node = rb_next(&state->rb_node);
|
|
|
|
if (other_node) {
|
|
|
|
other = rb_entry(other_node, struct extent_state, rb_node);
|
|
|
|
if (other->start == state->end + 1 &&
|
|
|
|
other->state == state->state) {
|
2018-11-01 12:09:52 +00:00
|
|
|
if (tree->private_data &&
|
|
|
|
is_data_inode(tree->private_data))
|
|
|
|
btrfs_merge_delalloc_extent(tree->private_data,
|
|
|
|
state, other);
|
2011-06-20 18:53:48 +00:00
|
|
|
state->end = other->end;
|
|
|
|
rb_erase(&other->rb_node, &tree->state);
|
2014-07-06 19:09:59 +00:00
|
|
|
RB_CLEAR_NODE(&other->rb_node);
|
2011-06-20 18:53:48 +00:00
|
|
|
free_extent_state(other);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-07-14 03:19:08 +00:00
|
|
|
static void set_state_bits(struct extent_io_tree *tree,
|
2020-11-13 12:51:40 +00:00
|
|
|
struct extent_state *state, u32 *bits,
|
2015-10-12 06:53:37 +00:00
|
|
|
struct extent_changeset *changeset);
|
2011-07-14 03:19:08 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* insert an extent_state struct into the tree. 'bits' are set on the
|
|
|
|
* struct before it is inserted.
|
|
|
|
*
|
|
|
|
* This may return -EEXIST if the extent is already there, in which case the
|
|
|
|
* state struct is freed.
|
|
|
|
*
|
|
|
|
* The tree lock is not taken internally. This is a utility function and
|
|
|
|
* probably isn't what you want to call (see set/clear_extent_bit).
|
|
|
|
*/
|
|
|
|
static int insert_state(struct extent_io_tree *tree,
|
|
|
|
struct extent_state *state, u64 start, u64 end,
|
2013-11-26 15:41:47 +00:00
|
|
|
struct rb_node ***p,
|
|
|
|
struct rb_node **parent,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 *bits, struct extent_changeset *changeset)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
|
2019-06-18 18:00:05 +00:00
|
|
|
if (end < start) {
|
|
|
|
btrfs_err(tree->fs_info,
|
|
|
|
"insert state: end < start %llu %llu", end, start);
|
|
|
|
WARN_ON(1);
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
state->start = start;
|
|
|
|
state->end = end;
|
2009-09-11 20:12:44 +00:00
|
|
|
|
2015-10-12 06:53:37 +00:00
|
|
|
set_state_bits(tree, state, bits, changeset);
|
2011-07-14 03:19:08 +00:00
|
|
|
|
2014-02-12 15:05:53 +00:00
|
|
|
node = tree_insert(&tree->state, NULL, end, &state->rb_node, p, parent);
|
2008-01-24 21:13:08 +00:00
|
|
|
if (node) {
|
|
|
|
struct extent_state *found;
|
|
|
|
found = rb_entry(node, struct extent_state, rb_node);
|
2019-06-18 18:00:05 +00:00
|
|
|
btrfs_err(tree->fs_info,
|
|
|
|
"found node %llu %llu on insert of %llu %llu",
|
2013-08-20 11:20:07 +00:00
|
|
|
found->start, found->end, start, end);
|
2008-01-24 21:13:08 +00:00
|
|
|
return -EEXIST;
|
|
|
|
}
|
|
|
|
merge_state(tree, state);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* split a given extent state struct in two, inserting the preallocated
|
|
|
|
* struct 'prealloc' as the newly created second half. 'split' indicates an
|
|
|
|
* offset inside 'orig' where it should be split.
|
|
|
|
*
|
|
|
|
* Before calling,
|
|
|
|
* the tree has 'orig' at [orig->start, orig->end]. After calling, there
|
|
|
|
* are two extent state structs in the tree:
|
|
|
|
* prealloc: [orig->start, split - 1]
|
|
|
|
* orig: [ split, orig->end ]
|
|
|
|
*
|
|
|
|
* The tree locks are not taken by this function. They need to be held
|
|
|
|
* by the caller.
|
|
|
|
*/
|
|
|
|
static int split_state(struct extent_io_tree *tree, struct extent_state *orig,
|
|
|
|
struct extent_state *prealloc, u64 split)
|
|
|
|
{
|
|
|
|
struct rb_node *node;
|
2009-09-11 20:12:44 +00:00
|
|
|
|
2018-11-01 12:09:53 +00:00
|
|
|
if (tree->private_data && is_data_inode(tree->private_data))
|
|
|
|
btrfs_split_delalloc_extent(tree->private_data, orig, split);
|
2009-09-11 20:12:44 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
prealloc->start = orig->start;
|
|
|
|
prealloc->end = split - 1;
|
|
|
|
prealloc->state = orig->state;
|
|
|
|
orig->start = split;
|
|
|
|
|
2014-02-12 15:05:53 +00:00
|
|
|
node = tree_insert(&tree->state, &orig->rb_node, prealloc->end,
|
|
|
|
&prealloc->rb_node, NULL, NULL);
|
2008-01-24 21:13:08 +00:00
|
|
|
if (node) {
|
|
|
|
free_extent_state(prealloc);
|
|
|
|
return -EEXIST;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-03-12 08:39:48 +00:00
|
|
|
static struct extent_state *next_state(struct extent_state *state)
|
|
|
|
{
|
|
|
|
struct rb_node *next = rb_next(&state->rb_node);
|
|
|
|
if (next)
|
|
|
|
return rb_entry(next, struct extent_state, rb_node);
|
|
|
|
else
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* utility function to clear some bits in an extent state struct.
|
2018-11-28 11:05:13 +00:00
|
|
|
* it will optionally wake up anyone waiting on this state (wake == 1).
|
2008-01-24 21:13:08 +00:00
|
|
|
*
|
|
|
|
* If no bits are set on the state struct after clearing things, the
|
|
|
|
* struct is freed and removed from the tree
|
|
|
|
*/
|
2012-03-12 08:39:48 +00:00
|
|
|
static struct extent_state *clear_state_bit(struct extent_io_tree *tree,
|
|
|
|
struct extent_state *state,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 *bits, int wake,
|
2015-10-12 07:35:38 +00:00
|
|
|
struct extent_changeset *changeset)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2012-03-12 08:39:48 +00:00
|
|
|
struct extent_state *next;
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 bits_to_clear = *bits & ~EXTENT_CTLBITS;
|
2018-03-01 16:56:34 +00:00
|
|
|
int ret;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2010-05-16 14:48:47 +00:00
|
|
|
if ((bits_to_clear & EXTENT_DIRTY) && (state->state & EXTENT_DIRTY)) {
|
2008-01-24 21:13:08 +00:00
|
|
|
u64 range = state->end - state->start + 1;
|
|
|
|
WARN_ON(range > tree->dirty_bytes);
|
|
|
|
tree->dirty_bytes -= range;
|
|
|
|
}
|
2018-11-01 12:09:51 +00:00
|
|
|
|
|
|
|
if (tree->private_data && is_data_inode(tree->private_data))
|
|
|
|
btrfs_clear_delalloc_extent(tree->private_data, state, bits);
|
|
|
|
|
2018-03-01 16:56:34 +00:00
|
|
|
ret = add_extent_changeset(state, bits_to_clear, changeset, 0);
|
|
|
|
BUG_ON(ret < 0);
|
2009-10-08 17:34:05 +00:00
|
|
|
state->state &= ~bits_to_clear;
|
2008-01-24 21:13:08 +00:00
|
|
|
if (wake)
|
|
|
|
wake_up(&state->wq);
|
2010-05-16 14:48:47 +00:00
|
|
|
if (state->state == 0) {
|
2012-03-12 08:39:48 +00:00
|
|
|
next = next_state(state);
|
2014-07-06 19:09:59 +00:00
|
|
|
if (extent_state_in_tree(state)) {
|
2008-01-24 21:13:08 +00:00
|
|
|
rb_erase(&state->rb_node, &tree->state);
|
2014-07-06 19:09:59 +00:00
|
|
|
RB_CLEAR_NODE(&state->rb_node);
|
2008-01-24 21:13:08 +00:00
|
|
|
free_extent_state(state);
|
|
|
|
} else {
|
|
|
|
WARN_ON(1);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
merge_state(tree, state);
|
2012-03-12 08:39:48 +00:00
|
|
|
next = next_state(state);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2012-03-12 08:39:48 +00:00
|
|
|
return next;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2011-04-20 06:44:57 +00:00
|
|
|
static struct extent_state *
|
|
|
|
alloc_extent_state_atomic(struct extent_state *prealloc)
|
|
|
|
{
|
|
|
|
if (!prealloc)
|
|
|
|
prealloc = alloc_extent_state(GFP_ATOMIC);
|
|
|
|
|
|
|
|
return prealloc;
|
|
|
|
}
|
|
|
|
|
2013-04-25 20:41:01 +00:00
|
|
|
static void extent_io_tree_panic(struct extent_io_tree *tree, int err)
|
2011-10-04 03:22:32 +00:00
|
|
|
{
|
2018-07-18 17:23:45 +00:00
|
|
|
struct inode *inode = tree->private_data;
|
|
|
|
|
|
|
|
btrfs_panic(btrfs_sb(inode->i_sb), err,
|
|
|
|
"locking error: extent tree was modified by another thread while locked");
|
2011-10-04 03:22:32 +00:00
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* clear some bits on a range in the tree. This may require splitting
|
|
|
|
* or inserting elements in the tree, so the gfp mask is used to
|
|
|
|
* indicate which allocations or sleeping are allowed.
|
|
|
|
*
|
|
|
|
* pass 'wake' == 1 to kick any sleepers, and 'delete' == 1 to remove
|
|
|
|
* the given range from the tree regardless of state (ie for truncate).
|
|
|
|
*
|
|
|
|
* the range [start, end] is inclusive.
|
|
|
|
*
|
2012-03-01 13:56:29 +00:00
|
|
|
* This takes the tree lock, and returns 0 on success and < 0 on error.
|
2008-01-24 21:13:08 +00:00
|
|
|
*/
|
2017-10-31 15:30:47 +00:00
|
|
|
int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 bits, int wake, int delete,
|
|
|
|
struct extent_state **cached_state,
|
|
|
|
gfp_t mask, struct extent_changeset *changeset)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct extent_state *state;
|
2009-09-02 19:04:12 +00:00
|
|
|
struct extent_state *cached;
|
2008-01-24 21:13:08 +00:00
|
|
|
struct extent_state *prealloc = NULL;
|
|
|
|
struct rb_node *node;
|
2009-05-27 13:16:03 +00:00
|
|
|
u64 last_end;
|
2008-01-24 21:13:08 +00:00
|
|
|
int err;
|
2010-02-03 19:33:23 +00:00
|
|
|
int clear = 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2013-12-13 15:02:44 +00:00
|
|
|
btrfs_debug_check_extent_io_range(tree, start, end);
|
2019-03-01 02:48:00 +00:00
|
|
|
trace_btrfs_clear_extent_bit(tree, start, end - start + 1, bits);
|
2013-04-30 15:22:23 +00:00
|
|
|
|
2013-06-21 20:37:03 +00:00
|
|
|
if (bits & EXTENT_DELALLOC)
|
|
|
|
bits |= EXTENT_NORESERVE;
|
|
|
|
|
2010-05-16 14:48:47 +00:00
|
|
|
if (delete)
|
|
|
|
bits |= ~EXTENT_CTLBITS;
|
|
|
|
|
2019-03-14 13:28:31 +00:00
|
|
|
if (bits & (EXTENT_LOCKED | EXTENT_BOUNDARY))
|
2010-02-03 19:33:23 +00:00
|
|
|
clear = 1;
|
2008-01-24 21:13:08 +00:00
|
|
|
again:
|
2015-11-07 00:28:21 +00:00
|
|
|
if (!prealloc && gfpflags_allow_blocking(mask)) {
|
2014-11-03 14:12:57 +00:00
|
|
|
/*
|
|
|
|
* Don't care for allocation failure here because we might end
|
|
|
|
* up not needing the pre-allocated extent state at all, which
|
|
|
|
* is the case if we only have in the tree extent states that
|
|
|
|
* cover our input range and don't cover too any other range.
|
|
|
|
* If we end up needing a new extent state we allocate it later.
|
|
|
|
*/
|
2008-01-24 21:13:08 +00:00
|
|
|
prealloc = alloc_extent_state(mask);
|
|
|
|
}
|
|
|
|
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_lock(&tree->lock);
|
2009-09-02 19:04:12 +00:00
|
|
|
if (cached_state) {
|
|
|
|
cached = *cached_state;
|
2010-02-03 19:33:23 +00:00
|
|
|
|
|
|
|
if (clear) {
|
|
|
|
*cached_state = NULL;
|
|
|
|
cached_state = NULL;
|
|
|
|
}
|
|
|
|
|
2014-07-06 19:09:59 +00:00
|
|
|
if (cached && extent_state_in_tree(cached) &&
|
|
|
|
cached->start <= start && cached->end > start) {
|
2010-02-03 19:33:23 +00:00
|
|
|
if (clear)
|
2017-03-03 08:55:19 +00:00
|
|
|
refcount_dec(&cached->refs);
|
2009-09-02 19:04:12 +00:00
|
|
|
state = cached;
|
2009-09-23 23:51:09 +00:00
|
|
|
goto hit_next;
|
2009-09-02 19:04:12 +00:00
|
|
|
}
|
2010-02-03 19:33:23 +00:00
|
|
|
if (clear)
|
|
|
|
free_extent_state(cached);
|
2009-09-02 19:04:12 +00:00
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* this search will find the extents that end after
|
|
|
|
* our range starts
|
|
|
|
*/
|
2008-02-01 19:51:59 +00:00
|
|
|
node = tree_search(tree, start);
|
2008-01-24 21:13:08 +00:00
|
|
|
if (!node)
|
|
|
|
goto out;
|
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
2009-09-02 19:04:12 +00:00
|
|
|
hit_next:
|
2008-01-24 21:13:08 +00:00
|
|
|
if (state->start > end)
|
|
|
|
goto out;
|
|
|
|
WARN_ON(state->end < start);
|
2009-05-27 13:16:03 +00:00
|
|
|
last_end = state->end;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2012-02-16 10:34:37 +00:00
|
|
|
/* the state doesn't have the wanted bits, go ahead */
|
2012-03-12 08:39:48 +00:00
|
|
|
if (!(state->state & bits)) {
|
|
|
|
state = next_state(state);
|
2012-02-16 10:34:37 +00:00
|
|
|
goto next;
|
2012-03-12 08:39:48 +00:00
|
|
|
}
|
2012-02-16 10:34:37 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* | ---- desired range ---- |
|
|
|
|
* | state | or
|
|
|
|
* | ------------- state -------------- |
|
|
|
|
*
|
|
|
|
* We need to split the extent we found, and may flip
|
|
|
|
* bits on second half.
|
|
|
|
*
|
|
|
|
* If the extent we found extends past our range, we
|
|
|
|
* just split and search again. It'll get split again
|
|
|
|
* the next time though.
|
|
|
|
*
|
|
|
|
* If the extent we found is inside our range, we clear
|
|
|
|
* the desired bit on it.
|
|
|
|
*/
|
|
|
|
|
|
|
|
if (state->start < start) {
|
2011-04-20 06:44:57 +00:00
|
|
|
prealloc = alloc_extent_state_atomic(prealloc);
|
|
|
|
BUG_ON(!prealloc);
|
2008-01-24 21:13:08 +00:00
|
|
|
err = split_state(tree, state, prealloc, start);
|
2011-10-04 03:22:32 +00:00
|
|
|
if (err)
|
|
|
|
extent_io_tree_panic(tree, err);
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
prealloc = NULL;
|
|
|
|
if (err)
|
|
|
|
goto out;
|
|
|
|
if (state->end <= end) {
|
2015-10-12 07:35:38 +00:00
|
|
|
state = clear_state_bit(tree, state, &bits, wake,
|
|
|
|
changeset);
|
2012-05-10 10:10:39 +00:00
|
|
|
goto next;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
goto search_again;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* | ---- desired range ---- |
|
|
|
|
* | state |
|
|
|
|
* We need to split the extent, and clear the bit
|
|
|
|
* on the first half
|
|
|
|
*/
|
|
|
|
if (state->start <= end && state->end > end) {
|
2011-04-20 06:44:57 +00:00
|
|
|
prealloc = alloc_extent_state_atomic(prealloc);
|
|
|
|
BUG_ON(!prealloc);
|
2008-01-24 21:13:08 +00:00
|
|
|
err = split_state(tree, state, prealloc, end + 1);
|
2011-10-04 03:22:32 +00:00
|
|
|
if (err)
|
|
|
|
extent_io_tree_panic(tree, err);
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
if (wake)
|
|
|
|
wake_up(&state->wq);
|
2009-09-23 23:51:09 +00:00
|
|
|
|
2015-10-12 07:35:38 +00:00
|
|
|
clear_state_bit(tree, prealloc, &bits, wake, changeset);
|
2009-09-11 20:12:44 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
prealloc = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
2009-09-23 23:51:09 +00:00
|
|
|
|
2015-10-12 07:35:38 +00:00
|
|
|
state = clear_state_bit(tree, state, &bits, wake, changeset);
|
2012-02-16 10:34:37 +00:00
|
|
|
next:
|
2009-05-27 13:16:03 +00:00
|
|
|
if (last_end == (u64)-1)
|
|
|
|
goto out;
|
|
|
|
start = last_end + 1;
|
2012-03-12 08:39:48 +00:00
|
|
|
if (start <= end && state && !need_resched())
|
2012-02-16 10:34:36 +00:00
|
|
|
goto hit_next;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
search_again:
|
|
|
|
if (start > end)
|
|
|
|
goto out;
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_unlock(&tree->lock);
|
2015-11-07 00:28:21 +00:00
|
|
|
if (gfpflags_allow_blocking(mask))
|
2008-01-24 21:13:08 +00:00
|
|
|
cond_resched();
|
|
|
|
goto again;
|
2016-04-26 23:02:15 +00:00
|
|
|
|
|
|
|
out:
|
|
|
|
spin_unlock(&tree->lock);
|
|
|
|
if (prealloc)
|
|
|
|
free_extent_state(prealloc);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2012-03-01 13:56:26 +00:00
|
|
|
static void wait_on_state(struct extent_io_tree *tree,
|
|
|
|
struct extent_state *state)
|
2008-12-02 11:36:10 +00:00
|
|
|
__releases(tree->lock)
|
|
|
|
__acquires(tree->lock)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
DEFINE_WAIT(wait);
|
|
|
|
prepare_to_wait(&state->wq, &wait, TASK_UNINTERRUPTIBLE);
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_unlock(&tree->lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
schedule();
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_lock(&tree->lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
finish_wait(&state->wq, &wait);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* waits for one or more bits to clear on a range in the state tree.
|
|
|
|
* The range [start, end] is inclusive.
|
|
|
|
* The tree lock is taken by this function
|
|
|
|
*/
|
2013-04-29 13:38:46 +00:00
|
|
|
static void wait_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 bits)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct extent_state *state;
|
|
|
|
struct rb_node *node;
|
|
|
|
|
2013-12-13 15:02:44 +00:00
|
|
|
btrfs_debug_check_extent_io_range(tree, start, end);
|
2013-04-30 15:22:23 +00:00
|
|
|
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_lock(&tree->lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
again:
|
|
|
|
while (1) {
|
|
|
|
/*
|
|
|
|
* this search will find all the extents that end after
|
|
|
|
* our range starts
|
|
|
|
*/
|
2008-02-01 19:51:59 +00:00
|
|
|
node = tree_search(tree, start);
|
2014-03-31 13:53:25 +00:00
|
|
|
process_node:
|
2008-01-24 21:13:08 +00:00
|
|
|
if (!node)
|
|
|
|
break;
|
|
|
|
|
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
|
|
|
|
|
|
|
if (state->start > end)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
if (state->state & bits) {
|
|
|
|
start = state->start;
|
2017-03-03 08:55:19 +00:00
|
|
|
refcount_inc(&state->refs);
|
2008-01-24 21:13:08 +00:00
|
|
|
wait_on_state(tree, state);
|
|
|
|
free_extent_state(state);
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
start = state->end + 1;
|
|
|
|
|
|
|
|
if (start > end)
|
|
|
|
break;
|
|
|
|
|
2014-03-31 13:53:25 +00:00
|
|
|
if (!cond_resched_lock(&tree->lock)) {
|
|
|
|
node = rb_next(node);
|
|
|
|
goto process_node;
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
out:
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_unlock(&tree->lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2011-07-21 16:56:09 +00:00
|
|
|
static void set_state_bits(struct extent_io_tree *tree,
|
2008-01-24 21:13:08 +00:00
|
|
|
struct extent_state *state,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 *bits, struct extent_changeset *changeset)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 bits_to_set = *bits & ~EXTENT_CTLBITS;
|
2018-03-01 16:56:34 +00:00
|
|
|
int ret;
|
2009-09-11 20:12:44 +00:00
|
|
|
|
2018-11-01 12:09:50 +00:00
|
|
|
if (tree->private_data && is_data_inode(tree->private_data))
|
|
|
|
btrfs_set_delalloc_extent(tree->private_data, state, bits);
|
|
|
|
|
2010-05-16 14:48:47 +00:00
|
|
|
if ((bits_to_set & EXTENT_DIRTY) && !(state->state & EXTENT_DIRTY)) {
|
2008-01-24 21:13:08 +00:00
|
|
|
u64 range = state->end - state->start + 1;
|
|
|
|
tree->dirty_bytes += range;
|
|
|
|
}
|
2018-03-01 16:56:34 +00:00
|
|
|
ret = add_extent_changeset(state, bits_to_set, changeset, 1);
|
|
|
|
BUG_ON(ret < 0);
|
2010-05-16 14:48:47 +00:00
|
|
|
state->state |= bits_to_set;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2014-10-13 11:28:38 +00:00
|
|
|
static void cache_state_if_flags(struct extent_state *state,
|
|
|
|
struct extent_state **cached_ptr,
|
2015-01-14 18:52:13 +00:00
|
|
|
unsigned flags)
|
2009-09-02 19:04:12 +00:00
|
|
|
{
|
|
|
|
if (cached_ptr && !(*cached_ptr)) {
|
2014-10-13 11:28:38 +00:00
|
|
|
if (!flags || (state->state & flags)) {
|
2009-09-02 19:04:12 +00:00
|
|
|
*cached_ptr = state;
|
2017-03-03 08:55:19 +00:00
|
|
|
refcount_inc(&state->refs);
|
2009-09-02 19:04:12 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-10-13 11:28:38 +00:00
|
|
|
static void cache_state(struct extent_state *state,
|
|
|
|
struct extent_state **cached_ptr)
|
|
|
|
{
|
|
|
|
return cache_state_if_flags(state, cached_ptr,
|
2019-03-14 13:28:31 +00:00
|
|
|
EXTENT_LOCKED | EXTENT_BOUNDARY);
|
2014-10-13 11:28:38 +00:00
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
2009-09-02 17:24:36 +00:00
|
|
|
* set some bits on a range in the tree. This may require allocations or
|
|
|
|
* sleeping, so the gfp mask is used to indicate what is allowed.
|
2008-01-24 21:13:08 +00:00
|
|
|
*
|
2009-09-02 17:24:36 +00:00
|
|
|
* If any of the exclusive bits are set, this will fail with -EEXIST if some
|
|
|
|
* part of the range already has the desired bits set. The start of the
|
|
|
|
* existing range is returned in failed_start in this case.
|
2008-01-24 21:13:08 +00:00
|
|
|
*
|
2009-09-02 17:24:36 +00:00
|
|
|
* [start, end] is inclusive This takes the tree lock.
|
2008-01-24 21:13:08 +00:00
|
|
|
*/
|
2020-11-13 12:51:40 +00:00
|
|
|
int set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end, u32 bits,
|
|
|
|
u32 exclusive_bits, u64 *failed_start,
|
2020-11-05 09:08:00 +00:00
|
|
|
struct extent_state **cached_state, gfp_t mask,
|
|
|
|
struct extent_changeset *changeset)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct extent_state *state;
|
|
|
|
struct extent_state *prealloc = NULL;
|
|
|
|
struct rb_node *node;
|
2013-11-26 15:41:47 +00:00
|
|
|
struct rb_node **p;
|
|
|
|
struct rb_node *parent;
|
2008-01-24 21:13:08 +00:00
|
|
|
int err = 0;
|
|
|
|
u64 last_start;
|
|
|
|
u64 last_end;
|
2009-09-23 23:51:09 +00:00
|
|
|
|
2013-12-13 15:02:44 +00:00
|
|
|
btrfs_debug_check_extent_io_range(tree, start, end);
|
2019-03-01 02:48:00 +00:00
|
|
|
trace_btrfs_set_extent_bit(tree, start, end - start + 1, bits);
|
2013-04-30 15:22:23 +00:00
|
|
|
|
2020-10-21 06:24:51 +00:00
|
|
|
if (exclusive_bits)
|
|
|
|
ASSERT(failed_start);
|
|
|
|
else
|
|
|
|
ASSERT(failed_start == NULL);
|
2008-01-24 21:13:08 +00:00
|
|
|
again:
|
2015-11-07 00:28:21 +00:00
|
|
|
if (!prealloc && gfpflags_allow_blocking(mask)) {
|
2016-04-26 23:03:45 +00:00
|
|
|
/*
|
|
|
|
* Don't care for allocation failure here because we might end
|
|
|
|
* up not needing the pre-allocated extent state at all, which
|
|
|
|
* is the case if we only have in the tree extent states that
|
|
|
|
* cover our input range and don't cover too any other range.
|
|
|
|
* If we end up needing a new extent state we allocate it later.
|
|
|
|
*/
|
2008-01-24 21:13:08 +00:00
|
|
|
prealloc = alloc_extent_state(mask);
|
|
|
|
}
|
|
|
|
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_lock(&tree->lock);
|
2009-09-02 19:22:30 +00:00
|
|
|
if (cached_state && *cached_state) {
|
|
|
|
state = *cached_state;
|
2011-06-20 18:53:48 +00:00
|
|
|
if (state->start <= start && state->end > start &&
|
2014-07-06 19:09:59 +00:00
|
|
|
extent_state_in_tree(state)) {
|
2009-09-02 19:22:30 +00:00
|
|
|
node = &state->rb_node;
|
|
|
|
goto hit_next;
|
|
|
|
}
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* this search will find all the extents that end after
|
|
|
|
* our range starts.
|
|
|
|
*/
|
2013-11-26 15:41:47 +00:00
|
|
|
node = tree_search_for_insert(tree, start, &p, &parent);
|
2008-01-24 21:13:08 +00:00
|
|
|
if (!node) {
|
2011-04-20 06:44:57 +00:00
|
|
|
prealloc = alloc_extent_state_atomic(prealloc);
|
|
|
|
BUG_ON(!prealloc);
|
2013-11-26 15:41:47 +00:00
|
|
|
err = insert_state(tree, prealloc, start, end,
|
2015-10-12 06:53:37 +00:00
|
|
|
&p, &parent, &bits, changeset);
|
2011-10-04 03:22:32 +00:00
|
|
|
if (err)
|
|
|
|
extent_io_tree_panic(tree, err);
|
|
|
|
|
2013-11-26 15:01:34 +00:00
|
|
|
cache_state(prealloc, cached_state);
|
2008-01-24 21:13:08 +00:00
|
|
|
prealloc = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
2009-08-05 16:57:59 +00:00
|
|
|
hit_next:
|
2008-01-24 21:13:08 +00:00
|
|
|
last_start = state->start;
|
|
|
|
last_end = state->end;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* | ---- desired range ---- |
|
|
|
|
* | state |
|
|
|
|
*
|
|
|
|
* Just lock what we found and keep going
|
|
|
|
*/
|
|
|
|
if (state->start == start && state->end <= end) {
|
2009-09-02 17:24:36 +00:00
|
|
|
if (state->state & exclusive_bits) {
|
2008-01-24 21:13:08 +00:00
|
|
|
*failed_start = state->start;
|
|
|
|
err = -EEXIST;
|
|
|
|
goto out;
|
|
|
|
}
|
2009-09-23 23:51:09 +00:00
|
|
|
|
2015-10-12 06:53:37 +00:00
|
|
|
set_state_bits(tree, state, &bits, changeset);
|
2009-09-02 19:04:12 +00:00
|
|
|
cache_state(state, cached_state);
|
2008-01-24 21:13:08 +00:00
|
|
|
merge_state(tree, state);
|
2009-05-27 13:16:03 +00:00
|
|
|
if (last_end == (u64)-1)
|
|
|
|
goto out;
|
|
|
|
start = last_end + 1;
|
2012-05-10 10:10:39 +00:00
|
|
|
state = next_state(state);
|
|
|
|
if (start < end && state && state->start == start &&
|
|
|
|
!need_resched())
|
|
|
|
goto hit_next;
|
2008-01-24 21:13:08 +00:00
|
|
|
goto search_again;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* | ---- desired range ---- |
|
|
|
|
* | state |
|
|
|
|
* or
|
|
|
|
* | ------------- state -------------- |
|
|
|
|
*
|
|
|
|
* We need to split the extent we found, and may flip bits on
|
|
|
|
* second half.
|
|
|
|
*
|
|
|
|
* If the extent we found extends past our
|
|
|
|
* range, we just split and search again. It'll get split
|
|
|
|
* again the next time though.
|
|
|
|
*
|
|
|
|
* If the extent we found is inside our range, we set the
|
|
|
|
* desired bit on it.
|
|
|
|
*/
|
|
|
|
if (state->start < start) {
|
2009-09-02 17:24:36 +00:00
|
|
|
if (state->state & exclusive_bits) {
|
2008-01-24 21:13:08 +00:00
|
|
|
*failed_start = start;
|
|
|
|
err = -EEXIST;
|
|
|
|
goto out;
|
|
|
|
}
|
2011-04-20 06:44:57 +00:00
|
|
|
|
Btrfs: avoid unnecessary splits when setting bits on an extent io tree
When attempting to set bits on a range of an exent io tree that already
has those bits set we can end up splitting an extent state record, use
the preallocated extent state record, insert it into the red black tree,
do another search on the red black tree, merge the preallocated extent
state record with the previous extent state record, remove that previous
record from the red black tree and then free it. This is all unnecessary
work that consumes time.
This happens specifically at the following case at __set_extent_bit():
$ cat -n fs/btrfs/extent_io.c
957 static int __must_check
958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
(...)
1044 /*
1045 * | ---- desired range ---- |
1046 * | state |
1047 * or
1048 * | ------------- state -------------- |
1049 *
(...)
1060 if (state->start < start) {
1061 if (state->state & exclusive_bits) {
1062 *failed_start = start;
1063 err = -EEXIST;
1064 goto out;
1065 }
1066
1067 prealloc = alloc_extent_state_atomic(prealloc);
1068 BUG_ON(!prealloc);
1069 err = split_state(tree, state, prealloc, start);
1070 if (err)
1071 extent_io_tree_panic(tree, err);
1072
1073 prealloc = NULL;
So if our extent state represents a range from 0 to 1MiB for example, and
we want to set bits in the range 128KiB to 256KiB for example, and that
extent state record already has all those bits set, we end up splitting
that record, so we end up with extent state records in the tree which
represent the ranges from 0 to 128KiB and from 128KiB to 1MiB. This is
temporary because a subsequent iteration in that function will end up
merging the records.
The splitting requires using the preallocated extent state record, so
a future iteration that needs to do another split will need to allocate
another extent state record in an atomic context, something not ideal
that we try to avoid as much as possible. The splitting also requires
an insertion in the red black tree, and a subsequent merge will require
a deletion from the red black tree and freeing an extent state record.
This change just skips the splitting of an extent state record when it
already has all the bits the we need to set.
Setting a bit that is already set for a range is very common in the
inode's 'file_extent_tree' extent io tree for example, where we keep
setting the EXTENT_DIRTY bit every time we replace an extent.
This change also fixes a bug that happens after the recent patchset from
Josef that avoids having implicit holes after a power failure when not
using the NO_HOLES feature, more specifically the patch with the subject:
"btrfs: introduce the inode->file_extent_tree"
This patch introduced an extent io tree per inode to keep track of
completed ordered extents and figure out at any time what is the safe
value for the inode's disk_i_size. This assumes that for contiguous
ranges in a file we always end up with a single extent state record in
the io tree, but that is not the case, as there is a short time window
where we can have two extent state records representing contiguous
ranges. When this happens we end setting up an incorrect value for the
inode's disk_i_size, resulting in data loss after a clean unmount
of the filesystem. The following example explains how this can happen.
Suppose we have an inode with an i_size and a disk_i_size of 1MiB, so in
the inode's file_extent_tree we have a single extent state record that
represents the range [0, 1MiB) with the EXTENT_DIRTY bit set. Then the
following steps happen:
1) A buffered write against file range [512KiB, 768KiB) is made. At this
point delalloc was not flushed yet;
2) Deduplication from some other inode into this inode's range
[128KiB, 256KiB) is made. This causes btrfs_inode_set_file_extent_range()
to be called, from btrfs_insert_clone_extent(), to mark the range
[128KiB, 256KiB) with EXTENT_DIRTY in the inode's file_extent_tree;
3) When btrfs_inode_set_file_extent_range() calls set_extent_bits(), we
end up at __set_extent_bit(). In the first iteration of that function's
loop we end up in the following branch:
$ cat -n fs/btrfs/extent_io.c
957 static int __must_check
958 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
(...)
1044 /*
1045 * | ---- desired range ---- |
1046 * | state |
1047 * or
1048 * | ------------- state -------------- |
1049 *
(...)
1060 if (state->start < start) {
1061 if (state->state & exclusive_bits) {
1062 *failed_start = start;
1063 err = -EEXIST;
1064 goto out;
1065 }
1066
1067 prealloc = alloc_extent_state_atomic(prealloc);
1068 BUG_ON(!prealloc);
1069 err = split_state(tree, state, prealloc, start);
1070 if (err)
1071 extent_io_tree_panic(tree, err);
1072
1073 prealloc = NULL;
(...)
1089 goto search_again;
This splits the state record into two, one for range [0, 128KiB) and
another for the range [128KiB, 1MiB). Both already have the EXTENT_DIRTY
bit set. Then we jump to the 'search_again' label, where we unlock the
the spinlock protecting the extent io tree before jumping to the
'again' label to perform the next iteration;
4) In the meanwhile, delalloc is flushed, the ordered extent for the range
[512KiB, 768KiB) is created and when it completes, at
btrfs_finish_ordered_io(), it calls btrfs_inode_safe_disk_i_size_write()
with a value of 0 for its 'new_size' argument;
5) Before the deduplication task currently at __set_extent_bit() moves to
the next iteration, the task finishing the ordered extent calls
find_first_extent_bit() through btrfs_inode_safe_disk_i_size_write()
and gets 'start' set to 0 and 'end' set to 128KiB - because at this
moment the io tree has two extent state records, one representing the
range [0, 128KiB) and another representing the range [128KiB, 1MiB),
both with EXTENT_DIRTY set. Then we set 'isize' to:
isize = min(isize, end + 1)
= min(1MiB, 128KiB - 1 + 1)
= 128KiB
Then we set the inode's disk_i_size to 128KiB (isize).
After a clean unmount of the filesystem and mounting it again, we have
the file with a size of 128KiB, and effectively lost all the data it
had before in the range from 128KiB to 1MiB.
This change fixes that issue too, as we never end up splitting extent
state records when they already have all the bits we want set.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-13 10:20:02 +00:00
|
|
|
/*
|
|
|
|
* If this extent already has all the bits we want set, then
|
|
|
|
* skip it, not necessary to split it or do anything with it.
|
|
|
|
*/
|
|
|
|
if ((state->state & bits) == bits) {
|
|
|
|
start = state->end + 1;
|
|
|
|
cache_state(state, cached_state);
|
|
|
|
goto search_again;
|
|
|
|
}
|
|
|
|
|
2011-04-20 06:44:57 +00:00
|
|
|
prealloc = alloc_extent_state_atomic(prealloc);
|
|
|
|
BUG_ON(!prealloc);
|
2008-01-24 21:13:08 +00:00
|
|
|
err = split_state(tree, state, prealloc, start);
|
2011-10-04 03:22:32 +00:00
|
|
|
if (err)
|
|
|
|
extent_io_tree_panic(tree, err);
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
prealloc = NULL;
|
|
|
|
if (err)
|
|
|
|
goto out;
|
|
|
|
if (state->end <= end) {
|
2015-10-12 06:53:37 +00:00
|
|
|
set_state_bits(tree, state, &bits, changeset);
|
2009-09-02 19:04:12 +00:00
|
|
|
cache_state(state, cached_state);
|
2008-01-24 21:13:08 +00:00
|
|
|
merge_state(tree, state);
|
2009-05-27 13:16:03 +00:00
|
|
|
if (last_end == (u64)-1)
|
|
|
|
goto out;
|
|
|
|
start = last_end + 1;
|
2012-05-10 10:10:39 +00:00
|
|
|
state = next_state(state);
|
|
|
|
if (start < end && state && state->start == start &&
|
|
|
|
!need_resched())
|
|
|
|
goto hit_next;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
goto search_again;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* | ---- desired range ---- |
|
|
|
|
* | state | or | state |
|
|
|
|
*
|
|
|
|
* There's a hole, we need to insert something in it and
|
|
|
|
* ignore the extent we found.
|
|
|
|
*/
|
|
|
|
if (state->start > start) {
|
|
|
|
u64 this_end;
|
|
|
|
if (end < last_start)
|
|
|
|
this_end = end;
|
|
|
|
else
|
2009-01-06 02:25:51 +00:00
|
|
|
this_end = last_start - 1;
|
2011-04-20 06:44:57 +00:00
|
|
|
|
|
|
|
prealloc = alloc_extent_state_atomic(prealloc);
|
|
|
|
BUG_ON(!prealloc);
|
2011-04-20 06:45:49 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Avoid to free 'prealloc' if it can be merged with
|
|
|
|
* the later extent.
|
|
|
|
*/
|
2008-01-24 21:13:08 +00:00
|
|
|
err = insert_state(tree, prealloc, start, this_end,
|
2015-10-12 06:53:37 +00:00
|
|
|
NULL, NULL, &bits, changeset);
|
2011-10-04 03:22:32 +00:00
|
|
|
if (err)
|
|
|
|
extent_io_tree_panic(tree, err);
|
|
|
|
|
2009-09-11 20:12:44 +00:00
|
|
|
cache_state(prealloc, cached_state);
|
|
|
|
prealloc = NULL;
|
2008-01-24 21:13:08 +00:00
|
|
|
start = this_end + 1;
|
|
|
|
goto search_again;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* | ---- desired range ---- |
|
|
|
|
* | state |
|
|
|
|
* We need to split the extent, and set the bit
|
|
|
|
* on the first half
|
|
|
|
*/
|
|
|
|
if (state->start <= end && state->end > end) {
|
2009-09-02 17:24:36 +00:00
|
|
|
if (state->state & exclusive_bits) {
|
2008-01-24 21:13:08 +00:00
|
|
|
*failed_start = start;
|
|
|
|
err = -EEXIST;
|
|
|
|
goto out;
|
|
|
|
}
|
2011-04-20 06:44:57 +00:00
|
|
|
|
|
|
|
prealloc = alloc_extent_state_atomic(prealloc);
|
|
|
|
BUG_ON(!prealloc);
|
2008-01-24 21:13:08 +00:00
|
|
|
err = split_state(tree, state, prealloc, end + 1);
|
2011-10-04 03:22:32 +00:00
|
|
|
if (err)
|
|
|
|
extent_io_tree_panic(tree, err);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2015-10-12 06:53:37 +00:00
|
|
|
set_state_bits(tree, prealloc, &bits, changeset);
|
2009-09-02 19:04:12 +00:00
|
|
|
cache_state(prealloc, cached_state);
|
2008-01-24 21:13:08 +00:00
|
|
|
merge_state(tree, prealloc);
|
|
|
|
prealloc = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2016-04-26 23:02:15 +00:00
|
|
|
search_again:
|
|
|
|
if (start > end)
|
|
|
|
goto out;
|
|
|
|
spin_unlock(&tree->lock);
|
|
|
|
if (gfpflags_allow_blocking(mask))
|
|
|
|
cond_resched();
|
|
|
|
goto again;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
out:
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_unlock(&tree->lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
if (prealloc)
|
|
|
|
free_extent_state(prealloc);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2011-09-26 17:56:12 +00:00
|
|
|
/**
|
2012-07-11 07:26:19 +00:00
|
|
|
* convert_extent_bit - convert all bits in a given range from one bit to
|
|
|
|
* another
|
2011-09-26 17:56:12 +00:00
|
|
|
* @tree: the io tree to search
|
|
|
|
* @start: the start offset in bytes
|
|
|
|
* @end: the end offset in bytes (inclusive)
|
|
|
|
* @bits: the bits to set in this range
|
|
|
|
* @clear_bits: the bits to clear in this range
|
2012-09-27 21:07:30 +00:00
|
|
|
* @cached_state: state that we're going to cache
|
2011-09-26 17:56:12 +00:00
|
|
|
*
|
|
|
|
* This will go through and set bits for the given range. If any states exist
|
|
|
|
* already in this range they are set with the given bit and cleared of the
|
|
|
|
* clear_bits. This is only meant to be used by things that are mergeable, ie
|
|
|
|
* converting from say DELALLOC to DIRTY. This is not meant to be used with
|
|
|
|
* boundary bits like LOCK.
|
2016-04-26 21:54:39 +00:00
|
|
|
*
|
|
|
|
* All allocations are done with GFP_NOFS.
|
2011-09-26 17:56:12 +00:00
|
|
|
*/
|
|
|
|
int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 bits, u32 clear_bits,
|
2016-04-26 21:54:39 +00:00
|
|
|
struct extent_state **cached_state)
|
2011-09-26 17:56:12 +00:00
|
|
|
{
|
|
|
|
struct extent_state *state;
|
|
|
|
struct extent_state *prealloc = NULL;
|
|
|
|
struct rb_node *node;
|
2013-11-26 15:41:47 +00:00
|
|
|
struct rb_node **p;
|
|
|
|
struct rb_node *parent;
|
2011-09-26 17:56:12 +00:00
|
|
|
int err = 0;
|
|
|
|
u64 last_start;
|
|
|
|
u64 last_end;
|
2014-10-13 11:28:39 +00:00
|
|
|
bool first_iteration = true;
|
2011-09-26 17:56:12 +00:00
|
|
|
|
2013-12-13 15:02:44 +00:00
|
|
|
btrfs_debug_check_extent_io_range(tree, start, end);
|
2019-03-01 02:48:00 +00:00
|
|
|
trace_btrfs_convert_extent_bit(tree, start, end - start + 1, bits,
|
|
|
|
clear_bits);
|
2013-04-30 15:22:23 +00:00
|
|
|
|
2011-09-26 17:56:12 +00:00
|
|
|
again:
|
2016-04-26 21:54:39 +00:00
|
|
|
if (!prealloc) {
|
2014-10-13 11:28:39 +00:00
|
|
|
/*
|
|
|
|
* Best effort, don't worry if extent state allocation fails
|
|
|
|
* here for the first iteration. We might have a cached state
|
|
|
|
* that matches exactly the target range, in which case no
|
|
|
|
* extent state allocations are needed. We'll only know this
|
|
|
|
* after locking the tree.
|
|
|
|
*/
|
2016-04-26 21:54:39 +00:00
|
|
|
prealloc = alloc_extent_state(GFP_NOFS);
|
2014-10-13 11:28:39 +00:00
|
|
|
if (!prealloc && !first_iteration)
|
2011-09-26 17:56:12 +00:00
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock(&tree->lock);
|
2012-09-27 21:07:30 +00:00
|
|
|
if (cached_state && *cached_state) {
|
|
|
|
state = *cached_state;
|
|
|
|
if (state->start <= start && state->end > start &&
|
2014-07-06 19:09:59 +00:00
|
|
|
extent_state_in_tree(state)) {
|
2012-09-27 21:07:30 +00:00
|
|
|
node = &state->rb_node;
|
|
|
|
goto hit_next;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-09-26 17:56:12 +00:00
|
|
|
/*
|
|
|
|
* this search will find all the extents that end after
|
|
|
|
* our range starts.
|
|
|
|
*/
|
2013-11-26 15:41:47 +00:00
|
|
|
node = tree_search_for_insert(tree, start, &p, &parent);
|
2011-09-26 17:56:12 +00:00
|
|
|
if (!node) {
|
|
|
|
prealloc = alloc_extent_state_atomic(prealloc);
|
2011-12-08 01:08:40 +00:00
|
|
|
if (!prealloc) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2013-11-26 15:41:47 +00:00
|
|
|
err = insert_state(tree, prealloc, start, end,
|
2015-10-12 06:53:37 +00:00
|
|
|
&p, &parent, &bits, NULL);
|
2011-10-04 03:22:32 +00:00
|
|
|
if (err)
|
|
|
|
extent_io_tree_panic(tree, err);
|
2013-11-26 15:01:34 +00:00
|
|
|
cache_state(prealloc, cached_state);
|
|
|
|
prealloc = NULL;
|
2011-09-26 17:56:12 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
|
|
|
hit_next:
|
|
|
|
last_start = state->start;
|
|
|
|
last_end = state->end;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* | ---- desired range ---- |
|
|
|
|
* | state |
|
|
|
|
*
|
|
|
|
* Just lock what we found and keep going
|
|
|
|
*/
|
|
|
|
if (state->start == start && state->end <= end) {
|
2015-10-12 06:53:37 +00:00
|
|
|
set_state_bits(tree, state, &bits, NULL);
|
2012-09-27 21:07:30 +00:00
|
|
|
cache_state(state, cached_state);
|
2015-10-12 07:35:38 +00:00
|
|
|
state = clear_state_bit(tree, state, &clear_bits, 0, NULL);
|
2011-09-26 17:56:12 +00:00
|
|
|
if (last_end == (u64)-1)
|
|
|
|
goto out;
|
|
|
|
start = last_end + 1;
|
2012-05-10 10:10:39 +00:00
|
|
|
if (start < end && state && state->start == start &&
|
|
|
|
!need_resched())
|
|
|
|
goto hit_next;
|
2011-09-26 17:56:12 +00:00
|
|
|
goto search_again;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* | ---- desired range ---- |
|
|
|
|
* | state |
|
|
|
|
* or
|
|
|
|
* | ------------- state -------------- |
|
|
|
|
*
|
|
|
|
* We need to split the extent we found, and may flip bits on
|
|
|
|
* second half.
|
|
|
|
*
|
|
|
|
* If the extent we found extends past our
|
|
|
|
* range, we just split and search again. It'll get split
|
|
|
|
* again the next time though.
|
|
|
|
*
|
|
|
|
* If the extent we found is inside our range, we set the
|
|
|
|
* desired bit on it.
|
|
|
|
*/
|
|
|
|
if (state->start < start) {
|
|
|
|
prealloc = alloc_extent_state_atomic(prealloc);
|
2011-12-08 01:08:40 +00:00
|
|
|
if (!prealloc) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2011-09-26 17:56:12 +00:00
|
|
|
err = split_state(tree, state, prealloc, start);
|
2011-10-04 03:22:32 +00:00
|
|
|
if (err)
|
|
|
|
extent_io_tree_panic(tree, err);
|
2011-09-26 17:56:12 +00:00
|
|
|
prealloc = NULL;
|
|
|
|
if (err)
|
|
|
|
goto out;
|
|
|
|
if (state->end <= end) {
|
2015-10-12 06:53:37 +00:00
|
|
|
set_state_bits(tree, state, &bits, NULL);
|
2012-09-27 21:07:30 +00:00
|
|
|
cache_state(state, cached_state);
|
2015-10-12 07:35:38 +00:00
|
|
|
state = clear_state_bit(tree, state, &clear_bits, 0,
|
|
|
|
NULL);
|
2011-09-26 17:56:12 +00:00
|
|
|
if (last_end == (u64)-1)
|
|
|
|
goto out;
|
|
|
|
start = last_end + 1;
|
2012-05-10 10:10:39 +00:00
|
|
|
if (start < end && state && state->start == start &&
|
|
|
|
!need_resched())
|
|
|
|
goto hit_next;
|
2011-09-26 17:56:12 +00:00
|
|
|
}
|
|
|
|
goto search_again;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* | ---- desired range ---- |
|
|
|
|
* | state | or | state |
|
|
|
|
*
|
|
|
|
* There's a hole, we need to insert something in it and
|
|
|
|
* ignore the extent we found.
|
|
|
|
*/
|
|
|
|
if (state->start > start) {
|
|
|
|
u64 this_end;
|
|
|
|
if (end < last_start)
|
|
|
|
this_end = end;
|
|
|
|
else
|
|
|
|
this_end = last_start - 1;
|
|
|
|
|
|
|
|
prealloc = alloc_extent_state_atomic(prealloc);
|
2011-12-08 01:08:40 +00:00
|
|
|
if (!prealloc) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2011-09-26 17:56:12 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Avoid to free 'prealloc' if it can be merged with
|
|
|
|
* the later extent.
|
|
|
|
*/
|
|
|
|
err = insert_state(tree, prealloc, start, this_end,
|
2015-10-12 06:53:37 +00:00
|
|
|
NULL, NULL, &bits, NULL);
|
2011-10-04 03:22:32 +00:00
|
|
|
if (err)
|
|
|
|
extent_io_tree_panic(tree, err);
|
2012-09-27 21:07:30 +00:00
|
|
|
cache_state(prealloc, cached_state);
|
2011-09-26 17:56:12 +00:00
|
|
|
prealloc = NULL;
|
|
|
|
start = this_end + 1;
|
|
|
|
goto search_again;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* | ---- desired range ---- |
|
|
|
|
* | state |
|
|
|
|
* We need to split the extent, and set the bit
|
|
|
|
* on the first half
|
|
|
|
*/
|
|
|
|
if (state->start <= end && state->end > end) {
|
|
|
|
prealloc = alloc_extent_state_atomic(prealloc);
|
2011-12-08 01:08:40 +00:00
|
|
|
if (!prealloc) {
|
|
|
|
err = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
2011-09-26 17:56:12 +00:00
|
|
|
|
|
|
|
err = split_state(tree, state, prealloc, end + 1);
|
2011-10-04 03:22:32 +00:00
|
|
|
if (err)
|
|
|
|
extent_io_tree_panic(tree, err);
|
2011-09-26 17:56:12 +00:00
|
|
|
|
2015-10-12 06:53:37 +00:00
|
|
|
set_state_bits(tree, prealloc, &bits, NULL);
|
2012-09-27 21:07:30 +00:00
|
|
|
cache_state(prealloc, cached_state);
|
2015-10-12 07:35:38 +00:00
|
|
|
clear_state_bit(tree, prealloc, &clear_bits, 0, NULL);
|
2011-09-26 17:56:12 +00:00
|
|
|
prealloc = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
search_again:
|
|
|
|
if (start > end)
|
|
|
|
goto out;
|
|
|
|
spin_unlock(&tree->lock);
|
2016-04-26 21:54:39 +00:00
|
|
|
cond_resched();
|
2014-10-13 11:28:39 +00:00
|
|
|
first_iteration = false;
|
2011-09-26 17:56:12 +00:00
|
|
|
goto again;
|
|
|
|
|
|
|
|
out:
|
|
|
|
spin_unlock(&tree->lock);
|
|
|
|
if (prealloc)
|
|
|
|
free_extent_state(prealloc);
|
|
|
|
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/* wrappers around set/clear extent bit */
|
2015-10-12 06:53:37 +00:00
|
|
|
int set_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 bits, struct extent_changeset *changeset)
|
2015-10-12 06:53:37 +00:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We don't support EXTENT_LOCKED yet, as current changeset will
|
|
|
|
* record any bits changed, so for EXTENT_LOCKED case, it will
|
|
|
|
* either fail with -EEXIST or changeset will record the whole
|
|
|
|
* range.
|
|
|
|
*/
|
|
|
|
BUG_ON(bits & EXTENT_LOCKED);
|
|
|
|
|
2020-11-05 09:08:00 +00:00
|
|
|
return set_extent_bit(tree, start, end, bits, 0, NULL, NULL, GFP_NOFS,
|
|
|
|
changeset);
|
2015-10-12 06:53:37 +00:00
|
|
|
}
|
|
|
|
|
2019-03-27 12:24:10 +00:00
|
|
|
int set_extent_bits_nowait(struct extent_io_tree *tree, u64 start, u64 end,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 bits)
|
2019-03-27 12:24:10 +00:00
|
|
|
{
|
2020-11-05 09:08:00 +00:00
|
|
|
return set_extent_bit(tree, start, end, bits, 0, NULL, NULL,
|
|
|
|
GFP_NOWAIT, NULL);
|
2019-03-27 12:24:10 +00:00
|
|
|
}
|
|
|
|
|
2015-10-12 07:35:38 +00:00
|
|
|
int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 bits, int wake, int delete,
|
2017-10-31 15:37:52 +00:00
|
|
|
struct extent_state **cached)
|
2015-10-12 07:35:38 +00:00
|
|
|
{
|
|
|
|
return __clear_extent_bit(tree, start, end, bits, wake, delete,
|
2017-10-31 15:37:52 +00:00
|
|
|
cached, GFP_NOFS, NULL);
|
2015-10-12 07:35:38 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
int clear_record_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 bits, struct extent_changeset *changeset)
|
2015-10-12 07:35:38 +00:00
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Don't support EXTENT_LOCKED case, same reason as
|
|
|
|
* set_record_extent_bits().
|
|
|
|
*/
|
|
|
|
BUG_ON(bits & EXTENT_LOCKED);
|
|
|
|
|
2016-04-26 21:54:39 +00:00
|
|
|
return __clear_extent_bit(tree, start, end, bits, 0, 0, NULL, GFP_NOFS,
|
2015-10-12 07:35:38 +00:00
|
|
|
changeset);
|
|
|
|
}
|
|
|
|
|
2008-09-29 19:18:18 +00:00
|
|
|
/*
|
|
|
|
* either insert or lock state struct between start and end use mask to tell
|
|
|
|
* us if waiting is desired.
|
|
|
|
*/
|
2009-09-02 17:24:36 +00:00
|
|
|
int lock_extent_bits(struct extent_io_tree *tree, u64 start, u64 end,
|
2015-12-03 13:30:40 +00:00
|
|
|
struct extent_state **cached_state)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
u64 failed_start;
|
2015-01-14 18:52:13 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
while (1) {
|
2020-11-05 09:08:00 +00:00
|
|
|
err = set_extent_bit(tree, start, end, EXTENT_LOCKED,
|
|
|
|
EXTENT_LOCKED, &failed_start,
|
|
|
|
cached_state, GFP_NOFS, NULL);
|
2012-03-01 13:57:19 +00:00
|
|
|
if (err == -EEXIST) {
|
2008-01-24 21:13:08 +00:00
|
|
|
wait_extent_bit(tree, failed_start, end, EXTENT_LOCKED);
|
|
|
|
start = failed_start;
|
2012-03-01 13:57:19 +00:00
|
|
|
} else
|
2008-01-24 21:13:08 +00:00
|
|
|
break;
|
|
|
|
WARN_ON(start > end);
|
|
|
|
}
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2012-03-01 13:57:19 +00:00
|
|
|
int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end)
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-29 18:49:05 +00:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
u64 failed_start;
|
|
|
|
|
2020-11-05 09:08:00 +00:00
|
|
|
err = set_extent_bit(tree, start, end, EXTENT_LOCKED, EXTENT_LOCKED,
|
|
|
|
&failed_start, NULL, GFP_NOFS, NULL);
|
2008-10-30 18:19:50 +00:00
|
|
|
if (err == -EEXIST) {
|
|
|
|
if (failed_start > start)
|
|
|
|
clear_extent_bit(tree, start, failed_start - 1,
|
2017-10-31 15:37:52 +00:00
|
|
|
EXTENT_LOCKED, 1, 0, NULL);
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-29 18:49:05 +00:00
|
|
|
return 0;
|
2008-10-30 18:19:50 +00:00
|
|
|
}
|
Btrfs: nuke fs wide allocation mutex V2
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
2008-10-29 18:49:05 +00:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2015-12-03 12:08:59 +00:00
|
|
|
void extent_range_clear_dirty_for_io(struct inode *inode, u64 start, u64 end)
|
2013-03-26 17:07:00 +00:00
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
unsigned long index = start >> PAGE_SHIFT;
|
|
|
|
unsigned long end_index = end >> PAGE_SHIFT;
|
2013-03-26 17:07:00 +00:00
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
while (index <= end_index) {
|
|
|
|
page = find_get_page(inode->i_mapping, index);
|
|
|
|
BUG_ON(!page); /* Pages should be in the extent_io_tree */
|
|
|
|
clear_page_dirty_for_io(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
put_page(page);
|
2013-03-26 17:07:00 +00:00
|
|
|
index++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-12-03 12:08:59 +00:00
|
|
|
void extent_range_redirty_for_io(struct inode *inode, u64 start, u64 end)
|
2013-03-26 17:07:00 +00:00
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
unsigned long index = start >> PAGE_SHIFT;
|
|
|
|
unsigned long end_index = end >> PAGE_SHIFT;
|
2013-03-26 17:07:00 +00:00
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
while (index <= end_index) {
|
|
|
|
page = find_get_page(inode->i_mapping, index);
|
|
|
|
BUG_ON(!page); /* Pages should be in the extent_io_tree */
|
|
|
|
__set_page_dirty_nobuffers(page);
|
2015-02-11 23:26:55 +00:00
|
|
|
account_page_redirty(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
put_page(page);
|
2013-03-26 17:07:00 +00:00
|
|
|
index++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-09-29 19:18:18 +00:00
|
|
|
/* find the first state struct with 'bits' set after 'start', and
|
|
|
|
* return it. tree->lock must be held. NULL will returned if
|
|
|
|
* nothing was found after 'start'
|
|
|
|
*/
|
2013-04-25 20:41:01 +00:00
|
|
|
static struct extent_state *
|
2020-11-13 12:51:40 +00:00
|
|
|
find_first_extent_bit_state(struct extent_io_tree *tree, u64 start, u32 bits)
|
2008-02-18 17:12:38 +00:00
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
struct extent_state *state;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* this search will find all the extents that end after
|
|
|
|
* our range starts.
|
|
|
|
*/
|
|
|
|
node = tree_search(tree, start);
|
2009-01-06 02:25:51 +00:00
|
|
|
if (!node)
|
2008-02-18 17:12:38 +00:00
|
|
|
goto out;
|
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (1) {
|
2008-02-18 17:12:38 +00:00
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
2009-01-06 02:25:51 +00:00
|
|
|
if (state->end >= start && (state->state & bits))
|
2008-02-18 17:12:38 +00:00
|
|
|
return state;
|
2009-01-06 02:25:51 +00:00
|
|
|
|
2008-02-18 17:12:38 +00:00
|
|
|
node = rb_next(node);
|
|
|
|
if (!node)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2011-07-14 03:19:45 +00:00
|
|
|
/*
|
2020-10-21 06:24:50 +00:00
|
|
|
* Find the first offset in the io tree with one or more @bits set.
|
2011-07-14 03:19:45 +00:00
|
|
|
*
|
2020-10-21 06:24:50 +00:00
|
|
|
* Note: If there are multiple bits set in @bits, any of them will match.
|
|
|
|
*
|
|
|
|
* Return 0 if we find something, and update @start_ret and @end_ret.
|
|
|
|
* Return 1 if we found nothing.
|
2011-07-14 03:19:45 +00:00
|
|
|
*/
|
|
|
|
int find_first_extent_bit(struct extent_io_tree *tree, u64 start,
|
2020-11-13 12:51:40 +00:00
|
|
|
u64 *start_ret, u64 *end_ret, u32 bits,
|
2012-09-27 21:07:30 +00:00
|
|
|
struct extent_state **cached_state)
|
2011-07-14 03:19:45 +00:00
|
|
|
{
|
|
|
|
struct extent_state *state;
|
|
|
|
int ret = 1;
|
|
|
|
|
|
|
|
spin_lock(&tree->lock);
|
2012-09-27 21:07:30 +00:00
|
|
|
if (cached_state && *cached_state) {
|
|
|
|
state = *cached_state;
|
2014-07-06 19:09:59 +00:00
|
|
|
if (state->end == start - 1 && extent_state_in_tree(state)) {
|
2018-08-22 19:14:53 +00:00
|
|
|
while ((state = next_state(state)) != NULL) {
|
2012-09-27 21:07:30 +00:00
|
|
|
if (state->state & bits)
|
|
|
|
goto got_it;
|
|
|
|
}
|
|
|
|
free_extent_state(*cached_state);
|
|
|
|
*cached_state = NULL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
free_extent_state(*cached_state);
|
|
|
|
*cached_state = NULL;
|
|
|
|
}
|
|
|
|
|
2011-07-14 03:19:45 +00:00
|
|
|
state = find_first_extent_bit_state(tree, start, bits);
|
2012-09-27 21:07:30 +00:00
|
|
|
got_it:
|
2011-07-14 03:19:45 +00:00
|
|
|
if (state) {
|
2014-10-13 11:28:38 +00:00
|
|
|
cache_state_if_flags(state, cached_state, 0);
|
2011-07-14 03:19:45 +00:00
|
|
|
*start_ret = state->start;
|
|
|
|
*end_ret = state->end;
|
|
|
|
ret = 0;
|
|
|
|
}
|
2012-09-27 21:07:30 +00:00
|
|
|
out:
|
2011-07-14 03:19:45 +00:00
|
|
|
spin_unlock(&tree->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-01-17 14:02:21 +00:00
|
|
|
/**
|
|
|
|
* find_contiguous_extent_bit: find a contiguous area of bits
|
|
|
|
* @tree - io tree to check
|
|
|
|
* @start - offset to start the search from
|
|
|
|
* @start_ret - the first offset we found with the bits set
|
|
|
|
* @end_ret - the final contiguous range of the bits that were set
|
|
|
|
* @bits - bits to look for
|
|
|
|
*
|
|
|
|
* set_extent_bit and clear_extent_bit can temporarily split contiguous ranges
|
|
|
|
* to set bits appropriately, and then merge them again. During this time it
|
|
|
|
* will drop the tree->lock, so use this helper if you want to find the actual
|
|
|
|
* contiguous area for given bits. We will search to the first bit we find, and
|
|
|
|
* then walk down the tree until we find a non-contiguous area. The area
|
|
|
|
* returned will be the full contiguous area with the bits set.
|
|
|
|
*/
|
|
|
|
int find_contiguous_extent_bit(struct extent_io_tree *tree, u64 start,
|
2020-11-13 12:51:40 +00:00
|
|
|
u64 *start_ret, u64 *end_ret, u32 bits)
|
2020-01-17 14:02:21 +00:00
|
|
|
{
|
|
|
|
struct extent_state *state;
|
|
|
|
int ret = 1;
|
|
|
|
|
|
|
|
spin_lock(&tree->lock);
|
|
|
|
state = find_first_extent_bit_state(tree, start, bits);
|
|
|
|
if (state) {
|
|
|
|
*start_ret = state->start;
|
|
|
|
*end_ret = state->end;
|
|
|
|
while ((state = next_state(state)) != NULL) {
|
|
|
|
if (state->start > (*end_ret + 1))
|
|
|
|
break;
|
|
|
|
*end_ret = state->end;
|
|
|
|
}
|
|
|
|
ret = 0;
|
|
|
|
}
|
|
|
|
spin_unlock(&tree->lock);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2019-03-27 12:24:17 +00:00
|
|
|
/**
|
2019-06-03 10:06:02 +00:00
|
|
|
* find_first_clear_extent_bit - find the first range that has @bits not set.
|
|
|
|
* This range could start before @start.
|
2019-03-27 12:24:17 +00:00
|
|
|
*
|
|
|
|
* @tree - the tree to search
|
|
|
|
* @start - the offset at/after which the found extent should start
|
|
|
|
* @start_ret - records the beginning of the range
|
|
|
|
* @end_ret - records the end of the range (inclusive)
|
|
|
|
* @bits - the set of bits which must be unset
|
|
|
|
*
|
|
|
|
* Since unallocated range is also considered one which doesn't have the bits
|
|
|
|
* set it's possible that @end_ret contains -1, this happens in case the range
|
|
|
|
* spans (last_range_end, end of device]. In this case it's up to the caller to
|
|
|
|
* trim @end_ret to the appropriate size.
|
|
|
|
*/
|
|
|
|
void find_first_clear_extent_bit(struct extent_io_tree *tree, u64 start,
|
2020-11-13 12:51:40 +00:00
|
|
|
u64 *start_ret, u64 *end_ret, u32 bits)
|
2019-03-27 12:24:17 +00:00
|
|
|
{
|
|
|
|
struct extent_state *state;
|
|
|
|
struct rb_node *node, *prev = NULL, *next;
|
|
|
|
|
|
|
|
spin_lock(&tree->lock);
|
|
|
|
|
|
|
|
/* Find first extent with bits cleared */
|
|
|
|
while (1) {
|
|
|
|
node = __etree_search(tree, start, &next, &prev, NULL, NULL);
|
2020-01-27 09:59:26 +00:00
|
|
|
if (!node && !next && !prev) {
|
|
|
|
/*
|
|
|
|
* Tree is completely empty, send full range and let
|
|
|
|
* caller deal with it
|
|
|
|
*/
|
|
|
|
*start_ret = 0;
|
|
|
|
*end_ret = -1;
|
|
|
|
goto out;
|
|
|
|
} else if (!node && !next) {
|
|
|
|
/*
|
|
|
|
* We are past the last allocated chunk, set start at
|
|
|
|
* the end of the last extent.
|
|
|
|
*/
|
|
|
|
state = rb_entry(prev, struct extent_state, rb_node);
|
|
|
|
*start_ret = state->end + 1;
|
|
|
|
*end_ret = -1;
|
|
|
|
goto out;
|
|
|
|
} else if (!node) {
|
2019-03-27 12:24:17 +00:00
|
|
|
node = next;
|
|
|
|
}
|
2019-06-03 10:06:02 +00:00
|
|
|
/*
|
|
|
|
* At this point 'node' either contains 'start' or start is
|
|
|
|
* before 'node'
|
|
|
|
*/
|
2019-03-27 12:24:17 +00:00
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
2019-06-03 10:06:02 +00:00
|
|
|
|
|
|
|
if (in_range(start, state->start, state->end - state->start + 1)) {
|
|
|
|
if (state->state & bits) {
|
|
|
|
/*
|
|
|
|
* |--range with bits sets--|
|
|
|
|
* |
|
|
|
|
* start
|
|
|
|
*/
|
|
|
|
start = state->end + 1;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* 'start' falls within a range that doesn't
|
|
|
|
* have the bits set, so take its start as
|
|
|
|
* the beginning of the desired range
|
|
|
|
*
|
|
|
|
* |--range with bits cleared----|
|
|
|
|
* |
|
|
|
|
* start
|
|
|
|
*/
|
|
|
|
*start_ret = state->start;
|
|
|
|
break;
|
|
|
|
}
|
2019-03-27 12:24:17 +00:00
|
|
|
} else {
|
2019-06-03 10:06:02 +00:00
|
|
|
/*
|
|
|
|
* |---prev range---|---hole/unset---|---node range---|
|
|
|
|
* |
|
|
|
|
* start
|
|
|
|
*
|
|
|
|
* or
|
|
|
|
*
|
|
|
|
* |---hole/unset--||--first node--|
|
|
|
|
* 0 |
|
|
|
|
* start
|
|
|
|
*/
|
|
|
|
if (prev) {
|
|
|
|
state = rb_entry(prev, struct extent_state,
|
|
|
|
rb_node);
|
|
|
|
*start_ret = state->end + 1;
|
|
|
|
} else {
|
|
|
|
*start_ret = 0;
|
|
|
|
}
|
2019-03-27 12:24:17 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find the longest stretch from start until an entry which has the
|
|
|
|
* bits set
|
|
|
|
*/
|
|
|
|
while (1) {
|
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
|
|
|
if (state->end >= start && !(state->state & bits)) {
|
|
|
|
*end_ret = state->end;
|
|
|
|
} else {
|
|
|
|
*end_ret = state->start - 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
node = rb_next(node);
|
|
|
|
if (!node)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
spin_unlock(&tree->lock);
|
|
|
|
}
|
|
|
|
|
2008-09-29 19:18:18 +00:00
|
|
|
/*
|
|
|
|
* find a contiguous range of bytes in the file marked as delalloc, not
|
|
|
|
* more than 'max_bytes'. start and end are used to return the range,
|
|
|
|
*
|
2018-11-29 03:33:38 +00:00
|
|
|
* true is returned if we find something, false if nothing was in the tree
|
2008-09-29 19:18:18 +00:00
|
|
|
*/
|
2019-09-23 14:05:20 +00:00
|
|
|
bool btrfs_find_delalloc_range(struct extent_io_tree *tree, u64 *start,
|
|
|
|
u64 *end, u64 max_bytes,
|
|
|
|
struct extent_state **cached_state)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
struct extent_state *state;
|
|
|
|
u64 cur_start = *start;
|
2018-11-29 03:33:38 +00:00
|
|
|
bool found = false;
|
2008-01-24 21:13:08 +00:00
|
|
|
u64 total_bytes = 0;
|
|
|
|
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_lock(&tree->lock);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* this search will find all the extents that end after
|
|
|
|
* our range starts.
|
|
|
|
*/
|
2008-02-01 19:51:59 +00:00
|
|
|
node = tree_search(tree, cur_start);
|
2008-04-01 15:21:40 +00:00
|
|
|
if (!node) {
|
2018-11-29 03:33:38 +00:00
|
|
|
*end = (u64)-1;
|
2008-01-24 21:13:08 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (1) {
|
2008-01-24 21:13:08 +00:00
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
2008-09-26 14:05:38 +00:00
|
|
|
if (found && (state->start != cur_start ||
|
|
|
|
(state->state & EXTENT_BOUNDARY))) {
|
2008-01-24 21:13:08 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (!(state->state & EXTENT_DELALLOC)) {
|
|
|
|
if (!found)
|
|
|
|
*end = state->end;
|
|
|
|
goto out;
|
|
|
|
}
|
2010-02-02 21:19:11 +00:00
|
|
|
if (!found) {
|
2008-01-24 21:13:08 +00:00
|
|
|
*start = state->start;
|
2010-02-02 21:19:11 +00:00
|
|
|
*cached_state = state;
|
2017-03-03 08:55:19 +00:00
|
|
|
refcount_inc(&state->refs);
|
2010-02-02 21:19:11 +00:00
|
|
|
}
|
2018-11-29 03:33:38 +00:00
|
|
|
found = true;
|
2008-01-24 21:13:08 +00:00
|
|
|
*end = state->end;
|
|
|
|
cur_start = state->end + 1;
|
|
|
|
node = rb_next(node);
|
|
|
|
total_bytes += state->end - state->start + 1;
|
2013-10-08 02:11:09 +00:00
|
|
|
if (total_bytes >= max_bytes)
|
2013-08-30 18:38:49 +00:00
|
|
|
break;
|
|
|
|
if (!node)
|
2008-01-24 21:13:08 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
out:
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_unlock(&tree->lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
return found;
|
|
|
|
}
|
|
|
|
|
2017-02-10 15:41:05 +00:00
|
|
|
static int __process_pages_contig(struct address_space *mapping,
|
|
|
|
struct page *locked_page,
|
|
|
|
pgoff_t start_index, pgoff_t end_index,
|
|
|
|
unsigned long page_ops, pgoff_t *index_ret);
|
|
|
|
|
2012-03-01 13:56:26 +00:00
|
|
|
static noinline void __unlock_for_delalloc(struct inode *inode,
|
|
|
|
struct page *locked_page,
|
|
|
|
u64 start, u64 end)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
unsigned long index = start >> PAGE_SHIFT;
|
|
|
|
unsigned long end_index = end >> PAGE_SHIFT;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
2017-02-10 15:42:14 +00:00
|
|
|
ASSERT(locked_page);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
if (index == locked_page->index && end_index == index)
|
2012-03-01 13:56:26 +00:00
|
|
|
return;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
2017-02-10 15:42:14 +00:00
|
|
|
__process_pages_contig(inode->i_mapping, locked_page, index, end_index,
|
|
|
|
PAGE_UNLOCK, NULL);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static noinline int lock_delalloc_pages(struct inode *inode,
|
|
|
|
struct page *locked_page,
|
|
|
|
u64 delalloc_start,
|
|
|
|
u64 delalloc_end)
|
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
unsigned long index = delalloc_start >> PAGE_SHIFT;
|
2017-02-10 15:42:14 +00:00
|
|
|
unsigned long index_ret = index;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
unsigned long end_index = delalloc_end >> PAGE_SHIFT;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
int ret;
|
|
|
|
|
2017-02-10 15:42:14 +00:00
|
|
|
ASSERT(locked_page);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
if (index == locked_page->index && index == end_index)
|
|
|
|
return 0;
|
|
|
|
|
2017-02-10 15:42:14 +00:00
|
|
|
ret = __process_pages_contig(inode->i_mapping, locked_page, index,
|
|
|
|
end_index, PAGE_LOCK, &index_ret);
|
|
|
|
if (ret == -EAGAIN)
|
|
|
|
__unlock_for_delalloc(inode, locked_page, delalloc_start,
|
|
|
|
(u64)index_ret << PAGE_SHIFT);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2018-11-29 03:33:38 +00:00
|
|
|
* Find and lock a contiguous range of bytes in the file marked as delalloc, no
|
|
|
|
* more than @max_bytes. @Start and @end are used to return the range,
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
*
|
2018-11-29 03:33:38 +00:00
|
|
|
* Return: true if we find something
|
|
|
|
* false if nothing was in the tree
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
*/
|
2018-11-19 09:38:17 +00:00
|
|
|
EXPORT_FOR_TESTS
|
2018-11-29 03:33:38 +00:00
|
|
|
noinline_for_stack bool find_lock_delalloc_range(struct inode *inode,
|
2013-10-09 16:00:56 +00:00
|
|
|
struct page *locked_page, u64 *start,
|
2018-10-26 11:43:20 +00:00
|
|
|
u64 *end)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
{
|
2019-06-21 15:02:54 +00:00
|
|
|
struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
|
2018-10-26 11:43:20 +00:00
|
|
|
u64 max_bytes = BTRFS_MAX_EXTENT_SIZE;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
u64 delalloc_start;
|
|
|
|
u64 delalloc_end;
|
2018-11-29 03:33:38 +00:00
|
|
|
bool found;
|
2009-09-02 19:22:30 +00:00
|
|
|
struct extent_state *cached_state = NULL;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
int ret;
|
|
|
|
int loops = 0;
|
|
|
|
|
|
|
|
again:
|
|
|
|
/* step one, find a bunch of delalloc bytes starting at start */
|
|
|
|
delalloc_start = *start;
|
|
|
|
delalloc_end = 0;
|
2019-09-23 14:05:20 +00:00
|
|
|
found = btrfs_find_delalloc_range(tree, &delalloc_start, &delalloc_end,
|
|
|
|
max_bytes, &cached_state);
|
2008-10-31 16:46:39 +00:00
|
|
|
if (!found || delalloc_end <= *start) {
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
*start = delalloc_start;
|
|
|
|
*end = delalloc_end;
|
2010-02-02 21:19:11 +00:00
|
|
|
free_extent_state(cached_state);
|
2018-11-29 03:33:38 +00:00
|
|
|
return false;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
}
|
|
|
|
|
2008-10-31 16:46:39 +00:00
|
|
|
/*
|
|
|
|
* start comes from the offset of locked_page. We have to lock
|
|
|
|
* pages in order, so we can't process delalloc bytes before
|
|
|
|
* locked_page
|
|
|
|
*/
|
2009-01-06 02:25:51 +00:00
|
|
|
if (delalloc_start < *start)
|
2008-10-31 16:46:39 +00:00
|
|
|
delalloc_start = *start;
|
|
|
|
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
/*
|
|
|
|
* make sure to limit the number of pages we try to lock down
|
|
|
|
*/
|
2013-10-08 02:11:09 +00:00
|
|
|
if (delalloc_end + 1 - delalloc_start > max_bytes)
|
|
|
|
delalloc_end = delalloc_start + max_bytes - 1;
|
2009-01-06 02:25:51 +00:00
|
|
|
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
/* step two, lock all the pages after the page that has start */
|
|
|
|
ret = lock_delalloc_pages(inode, locked_page,
|
|
|
|
delalloc_start, delalloc_end);
|
2018-10-26 11:43:21 +00:00
|
|
|
ASSERT(!ret || ret == -EAGAIN);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
if (ret == -EAGAIN) {
|
|
|
|
/* some of the pages are gone, lets avoid looping by
|
|
|
|
* shortening the size of the delalloc range we're searching
|
|
|
|
*/
|
2009-09-02 19:22:30 +00:00
|
|
|
free_extent_state(cached_state);
|
2014-05-21 12:49:54 +00:00
|
|
|
cached_state = NULL;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
if (!loops) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
max_bytes = PAGE_SIZE;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
loops = 1;
|
|
|
|
goto again;
|
|
|
|
} else {
|
2018-11-29 03:33:38 +00:00
|
|
|
found = false;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
goto out_failed;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* step three, lock the state bits for the whole range */
|
2015-12-03 13:30:40 +00:00
|
|
|
lock_extent_bits(tree, delalloc_start, delalloc_end, &cached_state);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
|
|
|
/* then test to make sure it is all still delalloc */
|
|
|
|
ret = test_range_bit(tree, delalloc_start, delalloc_end,
|
2009-09-02 19:22:30 +00:00
|
|
|
EXTENT_DELALLOC, 1, cached_state);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
if (!ret) {
|
2009-09-02 19:22:30 +00:00
|
|
|
unlock_extent_cached(tree, delalloc_start, delalloc_end,
|
2017-12-12 20:43:52 +00:00
|
|
|
&cached_state);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
__unlock_for_delalloc(inode, locked_page,
|
|
|
|
delalloc_start, delalloc_end);
|
|
|
|
cond_resched();
|
|
|
|
goto again;
|
|
|
|
}
|
2009-09-02 19:22:30 +00:00
|
|
|
free_extent_state(cached_state);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
*start = delalloc_start;
|
|
|
|
*end = delalloc_end;
|
|
|
|
out_failed:
|
|
|
|
return found;
|
|
|
|
}
|
|
|
|
|
2017-02-10 15:41:05 +00:00
|
|
|
static int __process_pages_contig(struct address_space *mapping,
|
|
|
|
struct page *locked_page,
|
|
|
|
pgoff_t start_index, pgoff_t end_index,
|
|
|
|
unsigned long page_ops, pgoff_t *index_ret)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
{
|
2017-02-03 01:49:22 +00:00
|
|
|
unsigned long nr_pages = end_index - start_index + 1;
|
2020-10-21 06:24:57 +00:00
|
|
|
unsigned long pages_processed = 0;
|
2017-02-03 01:49:22 +00:00
|
|
|
pgoff_t index = start_index;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
struct page *pages[16];
|
2017-02-03 01:49:22 +00:00
|
|
|
unsigned ret;
|
2017-02-10 15:41:05 +00:00
|
|
|
int err = 0;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
int i;
|
2008-11-07 03:02:51 +00:00
|
|
|
|
2017-02-10 15:41:05 +00:00
|
|
|
if (page_ops & PAGE_LOCK) {
|
|
|
|
ASSERT(page_ops == PAGE_LOCK);
|
|
|
|
ASSERT(index_ret && *index_ret == start_index);
|
|
|
|
}
|
|
|
|
|
2014-10-06 21:14:22 +00:00
|
|
|
if ((page_ops & PAGE_SET_ERROR) && nr_pages > 0)
|
2017-02-03 01:49:22 +00:00
|
|
|
mapping_set_error(mapping, -EIO);
|
2014-10-06 21:14:22 +00:00
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (nr_pages > 0) {
|
2017-02-03 01:49:22 +00:00
|
|
|
ret = find_get_pages_contig(mapping, index,
|
2008-11-11 14:34:41 +00:00
|
|
|
min_t(unsigned long,
|
|
|
|
nr_pages, ARRAY_SIZE(pages)), pages);
|
2017-02-10 15:41:05 +00:00
|
|
|
if (ret == 0) {
|
|
|
|
/*
|
|
|
|
* Only if we're going to lock these pages,
|
|
|
|
* can we find nothing at @index.
|
|
|
|
*/
|
|
|
|
ASSERT(page_ops & PAGE_LOCK);
|
2017-03-07 02:20:56 +00:00
|
|
|
err = -EAGAIN;
|
|
|
|
goto out;
|
2017-02-10 15:41:05 +00:00
|
|
|
}
|
2009-09-02 20:53:46 +00:00
|
|
|
|
2017-02-10 15:41:05 +00:00
|
|
|
for (i = 0; i < ret; i++) {
|
2013-07-29 15:20:47 +00:00
|
|
|
if (page_ops & PAGE_SET_PRIVATE2)
|
2009-09-02 20:53:46 +00:00
|
|
|
SetPagePrivate2(pages[i]);
|
|
|
|
|
Btrfs: only associate the locked page with one async_chunk struct
The btrfs writepages function collects a large range of pages flagged
for delayed allocation, and then sends them down through the COW code
for processing. When compression is on, we allocate one async_chunk
structure for every 512K, and then run those pages through the
compression code for IO submission.
writepages starts all of this off with a single page, locked by the
original call to extent_write_cache_pages(), and it's important to keep
track of this page because it has already been through
clear_page_dirty_for_io().
The btrfs async_chunk struct has a pointer to the locked_page, and when
we're redirtying the page because compression had to fallback to
uncompressed IO, we use page->index to decide if a given async_chunk
struct really owns that page.
But, this is racey. If a given delalloc range is broken up into two
async_chunks (chunkA and chunkB), we can end up with something like
this:
compress_file_range(chunkA)
submit_compress_extents(chunkA)
submit compressed bios(chunkA)
put_page(locked_page)
compress_file_range(chunkB)
...
Or:
async_cow_submit
submit_compressed_extents <--- falls back to buffered writeout
cow_file_range
extent_clear_unlock_delalloc
__process_pages_contig
put_page(locked_pages)
async_cow_submit
The end result is that chunkA is completed and cleaned up before chunkB
even starts processing. This means we can free locked_page() and reuse
it elsewhere. If we get really lucky, it'll have the same page->index
in its new home as it did before.
While we're processing chunkB, we might decide we need to fall back to
uncompressed IO, and so compress_file_range() will call
__set_page_dirty_nobufers() on chunkB->locked_page.
Without cgroups in use, this creates as a phantom dirty page, which
isn't great but isn't the end of the world. What can happen, it can go
through the fixup worker and the whole COW machinery again:
in submit_compressed_extents():
while (async extents) {
...
cow_file_range
if (!page_started ...)
extent_write_locked_range
else if (...)
unlock_page
continue;
This hasn't been observed in practice but is still possible.
With cgroups in use, we might crash in the accounting code because
page->mapping->i_wb isn't set.
BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
IP: percpu_counter_add_batch+0x11/0x70
PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
CPU: 16 PID: 2172 Comm: rm Not tainted
RIP: 0010:percpu_counter_add_batch+0x11/0x70
RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
account_page_cleaned+0x15b/0x1f0
__cancel_dirty_page+0x146/0x200
truncate_cleanup_page+0x92/0xb0
truncate_inode_pages_range+0x202/0x7d0
btrfs_evict_inode+0x92/0x5a0
evict+0xc1/0x190
do_unlinkat+0x176/0x280
do_syscall_64+0x63/0x1a0
entry_SYSCALL_64_after_hwframe+0x42/0xb7
The fix here is to make asyc_chunk->locked_page NULL everywhere but the
one async_chunk struct that's allowed to do things to the locked page.
Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Chris Mason <clm@fb.com>
[ update changelog from mail thread discussion ]
Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-10 19:28:16 +00:00
|
|
|
if (locked_page && pages[i] == locked_page) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
put_page(pages[i]);
|
2020-10-21 06:24:57 +00:00
|
|
|
pages_processed++;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
continue;
|
|
|
|
}
|
2013-07-29 15:20:47 +00:00
|
|
|
if (page_ops & PAGE_CLEAR_DIRTY)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
clear_page_dirty_for_io(pages[i]);
|
2013-07-29 15:20:47 +00:00
|
|
|
if (page_ops & PAGE_SET_WRITEBACK)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
set_page_writeback(pages[i]);
|
2014-10-06 21:14:22 +00:00
|
|
|
if (page_ops & PAGE_SET_ERROR)
|
|
|
|
SetPageError(pages[i]);
|
2013-07-29 15:20:47 +00:00
|
|
|
if (page_ops & PAGE_END_WRITEBACK)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
end_page_writeback(pages[i]);
|
2013-07-29 15:20:47 +00:00
|
|
|
if (page_ops & PAGE_UNLOCK)
|
2008-11-07 03:02:51 +00:00
|
|
|
unlock_page(pages[i]);
|
2017-02-10 15:41:05 +00:00
|
|
|
if (page_ops & PAGE_LOCK) {
|
|
|
|
lock_page(pages[i]);
|
|
|
|
if (!PageDirty(pages[i]) ||
|
|
|
|
pages[i]->mapping != mapping) {
|
|
|
|
unlock_page(pages[i]);
|
2020-07-20 01:42:09 +00:00
|
|
|
for (; i < ret; i++)
|
|
|
|
put_page(pages[i]);
|
2017-02-10 15:41:05 +00:00
|
|
|
err = -EAGAIN;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
put_page(pages[i]);
|
2020-10-21 06:24:57 +00:00
|
|
|
pages_processed++;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
}
|
|
|
|
nr_pages -= ret;
|
|
|
|
index += ret;
|
|
|
|
cond_resched();
|
|
|
|
}
|
2017-02-10 15:41:05 +00:00
|
|
|
out:
|
|
|
|
if (err && index_ret)
|
2020-10-21 06:24:57 +00:00
|
|
|
*index_ret = start_index + pages_processed - 1;
|
2017-02-10 15:41:05 +00:00
|
|
|
return err;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
}
|
|
|
|
|
2020-06-03 05:55:06 +00:00
|
|
|
void extent_clear_unlock_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
|
2019-07-17 13:18:16 +00:00
|
|
|
struct page *locked_page,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 clear_bits, unsigned long page_ops)
|
2017-02-03 01:49:22 +00:00
|
|
|
{
|
2020-06-03 05:55:06 +00:00
|
|
|
clear_extent_bit(&inode->io_tree, start, end, clear_bits, 1, 0, NULL);
|
2017-02-03 01:49:22 +00:00
|
|
|
|
2020-06-03 05:55:06 +00:00
|
|
|
__process_pages_contig(inode->vfs_inode.i_mapping, locked_page,
|
2017-02-03 01:49:22 +00:00
|
|
|
start >> PAGE_SHIFT, end >> PAGE_SHIFT,
|
2017-02-10 15:41:05 +00:00
|
|
|
page_ops, NULL);
|
2017-02-03 01:49:22 +00:00
|
|
|
}
|
|
|
|
|
2008-09-29 19:18:18 +00:00
|
|
|
/*
|
|
|
|
* count the number of bytes in the tree that have a given bit(s)
|
|
|
|
* set. This can be fairly slow, except for EXTENT_DIRTY which is
|
|
|
|
* cached. The total number found is returned.
|
|
|
|
*/
|
2008-01-24 21:13:08 +00:00
|
|
|
u64 count_range_bits(struct extent_io_tree *tree,
|
|
|
|
u64 *start, u64 search_end, u64 max_bytes,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 bits, int contig)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
struct extent_state *state;
|
|
|
|
u64 cur_start = *start;
|
|
|
|
u64 total_bytes = 0;
|
2011-02-23 21:23:20 +00:00
|
|
|
u64 last = 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
int found = 0;
|
|
|
|
|
2013-10-31 05:00:08 +00:00
|
|
|
if (WARN_ON(search_end <= cur_start))
|
2008-01-24 21:13:08 +00:00
|
|
|
return 0;
|
|
|
|
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_lock(&tree->lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
if (cur_start == 0 && bits == EXTENT_DIRTY) {
|
|
|
|
total_bytes = tree->dirty_bytes;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
/*
|
|
|
|
* this search will find all the extents that end after
|
|
|
|
* our range starts.
|
|
|
|
*/
|
2008-02-01 19:51:59 +00:00
|
|
|
node = tree_search(tree, cur_start);
|
2009-01-06 02:25:51 +00:00
|
|
|
if (!node)
|
2008-01-24 21:13:08 +00:00
|
|
|
goto out;
|
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (1) {
|
2008-01-24 21:13:08 +00:00
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
|
|
|
if (state->start > search_end)
|
|
|
|
break;
|
2011-02-23 21:23:20 +00:00
|
|
|
if (contig && found && state->start > last + 1)
|
|
|
|
break;
|
|
|
|
if (state->end >= cur_start && (state->state & bits) == bits) {
|
2008-01-24 21:13:08 +00:00
|
|
|
total_bytes += min(search_end, state->end) + 1 -
|
|
|
|
max(cur_start, state->start);
|
|
|
|
if (total_bytes >= max_bytes)
|
|
|
|
break;
|
|
|
|
if (!found) {
|
2011-05-04 15:11:17 +00:00
|
|
|
*start = max(cur_start, state->start);
|
2008-01-24 21:13:08 +00:00
|
|
|
found = 1;
|
|
|
|
}
|
2011-02-23 21:23:20 +00:00
|
|
|
last = state->end;
|
|
|
|
} else if (contig && found) {
|
|
|
|
break;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
node = rb_next(node);
|
|
|
|
if (!node)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
out:
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_unlock(&tree->lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
return total_bytes;
|
|
|
|
}
|
2008-12-02 14:54:17 +00:00
|
|
|
|
2008-09-29 19:18:18 +00:00
|
|
|
/*
|
|
|
|
* set the private field for a given byte offset in the tree. If there isn't
|
|
|
|
* an extent_state there already, this does nothing.
|
|
|
|
*/
|
2019-09-23 14:05:21 +00:00
|
|
|
int set_state_failrec(struct extent_io_tree *tree, u64 start,
|
|
|
|
struct io_failure_record *failrec)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
struct extent_state *state;
|
|
|
|
int ret = 0;
|
|
|
|
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_lock(&tree->lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* this search will find all the extents that end after
|
|
|
|
* our range starts.
|
|
|
|
*/
|
2008-02-01 19:51:59 +00:00
|
|
|
node = tree_search(tree, start);
|
2008-04-01 15:21:40 +00:00
|
|
|
if (!node) {
|
2008-01-24 21:13:08 +00:00
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
|
|
|
if (state->start != start) {
|
|
|
|
ret = -ENOENT;
|
|
|
|
goto out;
|
|
|
|
}
|
2016-02-11 12:24:13 +00:00
|
|
|
state->failrec = failrec;
|
2008-01-24 21:13:08 +00:00
|
|
|
out:
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_unlock(&tree->lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-07-02 12:23:28 +00:00
|
|
|
struct io_failure_record *get_state_failrec(struct extent_io_tree *tree, u64 start)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct rb_node *node;
|
|
|
|
struct extent_state *state;
|
2020-07-02 12:23:28 +00:00
|
|
|
struct io_failure_record *failrec;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_lock(&tree->lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* this search will find all the extents that end after
|
|
|
|
* our range starts.
|
|
|
|
*/
|
2008-02-01 19:51:59 +00:00
|
|
|
node = tree_search(tree, start);
|
2008-04-01 15:21:40 +00:00
|
|
|
if (!node) {
|
2020-07-02 12:23:28 +00:00
|
|
|
failrec = ERR_PTR(-ENOENT);
|
2008-01-24 21:13:08 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
|
|
|
if (state->start != start) {
|
2020-07-02 12:23:28 +00:00
|
|
|
failrec = ERR_PTR(-ENOENT);
|
2008-01-24 21:13:08 +00:00
|
|
|
goto out;
|
|
|
|
}
|
2020-07-02 12:23:28 +00:00
|
|
|
|
|
|
|
failrec = state->failrec;
|
2008-01-24 21:13:08 +00:00
|
|
|
out:
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_unlock(&tree->lock);
|
2020-07-02 12:23:28 +00:00
|
|
|
return failrec;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* searches a range in the state tree for a given mask.
|
2008-01-29 14:59:12 +00:00
|
|
|
* If 'filled' == 1, this returns 1 only if every extent in the tree
|
2008-01-24 21:13:08 +00:00
|
|
|
* has the bits set. Otherwise, 1 is returned if any bit in the
|
|
|
|
* range is found set.
|
|
|
|
*/
|
|
|
|
int test_range_bit(struct extent_io_tree *tree, u64 start, u64 end,
|
2020-11-13 12:51:40 +00:00
|
|
|
u32 bits, int filled, struct extent_state *cached)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct extent_state *state = NULL;
|
|
|
|
struct rb_node *node;
|
|
|
|
int bitset = 0;
|
|
|
|
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_lock(&tree->lock);
|
2014-07-06 19:09:59 +00:00
|
|
|
if (cached && extent_state_in_tree(cached) && cached->start <= start &&
|
2011-06-20 18:53:48 +00:00
|
|
|
cached->end > start)
|
2009-09-02 19:22:30 +00:00
|
|
|
node = &cached->rb_node;
|
|
|
|
else
|
|
|
|
node = tree_search(tree, start);
|
2008-01-24 21:13:08 +00:00
|
|
|
while (node && start <= end) {
|
|
|
|
state = rb_entry(node, struct extent_state, rb_node);
|
|
|
|
|
|
|
|
if (filled && state->start > start) {
|
|
|
|
bitset = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (state->start > end)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (state->state & bits) {
|
|
|
|
bitset = 1;
|
|
|
|
if (!filled)
|
|
|
|
break;
|
|
|
|
} else if (filled) {
|
|
|
|
bitset = 0;
|
|
|
|
break;
|
|
|
|
}
|
2009-09-24 00:23:16 +00:00
|
|
|
|
|
|
|
if (state->end == (u64)-1)
|
|
|
|
break;
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
start = state->end + 1;
|
|
|
|
if (start > end)
|
|
|
|
break;
|
|
|
|
node = rb_next(node);
|
|
|
|
if (!node) {
|
|
|
|
if (filled)
|
|
|
|
bitset = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2008-12-17 19:51:42 +00:00
|
|
|
spin_unlock(&tree->lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
return bitset;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* helper function to set a given page up to date if all the
|
|
|
|
* extents in the tree for that page are up to date
|
|
|
|
*/
|
2012-03-01 13:56:26 +00:00
|
|
|
static void check_page_uptodate(struct extent_io_tree *tree, struct page *page)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2012-12-21 09:17:45 +00:00
|
|
|
u64 start = page_offset(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
u64 end = start + PAGE_SIZE - 1;
|
2009-09-02 19:22:30 +00:00
|
|
|
if (test_range_bit(tree, start, end, EXTENT_UPTODATE, 1, NULL))
|
2008-01-24 21:13:08 +00:00
|
|
|
SetPageUptodate(page);
|
|
|
|
}
|
|
|
|
|
2017-05-05 15:57:15 +00:00
|
|
|
int free_io_failure(struct extent_io_tree *failure_tree,
|
|
|
|
struct extent_io_tree *io_tree,
|
|
|
|
struct io_failure_record *rec)
|
2011-07-22 13:41:52 +00:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
int err = 0;
|
|
|
|
|
2016-02-11 12:24:13 +00:00
|
|
|
set_state_failrec(failure_tree, rec->start, NULL);
|
2011-07-22 13:41:52 +00:00
|
|
|
ret = clear_extent_bits(failure_tree, rec->start,
|
|
|
|
rec->start + rec->len - 1,
|
2016-04-26 21:54:39 +00:00
|
|
|
EXTENT_LOCKED | EXTENT_DIRTY);
|
2011-07-22 13:41:52 +00:00
|
|
|
if (ret)
|
|
|
|
err = ret;
|
|
|
|
|
2017-05-05 15:57:15 +00:00
|
|
|
ret = clear_extent_bits(io_tree, rec->start,
|
2013-01-29 23:40:14 +00:00
|
|
|
rec->start + rec->len - 1,
|
2016-04-26 21:54:39 +00:00
|
|
|
EXTENT_DAMAGED);
|
2013-01-29 23:40:14 +00:00
|
|
|
if (ret && !err)
|
|
|
|
err = ret;
|
2011-07-22 13:41:52 +00:00
|
|
|
|
|
|
|
kfree(rec);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* this bypasses the standard btrfs submit functions deliberately, as
|
|
|
|
* the standard behavior is to write all copies in a raid setup. here we only
|
|
|
|
* want to write the one bad copy. so we do the mapping for ourselves and issue
|
|
|
|
* submit_bio directly.
|
2012-11-05 14:46:42 +00:00
|
|
|
* to avoid any synchronization issues, wait for the data after writing, which
|
2011-07-22 13:41:52 +00:00
|
|
|
* actually prevents the read that triggered the error from finishing.
|
|
|
|
* currently, there can be no more than two copies of every data bit. thus,
|
|
|
|
* exactly one rewrite is required.
|
|
|
|
*/
|
2017-05-05 15:57:14 +00:00
|
|
|
int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
|
|
|
|
u64 length, u64 logical, struct page *page,
|
|
|
|
unsigned int pg_offset, int mirror_num)
|
2011-07-22 13:41:52 +00:00
|
|
|
{
|
|
|
|
struct bio *bio;
|
|
|
|
struct btrfs_device *dev;
|
|
|
|
u64 map_length = 0;
|
|
|
|
u64 sector;
|
|
|
|
struct btrfs_bio *bbio = NULL;
|
|
|
|
int ret;
|
|
|
|
|
2017-11-27 21:05:09 +00:00
|
|
|
ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
|
2011-07-22 13:41:52 +00:00
|
|
|
BUG_ON(!mirror_num);
|
|
|
|
|
2017-06-12 15:29:41 +00:00
|
|
|
bio = btrfs_io_bio_alloc(1);
|
2013-10-11 22:44:27 +00:00
|
|
|
bio->bi_iter.bi_size = 0;
|
2011-07-22 13:41:52 +00:00
|
|
|
map_length = length;
|
|
|
|
|
2016-05-27 21:21:27 +00:00
|
|
|
/*
|
|
|
|
* Avoid races with device replace and make sure our bbio has devices
|
|
|
|
* associated to its stripes that don't go away while we are doing the
|
|
|
|
* read repair operation.
|
|
|
|
*/
|
|
|
|
btrfs_bio_counter_inc_blocked(fs_info);
|
2017-07-19 07:48:42 +00:00
|
|
|
if (btrfs_is_parity_mirror(fs_info, logical, length)) {
|
2017-03-29 17:53:58 +00:00
|
|
|
/*
|
|
|
|
* Note that we don't use BTRFS_MAP_WRITE because it's supposed
|
|
|
|
* to update all raid stripes, but here we just want to correct
|
|
|
|
* bad stripe, thus BTRFS_MAP_READ is abused to only get the bad
|
|
|
|
* stripe's dev and sector.
|
|
|
|
*/
|
|
|
|
ret = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical,
|
|
|
|
&map_length, &bbio, 0);
|
|
|
|
if (ret) {
|
|
|
|
btrfs_bio_counter_dec(fs_info);
|
|
|
|
bio_put(bio);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
ASSERT(bbio->mirror_num == 1);
|
|
|
|
} else {
|
|
|
|
ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical,
|
|
|
|
&map_length, &bbio, mirror_num);
|
|
|
|
if (ret) {
|
|
|
|
btrfs_bio_counter_dec(fs_info);
|
|
|
|
bio_put(bio);
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
BUG_ON(mirror_num != bbio->mirror_num);
|
2011-07-22 13:41:52 +00:00
|
|
|
}
|
2017-03-29 17:53:58 +00:00
|
|
|
|
|
|
|
sector = bbio->stripes[bbio->mirror_num - 1].physical >> 9;
|
2013-10-11 22:44:27 +00:00
|
|
|
bio->bi_iter.bi_sector = sector;
|
2017-03-29 17:53:58 +00:00
|
|
|
dev = bbio->stripes[bbio->mirror_num - 1].dev;
|
2015-01-20 07:11:34 +00:00
|
|
|
btrfs_put_bbio(bbio);
|
2017-12-04 04:54:52 +00:00
|
|
|
if (!dev || !dev->bdev ||
|
|
|
|
!test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state)) {
|
2016-05-27 21:21:27 +00:00
|
|
|
btrfs_bio_counter_dec(fs_info);
|
2011-07-22 13:41:52 +00:00
|
|
|
bio_put(bio);
|
|
|
|
return -EIO;
|
|
|
|
}
|
2017-08-23 17:10:32 +00:00
|
|
|
bio_set_dev(bio, dev->bdev);
|
2016-11-01 13:40:10 +00:00
|
|
|
bio->bi_opf = REQ_OP_WRITE | REQ_SYNC;
|
2014-09-12 10:44:00 +00:00
|
|
|
bio_add_page(bio, page, length, pg_offset);
|
2011-07-22 13:41:52 +00:00
|
|
|
|
2016-06-05 19:31:41 +00:00
|
|
|
if (btrfsic_submit_bio_wait(bio)) {
|
2011-07-22 13:41:52 +00:00
|
|
|
/* try to remap that extent elsewhere? */
|
2016-05-27 21:21:27 +00:00
|
|
|
btrfs_bio_counter_dec(fs_info);
|
2011-07-22 13:41:52 +00:00
|
|
|
bio_put(bio);
|
2012-05-25 14:06:08 +00:00
|
|
|
btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS);
|
2011-07-22 13:41:52 +00:00
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
|
2015-10-08 08:43:10 +00:00
|
|
|
btrfs_info_rl_in_rcu(fs_info,
|
|
|
|
"read error corrected: ino %llu off %llu (dev %s sector %llu)",
|
2017-05-05 15:57:14 +00:00
|
|
|
ino, start,
|
2014-09-12 10:44:01 +00:00
|
|
|
rcu_str_deref(dev->name), sector);
|
2016-05-27 21:21:27 +00:00
|
|
|
btrfs_bio_counter_dec(fs_info);
|
2011-07-22 13:41:52 +00:00
|
|
|
bio_put(bio);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
int btrfs_repair_eb_io_failure(const struct extent_buffer *eb, int mirror_num)
|
2012-03-27 01:57:36 +00:00
|
|
|
{
|
2019-03-20 10:23:44 +00:00
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
2012-03-27 01:57:36 +00:00
|
|
|
u64 start = eb->start;
|
2018-03-01 17:20:27 +00:00
|
|
|
int i, num_pages = num_extent_pages(eb);
|
2012-04-12 19:55:15 +00:00
|
|
|
int ret = 0;
|
2012-03-27 01:57:36 +00:00
|
|
|
|
2017-07-17 07:45:34 +00:00
|
|
|
if (sb_rdonly(fs_info->sb))
|
2013-11-03 17:06:39 +00:00
|
|
|
return -EROFS;
|
|
|
|
|
2012-03-27 01:57:36 +00:00
|
|
|
for (i = 0; i < num_pages; i++) {
|
2014-07-30 23:03:53 +00:00
|
|
|
struct page *p = eb->pages[i];
|
2014-09-12 10:44:01 +00:00
|
|
|
|
2017-05-05 15:57:14 +00:00
|
|
|
ret = repair_io_failure(fs_info, 0, start, PAGE_SIZE, start, p,
|
2014-09-12 10:44:01 +00:00
|
|
|
start - page_offset(p), mirror_num);
|
2012-03-27 01:57:36 +00:00
|
|
|
if (ret)
|
|
|
|
break;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
start += PAGE_SIZE;
|
2012-03-27 01:57:36 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2011-07-22 13:41:52 +00:00
|
|
|
/*
|
|
|
|
* each time an IO finishes, we do a fast check in the IO failure tree
|
|
|
|
* to see if we need to process or clean up an io_failure_record
|
|
|
|
*/
|
2017-05-05 15:57:15 +00:00
|
|
|
int clean_io_failure(struct btrfs_fs_info *fs_info,
|
|
|
|
struct extent_io_tree *failure_tree,
|
|
|
|
struct extent_io_tree *io_tree, u64 start,
|
|
|
|
struct page *page, u64 ino, unsigned int pg_offset)
|
2011-07-22 13:41:52 +00:00
|
|
|
{
|
|
|
|
u64 private;
|
|
|
|
struct io_failure_record *failrec;
|
|
|
|
struct extent_state *state;
|
|
|
|
int num_copies;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
private = 0;
|
2017-05-05 15:57:15 +00:00
|
|
|
ret = count_range_bits(failure_tree, &private, (u64)-1, 1,
|
|
|
|
EXTENT_DIRTY, 0);
|
2011-07-22 13:41:52 +00:00
|
|
|
if (!ret)
|
|
|
|
return 0;
|
|
|
|
|
2020-07-02 12:23:28 +00:00
|
|
|
failrec = get_state_failrec(failure_tree, start);
|
|
|
|
if (IS_ERR(failrec))
|
2011-07-22 13:41:52 +00:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
BUG_ON(!failrec->this_mirror);
|
|
|
|
|
|
|
|
if (failrec->in_validation) {
|
|
|
|
/* there was no real error, just free the record */
|
2016-09-20 14:05:02 +00:00
|
|
|
btrfs_debug(fs_info,
|
|
|
|
"clean_io_failure: freeing dummy error at %llu",
|
|
|
|
failrec->start);
|
2011-07-22 13:41:52 +00:00
|
|
|
goto out;
|
|
|
|
}
|
2017-07-17 07:45:34 +00:00
|
|
|
if (sb_rdonly(fs_info->sb))
|
2013-11-03 17:06:39 +00:00
|
|
|
goto out;
|
2011-07-22 13:41:52 +00:00
|
|
|
|
2017-05-05 15:57:15 +00:00
|
|
|
spin_lock(&io_tree->lock);
|
|
|
|
state = find_first_extent_bit_state(io_tree,
|
2011-07-22 13:41:52 +00:00
|
|
|
failrec->start,
|
|
|
|
EXTENT_LOCKED);
|
2017-05-05 15:57:15 +00:00
|
|
|
spin_unlock(&io_tree->lock);
|
2011-07-22 13:41:52 +00:00
|
|
|
|
2013-07-25 11:22:35 +00:00
|
|
|
if (state && state->start <= failrec->start &&
|
|
|
|
state->end >= failrec->start + failrec->len - 1) {
|
2012-11-05 14:46:42 +00:00
|
|
|
num_copies = btrfs_num_copies(fs_info, failrec->logical,
|
|
|
|
failrec->len);
|
2011-07-22 13:41:52 +00:00
|
|
|
if (num_copies > 1) {
|
2017-05-05 15:57:15 +00:00
|
|
|
repair_io_failure(fs_info, ino, start, failrec->len,
|
|
|
|
failrec->logical, page, pg_offset,
|
|
|
|
failrec->failed_mirror);
|
2011-07-22 13:41:52 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
2017-05-05 15:57:15 +00:00
|
|
|
free_io_failure(failure_tree, io_tree, failrec);
|
2011-07-22 13:41:52 +00:00
|
|
|
|
2014-09-12 10:43:58 +00:00
|
|
|
return 0;
|
2011-07-22 13:41:52 +00:00
|
|
|
}
|
|
|
|
|
Btrfs: cleanup the read failure record after write or when the inode is freeing
After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
the corrupted data is corrected, so the failure record can be removed. And if
some errors happen on the mirrors, we also needn't worry about it because the
failure record will be recreated if we read the same place again.
Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-12 10:44:04 +00:00
|
|
|
/*
|
|
|
|
* Can be called when
|
|
|
|
* - hold extent lock
|
|
|
|
* - under ordered extent
|
|
|
|
* - the inode is freeing
|
|
|
|
*/
|
2017-02-20 11:50:57 +00:00
|
|
|
void btrfs_free_io_failure_record(struct btrfs_inode *inode, u64 start, u64 end)
|
Btrfs: cleanup the read failure record after write or when the inode is freeing
After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
the corrupted data is corrected, so the failure record can be removed. And if
some errors happen on the mirrors, we also needn't worry about it because the
failure record will be recreated if we read the same place again.
Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-12 10:44:04 +00:00
|
|
|
{
|
2017-02-20 11:50:57 +00:00
|
|
|
struct extent_io_tree *failure_tree = &inode->io_failure_tree;
|
Btrfs: cleanup the read failure record after write or when the inode is freeing
After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
the corrupted data is corrected, so the failure record can be removed. And if
some errors happen on the mirrors, we also needn't worry about it because the
failure record will be recreated if we read the same place again.
Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-12 10:44:04 +00:00
|
|
|
struct io_failure_record *failrec;
|
|
|
|
struct extent_state *state, *next;
|
|
|
|
|
|
|
|
if (RB_EMPTY_ROOT(&failure_tree->state))
|
|
|
|
return;
|
|
|
|
|
|
|
|
spin_lock(&failure_tree->lock);
|
|
|
|
state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
|
|
|
|
while (state) {
|
|
|
|
if (state->start > end)
|
|
|
|
break;
|
|
|
|
|
|
|
|
ASSERT(state->end <= end);
|
|
|
|
|
|
|
|
next = next_state(state);
|
|
|
|
|
2016-02-11 12:24:13 +00:00
|
|
|
failrec = state->failrec;
|
Btrfs: cleanup the read failure record after write or when the inode is freeing
After the data is written successfully, we should cleanup the read failure record
in that range because
- If we set data COW for the file, the range that the failure record pointed to is
mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
the corrupted data is corrected, so the failure record can be removed. And if
some errors happen on the mirrors, we also needn't worry about it because the
failure record will be recreated if we read the same place again.
Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak happens.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-12 10:44:04 +00:00
|
|
|
free_extent_state(state);
|
|
|
|
kfree(failrec);
|
|
|
|
|
|
|
|
state = next;
|
|
|
|
}
|
|
|
|
spin_unlock(&failure_tree->lock);
|
|
|
|
}
|
|
|
|
|
2020-07-02 12:23:29 +00:00
|
|
|
static struct io_failure_record *btrfs_get_io_failure_record(struct inode *inode,
|
|
|
|
u64 start, u64 end)
|
2011-07-22 13:41:52 +00:00
|
|
|
{
|
2016-09-20 14:05:02 +00:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2014-09-12 10:43:59 +00:00
|
|
|
struct io_failure_record *failrec;
|
2011-07-22 13:41:52 +00:00
|
|
|
struct extent_map *em;
|
|
|
|
struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
|
|
|
|
struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
|
|
|
|
struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
|
|
|
|
int ret;
|
|
|
|
u64 logical;
|
|
|
|
|
2020-07-02 12:23:28 +00:00
|
|
|
failrec = get_state_failrec(failure_tree, start);
|
2020-07-02 12:23:29 +00:00
|
|
|
if (!IS_ERR(failrec)) {
|
2016-09-20 14:05:02 +00:00
|
|
|
btrfs_debug(fs_info,
|
|
|
|
"Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu, validation=%d",
|
|
|
|
failrec->logical, failrec->start, failrec->len,
|
|
|
|
failrec->in_validation);
|
2011-07-22 13:41:52 +00:00
|
|
|
/*
|
|
|
|
* when data can be on disk more than twice, add to failrec here
|
|
|
|
* (e.g. with a list for failed_mirror) to make
|
|
|
|
* clean_io_failure() clean all those errors at once.
|
|
|
|
*/
|
2020-07-02 12:23:29 +00:00
|
|
|
|
|
|
|
return failrec;
|
2011-07-22 13:41:52 +00:00
|
|
|
}
|
2014-09-12 10:43:59 +00:00
|
|
|
|
2020-07-02 12:23:29 +00:00
|
|
|
failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
|
|
|
|
if (!failrec)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
2014-09-12 10:43:59 +00:00
|
|
|
|
2020-07-02 12:23:29 +00:00
|
|
|
failrec->start = start;
|
|
|
|
failrec->len = end - start + 1;
|
|
|
|
failrec->this_mirror = 0;
|
|
|
|
failrec->bio_flags = 0;
|
|
|
|
failrec->in_validation = 0;
|
|
|
|
|
|
|
|
read_lock(&em_tree->lock);
|
|
|
|
em = lookup_extent_mapping(em_tree, start, failrec->len);
|
|
|
|
if (!em) {
|
|
|
|
read_unlock(&em_tree->lock);
|
|
|
|
kfree(failrec);
|
|
|
|
return ERR_PTR(-EIO);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (em->start > start || em->start + em->len <= start) {
|
|
|
|
free_extent_map(em);
|
|
|
|
em = NULL;
|
|
|
|
}
|
|
|
|
read_unlock(&em_tree->lock);
|
|
|
|
if (!em) {
|
|
|
|
kfree(failrec);
|
|
|
|
return ERR_PTR(-EIO);
|
|
|
|
}
|
|
|
|
|
|
|
|
logical = start - em->start;
|
|
|
|
logical = em->block_start + logical;
|
|
|
|
if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
|
|
|
|
logical = em->block_start;
|
|
|
|
failrec->bio_flags = EXTENT_BIO_COMPRESSED;
|
|
|
|
extent_set_compress_type(&failrec->bio_flags, em->compress_type);
|
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_debug(fs_info,
|
|
|
|
"Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu",
|
|
|
|
logical, start, failrec->len);
|
|
|
|
|
|
|
|
failrec->logical = logical;
|
|
|
|
free_extent_map(em);
|
|
|
|
|
|
|
|
/* Set the bits in the private failure tree */
|
|
|
|
ret = set_extent_bits(failure_tree, start, end,
|
|
|
|
EXTENT_LOCKED | EXTENT_DIRTY);
|
|
|
|
if (ret >= 0) {
|
|
|
|
ret = set_state_failrec(failure_tree, start, failrec);
|
|
|
|
/* Set the bits in the inode's tree */
|
|
|
|
ret = set_extent_bits(tree, start, end, EXTENT_DAMAGED);
|
|
|
|
} else if (ret < 0) {
|
|
|
|
kfree(failrec);
|
|
|
|
return ERR_PTR(ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
return failrec;
|
2014-09-12 10:43:59 +00:00
|
|
|
}
|
|
|
|
|
2020-04-16 21:46:18 +00:00
|
|
|
static bool btrfs_check_repairable(struct inode *inode, bool needs_validation,
|
|
|
|
struct io_failure_record *failrec,
|
|
|
|
int failed_mirror)
|
2014-09-12 10:43:59 +00:00
|
|
|
{
|
2016-09-20 14:05:02 +00:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2014-09-12 10:43:59 +00:00
|
|
|
int num_copies;
|
|
|
|
|
2016-09-20 14:05:02 +00:00
|
|
|
num_copies = btrfs_num_copies(fs_info, failrec->logical, failrec->len);
|
2011-07-22 13:41:52 +00:00
|
|
|
if (num_copies == 1) {
|
|
|
|
/*
|
|
|
|
* we only have a single copy of the data, so don't bother with
|
|
|
|
* all the retry and error correction code that follows. no
|
|
|
|
* matter what the error is, it is very likely to persist.
|
|
|
|
*/
|
2016-09-20 14:05:02 +00:00
|
|
|
btrfs_debug(fs_info,
|
|
|
|
"Check Repairable: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d",
|
|
|
|
num_copies, failrec->this_mirror, failed_mirror);
|
2017-07-13 22:00:50 +00:00
|
|
|
return false;
|
2011-07-22 13:41:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* there are two premises:
|
|
|
|
* a) deliver good data to the caller
|
|
|
|
* b) correct the bad sectors on disk
|
|
|
|
*/
|
2020-04-16 21:46:14 +00:00
|
|
|
if (needs_validation) {
|
2011-07-22 13:41:52 +00:00
|
|
|
/*
|
|
|
|
* to fulfill b), we need to know the exact failing sectors, as
|
|
|
|
* we don't want to rewrite any more than the failed ones. thus,
|
|
|
|
* we need separate read requests for the failed bio
|
|
|
|
*
|
|
|
|
* if the following BUG_ON triggers, our validation request got
|
|
|
|
* merged. we need separate requests for our algorithm to work.
|
|
|
|
*/
|
|
|
|
BUG_ON(failrec->in_validation);
|
|
|
|
failrec->in_validation = 1;
|
|
|
|
failrec->this_mirror = failed_mirror;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* we're ready to fulfill a) and b) alongside. get a good copy
|
|
|
|
* of the failed sector and if we succeed, we have setup
|
|
|
|
* everything for repair_io_failure to do the rest for us.
|
|
|
|
*/
|
|
|
|
if (failrec->in_validation) {
|
|
|
|
BUG_ON(failrec->this_mirror != failed_mirror);
|
|
|
|
failrec->in_validation = 0;
|
|
|
|
failrec->this_mirror = 0;
|
|
|
|
}
|
|
|
|
failrec->failed_mirror = failed_mirror;
|
|
|
|
failrec->this_mirror++;
|
|
|
|
if (failrec->this_mirror == failed_mirror)
|
|
|
|
failrec->this_mirror++;
|
|
|
|
}
|
|
|
|
|
2013-07-25 11:22:34 +00:00
|
|
|
if (failrec->this_mirror > num_copies) {
|
2016-09-20 14:05:02 +00:00
|
|
|
btrfs_debug(fs_info,
|
|
|
|
"Check Repairable: (fail) num_copies=%d, next_mirror %d, failed_mirror %d",
|
|
|
|
num_copies, failrec->this_mirror, failed_mirror);
|
2017-07-13 22:00:50 +00:00
|
|
|
return false;
|
2011-07-22 13:41:52 +00:00
|
|
|
}
|
|
|
|
|
2017-07-13 22:00:50 +00:00
|
|
|
return true;
|
2014-09-12 10:43:59 +00:00
|
|
|
}
|
|
|
|
|
2020-04-16 21:46:14 +00:00
|
|
|
static bool btrfs_io_needs_validation(struct inode *inode, struct bio *bio)
|
2014-09-12 10:43:59 +00:00
|
|
|
{
|
2020-04-16 21:46:14 +00:00
|
|
|
u64 len = 0;
|
2020-04-16 21:46:25 +00:00
|
|
|
const u32 blocksize = inode->i_sb->s_blocksize;
|
2014-09-12 10:43:59 +00:00
|
|
|
|
2020-04-16 21:46:15 +00:00
|
|
|
/*
|
|
|
|
* If bi_status is BLK_STS_OK, then this was a checksum error, not an
|
|
|
|
* I/O error. In this case, we already know exactly which sector was
|
|
|
|
* bad, so we don't need to validate.
|
|
|
|
*/
|
|
|
|
if (bio->bi_status == BLK_STS_OK)
|
|
|
|
return false;
|
2011-07-22 13:41:52 +00:00
|
|
|
|
2020-04-16 21:46:14 +00:00
|
|
|
/*
|
|
|
|
* We need to validate each sector individually if the failed I/O was
|
|
|
|
* for multiple sectors.
|
2020-04-16 21:46:25 +00:00
|
|
|
*
|
|
|
|
* There are a few possible bios that can end up here:
|
|
|
|
* 1. A buffered read bio, which is not cloned.
|
|
|
|
* 2. A direct I/O read bio, which is cloned.
|
|
|
|
* 3. A (buffered or direct) repair bio, which is not cloned.
|
|
|
|
*
|
|
|
|
* For cloned bios (case 2), we can get the size from
|
|
|
|
* btrfs_io_bio->iter; for non-cloned bios (cases 1 and 3), we can get
|
|
|
|
* it from the bvecs.
|
2020-04-16 21:46:14 +00:00
|
|
|
*/
|
2020-04-16 21:46:25 +00:00
|
|
|
if (bio_flagged(bio, BIO_CLONED)) {
|
|
|
|
if (btrfs_io_bio(bio)->iter.bi_size > blocksize)
|
2020-04-16 21:46:14 +00:00
|
|
|
return true;
|
2020-04-16 21:46:25 +00:00
|
|
|
} else {
|
|
|
|
struct bio_vec *bvec;
|
|
|
|
int i;
|
2013-07-25 11:22:34 +00:00
|
|
|
|
2020-04-16 21:46:25 +00:00
|
|
|
bio_for_each_bvec_all(bvec, bio, i) {
|
|
|
|
len += bvec->bv_len;
|
|
|
|
if (len > blocksize)
|
|
|
|
return true;
|
|
|
|
}
|
2013-07-25 11:22:34 +00:00
|
|
|
}
|
2020-04-16 21:46:14 +00:00
|
|
|
return false;
|
2014-09-12 10:43:59 +00:00
|
|
|
}
|
|
|
|
|
2020-04-16 21:46:25 +00:00
|
|
|
blk_status_t btrfs_submit_read_repair(struct inode *inode,
|
2020-12-02 06:47:58 +00:00
|
|
|
struct bio *failed_bio, u32 bio_offset,
|
2020-04-16 21:46:25 +00:00
|
|
|
struct page *page, unsigned int pgoff,
|
|
|
|
u64 start, u64 end, int failed_mirror,
|
|
|
|
submit_bio_hook_t *submit_bio_hook)
|
2014-09-12 10:43:59 +00:00
|
|
|
{
|
|
|
|
struct io_failure_record *failrec;
|
2020-04-16 21:46:25 +00:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2014-09-12 10:43:59 +00:00
|
|
|
struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
|
2017-05-05 15:57:15 +00:00
|
|
|
struct extent_io_tree *failure_tree = &BTRFS_I(inode)->io_failure_tree;
|
2020-04-16 21:46:25 +00:00
|
|
|
struct btrfs_io_bio *failed_io_bio = btrfs_io_bio(failed_bio);
|
2020-12-02 06:47:58 +00:00
|
|
|
const int icsum = bio_offset >> fs_info->sectorsize_bits;
|
2020-04-16 21:46:14 +00:00
|
|
|
bool need_validation;
|
2020-04-16 21:46:25 +00:00
|
|
|
struct bio *repair_bio;
|
|
|
|
struct btrfs_io_bio *repair_io_bio;
|
2017-06-03 07:38:06 +00:00
|
|
|
blk_status_t status;
|
2014-09-12 10:43:59 +00:00
|
|
|
|
2020-04-16 21:46:25 +00:00
|
|
|
btrfs_debug(fs_info,
|
|
|
|
"repair read error: read error at %llu", start);
|
2014-09-12 10:43:59 +00:00
|
|
|
|
2016-06-05 19:31:51 +00:00
|
|
|
BUG_ON(bio_op(failed_bio) == REQ_OP_WRITE);
|
2014-09-12 10:43:59 +00:00
|
|
|
|
2020-07-02 12:23:29 +00:00
|
|
|
failrec = btrfs_get_io_failure_record(inode, start, end);
|
|
|
|
if (IS_ERR(failrec))
|
|
|
|
return errno_to_blk_status(PTR_ERR(failrec));
|
2014-09-12 10:43:59 +00:00
|
|
|
|
2020-04-16 21:46:14 +00:00
|
|
|
need_validation = btrfs_io_needs_validation(inode, failed_bio);
|
2014-09-12 10:43:59 +00:00
|
|
|
|
2020-04-16 21:46:14 +00:00
|
|
|
if (!btrfs_check_repairable(inode, need_validation, failrec,
|
2017-07-13 22:00:50 +00:00
|
|
|
failed_mirror)) {
|
2017-05-05 15:57:15 +00:00
|
|
|
free_io_failure(failure_tree, tree, failrec);
|
2020-04-16 21:46:25 +00:00
|
|
|
return BLK_STS_IOERR;
|
2014-09-12 10:43:59 +00:00
|
|
|
}
|
|
|
|
|
2020-04-16 21:46:25 +00:00
|
|
|
repair_bio = btrfs_io_bio_alloc(1);
|
|
|
|
repair_io_bio = btrfs_io_bio(repair_bio);
|
|
|
|
repair_bio->bi_opf = REQ_OP_READ;
|
2020-04-16 21:46:14 +00:00
|
|
|
if (need_validation)
|
2020-04-16 21:46:25 +00:00
|
|
|
repair_bio->bi_opf |= REQ_FAILFAST_DEV;
|
|
|
|
repair_bio->bi_end_io = failed_bio->bi_end_io;
|
|
|
|
repair_bio->bi_iter.bi_sector = failrec->logical >> 9;
|
|
|
|
repair_bio->bi_private = failed_bio->bi_private;
|
2014-09-12 10:43:59 +00:00
|
|
|
|
2020-04-16 21:46:25 +00:00
|
|
|
if (failed_io_bio->csum) {
|
2020-07-02 09:27:30 +00:00
|
|
|
const u32 csum_size = fs_info->csum_size;
|
2020-04-16 21:46:25 +00:00
|
|
|
|
|
|
|
repair_io_bio->csum = repair_io_bio->csum_inline;
|
|
|
|
memcpy(repair_io_bio->csum,
|
|
|
|
failed_io_bio->csum + csum_size * icsum, csum_size);
|
|
|
|
}
|
2014-09-12 10:43:59 +00:00
|
|
|
|
2020-04-16 21:46:25 +00:00
|
|
|
bio_add_page(repair_bio, page, failrec->len, pgoff);
|
|
|
|
repair_io_bio->logical = failrec->start;
|
|
|
|
repair_io_bio->iter = repair_bio->bi_iter;
|
2011-07-22 13:41:52 +00:00
|
|
|
|
2016-09-20 14:05:02 +00:00
|
|
|
btrfs_debug(btrfs_sb(inode->i_sb),
|
2020-04-16 21:46:25 +00:00
|
|
|
"repair read error: submitting new read to mirror %d, in_validation=%d",
|
|
|
|
failrec->this_mirror, failrec->in_validation);
|
2011-07-22 13:41:52 +00:00
|
|
|
|
2020-04-16 21:46:25 +00:00
|
|
|
status = submit_bio_hook(inode, repair_bio, failrec->this_mirror,
|
|
|
|
failrec->bio_flags);
|
2017-06-03 07:38:06 +00:00
|
|
|
if (status) {
|
2017-05-05 15:57:15 +00:00
|
|
|
free_io_failure(failure_tree, tree, failrec);
|
2020-04-16 21:46:25 +00:00
|
|
|
bio_put(repair_bio);
|
2014-09-12 10:43:57 +00:00
|
|
|
}
|
2020-04-16 21:46:25 +00:00
|
|
|
return status;
|
2011-07-22 13:41:52 +00:00
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/* lots and lots of room for performance fixes in the end_bio funcs */
|
|
|
|
|
2015-12-03 12:08:59 +00:00
|
|
|
void end_extent_writepage(struct page *page, int err, u64 start, u64 end)
|
2012-02-15 15:23:57 +00:00
|
|
|
{
|
|
|
|
int uptodate = (err == 0);
|
2014-06-12 05:39:58 +00:00
|
|
|
int ret = 0;
|
2012-02-15 15:23:57 +00:00
|
|
|
|
2018-11-08 08:18:08 +00:00
|
|
|
btrfs_writepage_endio_finish_ordered(page, start, end, uptodate);
|
2012-02-15 15:23:57 +00:00
|
|
|
|
|
|
|
if (!uptodate) {
|
|
|
|
ClearPageUptodate(page);
|
|
|
|
SetPageError(page);
|
2017-05-09 17:14:01 +00:00
|
|
|
ret = err < 0 ? err : -EIO;
|
2014-05-12 04:47:36 +00:00
|
|
|
mapping_set_error(page->mapping, ret);
|
2012-02-15 15:23:57 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* after a writepage IO is done, we need to:
|
|
|
|
* clear the uptodate bits on error
|
|
|
|
* clear the writeback bits in the extent tree for this IO
|
|
|
|
* end_page_writeback if the page has no more pending IO
|
|
|
|
*
|
|
|
|
* Scheduling is not allowed, so the extent state tree is expected
|
|
|
|
* to have one and only one object corresponding to this IO.
|
|
|
|
*/
|
2015-07-20 13:29:37 +00:00
|
|
|
static void end_bio_extent_writepage(struct bio *bio)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2017-06-03 07:38:06 +00:00
|
|
|
int error = blk_status_to_errno(bio->bi_status);
|
2013-11-07 20:20:26 +00:00
|
|
|
struct bio_vec *bvec;
|
2008-01-24 21:13:08 +00:00
|
|
|
u64 start;
|
|
|
|
u64 end;
|
2019-02-15 11:13:19 +00:00
|
|
|
struct bvec_iter_all iter_all;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2017-07-13 16:10:07 +00:00
|
|
|
ASSERT(!bio_flagged(bio, BIO_CLONED));
|
2019-04-25 07:03:00 +00:00
|
|
|
bio_for_each_segment_all(bvec, bio, iter_all) {
|
2008-01-24 21:13:08 +00:00
|
|
|
struct page *page = bvec->bv_page;
|
2016-06-22 22:54:23 +00:00
|
|
|
struct inode *inode = page->mapping->host;
|
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2008-08-20 12:51:49 +00:00
|
|
|
|
2013-05-15 15:38:55 +00:00
|
|
|
/* We always issue full-page reads, but if some block
|
|
|
|
* in a page fails to read, blk_update_request() will
|
|
|
|
* advance bv_offset and adjust bv_len to compensate.
|
|
|
|
* Print a warning for nonzero offsets, and an error
|
|
|
|
* if they don't add up to a full page. */
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
if (bvec->bv_offset || bvec->bv_len != PAGE_SIZE) {
|
|
|
|
if (bvec->bv_offset + bvec->bv_len != PAGE_SIZE)
|
2016-06-22 22:54:23 +00:00
|
|
|
btrfs_err(fs_info,
|
2013-12-20 16:37:06 +00:00
|
|
|
"partial page write in btrfs with offset %u and length %u",
|
|
|
|
bvec->bv_offset, bvec->bv_len);
|
|
|
|
else
|
2016-06-22 22:54:23 +00:00
|
|
|
btrfs_info(fs_info,
|
2016-09-20 14:05:00 +00:00
|
|
|
"incomplete page write in btrfs with offset %u and length %u",
|
2013-12-20 16:37:06 +00:00
|
|
|
bvec->bv_offset, bvec->bv_len);
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2013-05-15 15:38:55 +00:00
|
|
|
start = page_offset(page);
|
|
|
|
end = start + bvec->bv_offset + bvec->bv_len - 1;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2017-06-03 07:38:06 +00:00
|
|
|
end_extent_writepage(page, error, start, end);
|
2013-05-15 15:38:55 +00:00
|
|
|
end_page_writeback(page);
|
2013-11-07 20:20:26 +00:00
|
|
|
}
|
2008-09-24 15:48:04 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
bio_put(bio);
|
|
|
|
}
|
|
|
|
|
btrfs: add structure to keep track of extent range in end_bio_extent_readpage
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.
Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.
Here is an example to explain the original work flow:
Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
end_bio_extent_extent_readpage() entered
|- extent_start = 0;
|- extent_end = 0;
|- bio_for_each_segment_all() {
| |- /* Got the 1st bvec */
| |- start = SZ_1M;
| |- end = SZ_1M + SZ_4K - 1;
| |- update = 1;
| |- if (extent_len == 0) {
| | |- extent_start = start; /* SZ_1M */
| | |- extent_len = end + 1 - start; /* SZ_1M */
| | }
| |
| |- /* Got the 2nd bvec */
| |- start = SZ_1M + 4K;
| |- end = SZ_1M + 4K - 1;
| |- update = 1;
| |- if (extent_start + extent_len == start) {
| | |- extent_len += end + 1 - start; /* SZ_8K */
| | }
| } /* All bio vec iterated */
|
|- if (extent_len) {
|- endio_readpage_release_extent(tree, extent_start, extent_len,
update);
/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.
However current behavior has something not really considered:
- The inode can change
For bio, its pages don't need to have contiguous page_offset.
This means, even pages from different inodes can be packed into one
bio.
- bvec cross page boundary
There is a feature called multi-page bvec, where bvec->bv_len can go
beyond bvec->bv_page boundary.
- Poor readability
This patch will address the problem:
- Introduce a proper structure, processed_extent, to record processed
extent range
- Integrate inode/start/end/uptodate check into
endio_readpage_release_extent()
- Add more comment on each step.
This should greatly improve the readability, now in
end_bio_extent_readpage() there are only two
endio_readpage_release_extent() calls.
- Add inode check for contiguity
Now we also ensure the inode is the same one before checking if the
range is contiguous.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-13 12:51:28 +00:00
|
|
|
/*
|
|
|
|
* Record previously processed extent range
|
|
|
|
*
|
|
|
|
* For endio_readpage_release_extent() to handle a full extent range, reducing
|
|
|
|
* the extent io operations.
|
|
|
|
*/
|
|
|
|
struct processed_extent {
|
|
|
|
struct btrfs_inode *inode;
|
|
|
|
/* Start of the range in @inode */
|
|
|
|
u64 start;
|
|
|
|
/* End of the range in in @inode */
|
|
|
|
u64 end;
|
|
|
|
bool uptodate;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to release processed extent range
|
|
|
|
*
|
|
|
|
* May not release the extent range right now if the current range is
|
|
|
|
* contiguous to processed extent.
|
|
|
|
*
|
|
|
|
* Will release processed extent when any of @inode, @uptodate, the range is
|
|
|
|
* no longer contiguous to the processed range.
|
|
|
|
*
|
|
|
|
* Passing @inode == NULL will force processed extent to be released.
|
|
|
|
*/
|
|
|
|
static void endio_readpage_release_extent(struct processed_extent *processed,
|
|
|
|
struct btrfs_inode *inode, u64 start, u64 end,
|
|
|
|
bool uptodate)
|
2013-07-25 11:22:35 +00:00
|
|
|
{
|
|
|
|
struct extent_state *cached = NULL;
|
btrfs: add structure to keep track of extent range in end_bio_extent_readpage
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.
Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.
Here is an example to explain the original work flow:
Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
end_bio_extent_extent_readpage() entered
|- extent_start = 0;
|- extent_end = 0;
|- bio_for_each_segment_all() {
| |- /* Got the 1st bvec */
| |- start = SZ_1M;
| |- end = SZ_1M + SZ_4K - 1;
| |- update = 1;
| |- if (extent_len == 0) {
| | |- extent_start = start; /* SZ_1M */
| | |- extent_len = end + 1 - start; /* SZ_1M */
| | }
| |
| |- /* Got the 2nd bvec */
| |- start = SZ_1M + 4K;
| |- end = SZ_1M + 4K - 1;
| |- update = 1;
| |- if (extent_start + extent_len == start) {
| | |- extent_len += end + 1 - start; /* SZ_8K */
| | }
| } /* All bio vec iterated */
|
|- if (extent_len) {
|- endio_readpage_release_extent(tree, extent_start, extent_len,
update);
/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.
However current behavior has something not really considered:
- The inode can change
For bio, its pages don't need to have contiguous page_offset.
This means, even pages from different inodes can be packed into one
bio.
- bvec cross page boundary
There is a feature called multi-page bvec, where bvec->bv_len can go
beyond bvec->bv_page boundary.
- Poor readability
This patch will address the problem:
- Introduce a proper structure, processed_extent, to record processed
extent range
- Integrate inode/start/end/uptodate check into
endio_readpage_release_extent()
- Add more comment on each step.
This should greatly improve the readability, now in
end_bio_extent_readpage() there are only two
endio_readpage_release_extent() calls.
- Add inode check for contiguity
Now we also ensure the inode is the same one before checking if the
range is contiguous.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-13 12:51:28 +00:00
|
|
|
struct extent_io_tree *tree;
|
|
|
|
|
|
|
|
/* The first extent, initialize @processed */
|
|
|
|
if (!processed->inode)
|
|
|
|
goto update;
|
2013-07-25 11:22:35 +00:00
|
|
|
|
btrfs: add structure to keep track of extent range in end_bio_extent_readpage
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.
Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.
Here is an example to explain the original work flow:
Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
end_bio_extent_extent_readpage() entered
|- extent_start = 0;
|- extent_end = 0;
|- bio_for_each_segment_all() {
| |- /* Got the 1st bvec */
| |- start = SZ_1M;
| |- end = SZ_1M + SZ_4K - 1;
| |- update = 1;
| |- if (extent_len == 0) {
| | |- extent_start = start; /* SZ_1M */
| | |- extent_len = end + 1 - start; /* SZ_1M */
| | }
| |
| |- /* Got the 2nd bvec */
| |- start = SZ_1M + 4K;
| |- end = SZ_1M + 4K - 1;
| |- update = 1;
| |- if (extent_start + extent_len == start) {
| | |- extent_len += end + 1 - start; /* SZ_8K */
| | }
| } /* All bio vec iterated */
|
|- if (extent_len) {
|- endio_readpage_release_extent(tree, extent_start, extent_len,
update);
/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.
However current behavior has something not really considered:
- The inode can change
For bio, its pages don't need to have contiguous page_offset.
This means, even pages from different inodes can be packed into one
bio.
- bvec cross page boundary
There is a feature called multi-page bvec, where bvec->bv_len can go
beyond bvec->bv_page boundary.
- Poor readability
This patch will address the problem:
- Introduce a proper structure, processed_extent, to record processed
extent range
- Integrate inode/start/end/uptodate check into
endio_readpage_release_extent()
- Add more comment on each step.
This should greatly improve the readability, now in
end_bio_extent_readpage() there are only two
endio_readpage_release_extent() calls.
- Add inode check for contiguity
Now we also ensure the inode is the same one before checking if the
range is contiguous.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-13 12:51:28 +00:00
|
|
|
/*
|
|
|
|
* Contiguous to processed extent, just uptodate the end.
|
|
|
|
*
|
|
|
|
* Several things to notice:
|
|
|
|
*
|
|
|
|
* - bio can be merged as long as on-disk bytenr is contiguous
|
|
|
|
* This means we can have page belonging to other inodes, thus need to
|
|
|
|
* check if the inode still matches.
|
|
|
|
* - bvec can contain range beyond current page for multi-page bvec
|
|
|
|
* Thus we need to do processed->end + 1 >= start check
|
|
|
|
*/
|
|
|
|
if (processed->inode == inode && processed->uptodate == uptodate &&
|
|
|
|
processed->end + 1 >= start && end >= processed->end) {
|
|
|
|
processed->end = end;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
tree = &processed->inode->io_tree;
|
|
|
|
/*
|
|
|
|
* Now we don't have range contiguous to the processed range, release
|
|
|
|
* the processed range now.
|
|
|
|
*/
|
|
|
|
if (processed->uptodate && tree->track_uptodate)
|
|
|
|
set_extent_uptodate(tree, processed->start, processed->end,
|
|
|
|
&cached, GFP_ATOMIC);
|
|
|
|
unlock_extent_cached_atomic(tree, processed->start, processed->end,
|
|
|
|
&cached);
|
|
|
|
|
|
|
|
update:
|
|
|
|
/* Update processed to current range */
|
|
|
|
processed->inode = inode;
|
|
|
|
processed->start = start;
|
|
|
|
processed->end = end;
|
|
|
|
processed->uptodate = uptodate;
|
2013-07-25 11:22:35 +00:00
|
|
|
}
|
|
|
|
|
2020-11-13 12:51:29 +00:00
|
|
|
static void endio_readpage_update_page_status(struct page *page, bool uptodate)
|
|
|
|
{
|
|
|
|
if (uptodate) {
|
|
|
|
SetPageUptodate(page);
|
|
|
|
} else {
|
|
|
|
ClearPageUptodate(page);
|
|
|
|
SetPageError(page);
|
|
|
|
}
|
|
|
|
unlock_page(page);
|
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* after a readpage IO is done, we need to:
|
|
|
|
* clear the uptodate bits on error
|
|
|
|
* set the uptodate bits if things worked
|
|
|
|
* set the page up to date if all extents in the tree are uptodate
|
|
|
|
* clear the lock bit in the extent tree
|
|
|
|
* unlock the page if there are no other extents locked for it
|
|
|
|
*
|
|
|
|
* Scheduling is not allowed, so the extent state tree is expected
|
|
|
|
* to have one and only one object corresponding to this IO.
|
|
|
|
*/
|
2015-07-20 13:29:37 +00:00
|
|
|
static void end_bio_extent_readpage(struct bio *bio)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2013-11-07 20:20:26 +00:00
|
|
|
struct bio_vec *bvec;
|
2017-06-03 07:38:06 +00:00
|
|
|
int uptodate = !bio->bi_status;
|
2013-07-25 11:22:34 +00:00
|
|
|
struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
|
2017-05-05 15:57:15 +00:00
|
|
|
struct extent_io_tree *tree, *failure_tree;
|
btrfs: add structure to keep track of extent range in end_bio_extent_readpage
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.
Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.
Here is an example to explain the original work flow:
Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
end_bio_extent_extent_readpage() entered
|- extent_start = 0;
|- extent_end = 0;
|- bio_for_each_segment_all() {
| |- /* Got the 1st bvec */
| |- start = SZ_1M;
| |- end = SZ_1M + SZ_4K - 1;
| |- update = 1;
| |- if (extent_len == 0) {
| | |- extent_start = start; /* SZ_1M */
| | |- extent_len = end + 1 - start; /* SZ_1M */
| | }
| |
| |- /* Got the 2nd bvec */
| |- start = SZ_1M + 4K;
| |- end = SZ_1M + 4K - 1;
| |- update = 1;
| |- if (extent_start + extent_len == start) {
| | |- extent_len += end + 1 - start; /* SZ_8K */
| | }
| } /* All bio vec iterated */
|
|- if (extent_len) {
|- endio_readpage_release_extent(tree, extent_start, extent_len,
update);
/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.
However current behavior has something not really considered:
- The inode can change
For bio, its pages don't need to have contiguous page_offset.
This means, even pages from different inodes can be packed into one
bio.
- bvec cross page boundary
There is a feature called multi-page bvec, where bvec->bv_len can go
beyond bvec->bv_page boundary.
- Poor readability
This patch will address the problem:
- Introduce a proper structure, processed_extent, to record processed
extent range
- Integrate inode/start/end/uptodate check into
endio_readpage_release_extent()
- Add more comment on each step.
This should greatly improve the readability, now in
end_bio_extent_readpage() there are only two
endio_readpage_release_extent() calls.
- Add inode check for contiguity
Now we also ensure the inode is the same one before checking if the
range is contiguous.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-13 12:51:28 +00:00
|
|
|
struct processed_extent processed = { 0 };
|
2020-12-02 06:47:58 +00:00
|
|
|
/*
|
|
|
|
* The offset to the beginning of a bio, since one bio can never be
|
|
|
|
* larger than UINT_MAX, u32 here is enough.
|
|
|
|
*/
|
|
|
|
u32 bio_offset = 0;
|
2012-04-16 13:42:26 +00:00
|
|
|
int mirror;
|
2008-01-24 21:13:08 +00:00
|
|
|
int ret;
|
2019-02-15 11:13:19 +00:00
|
|
|
struct bvec_iter_all iter_all;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2017-07-13 16:10:07 +00:00
|
|
|
ASSERT(!bio_flagged(bio, BIO_CLONED));
|
2019-04-25 07:03:00 +00:00
|
|
|
bio_for_each_segment_all(bvec, bio, iter_all) {
|
2008-01-24 21:13:08 +00:00
|
|
|
struct page *page = bvec->bv_page;
|
2013-06-17 21:14:39 +00:00
|
|
|
struct inode *inode = page->mapping->host;
|
2016-09-20 14:05:02 +00:00
|
|
|
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
|
2020-12-02 06:47:58 +00:00
|
|
|
const u32 sectorsize = fs_info->sectorsize;
|
|
|
|
u64 start;
|
|
|
|
u64 end;
|
|
|
|
u32 len;
|
2011-04-06 10:02:20 +00:00
|
|
|
|
2016-09-20 14:05:02 +00:00
|
|
|
btrfs_debug(fs_info,
|
|
|
|
"end_bio_extent_readpage: bi_sector=%llu, err=%d, mirror=%u",
|
2020-11-26 14:41:27 +00:00
|
|
|
bio->bi_iter.bi_sector, bio->bi_status,
|
2016-09-20 14:05:02 +00:00
|
|
|
io_bio->mirror_num);
|
2013-06-17 21:14:39 +00:00
|
|
|
tree = &BTRFS_I(inode)->io_tree;
|
2017-05-05 15:57:15 +00:00
|
|
|
failure_tree = &BTRFS_I(inode)->io_failure_tree;
|
2008-08-20 12:51:49 +00:00
|
|
|
|
2020-10-21 06:24:58 +00:00
|
|
|
/*
|
|
|
|
* We always issue full-sector reads, but if some block in a
|
|
|
|
* page fails to read, blk_update_request() will advance
|
|
|
|
* bv_offset and adjust bv_len to compensate. Print a warning
|
|
|
|
* for unaligned offsets, and an error if they don't add up to
|
|
|
|
* a full sector.
|
|
|
|
*/
|
|
|
|
if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
|
|
|
|
btrfs_err(fs_info,
|
|
|
|
"partial page read in btrfs with offset %u and length %u",
|
|
|
|
bvec->bv_offset, bvec->bv_len);
|
|
|
|
else if (!IS_ALIGNED(bvec->bv_offset + bvec->bv_len,
|
|
|
|
sectorsize))
|
|
|
|
btrfs_info(fs_info,
|
|
|
|
"incomplete page read with offset %u and length %u",
|
|
|
|
bvec->bv_offset, bvec->bv_len);
|
|
|
|
|
|
|
|
start = page_offset(page) + bvec->bv_offset;
|
|
|
|
end = start + bvec->bv_len - 1;
|
2013-07-25 11:22:34 +00:00
|
|
|
len = bvec->bv_len;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2013-05-17 22:30:14 +00:00
|
|
|
mirror = io_bio->mirror_num;
|
2018-11-22 08:17:49 +00:00
|
|
|
if (likely(uptodate)) {
|
2020-09-18 13:34:36 +00:00
|
|
|
if (is_data_inode(inode))
|
2020-12-02 06:47:58 +00:00
|
|
|
ret = btrfs_verify_data_csum(io_bio,
|
|
|
|
bio_offset, page, start, end,
|
|
|
|
mirror);
|
2020-09-18 13:34:33 +00:00
|
|
|
else
|
|
|
|
ret = btrfs_validate_metadata_buffer(io_bio,
|
2020-11-12 08:47:57 +00:00
|
|
|
page, start, end, mirror);
|
2012-08-27 14:30:03 +00:00
|
|
|
if (ret)
|
2008-01-24 21:13:08 +00:00
|
|
|
uptodate = 0;
|
2012-08-27 14:30:03 +00:00
|
|
|
else
|
2017-05-05 15:57:15 +00:00
|
|
|
clean_io_failure(BTRFS_I(inode)->root->fs_info,
|
|
|
|
failure_tree, tree, start,
|
|
|
|
page,
|
|
|
|
btrfs_ino(BTRFS_I(inode)), 0);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2012-03-27 01:57:36 +00:00
|
|
|
|
2013-07-25 11:22:33 +00:00
|
|
|
if (likely(uptodate))
|
|
|
|
goto readpage_ok;
|
|
|
|
|
2020-09-18 13:34:36 +00:00
|
|
|
if (is_data_inode(inode)) {
|
2017-03-24 22:04:50 +00:00
|
|
|
|
2011-12-01 14:30:36 +00:00
|
|
|
/*
|
2018-11-22 08:17:49 +00:00
|
|
|
* The generic bio_readpage_error handles errors the
|
|
|
|
* following way: If possible, new read requests are
|
|
|
|
* created and submitted and will end up in
|
|
|
|
* end_bio_extent_readpage as well (if we're lucky,
|
|
|
|
* not in the !uptodate case). In that case it returns
|
|
|
|
* 0 and we just go on with the next page in our bio.
|
|
|
|
* If it can't handle the error it will return -EIO and
|
|
|
|
* we remain responsible for that page.
|
2011-12-01 14:30:36 +00:00
|
|
|
*/
|
2020-12-02 06:47:58 +00:00
|
|
|
if (!btrfs_submit_read_repair(inode, bio, bio_offset,
|
|
|
|
page,
|
2020-04-16 21:46:25 +00:00
|
|
|
start - page_offset(page),
|
|
|
|
start, end, mirror,
|
2020-09-18 13:34:37 +00:00
|
|
|
btrfs_submit_data_bio)) {
|
2018-11-22 08:17:49 +00:00
|
|
|
uptodate = !bio->bi_status;
|
2020-12-02 06:47:58 +00:00
|
|
|
ASSERT(bio_offset + len > bio_offset);
|
|
|
|
bio_offset += len;
|
2018-11-22 08:17:49 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
|
|
|
|
eb = (struct extent_buffer *)page->private;
|
|
|
|
set_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
|
|
|
|
eb->read_mirror = mirror;
|
|
|
|
atomic_dec(&eb->io_pages);
|
|
|
|
if (test_and_clear_bit(EXTENT_BUFFER_READAHEAD,
|
|
|
|
&eb->bflags))
|
|
|
|
btree_readahead_hook(eb, -EIO);
|
2008-04-09 20:28:12 +00:00
|
|
|
}
|
2013-07-25 11:22:33 +00:00
|
|
|
readpage_ok:
|
2013-07-25 11:22:35 +00:00
|
|
|
if (likely(uptodate)) {
|
2013-06-17 21:14:39 +00:00
|
|
|
loff_t i_size = i_size_read(inode);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
pgoff_t end_index = i_size >> PAGE_SHIFT;
|
2014-08-19 15:32:22 +00:00
|
|
|
unsigned off;
|
2013-06-17 21:14:39 +00:00
|
|
|
|
|
|
|
/* Zero out the end if this page straddles i_size */
|
2018-12-05 14:23:03 +00:00
|
|
|
off = offset_in_page(i_size);
|
2014-08-19 15:32:22 +00:00
|
|
|
if (page->index == end_index && off)
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
zero_user_segment(page, off, PAGE_SIZE);
|
2008-01-29 14:59:12 +00:00
|
|
|
}
|
2020-12-02 06:47:58 +00:00
|
|
|
ASSERT(bio_offset + len > bio_offset);
|
|
|
|
bio_offset += len;
|
2013-07-25 11:22:35 +00:00
|
|
|
|
2020-11-13 12:51:29 +00:00
|
|
|
/* Update page status and unlock */
|
|
|
|
endio_readpage_update_page_status(page, uptodate);
|
btrfs: add structure to keep track of extent range in end_bio_extent_readpage
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.
Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.
Here is an example to explain the original work flow:
Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
end_bio_extent_extent_readpage() entered
|- extent_start = 0;
|- extent_end = 0;
|- bio_for_each_segment_all() {
| |- /* Got the 1st bvec */
| |- start = SZ_1M;
| |- end = SZ_1M + SZ_4K - 1;
| |- update = 1;
| |- if (extent_len == 0) {
| | |- extent_start = start; /* SZ_1M */
| | |- extent_len = end + 1 - start; /* SZ_1M */
| | }
| |
| |- /* Got the 2nd bvec */
| |- start = SZ_1M + 4K;
| |- end = SZ_1M + 4K - 1;
| |- update = 1;
| |- if (extent_start + extent_len == start) {
| | |- extent_len += end + 1 - start; /* SZ_8K */
| | }
| } /* All bio vec iterated */
|
|- if (extent_len) {
|- endio_readpage_release_extent(tree, extent_start, extent_len,
update);
/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.
However current behavior has something not really considered:
- The inode can change
For bio, its pages don't need to have contiguous page_offset.
This means, even pages from different inodes can be packed into one
bio.
- bvec cross page boundary
There is a feature called multi-page bvec, where bvec->bv_len can go
beyond bvec->bv_page boundary.
- Poor readability
This patch will address the problem:
- Introduce a proper structure, processed_extent, to record processed
extent range
- Integrate inode/start/end/uptodate check into
endio_readpage_release_extent()
- Add more comment on each step.
This should greatly improve the readability, now in
end_bio_extent_readpage() there are only two
endio_readpage_release_extent() calls.
- Add inode check for contiguity
Now we also ensure the inode is the same one before checking if the
range is contiguous.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-13 12:51:28 +00:00
|
|
|
endio_readpage_release_extent(&processed, BTRFS_I(inode),
|
|
|
|
start, end, uptodate);
|
2013-11-07 20:20:26 +00:00
|
|
|
}
|
btrfs: add structure to keep track of extent range in end_bio_extent_readpage
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.
Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.
Here is an example to explain the original work flow:
Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
end_bio_extent_extent_readpage() entered
|- extent_start = 0;
|- extent_end = 0;
|- bio_for_each_segment_all() {
| |- /* Got the 1st bvec */
| |- start = SZ_1M;
| |- end = SZ_1M + SZ_4K - 1;
| |- update = 1;
| |- if (extent_len == 0) {
| | |- extent_start = start; /* SZ_1M */
| | |- extent_len = end + 1 - start; /* SZ_1M */
| | }
| |
| |- /* Got the 2nd bvec */
| |- start = SZ_1M + 4K;
| |- end = SZ_1M + 4K - 1;
| |- update = 1;
| |- if (extent_start + extent_len == start) {
| | |- extent_len += end + 1 - start; /* SZ_8K */
| | }
| } /* All bio vec iterated */
|
|- if (extent_len) {
|- endio_readpage_release_extent(tree, extent_start, extent_len,
update);
/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.
However current behavior has something not really considered:
- The inode can change
For bio, its pages don't need to have contiguous page_offset.
This means, even pages from different inodes can be packed into one
bio.
- bvec cross page boundary
There is a feature called multi-page bvec, where bvec->bv_len can go
beyond bvec->bv_page boundary.
- Poor readability
This patch will address the problem:
- Introduce a proper structure, processed_extent, to record processed
extent range
- Integrate inode/start/end/uptodate check into
endio_readpage_release_extent()
- Add more comment on each step.
This should greatly improve the readability, now in
end_bio_extent_readpage() there are only two
endio_readpage_release_extent() calls.
- Add inode check for contiguity
Now we also ensure the inode is the same one before checking if the
range is contiguous.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-13 12:51:28 +00:00
|
|
|
/* Release the last extent */
|
|
|
|
endio_readpage_release_extent(&processed, NULL, 0, 0, false);
|
2018-11-22 16:16:49 +00:00
|
|
|
btrfs_io_bio_free_csum(io_bio);
|
2008-01-24 21:13:08 +00:00
|
|
|
bio_put(bio);
|
|
|
|
}
|
|
|
|
|
2013-05-17 22:30:14 +00:00
|
|
|
/*
|
2017-06-12 15:29:39 +00:00
|
|
|
* Initialize the members up to but not including 'bio'. Use after allocating a
|
|
|
|
* new bio by bio_alloc_bioset as it does not initialize the bytes outside of
|
|
|
|
* 'bio' because use of __GFP_ZERO is not supported.
|
2013-05-17 22:30:14 +00:00
|
|
|
*/
|
2017-06-12 15:29:39 +00:00
|
|
|
static inline void btrfs_io_bio_init(struct btrfs_io_bio *btrfs_bio)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2017-06-12 15:29:39 +00:00
|
|
|
memset(btrfs_bio, 0, offsetof(struct btrfs_io_bio, bio));
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2013-05-17 22:30:14 +00:00
|
|
|
/*
|
2017-06-02 15:26:26 +00:00
|
|
|
* The following helpers allocate a bio. As it's backed by a bioset, it'll
|
|
|
|
* never fail. We're returning a bio right now but you can call btrfs_io_bio
|
|
|
|
* for the appropriate container_of magic
|
2013-05-17 22:30:14 +00:00
|
|
|
*/
|
2019-06-18 18:00:16 +00:00
|
|
|
struct bio *btrfs_bio_alloc(u64 first_byte)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct bio *bio;
|
|
|
|
|
2018-05-20 22:25:56 +00:00
|
|
|
bio = bio_alloc_bioset(GFP_NOFS, BIO_MAX_PAGES, &btrfs_bioset);
|
2017-06-02 16:35:36 +00:00
|
|
|
bio->bi_iter.bi_sector = first_byte >> 9;
|
2017-06-12 15:29:39 +00:00
|
|
|
btrfs_io_bio_init(btrfs_io_bio(bio));
|
2008-01-24 21:13:08 +00:00
|
|
|
return bio;
|
|
|
|
}
|
|
|
|
|
2017-06-02 15:48:13 +00:00
|
|
|
struct bio *btrfs_bio_clone(struct bio *bio)
|
2013-05-17 22:30:14 +00:00
|
|
|
{
|
2014-09-12 10:43:54 +00:00
|
|
|
struct btrfs_io_bio *btrfs_bio;
|
|
|
|
struct bio *new;
|
2013-05-17 22:30:14 +00:00
|
|
|
|
2017-06-02 15:26:26 +00:00
|
|
|
/* Bio allocation backed by a bioset does not fail */
|
2018-05-20 22:25:56 +00:00
|
|
|
new = bio_clone_fast(bio, GFP_NOFS, &btrfs_bioset);
|
2017-06-02 15:26:26 +00:00
|
|
|
btrfs_bio = btrfs_io_bio(new);
|
2017-06-12 15:29:39 +00:00
|
|
|
btrfs_io_bio_init(btrfs_bio);
|
2017-06-02 15:26:26 +00:00
|
|
|
btrfs_bio->iter = bio->bi_iter;
|
2014-09-12 10:43:54 +00:00
|
|
|
return new;
|
|
|
|
}
|
2013-05-17 22:30:14 +00:00
|
|
|
|
2017-06-12 15:29:41 +00:00
|
|
|
struct bio *btrfs_io_bio_alloc(unsigned int nr_iovecs)
|
2013-05-17 22:30:14 +00:00
|
|
|
{
|
2013-07-25 11:22:34 +00:00
|
|
|
struct bio *bio;
|
|
|
|
|
2017-06-02 15:26:26 +00:00
|
|
|
/* Bio allocation backed by a bioset does not fail */
|
2018-05-20 22:25:56 +00:00
|
|
|
bio = bio_alloc_bioset(GFP_NOFS, nr_iovecs, &btrfs_bioset);
|
2017-06-12 15:29:39 +00:00
|
|
|
btrfs_io_bio_init(btrfs_io_bio(bio));
|
2013-07-25 11:22:34 +00:00
|
|
|
return bio;
|
2013-05-17 22:30:14 +00:00
|
|
|
}
|
|
|
|
|
2017-05-16 17:57:14 +00:00
|
|
|
struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
|
2017-05-16 00:43:31 +00:00
|
|
|
{
|
|
|
|
struct bio *bio;
|
|
|
|
struct btrfs_io_bio *btrfs_bio;
|
|
|
|
|
|
|
|
/* this will never fail when it's backed by a bioset */
|
2018-05-20 22:25:56 +00:00
|
|
|
bio = bio_clone_fast(orig, GFP_NOFS, &btrfs_bioset);
|
2017-05-16 00:43:31 +00:00
|
|
|
ASSERT(bio);
|
|
|
|
|
|
|
|
btrfs_bio = btrfs_io_bio(bio);
|
2017-06-12 15:29:39 +00:00
|
|
|
btrfs_io_bio_init(btrfs_bio);
|
2017-05-16 00:43:31 +00:00
|
|
|
|
|
|
|
bio_trim(bio, offset >> 9, size >> 9);
|
2017-05-15 22:33:27 +00:00
|
|
|
btrfs_bio->iter = bio->bi_iter;
|
2017-05-16 00:43:31 +00:00
|
|
|
return bio;
|
|
|
|
}
|
2013-05-17 22:30:14 +00:00
|
|
|
|
2017-06-06 17:14:26 +00:00
|
|
|
/*
|
|
|
|
* @opf: bio REQ_OP_* and REQ_* flags as one value
|
2017-06-12 17:50:41 +00:00
|
|
|
* @wbc: optional writeback control for io accounting
|
|
|
|
* @page: page to add to the bio
|
|
|
|
* @pg_offset: offset of the new bio or to check whether we are adding
|
|
|
|
* a contiguous page to the previous one
|
|
|
|
* @size: portion of page that we want to write
|
|
|
|
* @offset: starting offset in the page
|
2017-06-06 17:22:55 +00:00
|
|
|
* @bio_ret: must be valid pointer, newly allocated bio will be stored there
|
2017-06-12 17:50:41 +00:00
|
|
|
* @end_io_func: end_io callback for new bio
|
|
|
|
* @mirror_num: desired mirror to read/write
|
|
|
|
* @prev_bio_flags: flags of previous bio to see if we can merge the current one
|
|
|
|
* @bio_flags: flags of the current bio to see if we can merge them
|
2017-06-06 17:14:26 +00:00
|
|
|
*/
|
2020-02-05 18:09:28 +00:00
|
|
|
static int submit_extent_page(unsigned int opf,
|
2015-07-02 20:57:22 +00:00
|
|
|
struct writeback_control *wbc,
|
2017-10-04 15:30:11 +00:00
|
|
|
struct page *page, u64 offset,
|
2017-10-04 15:10:34 +00:00
|
|
|
size_t size, unsigned long pg_offset,
|
2008-01-24 21:13:08 +00:00
|
|
|
struct bio **bio_ret,
|
2008-04-09 20:28:12 +00:00
|
|
|
bio_end_io_t end_io_func,
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
int mirror_num,
|
|
|
|
unsigned long prev_bio_flags,
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
unsigned long bio_flags,
|
|
|
|
bool force_bio_submit)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct bio *bio;
|
2020-10-21 06:25:01 +00:00
|
|
|
size_t io_size = min_t(size_t, size, PAGE_SIZE);
|
2017-10-04 15:30:11 +00:00
|
|
|
sector_t sector = offset >> 9;
|
2020-02-05 18:09:28 +00:00
|
|
|
struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2017-06-06 17:22:55 +00:00
|
|
|
ASSERT(bio_ret);
|
|
|
|
|
|
|
|
if (*bio_ret) {
|
2017-06-12 18:00:43 +00:00
|
|
|
bool contig;
|
|
|
|
bool can_merge = true;
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
bio = *bio_ret;
|
2017-06-12 18:00:43 +00:00
|
|
|
if (prev_bio_flags & EXTENT_BIO_COMPRESSED)
|
2013-10-11 22:44:27 +00:00
|
|
|
contig = bio->bi_iter.bi_sector == sector;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
else
|
2012-09-25 22:05:12 +00:00
|
|
|
contig = bio_end_sector(bio) == sector;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
2020-10-21 06:25:01 +00:00
|
|
|
if (btrfs_bio_fits_in_stripe(page, io_size, bio, bio_flags))
|
2017-06-12 18:00:43 +00:00
|
|
|
can_merge = false;
|
|
|
|
|
|
|
|
if (prev_bio_flags != bio_flags || !contig || !can_merge ||
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
force_bio_submit ||
|
2020-10-21 06:25:01 +00:00
|
|
|
bio_add_page(bio, page, io_size, pg_offset) < io_size) {
|
2016-06-05 19:31:51 +00:00
|
|
|
ret = submit_one_bio(bio, mirror_num, prev_bio_flags);
|
2015-01-05 16:01:03 +00:00
|
|
|
if (ret < 0) {
|
|
|
|
*bio_ret = NULL;
|
2012-03-12 15:03:00 +00:00
|
|
|
return ret;
|
2015-01-05 16:01:03 +00:00
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
bio = NULL;
|
|
|
|
} else {
|
2015-07-02 20:57:22 +00:00
|
|
|
if (wbc)
|
2020-10-21 06:25:01 +00:00
|
|
|
wbc_account_cgroup_owner(wbc, page, io_size);
|
2008-01-24 21:13:08 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
2019-06-18 18:00:16 +00:00
|
|
|
bio = btrfs_bio_alloc(offset);
|
2020-10-21 06:25:01 +00:00
|
|
|
bio_add_page(bio, page, io_size, pg_offset);
|
2008-01-24 21:13:08 +00:00
|
|
|
bio->bi_end_io = end_io_func;
|
|
|
|
bio->bi_private = tree;
|
2017-06-27 17:51:28 +00:00
|
|
|
bio->bi_write_hint = page->mapping->host->i_write_hint;
|
2017-06-06 17:14:26 +00:00
|
|
|
bio->bi_opf = opf;
|
2015-07-02 20:57:22 +00:00
|
|
|
if (wbc) {
|
2019-11-18 22:27:55 +00:00
|
|
|
struct block_device *bdev;
|
|
|
|
|
|
|
|
bdev = BTRFS_I(page->mapping->host)->root->fs_info->fs_devices->latest_bdev;
|
|
|
|
bio_set_dev(bio, bdev);
|
2015-07-02 20:57:22 +00:00
|
|
|
wbc_init_bio(wbc, bio);
|
2020-10-21 06:25:01 +00:00
|
|
|
wbc_account_cgroup_owner(wbc, page, io_size);
|
2015-07-02 20:57:22 +00:00
|
|
|
}
|
2008-01-29 14:59:12 +00:00
|
|
|
|
2017-06-06 17:22:55 +00:00
|
|
|
*bio_ret = bio;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-04-25 20:41:01 +00:00
|
|
|
static void attach_extent_buffer_page(struct extent_buffer *eb,
|
|
|
|
struct page *page)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2020-10-21 06:25:02 +00:00
|
|
|
/*
|
|
|
|
* If the page is mapped to btree inode, we should hold the private
|
|
|
|
* lock to prevent race.
|
|
|
|
* For cloned or dummy extent buffers, their pages are not mapped and
|
|
|
|
* will not race with any other ebs.
|
|
|
|
*/
|
|
|
|
if (page->mapping)
|
|
|
|
lockdep_assert_held(&page->mapping->private_lock);
|
|
|
|
|
2020-06-02 04:47:45 +00:00
|
|
|
if (!PagePrivate(page))
|
|
|
|
attach_page_private(page, eb);
|
|
|
|
else
|
2012-03-07 21:20:05 +00:00
|
|
|
WARN_ON(page->private != (unsigned long)eb);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2012-03-07 21:20:05 +00:00
|
|
|
void set_page_extent_mapped(struct page *page)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2020-06-02 04:47:45 +00:00
|
|
|
if (!PagePrivate(page))
|
|
|
|
attach_page_private(page, (void *)EXTENT_PAGE_PRIVATE);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2013-07-25 11:22:37 +00:00
|
|
|
static struct extent_map *
|
|
|
|
__get_extent_map(struct inode *inode, struct page *page, size_t pg_offset,
|
2020-09-14 09:37:06 +00:00
|
|
|
u64 start, u64 len, struct extent_map **em_cached)
|
2013-07-25 11:22:37 +00:00
|
|
|
{
|
|
|
|
struct extent_map *em;
|
|
|
|
|
|
|
|
if (em_cached && *em_cached) {
|
|
|
|
em = *em_cached;
|
2014-02-25 14:15:12 +00:00
|
|
|
if (extent_map_in_tree(em) && start >= em->start &&
|
2013-07-25 11:22:37 +00:00
|
|
|
start < extent_map_end(em)) {
|
2017-03-03 08:55:12 +00:00
|
|
|
refcount_inc(&em->refs);
|
2013-07-25 11:22:37 +00:00
|
|
|
return em;
|
|
|
|
}
|
|
|
|
|
|
|
|
free_extent_map(em);
|
|
|
|
*em_cached = NULL;
|
|
|
|
}
|
|
|
|
|
2020-09-14 09:37:06 +00:00
|
|
|
em = btrfs_get_extent(BTRFS_I(inode), page, pg_offset, start, len);
|
2013-07-25 11:22:37 +00:00
|
|
|
if (em_cached && !IS_ERR_OR_NULL(em)) {
|
|
|
|
BUG_ON(*em_cached);
|
2017-03-03 08:55:12 +00:00
|
|
|
refcount_inc(&em->refs);
|
2013-07-25 11:22:37 +00:00
|
|
|
*em_cached = em;
|
|
|
|
}
|
|
|
|
return em;
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* basic readpage implementation. Locked extent state structs are inserted
|
|
|
|
* into the tree that are removed when the IO is done (by the end_io
|
|
|
|
* handlers)
|
2012-03-12 15:03:00 +00:00
|
|
|
* XXX JDM: This needs looking at to ensure proper page locking
|
2016-07-11 17:39:07 +00:00
|
|
|
* return 0 on success, otherwise return error
|
2008-01-24 21:13:08 +00:00
|
|
|
*/
|
2020-09-14 11:39:16 +00:00
|
|
|
int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
|
|
|
|
struct bio **bio, unsigned long *bio_flags,
|
|
|
|
unsigned int read_flags, u64 *prev_em_start)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct inode *inode = page->mapping->host;
|
2012-12-21 09:17:45 +00:00
|
|
|
u64 start = page_offset(page);
|
2017-06-06 17:50:13 +00:00
|
|
|
const u64 end = start + PAGE_SIZE - 1;
|
2008-01-24 21:13:08 +00:00
|
|
|
u64 cur = start;
|
|
|
|
u64 extent_offset;
|
|
|
|
u64 last_byte = i_size_read(inode);
|
|
|
|
u64 block_start;
|
|
|
|
u64 cur_end;
|
|
|
|
struct extent_map *em;
|
2016-07-11 17:39:07 +00:00
|
|
|
int ret = 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
int nr = 0;
|
2011-04-19 12:29:38 +00:00
|
|
|
size_t pg_offset = 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
size_t iosize;
|
|
|
|
size_t blocksize = inode->i_sb->s_blocksize;
|
2016-01-27 19:17:20 +00:00
|
|
|
unsigned long this_bio_flag = 0;
|
2020-02-05 18:09:42 +00:00
|
|
|
struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree;
|
2020-02-05 18:09:30 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
set_page_extent_mapped(page);
|
|
|
|
|
2011-05-26 16:01:56 +00:00
|
|
|
if (!PageUptodate(page)) {
|
|
|
|
if (cleancache_get_page(page) == 0) {
|
|
|
|
BUG_ON(blocksize != PAGE_SIZE);
|
2013-07-25 11:22:36 +00:00
|
|
|
unlock_extent(tree, start, end);
|
2011-05-26 16:01:56 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
if (page->index == last_byte >> PAGE_SHIFT) {
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
char *userpage;
|
2018-12-05 14:23:03 +00:00
|
|
|
size_t zero_offset = offset_in_page(last_byte);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
|
|
|
if (zero_offset) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
iosize = PAGE_SIZE - zero_offset;
|
2011-11-25 15:14:28 +00:00
|
|
|
userpage = kmap_atomic(page);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
memset(userpage + zero_offset, 0, iosize);
|
|
|
|
flush_dcache_page(page);
|
2011-11-25 15:14:28 +00:00
|
|
|
kunmap_atomic(userpage);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
}
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
while (cur <= end) {
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
bool force_bio_submit = false;
|
2017-10-04 15:30:11 +00:00
|
|
|
u64 offset;
|
2013-02-11 16:33:00 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
if (cur >= last_byte) {
|
|
|
|
char *userpage;
|
2011-04-06 10:02:20 +00:00
|
|
|
struct extent_state *cached = NULL;
|
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
iosize = PAGE_SIZE - pg_offset;
|
2011-11-25 15:14:28 +00:00
|
|
|
userpage = kmap_atomic(page);
|
2011-04-19 12:29:38 +00:00
|
|
|
memset(userpage + pg_offset, 0, iosize);
|
2008-01-24 21:13:08 +00:00
|
|
|
flush_dcache_page(page);
|
2011-11-25 15:14:28 +00:00
|
|
|
kunmap_atomic(userpage);
|
2008-01-24 21:13:08 +00:00
|
|
|
set_extent_uptodate(tree, cur, cur + iosize - 1,
|
2011-04-06 10:02:20 +00:00
|
|
|
&cached, GFP_NOFS);
|
2016-01-27 19:17:20 +00:00
|
|
|
unlock_extent_cached(tree, cur,
|
2017-12-12 20:43:52 +00:00
|
|
|
cur + iosize - 1, &cached);
|
2008-01-24 21:13:08 +00:00
|
|
|
break;
|
|
|
|
}
|
2013-07-25 11:22:37 +00:00
|
|
|
em = __get_extent_map(inode, page, pg_offset, cur,
|
2020-09-14 09:37:06 +00:00
|
|
|
end - cur + 1, em_cached);
|
2011-04-19 16:00:01 +00:00
|
|
|
if (IS_ERR_OR_NULL(em)) {
|
2008-01-24 21:13:08 +00:00
|
|
|
SetPageError(page);
|
2016-01-27 19:17:20 +00:00
|
|
|
unlock_extent(tree, cur, end);
|
2008-01-24 21:13:08 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
extent_offset = cur - em->start;
|
|
|
|
BUG_ON(extent_map_end(em) <= cur);
|
|
|
|
BUG_ON(end < cur);
|
|
|
|
|
2010-12-17 06:21:50 +00:00
|
|
|
if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags)) {
|
2013-08-06 18:42:50 +00:00
|
|
|
this_bio_flag |= EXTENT_BIO_COMPRESSED;
|
2010-12-17 06:21:50 +00:00
|
|
|
extent_set_compress_type(&this_bio_flag,
|
|
|
|
em->compress_type);
|
|
|
|
}
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
iosize = min(extent_map_end(em) - cur, end - cur + 1);
|
|
|
|
cur_end = min(extent_map_end(em) - 1, end);
|
2013-02-26 08:10:22 +00:00
|
|
|
iosize = ALIGN(iosize, blocksize);
|
2020-09-15 15:41:40 +00:00
|
|
|
if (this_bio_flag & EXTENT_BIO_COMPRESSED)
|
2017-10-04 15:30:11 +00:00
|
|
|
offset = em->block_start;
|
2020-09-15 15:41:40 +00:00
|
|
|
else
|
2017-10-04 15:30:11 +00:00
|
|
|
offset = em->block_start + extent_offset;
|
2008-01-24 21:13:08 +00:00
|
|
|
block_start = em->block_start;
|
2008-10-30 18:25:28 +00:00
|
|
|
if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
|
|
|
|
block_start = EXTENT_MAP_HOLE;
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we have a file range that points to a compressed extent
|
2020-08-05 02:48:34 +00:00
|
|
|
* and it's followed by a consecutive file range that points
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
* to the same compressed extent (possibly with a different
|
|
|
|
* offset and/or length, so it either points to the whole extent
|
|
|
|
* or only part of it), we must make sure we do not submit a
|
|
|
|
* single bio to populate the pages for the 2 ranges because
|
|
|
|
* this makes the compressed extent read zero out the pages
|
|
|
|
* belonging to the 2nd range. Imagine the following scenario:
|
|
|
|
*
|
|
|
|
* File layout
|
|
|
|
* [0 - 8K] [8K - 24K]
|
|
|
|
* | |
|
|
|
|
* | |
|
|
|
|
* points to extent X, points to extent X,
|
|
|
|
* offset 4K, length of 8K offset 0, length 16K
|
|
|
|
*
|
|
|
|
* [extent X, compressed length = 4K uncompressed length = 16K]
|
|
|
|
*
|
|
|
|
* If the bio to read the compressed extent covers both ranges,
|
|
|
|
* it will decompress extent X into the pages belonging to the
|
|
|
|
* first range and then it will stop, zeroing out the remaining
|
|
|
|
* pages that belong to the other range that points to extent X.
|
|
|
|
* So here we make sure we submit 2 bios, one for the first
|
|
|
|
* range and another one for the third range. Both will target
|
|
|
|
* the same physical extent from disk, but we can't currently
|
|
|
|
* make the compressed bio endio callback populate the pages
|
|
|
|
* for both ranges because each compressed bio is tightly
|
|
|
|
* coupled with a single extent map, and each range can have
|
|
|
|
* an extent map with a different offset value relative to the
|
|
|
|
* uncompressed data of our extent and different lengths. This
|
|
|
|
* is a corner case so we prioritize correctness over
|
|
|
|
* non-optimal behavior (submitting 2 bios for the same extent).
|
|
|
|
*/
|
|
|
|
if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) &&
|
|
|
|
prev_em_start && *prev_em_start != (u64)-1 &&
|
Btrfs: fix corruption reading shared and compressed extents after hole punching
In the past we had data corruption when reading compressed extents that
are shared within the same file and they are consecutive, this got fixed
by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and
shared extents") and by commit 808f80b46790f ("Btrfs: update fix for read
corruption of compressed and shared extents"). However there was a case
that was missing in those fixes, which is when the shared and compressed
extents are referenced with a non-zero offset. The following shell script
creates a reproducer for this issue:
#!/bin/bash
mkfs.btrfs -f /dev/sdc &> /dev/null
mount -o compress /dev/sdc /mnt/sdc
# Create a file with 3 consecutive compressed extents, each has an
# uncompressed size of 128Kb and a compressed size of 4Kb.
for ((i = 1; i <= 3; i++)); do
head -c 4096 /dev/zero
for ((j = 1; j <= 31; j++)); do
head -c 4096 /dev/zero | tr '\0' "\377"
done
done > /mnt/sdc/foobar
sync
echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)"
# Clone the first extent into offsets 128K and 256K.
xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
sync
echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)"
# Punch holes into the regions that are already full of zeroes.
xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
sync
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
echo "Dropping page cache..."
sysctl -q vm.drop_caches=1
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
umount /dev/sdc
When running the script we get the following output:
Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
linked 131072/131072 bytes at offset 131072
128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
linked 131072/131072 bytes at offset 262144
128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Dropping page cache...
Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar
This happens because after reading all the pages of the extent in the
range from 128K to 256K for example, we read the hole at offset 256K
and then when reading the page at offset 260K we don't submit the
existing bio, which is responsible for filling all the page in the
range 128K to 256K only, therefore adding the pages from range 260K
to 384K to the existing bio and submitting it after iterating over the
entire range. Once the bio completes, the uncompressed data fills only
the pages in the range 128K to 256K because there's no more data read
from disk, leaving the pages in the range 260K to 384K unfilled. It is
just a slightly different variant of what was solved by commit
005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared
extents").
Fix this by forcing a bio submit, during readpages(), whenever we find a
compressed extent map for a page that is different from the extent map
for the previous page or has a different starting offset (in case it's
the same compressed extent), instead of the extent map's original start
offset.
A test case for fstests follows soon.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents")
Fixes: 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents")
Cc: stable@vger.kernel.org # 4.3+
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-14 15:17:20 +00:00
|
|
|
*prev_em_start != em->start)
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
force_bio_submit = true;
|
|
|
|
|
|
|
|
if (prev_em_start)
|
Btrfs: fix corruption reading shared and compressed extents after hole punching
In the past we had data corruption when reading compressed extents that
are shared within the same file and they are consecutive, this got fixed
by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and
shared extents") and by commit 808f80b46790f ("Btrfs: update fix for read
corruption of compressed and shared extents"). However there was a case
that was missing in those fixes, which is when the shared and compressed
extents are referenced with a non-zero offset. The following shell script
creates a reproducer for this issue:
#!/bin/bash
mkfs.btrfs -f /dev/sdc &> /dev/null
mount -o compress /dev/sdc /mnt/sdc
# Create a file with 3 consecutive compressed extents, each has an
# uncompressed size of 128Kb and a compressed size of 4Kb.
for ((i = 1; i <= 3; i++)); do
head -c 4096 /dev/zero
for ((j = 1; j <= 31; j++)); do
head -c 4096 /dev/zero | tr '\0' "\377"
done
done > /mnt/sdc/foobar
sync
echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)"
# Clone the first extent into offsets 128K and 256K.
xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar
xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar
sync
echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)"
# Punch holes into the regions that are already full of zeroes.
xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar
xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar
sync
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
echo "Dropping page cache..."
sysctl -q vm.drop_caches=1
echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)"
umount /dev/sdc
When running the script we get the following output:
Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
linked 131072/131072 bytes at offset 131072
128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec)
linked 131072/131072 bytes at offset 262144
128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec)
Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar
Dropping page cache...
Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar
This happens because after reading all the pages of the extent in the
range from 128K to 256K for example, we read the hole at offset 256K
and then when reading the page at offset 260K we don't submit the
existing bio, which is responsible for filling all the page in the
range 128K to 256K only, therefore adding the pages from range 260K
to 384K to the existing bio and submitting it after iterating over the
entire range. Once the bio completes, the uncompressed data fills only
the pages in the range 128K to 256K because there's no more data read
from disk, leaving the pages in the range 260K to 384K unfilled. It is
just a slightly different variant of what was solved by commit
005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared
extents").
Fix this by forcing a bio submit, during readpages(), whenever we find a
compressed extent map for a page that is different from the extent map
for the previous page or has a different starting offset (in case it's
the same compressed extent), instead of the extent map's original start
offset.
A test case for fstests follows soon.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents")
Fixes: 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents")
Cc: stable@vger.kernel.org # 4.3+
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-14 15:17:20 +00:00
|
|
|
*prev_em_start = em->start;
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
free_extent_map(em);
|
|
|
|
em = NULL;
|
|
|
|
|
|
|
|
/* we've found a hole, just zero and go on */
|
|
|
|
if (block_start == EXTENT_MAP_HOLE) {
|
|
|
|
char *userpage;
|
2011-04-06 10:02:20 +00:00
|
|
|
struct extent_state *cached = NULL;
|
|
|
|
|
2011-11-25 15:14:28 +00:00
|
|
|
userpage = kmap_atomic(page);
|
2011-04-19 12:29:38 +00:00
|
|
|
memset(userpage + pg_offset, 0, iosize);
|
2008-01-24 21:13:08 +00:00
|
|
|
flush_dcache_page(page);
|
2011-11-25 15:14:28 +00:00
|
|
|
kunmap_atomic(userpage);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
set_extent_uptodate(tree, cur, cur + iosize - 1,
|
2011-04-06 10:02:20 +00:00
|
|
|
&cached, GFP_NOFS);
|
2016-01-27 19:17:20 +00:00
|
|
|
unlock_extent_cached(tree, cur,
|
2017-12-12 20:43:52 +00:00
|
|
|
cur + iosize - 1, &cached);
|
2008-01-24 21:13:08 +00:00
|
|
|
cur = cur + iosize;
|
2011-04-19 12:29:38 +00:00
|
|
|
pg_offset += iosize;
|
2008-01-24 21:13:08 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
/* the get_extent function already copied into the page */
|
2009-09-02 19:22:30 +00:00
|
|
|
if (test_range_bit(tree, cur, cur_end,
|
|
|
|
EXTENT_UPTODATE, 1, NULL)) {
|
2008-09-05 20:09:51 +00:00
|
|
|
check_page_uptodate(tree, page);
|
2016-01-27 19:17:20 +00:00
|
|
|
unlock_extent(tree, cur, cur + iosize - 1);
|
2008-01-24 21:13:08 +00:00
|
|
|
cur = cur + iosize;
|
2011-04-19 12:29:38 +00:00
|
|
|
pg_offset += iosize;
|
2008-01-24 21:13:08 +00:00
|
|
|
continue;
|
|
|
|
}
|
2008-01-29 14:59:12 +00:00
|
|
|
/* we have an inline extent but it didn't get marked up
|
|
|
|
* to date. Error out
|
|
|
|
*/
|
|
|
|
if (block_start == EXTENT_MAP_INLINE) {
|
|
|
|
SetPageError(page);
|
2016-01-27 19:17:20 +00:00
|
|
|
unlock_extent(tree, cur, cur + iosize - 1);
|
2008-01-29 14:59:12 +00:00
|
|
|
cur = cur + iosize;
|
2011-04-19 12:29:38 +00:00
|
|
|
pg_offset += iosize;
|
2008-01-29 14:59:12 +00:00
|
|
|
continue;
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2020-02-05 18:09:28 +00:00
|
|
|
ret = submit_extent_page(REQ_OP_READ | read_flags, NULL,
|
2020-09-15 15:41:40 +00:00
|
|
|
page, offset, iosize,
|
2019-10-03 15:29:05 +00:00
|
|
|
pg_offset, bio,
|
2020-09-14 09:37:11 +00:00
|
|
|
end_bio_extent_readpage, 0,
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
*bio_flags,
|
Btrfs: fix read corruption of compressed and shared extents
If a file has a range pointing to a compressed extent, followed by
another range that points to the same compressed extent and a read
operation attempts to read both ranges (either completely or part of
them), the pages that correspond to the second range are incorrectly
filled with zeroes.
Consider the following example:
File layout
[0 - 8K] [8K - 24K]
| |
| |
points to extent X, points to extent X,
offset 4K, length of 8K offset 0, length 16K
[extent X, compressed length = 4K uncompressed length = 16K]
If a readpages() call spans the 2 ranges, a single bio to read the extent
is submitted - extent_io.c:submit_extent_page() would only create a new
bio to cover the second range pointing to the extent if the extent it
points to had a different logical address than the extent associated with
the first range. This has a consequence of the compressed read end io
handler (compression.c:end_compressed_bio_read()) finish once the extent
is decompressed into the pages covering the first range, leaving the
remaining pages (belonging to the second range) filled with zeroes (done
by compression.c:btrfs_clear_biovec_end()).
So fix this by submitting the current bio whenever we find a range
pointing to a compressed extent that was preceded by a range with a
different extent map. This is the simplest solution for this corner
case. Making the end io callback populate both ranges (or more, if we
have multiple pointing to the same extent) is a much more complex
solution since each bio is tightly coupled with a single extent map and
the extent maps associated to the ranges pointing to the shared extent
can have different offsets and lengths.
The following test case for fstests triggers the issue:
seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15
_cleanup()
{
rm -f $tmp.*
}
# get standard environment, filters and checks
. ./common/rc
. ./common/filter
# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
rm -f $seqres.full
test_clone_and_read_compressed_extent()
{
local mount_opts=$1
_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount $mount_opts
# Create a test file with a single extent that is compressed (the
# data we write into it is highly compressible no matter which
# compression algorithm is used, zlib or lzo).
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \
-c "pwrite -S 0xbb 4K 8K" \
-c "pwrite -S 0xcc 12K 4K" \
$SCRATCH_MNT/foo | _filter_xfs_io
# Now clone our extent into an adjacent offset.
$CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \
$SCRATCH_MNT/foo $SCRATCH_MNT/foo
# Same as before but for this file we clone the extent into a lower
# file offset.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \
-c "pwrite -S 0xbb 12K 8K" \
-c "pwrite -S 0xcc 20K 4K" \
$SCRATCH_MNT/bar | _filter_xfs_io
$CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \
$SCRATCH_MNT/bar $SCRATCH_MNT/bar
echo "File digests before unmounting filesystem:"
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
# Evicting the inode or clearing the page cache before reading
# again the file would also trigger the bug - reads were returning
# all bytes in the range corresponding to the second reference to
# the extent with a value of 0, but the correct data was persisted
# (it was a bug exclusively in the read path). The issue happened
# only if the same readpages() call targeted pages belonging to the
# first and second ranges that point to the same compressed extent.
_scratch_remount
echo "File digests after mounting filesystem again:"
# Must match the same digests we got before.
md5sum $SCRATCH_MNT/foo | _filter_scratch
md5sum $SCRATCH_MNT/bar | _filter_scratch
}
echo -e "\nTesting with zlib compression..."
test_clone_and_read_compressed_extent "-o compress=zlib"
_scratch_unmount
echo -e "\nTesting with lzo compression..."
test_clone_and_read_compressed_extent "-o compress=lzo"
status=0
exit
Cc: stable@vger.kernel.org
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-14 08:09:31 +00:00
|
|
|
this_bio_flag,
|
|
|
|
force_bio_submit);
|
2013-02-11 16:33:00 +00:00
|
|
|
if (!ret) {
|
|
|
|
nr++;
|
|
|
|
*bio_flags = this_bio_flag;
|
|
|
|
} else {
|
2008-01-24 21:13:08 +00:00
|
|
|
SetPageError(page);
|
2016-01-27 19:17:20 +00:00
|
|
|
unlock_extent(tree, cur, cur + iosize - 1);
|
2016-07-11 17:39:07 +00:00
|
|
|
goto out;
|
2012-10-05 20:40:32 +00:00
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
cur = cur + iosize;
|
2011-04-19 12:29:38 +00:00
|
|
|
pg_offset += iosize;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2011-05-26 16:01:56 +00:00
|
|
|
out:
|
2008-01-24 21:13:08 +00:00
|
|
|
if (!nr) {
|
|
|
|
if (!PageError(page))
|
|
|
|
SetPageUptodate(page);
|
|
|
|
unlock_page(page);
|
|
|
|
}
|
2016-07-11 17:39:07 +00:00
|
|
|
return ret;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2020-02-05 18:09:40 +00:00
|
|
|
static inline void contiguous_readpages(struct page *pages[], int nr_pages,
|
2013-07-25 11:22:36 +00:00
|
|
|
u64 start, u64 end,
|
2013-07-25 11:22:37 +00:00
|
|
|
struct extent_map **em_cached,
|
2017-10-24 08:50:39 +00:00
|
|
|
struct bio **bio,
|
2016-06-05 19:31:51 +00:00
|
|
|
unsigned long *bio_flags,
|
2015-09-28 08:56:26 +00:00
|
|
|
u64 *prev_em_start)
|
2013-07-25 11:22:36 +00:00
|
|
|
{
|
2019-05-07 07:19:23 +00:00
|
|
|
struct btrfs_inode *inode = BTRFS_I(pages[0]->mapping->host);
|
2013-07-25 11:22:36 +00:00
|
|
|
int index;
|
|
|
|
|
2020-02-05 18:09:33 +00:00
|
|
|
btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
|
2013-07-25 11:22:36 +00:00
|
|
|
|
|
|
|
for (index = 0; index < nr_pages; index++) {
|
2020-09-14 11:39:16 +00:00
|
|
|
btrfs_do_readpage(pages[index], em_cached, bio, bio_flags,
|
|
|
|
REQ_RAHEAD, prev_em_start);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
put_page(pages[index]);
|
2013-07-25 11:22:36 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-02-10 18:33:41 +00:00
|
|
|
static void update_nr_written(struct writeback_control *wbc,
|
2016-03-08 00:56:21 +00:00
|
|
|
unsigned long nr_written)
|
2009-04-20 19:50:09 +00:00
|
|
|
{
|
|
|
|
wbc->nr_to_write -= nr_written;
|
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
2014-05-21 20:35:51 +00:00
|
|
|
* helper for __extent_writepage, doing all of the delayed allocation setup.
|
|
|
|
*
|
2018-11-01 12:09:46 +00:00
|
|
|
* This returns 1 if btrfs_run_delalloc_range function did all the work required
|
2014-05-21 20:35:51 +00:00
|
|
|
* to write the page (copy into inline extent). In this case the IO has
|
|
|
|
* been started and the page is already unlocked.
|
|
|
|
*
|
|
|
|
* This returns 0 if all went well (page still locked)
|
|
|
|
* This returns < 0 if there were errors (page still locked)
|
2008-01-24 21:13:08 +00:00
|
|
|
*/
|
2020-06-05 07:42:10 +00:00
|
|
|
static noinline_for_stack int writepage_delalloc(struct btrfs_inode *inode,
|
2018-11-08 08:18:07 +00:00
|
|
|
struct page *page, struct writeback_control *wbc,
|
|
|
|
u64 delalloc_start, unsigned long *nr_written)
|
2014-05-21 20:35:51 +00:00
|
|
|
{
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
u64 page_end = delalloc_start + PAGE_SIZE - 1;
|
2018-11-29 03:33:38 +00:00
|
|
|
bool found;
|
2014-05-21 20:35:51 +00:00
|
|
|
u64 delalloc_to_write = 0;
|
|
|
|
u64 delalloc_end = 0;
|
|
|
|
int ret;
|
|
|
|
int page_started = 0;
|
|
|
|
|
|
|
|
|
|
|
|
while (delalloc_end < page_end) {
|
2020-06-05 07:42:10 +00:00
|
|
|
found = find_lock_delalloc_range(&inode->vfs_inode, page,
|
2014-05-21 20:35:51 +00:00
|
|
|
&delalloc_start,
|
2018-10-26 11:43:20 +00:00
|
|
|
&delalloc_end);
|
2018-11-29 03:33:38 +00:00
|
|
|
if (!found) {
|
2014-05-21 20:35:51 +00:00
|
|
|
delalloc_start = delalloc_end + 1;
|
|
|
|
continue;
|
|
|
|
}
|
2020-06-05 07:42:10 +00:00
|
|
|
ret = btrfs_run_delalloc_range(inode, page, delalloc_start,
|
2018-11-01 12:09:46 +00:00
|
|
|
delalloc_end, &page_started, nr_written, wbc);
|
2014-05-21 20:35:51 +00:00
|
|
|
if (ret) {
|
|
|
|
SetPageError(page);
|
2018-11-01 12:09:46 +00:00
|
|
|
/*
|
|
|
|
* btrfs_run_delalloc_range should return < 0 for error
|
|
|
|
* but just in case, we use > 0 here meaning the IO is
|
|
|
|
* started, so we don't want to return > 0 unless
|
|
|
|
* things are going well.
|
2014-05-21 20:35:51 +00:00
|
|
|
*/
|
2020-07-16 15:17:19 +00:00
|
|
|
return ret < 0 ? ret : -EIO;
|
2014-05-21 20:35:51 +00:00
|
|
|
}
|
|
|
|
/*
|
2016-04-01 12:29:48 +00:00
|
|
|
* delalloc_end is already one less than the total length, so
|
|
|
|
* we don't subtract one from PAGE_SIZE
|
2014-05-21 20:35:51 +00:00
|
|
|
*/
|
|
|
|
delalloc_to_write += (delalloc_end - delalloc_start +
|
2016-04-01 12:29:48 +00:00
|
|
|
PAGE_SIZE) >> PAGE_SHIFT;
|
2014-05-21 20:35:51 +00:00
|
|
|
delalloc_start = delalloc_end + 1;
|
|
|
|
}
|
|
|
|
if (wbc->nr_to_write < delalloc_to_write) {
|
|
|
|
int thresh = 8192;
|
|
|
|
|
|
|
|
if (delalloc_to_write < thresh * 2)
|
|
|
|
thresh = delalloc_to_write;
|
|
|
|
wbc->nr_to_write = min_t(u64, delalloc_to_write,
|
|
|
|
thresh);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* did the fill delalloc function already unlock and start
|
|
|
|
* the IO?
|
|
|
|
*/
|
|
|
|
if (page_started) {
|
|
|
|
/*
|
|
|
|
* we've unlocked the page, so we can't update
|
|
|
|
* the mapping's writeback index, just update
|
|
|
|
* nr_to_write.
|
|
|
|
*/
|
|
|
|
wbc->nr_to_write -= *nr_written;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2020-07-16 15:17:19 +00:00
|
|
|
return 0;
|
2014-05-21 20:35:51 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* helper for __extent_writepage. This calls the writepage start hooks,
|
|
|
|
* and does the loop to map the page into extents and bios.
|
|
|
|
*
|
|
|
|
* We return 1 if the IO is started and the page is unlocked,
|
|
|
|
* 0 if all went well (page still locked)
|
|
|
|
* < 0 if there were errors (page still locked)
|
|
|
|
*/
|
2020-06-03 05:55:33 +00:00
|
|
|
static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
|
2014-05-21 20:35:51 +00:00
|
|
|
struct page *page,
|
|
|
|
struct writeback_control *wbc,
|
|
|
|
struct extent_page_data *epd,
|
|
|
|
loff_t i_size,
|
|
|
|
unsigned long nr_written,
|
2019-10-29 17:28:55 +00:00
|
|
|
int *nr_ret)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2020-06-03 05:55:33 +00:00
|
|
|
struct extent_io_tree *tree = &inode->io_tree;
|
2012-12-21 09:17:45 +00:00
|
|
|
u64 start = page_offset(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
u64 page_end = start + PAGE_SIZE - 1;
|
2008-01-24 21:13:08 +00:00
|
|
|
u64 end;
|
|
|
|
u64 cur = start;
|
|
|
|
u64 extent_offset;
|
|
|
|
u64 block_start;
|
|
|
|
u64 iosize;
|
|
|
|
struct extent_map *em;
|
2008-07-18 16:01:11 +00:00
|
|
|
size_t pg_offset = 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
size_t blocksize;
|
2014-05-21 20:35:51 +00:00
|
|
|
int ret = 0;
|
|
|
|
int nr = 0;
|
2019-10-29 17:28:55 +00:00
|
|
|
const unsigned int write_flags = wbc_to_write_flags(wbc);
|
2014-05-21 20:35:51 +00:00
|
|
|
bool compressed;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
2018-11-01 12:09:47 +00:00
|
|
|
ret = btrfs_writepage_cow_fixup(page, start, page_end);
|
|
|
|
if (ret) {
|
|
|
|
/* Fixup worker will requeue */
|
2020-01-21 16:51:43 +00:00
|
|
|
redirty_page_for_writepage(wbc, page);
|
2018-11-01 12:09:47 +00:00
|
|
|
update_nr_written(wbc, nr_written);
|
|
|
|
unlock_page(page);
|
|
|
|
return 1;
|
2008-07-17 16:53:51 +00:00
|
|
|
}
|
|
|
|
|
2009-04-20 19:50:09 +00:00
|
|
|
/*
|
|
|
|
* we don't want to touch the inode after unlocking the page,
|
|
|
|
* so we update the mapping writeback index now
|
|
|
|
*/
|
2017-02-10 18:33:41 +00:00
|
|
|
update_nr_written(wbc, nr_written + 1);
|
2008-11-07 03:02:51 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
end = page_end;
|
2020-06-03 05:55:33 +00:00
|
|
|
blocksize = inode->vfs_inode.i_sb->s_blocksize;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
while (cur <= end) {
|
2014-05-21 20:35:51 +00:00
|
|
|
u64 em_end;
|
2017-10-04 15:30:11 +00:00
|
|
|
u64 offset;
|
2016-05-04 09:46:10 +00:00
|
|
|
|
2014-05-21 20:35:51 +00:00
|
|
|
if (cur >= i_size) {
|
2018-11-01 12:09:48 +00:00
|
|
|
btrfs_writepage_endio_finish_ordered(page, cur,
|
2018-11-08 08:18:08 +00:00
|
|
|
page_end, 1);
|
2008-01-24 21:13:08 +00:00
|
|
|
break;
|
|
|
|
}
|
2020-06-03 05:55:33 +00:00
|
|
|
em = btrfs_get_extent(inode, NULL, 0, cur, end - cur + 1);
|
2011-04-19 16:00:01 +00:00
|
|
|
if (IS_ERR_OR_NULL(em)) {
|
2008-01-24 21:13:08 +00:00
|
|
|
SetPageError(page);
|
2014-05-09 16:17:40 +00:00
|
|
|
ret = PTR_ERR_OR_ZERO(em);
|
2008-01-24 21:13:08 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
extent_offset = cur - em->start;
|
2014-05-21 20:35:51 +00:00
|
|
|
em_end = extent_map_end(em);
|
|
|
|
BUG_ON(em_end <= cur);
|
2008-01-24 21:13:08 +00:00
|
|
|
BUG_ON(end < cur);
|
2014-05-21 20:35:51 +00:00
|
|
|
iosize = min(em_end - cur, end - cur + 1);
|
2013-02-26 08:10:22 +00:00
|
|
|
iosize = ALIGN(iosize, blocksize);
|
2017-10-04 15:30:11 +00:00
|
|
|
offset = em->block_start + extent_offset;
|
2008-01-24 21:13:08 +00:00
|
|
|
block_start = em->block_start;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
compressed = test_bit(EXTENT_FLAG_COMPRESSED, &em->flags);
|
2008-01-24 21:13:08 +00:00
|
|
|
free_extent_map(em);
|
|
|
|
em = NULL;
|
|
|
|
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
/*
|
|
|
|
* compressed and inline extents are written through other
|
|
|
|
* paths in the FS
|
|
|
|
*/
|
|
|
|
if (compressed || block_start == EXTENT_MAP_HOLE ||
|
2008-01-24 21:13:08 +00:00
|
|
|
block_start == EXTENT_MAP_INLINE) {
|
2019-12-03 01:34:24 +00:00
|
|
|
if (compressed)
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
nr++;
|
2019-12-03 01:34:24 +00:00
|
|
|
else
|
|
|
|
btrfs_writepage_endio_finish_ordered(page, cur,
|
|
|
|
cur + iosize - 1, 1);
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
cur += iosize;
|
2008-07-18 16:01:11 +00:00
|
|
|
pg_offset += iosize;
|
2008-01-24 21:13:08 +00:00
|
|
|
continue;
|
|
|
|
}
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
|
2018-07-18 18:32:52 +00:00
|
|
|
btrfs_set_range_writeback(tree, cur, cur + iosize - 1);
|
2016-05-04 09:46:10 +00:00
|
|
|
if (!PageWriteback(page)) {
|
2020-06-03 05:55:33 +00:00
|
|
|
btrfs_err(inode->root->fs_info,
|
2016-05-04 09:46:10 +00:00
|
|
|
"page %lu not writeback, cur %llu end %llu",
|
|
|
|
page->index, cur, end);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2008-07-18 16:01:11 +00:00
|
|
|
|
2020-02-05 18:09:28 +00:00
|
|
|
ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
|
2017-10-04 15:30:11 +00:00
|
|
|
page, offset, iosize, pg_offset,
|
2019-10-03 15:29:05 +00:00
|
|
|
&epd->bio,
|
2016-05-04 09:46:10 +00:00
|
|
|
end_bio_extent_writepage,
|
|
|
|
0, 0, 0, false);
|
Btrfs: add another missing end_page_writeback on submit_extent_page failure
If btrfs_bio_alloc fails in submit_extent_page, submit_extent_page returns
without clearing the writeback bit of the failed page.
__extent_writepage_io, that is a caller of submit_extent_page,
does not clear the remaining writeback bit anywhere.
As a result, this will cause the hang at filemap_fdatawait_range,
because it waits the writeback bit to be cleared from the failed page.
So, we have to call end_page_writeback to clear the writeback bit.
For reproducing the hang, we inject a fault like
if (should_failtest()) { // I define should_failtest()
bio = NULL;
}
else {
bio = btrfs_bio_alloc(...);
}
in submit_extent_page.
We should also check whether page has the bit before end_page_writeback,
to avoid the conflict against the other end_page_writeback in bio_endio.
Thus, we add PageWriteback checks not only in __extent_writepage_io,
but also in write_one_eb too, because it misses the check.
Signed-off-by: Takafumi Kubota <takafumi.kubota1012@sslab.ics.keio.ac.jp>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-09 08:24:33 +00:00
|
|
|
if (ret) {
|
2016-05-04 09:46:10 +00:00
|
|
|
SetPageError(page);
|
Btrfs: add another missing end_page_writeback on submit_extent_page failure
If btrfs_bio_alloc fails in submit_extent_page, submit_extent_page returns
without clearing the writeback bit of the failed page.
__extent_writepage_io, that is a caller of submit_extent_page,
does not clear the remaining writeback bit anywhere.
As a result, this will cause the hang at filemap_fdatawait_range,
because it waits the writeback bit to be cleared from the failed page.
So, we have to call end_page_writeback to clear the writeback bit.
For reproducing the hang, we inject a fault like
if (should_failtest()) { // I define should_failtest()
bio = NULL;
}
else {
bio = btrfs_bio_alloc(...);
}
in submit_extent_page.
We should also check whether page has the bit before end_page_writeback,
to avoid the conflict against the other end_page_writeback in bio_endio.
Thus, we add PageWriteback checks not only in __extent_writepage_io,
but also in write_one_eb too, because it misses the check.
Signed-off-by: Takafumi Kubota <takafumi.kubota1012@sslab.ics.keio.ac.jp>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-09 08:24:33 +00:00
|
|
|
if (PageWriteback(page))
|
|
|
|
end_page_writeback(page);
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
cur = cur + iosize;
|
2008-07-18 16:01:11 +00:00
|
|
|
pg_offset += iosize;
|
2008-01-24 21:13:08 +00:00
|
|
|
nr++;
|
|
|
|
}
|
2014-05-21 20:35:51 +00:00
|
|
|
*nr_ret = nr;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the writepage semantics are similar to regular writepage. extent
|
|
|
|
* records are inserted to lock ranges in the tree, and as dirty areas
|
|
|
|
* are found, they are marked writeback. Then the lock bits are removed
|
|
|
|
* and the end_io handler clears the writeback ranges
|
2019-03-20 06:27:42 +00:00
|
|
|
*
|
|
|
|
* Return 0 if everything goes well.
|
|
|
|
* Return <0 for error.
|
2014-05-21 20:35:51 +00:00
|
|
|
*/
|
|
|
|
static int __extent_writepage(struct page *page, struct writeback_control *wbc,
|
2017-11-30 17:00:02 +00:00
|
|
|
struct extent_page_data *epd)
|
2014-05-21 20:35:51 +00:00
|
|
|
{
|
|
|
|
struct inode *inode = page->mapping->host;
|
|
|
|
u64 start = page_offset(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
u64 page_end = start + PAGE_SIZE - 1;
|
2014-05-21 20:35:51 +00:00
|
|
|
int ret;
|
|
|
|
int nr = 0;
|
2019-12-03 01:34:20 +00:00
|
|
|
size_t pg_offset;
|
2014-05-21 20:35:51 +00:00
|
|
|
loff_t i_size = i_size_read(inode);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
unsigned long end_index = i_size >> PAGE_SHIFT;
|
2014-05-21 20:35:51 +00:00
|
|
|
unsigned long nr_written = 0;
|
|
|
|
|
|
|
|
trace___extent_writepage(page, inode, wbc);
|
|
|
|
|
|
|
|
WARN_ON(!PageLocked(page));
|
|
|
|
|
|
|
|
ClearPageError(page);
|
|
|
|
|
2018-12-05 14:23:03 +00:00
|
|
|
pg_offset = offset_in_page(i_size);
|
2014-05-21 20:35:51 +00:00
|
|
|
if (page->index > end_index ||
|
|
|
|
(page->index == end_index && !pg_offset)) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
|
2014-05-21 20:35:51 +00:00
|
|
|
unlock_page(page);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (page->index == end_index) {
|
|
|
|
char *userpage;
|
|
|
|
|
|
|
|
userpage = kmap_atomic(page);
|
|
|
|
memset(userpage + pg_offset, 0,
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
PAGE_SIZE - pg_offset);
|
2014-05-21 20:35:51 +00:00
|
|
|
kunmap_atomic(userpage);
|
|
|
|
flush_dcache_page(page);
|
|
|
|
}
|
|
|
|
|
|
|
|
set_page_extent_mapped(page);
|
|
|
|
|
2018-11-08 08:18:06 +00:00
|
|
|
if (!epd->extent_locked) {
|
2020-06-05 07:42:10 +00:00
|
|
|
ret = writepage_delalloc(BTRFS_I(inode), page, wbc, start,
|
|
|
|
&nr_written);
|
2018-11-08 08:18:06 +00:00
|
|
|
if (ret == 1)
|
2019-12-03 01:34:21 +00:00
|
|
|
return 0;
|
2018-11-08 08:18:06 +00:00
|
|
|
if (ret)
|
|
|
|
goto done;
|
|
|
|
}
|
2014-05-21 20:35:51 +00:00
|
|
|
|
2020-06-03 05:55:33 +00:00
|
|
|
ret = __extent_writepage_io(BTRFS_I(inode), page, wbc, epd, i_size,
|
|
|
|
nr_written, &nr);
|
2014-05-21 20:35:51 +00:00
|
|
|
if (ret == 1)
|
2019-12-03 01:34:21 +00:00
|
|
|
return 0;
|
2014-05-21 20:35:51 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
done:
|
|
|
|
if (nr == 0) {
|
|
|
|
/* make sure the mapping tag for page dirty gets cleared */
|
|
|
|
set_page_writeback(page);
|
|
|
|
end_page_writeback(page);
|
|
|
|
}
|
2014-05-09 16:17:40 +00:00
|
|
|
if (PageError(page)) {
|
|
|
|
ret = ret < 0 ? ret : -EIO;
|
|
|
|
end_extent_writepage(page, ret, start, page_end);
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
unlock_page(page);
|
2019-03-20 06:27:42 +00:00
|
|
|
ASSERT(ret <= 0);
|
2014-05-21 20:35:51 +00:00
|
|
|
return ret;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2013-04-24 20:41:19 +00:00
|
|
|
void wait_on_extent_buffer_writeback(struct extent_buffer *eb)
|
2012-03-13 13:38:00 +00:00
|
|
|
{
|
sched: Remove proliferation of wait_on_bit() action functions
The current "wait_on_bit" interface requires an 'action'
function to be provided which does the actual waiting.
There are over 20 such functions, many of them identical.
Most cases can be satisfied by one of just two functions, one
which uses io_schedule() and one which just uses schedule().
So:
Rename wait_on_bit and wait_on_bit_lock to
wait_on_bit_action and wait_on_bit_lock_action
to make it explicit that they need an action function.
Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
which are *not* given an action function but implicitly use
a standard one.
The decision to error-out if a signal is pending is now made
based on the 'mode' argument rather than being encoded in the action
function.
All instances of the old wait_on_bit and wait_on_bit_lock which
can use the new version have been changed accordingly and their
action functions have been discarded.
wait_on_bit{_lock} does not return any specific error code in the
event of a signal so the caller must check for non-zero and
interpolate their own error code as appropriate.
The wait_on_bit() call in __fscache_wait_on_invalidate() was
ambiguous as it specified TASK_UNINTERRUPTIBLE but used
fscache_wait_bit_interruptible as an action function.
David Howells confirms this should be uniformly
"uninterruptible"
The main remaining user of wait_on_bit{,_lock}_action is NFS
which needs to use a freezer-aware schedule() call.
A comment in fs/gfs2/glock.c notes that having multiple 'action'
functions is useful as they display differently in the 'wchan'
field of 'ps'. (and /proc/$PID/wchan).
As the new bit_wait{,_io} functions are tagged "__sched", they
will not show up at all, but something higher in the stack. So
the distinction will still be visible, only with different
function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
gfs2/glock.c case).
Since first version of this patch (against 3.15) two new action
functions appeared, on in NFS and one in CIFS. CIFS also now
uses an action function that makes the same freezer aware
schedule call as NFS.
Signed-off-by: NeilBrown <neilb@suse.de>
Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steve French <sfrench@samba.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brown
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-07 05:16:04 +00:00
|
|
|
wait_on_bit_io(&eb->bflags, EXTENT_BUFFER_WRITEBACK,
|
|
|
|
TASK_UNINTERRUPTIBLE);
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
|
Btrfs: fix unwritten extent buffers and hangs on future writeback attempts
The lock_extent_buffer_io() returns 1 to the caller to tell it everything
went fine and the callers needs to start writeback for the extent buffer
(submit a bio, etc), 0 to tell the caller everything went fine but it does
not need to start writeback for the extent buffer, and a negative value if
some error happened.
When it's about to return 1 it tries to lock all pages, and if a try lock
on a page fails, and we didn't flush any existing bio in our "epd", it
calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
an error. The page might have been locked elsewhere, not with the goal
of starting writeback of the extent buffer, and even by some code other
than btrfs, like page migration for example, so it does not mean the
writeback of the extent buffer was already started by some other task,
so returning a 0 tells the caller (btree_write_cache_pages()) to not
start writeback for the extent buffer. Note that epd might currently have
either no bio, so flush_write_bio() returns 0 (success) or it might have
a bio for another extent buffer with a lower index (logical address).
Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
extent buffer and writeback is never started for the extent buffer,
future attempts to writeback the extent buffer will hang forever waiting
on that bit to be cleared, since it can only be cleared after writeback
completes. Such hang is reported with a trace like the following:
[49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds.
[49887.347059] Not tainted 5.2.13-gentoo #2
[49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[49887.347062] btrfs-transacti D 0 1752 2 0x80004000
[49887.347064] Call Trace:
[49887.347069] ? __schedule+0x265/0x830
[49887.347071] ? bit_wait+0x50/0x50
[49887.347072] ? bit_wait+0x50/0x50
[49887.347074] schedule+0x24/0x90
[49887.347075] io_schedule+0x3c/0x60
[49887.347077] bit_wait_io+0x8/0x50
[49887.347079] __wait_on_bit+0x6c/0x80
[49887.347081] ? __lock_release.isra.29+0x155/0x2d0
[49887.347083] out_of_line_wait_on_bit+0x7b/0x80
[49887.347084] ? var_wake_function+0x20/0x20
[49887.347087] lock_extent_buffer_for_io+0x28c/0x390
[49887.347089] btree_write_cache_pages+0x18e/0x340
[49887.347091] do_writepages+0x29/0xb0
[49887.347093] ? kmem_cache_free+0x132/0x160
[49887.347095] ? convert_extent_bit+0x544/0x680
[49887.347097] filemap_fdatawrite_range+0x70/0x90
[49887.347099] btrfs_write_marked_extents+0x53/0x120
[49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0
[49887.347102] btrfs_commit_transaction+0x6bb/0x990
[49887.347103] ? start_transaction+0x33e/0x500
[49887.347105] transaction_kthread+0x139/0x15c
So fix this by not overwriting the return value (ret) with the result
from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
bit in case flush_write_bio() returns an error, otherwise it will hang
any future attempts to writeback the extent buffer, and undo all work
done before (set back EXTENT_BUFFER_DIRTY, etc).
This is a regression introduced in the 5.2 kernel.
Fixes: 2e3c25136adfb ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()")
Fixes: f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
Reported-by: Zdenek Sojka <zsojka@seznam.cz>
Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u
Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t
Reported-by: Drazen Kacar <drazen.kacar@oradian.com>
Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-11 16:42:00 +00:00
|
|
|
static void end_extent_buffer_writeback(struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
clear_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
|
|
|
|
smp_mb__after_atomic();
|
|
|
|
wake_up_bit(&eb->bflags, EXTENT_BUFFER_WRITEBACK);
|
|
|
|
}
|
|
|
|
|
2019-03-20 06:27:46 +00:00
|
|
|
/*
|
btrfs: fix the comment on lock_extent_buffer_for_io
The return value of that function is completely wrong.
That function only returns 0 if the extent buffer doesn't need to be
submitted. The "ret = 1" and "ret = 0" are determined by the return
value of "test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)".
And if we get ret == 1, it's because the extent buffer is dirty, and we
set its status to EXTENT_BUFFER_WRITE_BACK, and continue to page
locking.
While if we get ret == 0, it means the extent is not dirty from the
beginning, so we don't need to write it back.
The caller also follows this, in btree_write_cache_pages(), if
lock_extent_buffer_for_io() returns 0, we just skip the extent buffer
completely.
So the comment is completely wrong.
Since we're here, also change the description a little. The write bio
flushing won't be visible to the caller, thus it's not an major feature.
In the main description, only describe the locking part to make the
point more clear.
For reference, added in commit 2e3c25136adf ("btrfs: extent_io: add
proper error handling to lock_extent_buffer_for_io()")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-21 06:24:49 +00:00
|
|
|
* Lock extent buffer status and pages for writeback.
|
2019-03-20 06:27:46 +00:00
|
|
|
*
|
btrfs: fix the comment on lock_extent_buffer_for_io
The return value of that function is completely wrong.
That function only returns 0 if the extent buffer doesn't need to be
submitted. The "ret = 1" and "ret = 0" are determined by the return
value of "test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)".
And if we get ret == 1, it's because the extent buffer is dirty, and we
set its status to EXTENT_BUFFER_WRITE_BACK, and continue to page
locking.
While if we get ret == 0, it means the extent is not dirty from the
beginning, so we don't need to write it back.
The caller also follows this, in btree_write_cache_pages(), if
lock_extent_buffer_for_io() returns 0, we just skip the extent buffer
completely.
So the comment is completely wrong.
Since we're here, also change the description a little. The write bio
flushing won't be visible to the caller, thus it's not an major feature.
In the main description, only describe the locking part to make the
point more clear.
For reference, added in commit 2e3c25136adf ("btrfs: extent_io: add
proper error handling to lock_extent_buffer_for_io()")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-10-21 06:24:49 +00:00
|
|
|
* May try to flush write bio if we can't get the lock.
|
|
|
|
*
|
|
|
|
* Return 0 if the extent buffer doesn't need to be submitted.
|
|
|
|
* (E.g. the extent buffer is not dirty)
|
|
|
|
* Return >0 is the extent buffer is submitted to bio.
|
|
|
|
* Return <0 if something went wrong, no page is locked.
|
2019-03-20 06:27:46 +00:00
|
|
|
*/
|
2019-03-20 10:21:41 +00:00
|
|
|
static noinline_for_stack int lock_extent_buffer_for_io(struct extent_buffer *eb,
|
2014-05-20 03:55:27 +00:00
|
|
|
struct extent_page_data *epd)
|
2012-03-13 13:38:00 +00:00
|
|
|
{
|
2019-03-20 10:21:41 +00:00
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
2019-03-20 06:27:46 +00:00
|
|
|
int i, num_pages, failed_page_nr;
|
2012-03-13 13:38:00 +00:00
|
|
|
int flush = 0;
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (!btrfs_try_tree_write_lock(eb)) {
|
2019-03-20 06:27:41 +00:00
|
|
|
ret = flush_write_bio(epd);
|
2019-03-20 06:27:46 +00:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
flush = 1;
|
2012-03-13 13:38:00 +00:00
|
|
|
btrfs_tree_lock(eb);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags)) {
|
|
|
|
btrfs_tree_unlock(eb);
|
|
|
|
if (!epd->sync_io)
|
|
|
|
return 0;
|
|
|
|
if (!flush) {
|
2019-03-20 06:27:41 +00:00
|
|
|
ret = flush_write_bio(epd);
|
2019-03-20 06:27:46 +00:00
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
2012-03-13 13:38:00 +00:00
|
|
|
flush = 1;
|
|
|
|
}
|
2012-03-21 16:09:56 +00:00
|
|
|
while (1) {
|
|
|
|
wait_on_extent_buffer_writeback(eb);
|
|
|
|
btrfs_tree_lock(eb);
|
|
|
|
if (!test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags))
|
|
|
|
break;
|
2012-03-13 13:38:00 +00:00
|
|
|
btrfs_tree_unlock(eb);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-07-20 20:25:24 +00:00
|
|
|
/*
|
|
|
|
* We need to do this to prevent races in people who check if the eb is
|
|
|
|
* under IO since we can end up having no IO bits set for a short period
|
|
|
|
* of time.
|
|
|
|
*/
|
|
|
|
spin_lock(&eb->refs_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)) {
|
|
|
|
set_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags);
|
2012-07-20 20:25:24 +00:00
|
|
|
spin_unlock(&eb->refs_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
btrfs_set_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
|
2017-06-20 18:01:20 +00:00
|
|
|
percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
|
|
|
|
-eb->len,
|
|
|
|
fs_info->dirty_metadata_batch);
|
2012-03-13 13:38:00 +00:00
|
|
|
ret = 1;
|
2012-07-20 20:25:24 +00:00
|
|
|
} else {
|
|
|
|
spin_unlock(&eb->refs_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
btrfs_tree_unlock(eb);
|
|
|
|
|
|
|
|
if (!ret)
|
|
|
|
return ret;
|
|
|
|
|
2018-06-29 08:56:49 +00:00
|
|
|
num_pages = num_extent_pages(eb);
|
2012-03-13 13:38:00 +00:00
|
|
|
for (i = 0; i < num_pages; i++) {
|
2014-07-30 23:03:53 +00:00
|
|
|
struct page *p = eb->pages[i];
|
2012-03-13 13:38:00 +00:00
|
|
|
|
|
|
|
if (!trylock_page(p)) {
|
|
|
|
if (!flush) {
|
Btrfs: fix unwritten extent buffers and hangs on future writeback attempts
The lock_extent_buffer_io() returns 1 to the caller to tell it everything
went fine and the callers needs to start writeback for the extent buffer
(submit a bio, etc), 0 to tell the caller everything went fine but it does
not need to start writeback for the extent buffer, and a negative value if
some error happened.
When it's about to return 1 it tries to lock all pages, and if a try lock
on a page fails, and we didn't flush any existing bio in our "epd", it
calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
an error. The page might have been locked elsewhere, not with the goal
of starting writeback of the extent buffer, and even by some code other
than btrfs, like page migration for example, so it does not mean the
writeback of the extent buffer was already started by some other task,
so returning a 0 tells the caller (btree_write_cache_pages()) to not
start writeback for the extent buffer. Note that epd might currently have
either no bio, so flush_write_bio() returns 0 (success) or it might have
a bio for another extent buffer with a lower index (logical address).
Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
extent buffer and writeback is never started for the extent buffer,
future attempts to writeback the extent buffer will hang forever waiting
on that bit to be cleared, since it can only be cleared after writeback
completes. Such hang is reported with a trace like the following:
[49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds.
[49887.347059] Not tainted 5.2.13-gentoo #2
[49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[49887.347062] btrfs-transacti D 0 1752 2 0x80004000
[49887.347064] Call Trace:
[49887.347069] ? __schedule+0x265/0x830
[49887.347071] ? bit_wait+0x50/0x50
[49887.347072] ? bit_wait+0x50/0x50
[49887.347074] schedule+0x24/0x90
[49887.347075] io_schedule+0x3c/0x60
[49887.347077] bit_wait_io+0x8/0x50
[49887.347079] __wait_on_bit+0x6c/0x80
[49887.347081] ? __lock_release.isra.29+0x155/0x2d0
[49887.347083] out_of_line_wait_on_bit+0x7b/0x80
[49887.347084] ? var_wake_function+0x20/0x20
[49887.347087] lock_extent_buffer_for_io+0x28c/0x390
[49887.347089] btree_write_cache_pages+0x18e/0x340
[49887.347091] do_writepages+0x29/0xb0
[49887.347093] ? kmem_cache_free+0x132/0x160
[49887.347095] ? convert_extent_bit+0x544/0x680
[49887.347097] filemap_fdatawrite_range+0x70/0x90
[49887.347099] btrfs_write_marked_extents+0x53/0x120
[49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0
[49887.347102] btrfs_commit_transaction+0x6bb/0x990
[49887.347103] ? start_transaction+0x33e/0x500
[49887.347105] transaction_kthread+0x139/0x15c
So fix this by not overwriting the return value (ret) with the result
from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
bit in case flush_write_bio() returns an error, otherwise it will hang
any future attempts to writeback the extent buffer, and undo all work
done before (set back EXTENT_BUFFER_DIRTY, etc).
This is a regression introduced in the 5.2 kernel.
Fixes: 2e3c25136adfb ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()")
Fixes: f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
Reported-by: Zdenek Sojka <zsojka@seznam.cz>
Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u
Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t
Reported-by: Drazen Kacar <drazen.kacar@oradian.com>
Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-11 16:42:00 +00:00
|
|
|
int err;
|
|
|
|
|
|
|
|
err = flush_write_bio(epd);
|
|
|
|
if (err < 0) {
|
|
|
|
ret = err;
|
2019-03-20 06:27:46 +00:00
|
|
|
failed_page_nr = i;
|
|
|
|
goto err_unlock;
|
|
|
|
}
|
2012-03-13 13:38:00 +00:00
|
|
|
flush = 1;
|
|
|
|
}
|
|
|
|
lock_page(p);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
2019-03-20 06:27:46 +00:00
|
|
|
err_unlock:
|
|
|
|
/* Unlock already locked pages */
|
|
|
|
for (i = 0; i < failed_page_nr; i++)
|
|
|
|
unlock_page(eb->pages[i]);
|
Btrfs: fix unwritten extent buffers and hangs on future writeback attempts
The lock_extent_buffer_io() returns 1 to the caller to tell it everything
went fine and the callers needs to start writeback for the extent buffer
(submit a bio, etc), 0 to tell the caller everything went fine but it does
not need to start writeback for the extent buffer, and a negative value if
some error happened.
When it's about to return 1 it tries to lock all pages, and if a try lock
on a page fails, and we didn't flush any existing bio in our "epd", it
calls flush_write_bio(epd) and overwrites the return value of 1 to 0 or
an error. The page might have been locked elsewhere, not with the goal
of starting writeback of the extent buffer, and even by some code other
than btrfs, like page migration for example, so it does not mean the
writeback of the extent buffer was already started by some other task,
so returning a 0 tells the caller (btree_write_cache_pages()) to not
start writeback for the extent buffer. Note that epd might currently have
either no bio, so flush_write_bio() returns 0 (success) or it might have
a bio for another extent buffer with a lower index (logical address).
Since we return 0 with the EXTENT_BUFFER_WRITEBACK bit set on the
extent buffer and writeback is never started for the extent buffer,
future attempts to writeback the extent buffer will hang forever waiting
on that bit to be cleared, since it can only be cleared after writeback
completes. Such hang is reported with a trace like the following:
[49887.347053] INFO: task btrfs-transacti:1752 blocked for more than 122 seconds.
[49887.347059] Not tainted 5.2.13-gentoo #2
[49887.347060] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[49887.347062] btrfs-transacti D 0 1752 2 0x80004000
[49887.347064] Call Trace:
[49887.347069] ? __schedule+0x265/0x830
[49887.347071] ? bit_wait+0x50/0x50
[49887.347072] ? bit_wait+0x50/0x50
[49887.347074] schedule+0x24/0x90
[49887.347075] io_schedule+0x3c/0x60
[49887.347077] bit_wait_io+0x8/0x50
[49887.347079] __wait_on_bit+0x6c/0x80
[49887.347081] ? __lock_release.isra.29+0x155/0x2d0
[49887.347083] out_of_line_wait_on_bit+0x7b/0x80
[49887.347084] ? var_wake_function+0x20/0x20
[49887.347087] lock_extent_buffer_for_io+0x28c/0x390
[49887.347089] btree_write_cache_pages+0x18e/0x340
[49887.347091] do_writepages+0x29/0xb0
[49887.347093] ? kmem_cache_free+0x132/0x160
[49887.347095] ? convert_extent_bit+0x544/0x680
[49887.347097] filemap_fdatawrite_range+0x70/0x90
[49887.347099] btrfs_write_marked_extents+0x53/0x120
[49887.347100] btrfs_write_and_wait_transaction.isra.4+0x38/0xa0
[49887.347102] btrfs_commit_transaction+0x6bb/0x990
[49887.347103] ? start_transaction+0x33e/0x500
[49887.347105] transaction_kthread+0x139/0x15c
So fix this by not overwriting the return value (ret) with the result
from flush_write_bio(). We also need to clear the EXTENT_BUFFER_WRITEBACK
bit in case flush_write_bio() returns an error, otherwise it will hang
any future attempts to writeback the extent buffer, and undo all work
done before (set back EXTENT_BUFFER_DIRTY, etc).
This is a regression introduced in the 5.2 kernel.
Fixes: 2e3c25136adfb ("btrfs: extent_io: add proper error handling to lock_extent_buffer_for_io()")
Fixes: f4340622e0226 ("btrfs: extent_io: Move the BUG_ON() in flush_write_bio() one level up")
Reported-by: Zdenek Sojka <zsojka@seznam.cz>
Link: https://lore.kernel.org/linux-btrfs/GpO.2yos.3WGDOLpx6t%7D.1TUDYM@seznam.cz/T/#u
Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Link: https://lore.kernel.org/linux-btrfs/5c4688ac-10a7-fb07-70e8-c5d31a3fbb38@profihost.ag/T/#t
Reported-by: Drazen Kacar <drazen.kacar@oradian.com>
Link: https://lore.kernel.org/linux-btrfs/DB8PR03MB562876ECE2319B3E579590F799C80@DB8PR03MB5628.eurprd03.prod.outlook.com/
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204377
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-11 16:42:00 +00:00
|
|
|
/*
|
|
|
|
* Clear EXTENT_BUFFER_WRITEBACK and wake up anyone waiting on it.
|
|
|
|
* Also set back EXTENT_BUFFER_DIRTY so future attempts to this eb can
|
|
|
|
* be made and undo everything done before.
|
|
|
|
*/
|
|
|
|
btrfs_tree_lock(eb);
|
|
|
|
spin_lock(&eb->refs_lock);
|
|
|
|
set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
|
|
|
|
end_extent_buffer_writeback(eb);
|
|
|
|
spin_unlock(&eb->refs_lock);
|
|
|
|
percpu_counter_add_batch(&fs_info->dirty_metadata_bytes, eb->len,
|
|
|
|
fs_info->dirty_metadata_batch);
|
|
|
|
btrfs_clear_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN);
|
|
|
|
btrfs_tree_unlock(eb);
|
2019-03-20 06:27:46 +00:00
|
|
|
return ret;
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
static void set_btree_ioerr(struct page *page)
|
|
|
|
{
|
|
|
|
struct extent_buffer *eb = (struct extent_buffer *)page->private;
|
2019-09-13 13:54:07 +00:00
|
|
|
struct btrfs_fs_info *fs_info;
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
|
|
|
|
SetPageError(page);
|
|
|
|
if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
|
|
|
|
return;
|
|
|
|
|
2019-09-13 13:54:07 +00:00
|
|
|
/*
|
|
|
|
* If we error out, we should add back the dirty_metadata_bytes
|
|
|
|
* to make it consistent.
|
|
|
|
*/
|
|
|
|
fs_info = eb->fs_info;
|
|
|
|
percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
|
|
|
|
eb->len, fs_info->dirty_metadata_batch);
|
|
|
|
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
/*
|
|
|
|
* If writeback for a btree extent that doesn't belong to a log tree
|
|
|
|
* failed, increment the counter transaction->eb_write_errors.
|
|
|
|
* We do this because while the transaction is running and before it's
|
|
|
|
* committing (when we call filemap_fdata[write|wait]_range against
|
|
|
|
* the btree inode), we might have
|
|
|
|
* btree_inode->i_mapping->a_ops->writepages() called by the VM - if it
|
|
|
|
* returns an error or an error happens during writeback, when we're
|
|
|
|
* committing the transaction we wouldn't know about it, since the pages
|
|
|
|
* can be no longer dirty nor marked anymore for writeback (if a
|
|
|
|
* subsequent modification to the extent buffer didn't happen before the
|
|
|
|
* transaction commit), which makes filemap_fdata[write|wait]_range not
|
|
|
|
* able to find the pages tagged with SetPageError at transaction
|
|
|
|
* commit time. So if this happens we must abort the transaction,
|
|
|
|
* otherwise we commit a super block with btree roots that point to
|
|
|
|
* btree nodes/leafs whose content on disk is invalid - either garbage
|
|
|
|
* or the content of some node/leaf from a past generation that got
|
|
|
|
* cowed or deleted and is no longer valid.
|
|
|
|
*
|
|
|
|
* Note: setting AS_EIO/AS_ENOSPC in the btree inode's i_mapping would
|
|
|
|
* not be enough - we need to distinguish between log tree extents vs
|
|
|
|
* non-log tree extents, and the next filemap_fdatawait_range() call
|
|
|
|
* will catch and clear such errors in the mapping - and that call might
|
|
|
|
* be from a log sync and not from a transaction commit. Also, checking
|
|
|
|
* for the eb flag EXTENT_BUFFER_WRITE_ERR at transaction commit time is
|
|
|
|
* not done and would not be reliable - the eb might have been released
|
|
|
|
* from memory and reading it back again means that flag would not be
|
|
|
|
* set (since it's a runtime flag, not persisted on disk).
|
|
|
|
*
|
|
|
|
* Using the flags below in the btree inode also makes us achieve the
|
|
|
|
* goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
|
|
|
|
* writeback for all dirty pages and before filemap_fdatawait_range()
|
|
|
|
* is called, the writeback for all dirty pages had already finished
|
|
|
|
* with errors - because we were not using AS_EIO/AS_ENOSPC,
|
|
|
|
* filemap_fdatawait_range() would return success, as it could not know
|
|
|
|
* that writeback errors happened (the pages were no longer tagged for
|
|
|
|
* writeback).
|
|
|
|
*/
|
|
|
|
switch (eb->log_index) {
|
|
|
|
case -1:
|
2016-09-02 19:40:02 +00:00
|
|
|
set_bit(BTRFS_FS_BTREE_ERR, &eb->fs_info->flags);
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
break;
|
|
|
|
case 0:
|
2016-09-02 19:40:02 +00:00
|
|
|
set_bit(BTRFS_FS_LOG1_ERR, &eb->fs_info->flags);
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
break;
|
|
|
|
case 1:
|
2016-09-02 19:40:02 +00:00
|
|
|
set_bit(BTRFS_FS_LOG2_ERR, &eb->fs_info->flags);
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
BUG(); /* unexpected, logic error */
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-07-20 13:29:37 +00:00
|
|
|
static void end_bio_extent_buffer_writepage(struct bio *bio)
|
2012-03-13 13:38:00 +00:00
|
|
|
{
|
2013-11-07 20:20:26 +00:00
|
|
|
struct bio_vec *bvec;
|
2012-03-13 13:38:00 +00:00
|
|
|
struct extent_buffer *eb;
|
2019-04-25 07:03:00 +00:00
|
|
|
int done;
|
2019-02-15 11:13:19 +00:00
|
|
|
struct bvec_iter_all iter_all;
|
2012-03-13 13:38:00 +00:00
|
|
|
|
2017-07-13 16:10:07 +00:00
|
|
|
ASSERT(!bio_flagged(bio, BIO_CLONED));
|
2019-04-25 07:03:00 +00:00
|
|
|
bio_for_each_segment_all(bvec, bio, iter_all) {
|
2012-03-13 13:38:00 +00:00
|
|
|
struct page *page = bvec->bv_page;
|
|
|
|
|
|
|
|
eb = (struct extent_buffer *)page->private;
|
|
|
|
BUG_ON(!eb);
|
|
|
|
done = atomic_dec_and_test(&eb->io_pages);
|
|
|
|
|
2017-06-03 07:38:06 +00:00
|
|
|
if (bio->bi_status ||
|
2015-07-20 13:29:37 +00:00
|
|
|
test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
|
2012-03-13 13:38:00 +00:00
|
|
|
ClearPageUptodate(page);
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
set_btree_ioerr(page);
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
end_page_writeback(page);
|
|
|
|
|
|
|
|
if (!done)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
end_extent_buffer_writeback(eb);
|
2013-11-07 20:20:26 +00:00
|
|
|
}
|
2012-03-13 13:38:00 +00:00
|
|
|
|
|
|
|
bio_put(bio);
|
|
|
|
}
|
|
|
|
|
2014-05-20 03:55:27 +00:00
|
|
|
static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
|
2012-03-13 13:38:00 +00:00
|
|
|
struct writeback_control *wbc,
|
|
|
|
struct extent_page_data *epd)
|
|
|
|
{
|
|
|
|
u64 offset = eb->start;
|
2016-09-23 20:44:44 +00:00
|
|
|
u32 nritems;
|
2018-03-01 17:20:27 +00:00
|
|
|
int i, num_pages;
|
2016-09-23 20:44:44 +00:00
|
|
|
unsigned long start, end;
|
2017-08-25 00:19:48 +00:00
|
|
|
unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
|
2012-04-23 18:00:51 +00:00
|
|
|
int ret = 0;
|
2012-03-13 13:38:00 +00:00
|
|
|
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
|
2018-06-29 08:56:49 +00:00
|
|
|
num_pages = num_extent_pages(eb);
|
2012-03-13 13:38:00 +00:00
|
|
|
atomic_set(&eb->io_pages, num_pages);
|
2012-09-25 18:25:58 +00:00
|
|
|
|
2016-09-23 20:44:44 +00:00
|
|
|
/* set btree blocks beyond nritems with 0 to avoid stale content. */
|
|
|
|
nritems = btrfs_header_nritems(eb);
|
2016-09-15 00:22:57 +00:00
|
|
|
if (btrfs_header_level(eb) > 0) {
|
|
|
|
end = btrfs_node_key_ptr_offset(nritems);
|
|
|
|
|
2016-11-08 17:09:03 +00:00
|
|
|
memzero_extent_buffer(eb, end, eb->len - end);
|
2016-09-23 20:44:44 +00:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* leaf:
|
|
|
|
* header 0 1 2 .. N ... data_N .. data_2 data_1 data_0
|
|
|
|
*/
|
|
|
|
start = btrfs_item_nr_offset(nritems);
|
2019-03-20 10:33:10 +00:00
|
|
|
end = BTRFS_LEAF_DATA_OFFSET + leaf_data_end(eb);
|
2016-11-08 17:09:03 +00:00
|
|
|
memzero_extent_buffer(eb, start, end - start);
|
2016-09-15 00:22:57 +00:00
|
|
|
}
|
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
for (i = 0; i < num_pages; i++) {
|
2014-07-30 23:03:53 +00:00
|
|
|
struct page *p = eb->pages[i];
|
2012-03-13 13:38:00 +00:00
|
|
|
|
|
|
|
clear_page_dirty_for_io(p);
|
|
|
|
set_page_writeback(p);
|
2020-02-05 18:09:28 +00:00
|
|
|
ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
|
2019-10-03 15:29:05 +00:00
|
|
|
p, offset, PAGE_SIZE, 0,
|
2017-02-10 18:29:38 +00:00
|
|
|
&epd->bio,
|
2016-06-05 19:31:51 +00:00
|
|
|
end_bio_extent_buffer_writepage,
|
Btrfs: remove bio_flags which indicates a meta block of log-tree
Since both committing transaction and writing log-tree are doing
plugging on metadata IO, we can unify to use %sync_writers to benefit
both cases, instead of checking bio_flags while writing meta blocks of
log-tree.
We can remove this bio_flags because in order to write dirty blocks,
log tree also uses btrfs_write_marked_extents(), inside which we
have enabled %sync_writers, therefore, every write goes in a
synchronous way, so does checksuming.
Please also note that, bio_flags is applied per-context while
%sync_writers is applied per-inode, so this might incur some overhead, ie.
1) while log tree is flushing its dirty blocks via
btrfs_write_marked_extents(), in which %sync_writers is increased
by one.
2) in the meantime, some writeback operations may happen upon btrfs's
metadata inode, so these writes go synchronously, too.
However, AFAICS, the overhead is not a big one while the win is that
we unify the two places that needs synchronous way and remove a
special hack/flag.
This removes the bio_flags related stuff for writing log-tree.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-09-13 18:18:22 +00:00
|
|
|
0, 0, 0, false);
|
2012-03-13 13:38:00 +00:00
|
|
|
if (ret) {
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
set_btree_ioerr(p);
|
Btrfs: add another missing end_page_writeback on submit_extent_page failure
If btrfs_bio_alloc fails in submit_extent_page, submit_extent_page returns
without clearing the writeback bit of the failed page.
__extent_writepage_io, that is a caller of submit_extent_page,
does not clear the remaining writeback bit anywhere.
As a result, this will cause the hang at filemap_fdatawait_range,
because it waits the writeback bit to be cleared from the failed page.
So, we have to call end_page_writeback to clear the writeback bit.
For reproducing the hang, we inject a fault like
if (should_failtest()) { // I define should_failtest()
bio = NULL;
}
else {
bio = btrfs_bio_alloc(...);
}
in submit_extent_page.
We should also check whether page has the bit before end_page_writeback,
to avoid the conflict against the other end_page_writeback in bio_endio.
Thus, we add PageWriteback checks not only in __extent_writepage_io,
but also in write_one_eb too, because it misses the check.
Signed-off-by: Takafumi Kubota <takafumi.kubota1012@sslab.ics.keio.ac.jp>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-02-09 08:24:33 +00:00
|
|
|
if (PageWriteback(p))
|
|
|
|
end_page_writeback(p);
|
2012-03-13 13:38:00 +00:00
|
|
|
if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
|
|
|
|
end_extent_buffer_writeback(eb);
|
|
|
|
ret = -EIO;
|
|
|
|
break;
|
|
|
|
}
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
offset += PAGE_SIZE;
|
2017-02-10 18:33:41 +00:00
|
|
|
update_nr_written(wbc, 1);
|
2012-03-13 13:38:00 +00:00
|
|
|
unlock_page(p);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (unlikely(ret)) {
|
|
|
|
for (; i < num_pages; i++) {
|
2014-10-04 16:56:45 +00:00
|
|
|
struct page *p = eb->pages[i];
|
2014-09-23 14:22:33 +00:00
|
|
|
clear_page_dirty_for_io(p);
|
2012-03-13 13:38:00 +00:00
|
|
|
unlock_page(p);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int btree_write_cache_pages(struct address_space *mapping,
|
|
|
|
struct writeback_control *wbc)
|
|
|
|
{
|
|
|
|
struct extent_buffer *eb, *prev_eb = NULL;
|
|
|
|
struct extent_page_data epd = {
|
|
|
|
.bio = NULL,
|
|
|
|
.extent_locked = 0,
|
|
|
|
.sync_io = wbc->sync_mode == WB_SYNC_ALL,
|
|
|
|
};
|
btrfs: Don't submit any btree write bio if the fs has errors
[BUG]
There is a fuzzed image which could cause KASAN report at unmount time.
BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
Read of size 8 at addr ffff888067cf6848 by task umount/1922
CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x5b/0x8b
print_address_description+0x70/0x280
kasan_report+0x13a/0x19b
btrfs_queue_work+0x2c1/0x390
btrfs_wq_submit_bio+0x1cd/0x240
btree_submit_bio_hook+0x18c/0x2a0
submit_one_bio+0x1be/0x320
flush_write_bio.isra.41+0x2c/0x70
btree_write_cache_pages+0x3bb/0x7f0
do_writepages+0x5c/0x130
__writeback_single_inode+0xa3/0x9a0
writeback_single_inode+0x23d/0x390
write_inode_now+0x1b5/0x280
iput+0x2ef/0x600
close_ctree+0x341/0x750
generic_shutdown_super+0x126/0x370
kill_anon_super+0x31/0x50
btrfs_kill_super+0x36/0x2b0
deactivate_locked_super+0x80/0xc0
deactivate_super+0x13c/0x150
cleanup_mnt+0x9a/0x130
task_work_run+0x11a/0x1b0
exit_to_usermode_loop+0x107/0x130
do_syscall_64+0x1e5/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[CAUSE]
The fuzzed image has a completely screwd up extent tree:
leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 0 count 1
item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 271 offset 0 count 1
item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
item 3 key (29360128 169 0) itemoff 3803 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 4 key (29368320 169 1) itemoff 3770 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 5 key (29372416 169 0) itemoff 3737 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
Note that leaf 29421568 doesn't have its backref in the extent tree.
Thus extent allocator can re-allocate leaf 29421568 for other trees.
In short, the bug is caused by:
- Existing tree block gets allocated to log tree
This got its generation bumped.
- Log tree balance cleaned dirty bit of offending tree block
It will not be written back to disk, thus no WRITTEN flag.
- Original owner of the tree block gets COWed
Since the tree block has higher transid, no WRITTEN flag, it's reused,
and not traced by transaction::dirty_pages.
- Transaction aborted
Tree blocks get cleaned according to transaction::dirty_pages. But the
offending tree block is not recorded at all.
- Filesystem unmount
All pages are assumed to be are clean, destroying all workqueue, then
call iput(btree_inode).
But offending tree block is still dirty, which triggers writeback, and
causes use-after-free bug.
The detailed sequence looks like this:
- Initial status
eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
not traced by any dirty extent_iot_tree.
- New tree block is allocated
Since there is no backref for 29421568, it's re-allocated as new tree
block.
Keep in mind that tree block 29421568 is still referred by extent
tree.
- Tree block 29421568 is filled for log tree
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
traced by btrfs_root::dirty_log_pages
- Some log tree operations
Since the fs is using node size 4096, the log tree can easily go a
level higher.
- Log tree needs balance
Tree block 29421568 gets all its content pushed to right, thus now
it is empty, and we don't need it.
btrfs_clean_tree_block() from __push_leaf_right() get called.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
traced by btrfs_root::dirty_log_pages
- Log tree write back
btree_write_cache_pages() goes through dirty pages ranges, but since
page of tree block 29421568 gets cleaned already, it's not written
back to disk. Thus it doesn't have WRITTEN bit set.
But ranges in dirty_log_pages are cleared.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
not traced by any dirty extent_iot_tree.
- Extent tree update when committing transaction
Since tree block 29421568 has transid equal to running trans, and has
no WRITTEN bit, should_cow_block() will use it directly without adding
it to btrfs_transaction::dirty_pages.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
At this stage, we're doomed. We have a dirty eb not tracked by any
extent io tree.
- Transaction gets aborted due to corrupted extent tree
Btrfs cleans up dirty pages according to transaction::dirty_pages and
btrfs_root::dirty_log_pages.
But since tree block 29421568 is not tracked by neither of them, it's
still dirty.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
- Filesystem unmount
Since all cleanup is assumed to be done, all workqueus are destroyed.
Then iput(btree_inode) is called, expecting no dirty pages.
But tree 29421568 is still dirty, thus triggering writeback.
Since all workqueues are already freed, we cause use-after-free.
This shows us that, log tree blocks + bad extent tree can cause wild
dirty pages.
[FIX]
To fix the problem, don't submit any btree write bio if the filesytem
has any error. This is the last safe net, just in case other cleanup
haven't caught catch it.
Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12 06:12:44 +00:00
|
|
|
struct btrfs_fs_info *fs_info = BTRFS_I(mapping->host)->root->fs_info;
|
2012-03-13 13:38:00 +00:00
|
|
|
int ret = 0;
|
|
|
|
int done = 0;
|
|
|
|
int nr_to_write_done = 0;
|
|
|
|
struct pagevec pvec;
|
|
|
|
int nr_pages;
|
|
|
|
pgoff_t index;
|
|
|
|
pgoff_t end; /* Inclusive */
|
|
|
|
int scanned = 0;
|
2017-12-05 22:30:38 +00:00
|
|
|
xa_mark_t tag;
|
2012-03-13 13:38:00 +00:00
|
|
|
|
2017-11-16 01:37:52 +00:00
|
|
|
pagevec_init(&pvec);
|
2012-03-13 13:38:00 +00:00
|
|
|
if (wbc->range_cyclic) {
|
|
|
|
index = mapping->writeback_index; /* Start from prev offset */
|
|
|
|
end = -1;
|
2020-01-03 15:38:44 +00:00
|
|
|
/*
|
|
|
|
* Start from the beginning does not need to cycle over the
|
|
|
|
* range, mark it as scanned.
|
|
|
|
*/
|
|
|
|
scanned = (index == 0);
|
2012-03-13 13:38:00 +00:00
|
|
|
} else {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
index = wbc->range_start >> PAGE_SHIFT;
|
|
|
|
end = wbc->range_end >> PAGE_SHIFT;
|
2012-03-13 13:38:00 +00:00
|
|
|
scanned = 1;
|
|
|
|
}
|
|
|
|
if (wbc->sync_mode == WB_SYNC_ALL)
|
|
|
|
tag = PAGECACHE_TAG_TOWRITE;
|
|
|
|
else
|
|
|
|
tag = PAGECACHE_TAG_DIRTY;
|
|
|
|
retry:
|
|
|
|
if (wbc->sync_mode == WB_SYNC_ALL)
|
|
|
|
tag_pages_for_writeback(mapping, index, end);
|
|
|
|
while (!done && !nr_to_write_done && (index <= end) &&
|
2017-11-16 01:34:37 +00:00
|
|
|
(nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end,
|
2017-11-16 01:35:19 +00:00
|
|
|
tag))) {
|
2012-03-13 13:38:00 +00:00
|
|
|
unsigned i;
|
|
|
|
|
|
|
|
for (i = 0; i < nr_pages; i++) {
|
|
|
|
struct page *page = pvec.pages[i];
|
|
|
|
|
|
|
|
if (!PagePrivate(page))
|
|
|
|
continue;
|
|
|
|
|
2012-09-14 17:43:01 +00:00
|
|
|
spin_lock(&mapping->private_lock);
|
|
|
|
if (!PagePrivate(page)) {
|
|
|
|
spin_unlock(&mapping->private_lock);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
eb = (struct extent_buffer *)page->private;
|
2012-09-14 17:43:01 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Shouldn't happen and normally this would be a BUG_ON
|
|
|
|
* but no sense in crashing the users box for something
|
|
|
|
* we can survive anyway.
|
|
|
|
*/
|
2013-10-31 05:00:08 +00:00
|
|
|
if (WARN_ON(!eb)) {
|
2012-09-14 17:43:01 +00:00
|
|
|
spin_unlock(&mapping->private_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2012-09-14 17:43:01 +00:00
|
|
|
if (eb == prev_eb) {
|
|
|
|
spin_unlock(&mapping->private_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
continue;
|
2012-09-14 17:43:01 +00:00
|
|
|
}
|
2012-03-13 13:38:00 +00:00
|
|
|
|
2012-09-14 17:43:01 +00:00
|
|
|
ret = atomic_inc_not_zero(&eb->refs);
|
|
|
|
spin_unlock(&mapping->private_lock);
|
|
|
|
if (!ret)
|
2012-03-13 13:38:00 +00:00
|
|
|
continue;
|
|
|
|
|
|
|
|
prev_eb = eb;
|
2019-03-20 10:21:41 +00:00
|
|
|
ret = lock_extent_buffer_for_io(eb, &epd);
|
2012-03-13 13:38:00 +00:00
|
|
|
if (!ret) {
|
|
|
|
free_extent_buffer(eb);
|
|
|
|
continue;
|
2019-09-11 16:42:28 +00:00
|
|
|
} else if (ret < 0) {
|
|
|
|
done = 1;
|
|
|
|
free_extent_buffer(eb);
|
|
|
|
break;
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
|
2019-03-20 10:27:57 +00:00
|
|
|
ret = write_one_eb(eb, wbc, &epd);
|
2012-03-13 13:38:00 +00:00
|
|
|
if (ret) {
|
|
|
|
done = 1;
|
|
|
|
free_extent_buffer(eb);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
free_extent_buffer(eb);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the filesystem may choose to bump up nr_to_write.
|
|
|
|
* We have to make sure to honor the new nr_to_write
|
|
|
|
* at any time
|
|
|
|
*/
|
|
|
|
nr_to_write_done = wbc->nr_to_write <= 0;
|
|
|
|
}
|
|
|
|
pagevec_release(&pvec);
|
|
|
|
cond_resched();
|
|
|
|
}
|
|
|
|
if (!scanned && !done) {
|
|
|
|
/*
|
|
|
|
* We hit the last page and there is more work to be done: wrap
|
|
|
|
* back to the start of the file
|
|
|
|
*/
|
|
|
|
scanned = 1;
|
|
|
|
index = 0;
|
|
|
|
goto retry;
|
|
|
|
}
|
2019-03-20 06:27:43 +00:00
|
|
|
ASSERT(ret <= 0);
|
|
|
|
if (ret < 0) {
|
|
|
|
end_write_bio(&epd, ret);
|
|
|
|
return ret;
|
|
|
|
}
|
btrfs: Don't submit any btree write bio if the fs has errors
[BUG]
There is a fuzzed image which could cause KASAN report at unmount time.
BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
Read of size 8 at addr ffff888067cf6848 by task umount/1922
CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x5b/0x8b
print_address_description+0x70/0x280
kasan_report+0x13a/0x19b
btrfs_queue_work+0x2c1/0x390
btrfs_wq_submit_bio+0x1cd/0x240
btree_submit_bio_hook+0x18c/0x2a0
submit_one_bio+0x1be/0x320
flush_write_bio.isra.41+0x2c/0x70
btree_write_cache_pages+0x3bb/0x7f0
do_writepages+0x5c/0x130
__writeback_single_inode+0xa3/0x9a0
writeback_single_inode+0x23d/0x390
write_inode_now+0x1b5/0x280
iput+0x2ef/0x600
close_ctree+0x341/0x750
generic_shutdown_super+0x126/0x370
kill_anon_super+0x31/0x50
btrfs_kill_super+0x36/0x2b0
deactivate_locked_super+0x80/0xc0
deactivate_super+0x13c/0x150
cleanup_mnt+0x9a/0x130
task_work_run+0x11a/0x1b0
exit_to_usermode_loop+0x107/0x130
do_syscall_64+0x1e5/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[CAUSE]
The fuzzed image has a completely screwd up extent tree:
leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 0 count 1
item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 271 offset 0 count 1
item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
item 3 key (29360128 169 0) itemoff 3803 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 4 key (29368320 169 1) itemoff 3770 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 5 key (29372416 169 0) itemoff 3737 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
Note that leaf 29421568 doesn't have its backref in the extent tree.
Thus extent allocator can re-allocate leaf 29421568 for other trees.
In short, the bug is caused by:
- Existing tree block gets allocated to log tree
This got its generation bumped.
- Log tree balance cleaned dirty bit of offending tree block
It will not be written back to disk, thus no WRITTEN flag.
- Original owner of the tree block gets COWed
Since the tree block has higher transid, no WRITTEN flag, it's reused,
and not traced by transaction::dirty_pages.
- Transaction aborted
Tree blocks get cleaned according to transaction::dirty_pages. But the
offending tree block is not recorded at all.
- Filesystem unmount
All pages are assumed to be are clean, destroying all workqueue, then
call iput(btree_inode).
But offending tree block is still dirty, which triggers writeback, and
causes use-after-free bug.
The detailed sequence looks like this:
- Initial status
eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
not traced by any dirty extent_iot_tree.
- New tree block is allocated
Since there is no backref for 29421568, it's re-allocated as new tree
block.
Keep in mind that tree block 29421568 is still referred by extent
tree.
- Tree block 29421568 is filled for log tree
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
traced by btrfs_root::dirty_log_pages
- Some log tree operations
Since the fs is using node size 4096, the log tree can easily go a
level higher.
- Log tree needs balance
Tree block 29421568 gets all its content pushed to right, thus now
it is empty, and we don't need it.
btrfs_clean_tree_block() from __push_leaf_right() get called.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
traced by btrfs_root::dirty_log_pages
- Log tree write back
btree_write_cache_pages() goes through dirty pages ranges, but since
page of tree block 29421568 gets cleaned already, it's not written
back to disk. Thus it doesn't have WRITTEN bit set.
But ranges in dirty_log_pages are cleared.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
not traced by any dirty extent_iot_tree.
- Extent tree update when committing transaction
Since tree block 29421568 has transid equal to running trans, and has
no WRITTEN bit, should_cow_block() will use it directly without adding
it to btrfs_transaction::dirty_pages.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
At this stage, we're doomed. We have a dirty eb not tracked by any
extent io tree.
- Transaction gets aborted due to corrupted extent tree
Btrfs cleans up dirty pages according to transaction::dirty_pages and
btrfs_root::dirty_log_pages.
But since tree block 29421568 is not tracked by neither of them, it's
still dirty.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
- Filesystem unmount
Since all cleanup is assumed to be done, all workqueus are destroyed.
Then iput(btree_inode) is called, expecting no dirty pages.
But tree 29421568 is still dirty, thus triggering writeback.
Since all workqueues are already freed, we cause use-after-free.
This shows us that, log tree blocks + bad extent tree can cause wild
dirty pages.
[FIX]
To fix the problem, don't submit any btree write bio if the filesytem
has any error. This is the last safe net, just in case other cleanup
haven't caught catch it.
Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12 06:12:44 +00:00
|
|
|
/*
|
|
|
|
* If something went wrong, don't allow any metadata write bio to be
|
|
|
|
* submitted.
|
|
|
|
*
|
|
|
|
* This would prevent use-after-free if we had dirty pages not
|
|
|
|
* cleaned up, which can still happen by fuzzed images.
|
|
|
|
*
|
|
|
|
* - Bad extent tree
|
|
|
|
* Allowing existing tree block to be allocated for other trees.
|
|
|
|
*
|
|
|
|
* - Log tree operations
|
|
|
|
* Exiting tree blocks get allocated to log tree, bumps its
|
|
|
|
* generation, then get cleaned in tree re-balance.
|
|
|
|
* Such tree block will not be written back, since it's clean,
|
|
|
|
* thus no WRITTEN flag set.
|
|
|
|
* And after log writes back, this tree block is not traced by
|
|
|
|
* any dirty extent_io_tree.
|
|
|
|
*
|
|
|
|
* - Offending tree block gets re-dirtied from its original owner
|
|
|
|
* Since it has bumped generation, no WRITTEN flag, it can be
|
|
|
|
* reused without COWing. This tree block will not be traced
|
|
|
|
* by btrfs_transaction::dirty_pages.
|
|
|
|
*
|
|
|
|
* Now such dirty tree block will not be cleaned by any dirty
|
|
|
|
* extent io tree. Thus we don't want to submit such wild eb
|
|
|
|
* if the fs already has error.
|
|
|
|
*/
|
|
|
|
if (!test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state)) {
|
|
|
|
ret = flush_write_bio(&epd);
|
|
|
|
} else {
|
btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases
Eric reported seeing this message while running generic/475
BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
Full stack trace:
BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
BTRFS info (device dm-0): forced readonly
BTRFS warning (device dm-0): Skipping commit of aborted transaction.
------------[ cut here ]------------
BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure
BTRFS: Transaction aborted (error -117)
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10
WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs]
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10
CPU: 3 PID: 23985 Comm: fsstress Tainted: G W L 5.8.0-rc4-default+ #1181
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs]
RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001
RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7
RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000
R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70
FS: 00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0
Call Trace:
? finish_wait+0x90/0x90
? __mutex_unlock_slowpath+0x45/0x2a0
? lock_acquire+0xa3/0x440
? lockref_put_or_lock+0x9/0x30
? dput+0x20/0x4a0
? dput+0x20/0x4a0
? do_raw_spin_unlock+0x4b/0xc0
? _raw_spin_unlock+0x1f/0x30
btrfs_sync_file+0x335/0x490 [btrfs]
do_fsync+0x38/0x70
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x50/0xe0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f6a6ef1b6e3
Code: Bad RIP value.
RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3
RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
softirqs last enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace af146e0e38433456 ]---
BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
This ret came from btrfs_write_marked_extents(). If we get an aborted
transaction via EIO before, we'll see it in btree_write_cache_pages()
and return EUCLEAN, which gets printed as "Filesystem corrupted".
Except we shouldn't be returning EUCLEAN here, we need to be returning
EROFS because EUCLEAN is reserved for actual corruption, not IO errors.
We are inconsistent about our handling of BTRFS_FS_STATE_ERROR
elsewhere, but we want to use EROFS for this particular case. The
original transaction abort has the real error code for why we ended up
with an aborted transaction, all subsequent actions just need to return
EROFS because they may not have a trans handle and have no idea about
the original cause of the abort.
After patch "btrfs: don't WARN if we abort a transaction with EROFS" the
stacktrace will not be dumped either.
Reported-by: Eric Sandeen <esandeen@redhat.com>
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add full test stacktrace ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-21 14:38:37 +00:00
|
|
|
ret = -EROFS;
|
btrfs: Don't submit any btree write bio if the fs has errors
[BUG]
There is a fuzzed image which could cause KASAN report at unmount time.
BUG: KASAN: use-after-free in btrfs_queue_work+0x2c1/0x390
Read of size 8 at addr ffff888067cf6848 by task umount/1922
CPU: 0 PID: 1922 Comm: umount Tainted: G W 5.0.21 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x5b/0x8b
print_address_description+0x70/0x280
kasan_report+0x13a/0x19b
btrfs_queue_work+0x2c1/0x390
btrfs_wq_submit_bio+0x1cd/0x240
btree_submit_bio_hook+0x18c/0x2a0
submit_one_bio+0x1be/0x320
flush_write_bio.isra.41+0x2c/0x70
btree_write_cache_pages+0x3bb/0x7f0
do_writepages+0x5c/0x130
__writeback_single_inode+0xa3/0x9a0
writeback_single_inode+0x23d/0x390
write_inode_now+0x1b5/0x280
iput+0x2ef/0x600
close_ctree+0x341/0x750
generic_shutdown_super+0x126/0x370
kill_anon_super+0x31/0x50
btrfs_kill_super+0x36/0x2b0
deactivate_locked_super+0x80/0xc0
deactivate_super+0x13c/0x150
cleanup_mnt+0x9a/0x130
task_work_run+0x11a/0x1b0
exit_to_usermode_loop+0x107/0x130
do_syscall_64+0x1e5/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9
[CAUSE]
The fuzzed image has a completely screwd up extent tree:
leaf 29421568 gen 8 total ptrs 6 free space 3587 owner EXTENT_TREE
refs 2 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 5938
item 0 key (12587008 168 4096) itemoff 3942 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 0 count 1
item 1 key (12591104 168 8192) itemoff 3889 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 271 offset 0 count 1
item 2 key (12599296 168 4096) itemoff 3836 itemsize 53
extent refs 1 gen 9 flags 1
ref#0: extent data backref root 5 objectid 259 offset 4096 count 1
item 3 key (29360128 169 0) itemoff 3803 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 4 key (29368320 169 1) itemoff 3770 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
item 5 key (29372416 169 0) itemoff 3737 itemsize 33
extent refs 1 gen 9 flags 2
ref#0: tree block backref root 5
Note that leaf 29421568 doesn't have its backref in the extent tree.
Thus extent allocator can re-allocate leaf 29421568 for other trees.
In short, the bug is caused by:
- Existing tree block gets allocated to log tree
This got its generation bumped.
- Log tree balance cleaned dirty bit of offending tree block
It will not be written back to disk, thus no WRITTEN flag.
- Original owner of the tree block gets COWed
Since the tree block has higher transid, no WRITTEN flag, it's reused,
and not traced by transaction::dirty_pages.
- Transaction aborted
Tree blocks get cleaned according to transaction::dirty_pages. But the
offending tree block is not recorded at all.
- Filesystem unmount
All pages are assumed to be are clean, destroying all workqueue, then
call iput(btree_inode).
But offending tree block is still dirty, which triggers writeback, and
causes use-after-free bug.
The detailed sequence looks like this:
- Initial status
eb: 29421568, header=WRITTEN bflags_dirty=0, page_dirty=0, gen=8,
not traced by any dirty extent_iot_tree.
- New tree block is allocated
Since there is no backref for 29421568, it's re-allocated as new tree
block.
Keep in mind that tree block 29421568 is still referred by extent
tree.
- Tree block 29421568 is filled for log tree
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9 << (gen bumped)
traced by btrfs_root::dirty_log_pages
- Some log tree operations
Since the fs is using node size 4096, the log tree can easily go a
level higher.
- Log tree needs balance
Tree block 29421568 gets all its content pushed to right, thus now
it is empty, and we don't need it.
btrfs_clean_tree_block() from __push_leaf_right() get called.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
traced by btrfs_root::dirty_log_pages
- Log tree write back
btree_write_cache_pages() goes through dirty pages ranges, but since
page of tree block 29421568 gets cleaned already, it's not written
back to disk. Thus it doesn't have WRITTEN bit set.
But ranges in dirty_log_pages are cleared.
eb: 29421568, header=0 bflags_dirty=0, page_dirty=0, gen=9
not traced by any dirty extent_iot_tree.
- Extent tree update when committing transaction
Since tree block 29421568 has transid equal to running trans, and has
no WRITTEN bit, should_cow_block() will use it directly without adding
it to btrfs_transaction::dirty_pages.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
At this stage, we're doomed. We have a dirty eb not tracked by any
extent io tree.
- Transaction gets aborted due to corrupted extent tree
Btrfs cleans up dirty pages according to transaction::dirty_pages and
btrfs_root::dirty_log_pages.
But since tree block 29421568 is not tracked by neither of them, it's
still dirty.
eb: 29421568, header=0 bflags_dirty=1, page_dirty=1, gen=9
not traced by any dirty extent_iot_tree.
- Filesystem unmount
Since all cleanup is assumed to be done, all workqueus are destroyed.
Then iput(btree_inode) is called, expecting no dirty pages.
But tree 29421568 is still dirty, thus triggering writeback.
Since all workqueues are already freed, we cause use-after-free.
This shows us that, log tree blocks + bad extent tree can cause wild
dirty pages.
[FIX]
To fix the problem, don't submit any btree write bio if the filesytem
has any error. This is the last safe net, just in case other cleanup
haven't caught catch it.
Link: https://github.com/bobfuzzer/CVE/tree/master/CVE-2019-19377
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12 06:12:44 +00:00
|
|
|
end_write_bio(&epd, ret);
|
|
|
|
}
|
2012-03-13 13:38:00 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/**
|
2008-09-08 15:18:08 +00:00
|
|
|
* write_cache_pages - walk the list of dirty pages of the given address space and write all of them.
|
2008-01-24 21:13:08 +00:00
|
|
|
* @mapping: address space structure to write
|
|
|
|
* @wbc: subtract the number of written pages from *@wbc->nr_to_write
|
2017-06-23 02:30:28 +00:00
|
|
|
* @data: data passed to __extent_writepage function
|
2008-01-24 21:13:08 +00:00
|
|
|
*
|
|
|
|
* If a page is already under I/O, write_cache_pages() skips it, even
|
|
|
|
* if it's dirty. This is desirable behaviour for memory-cleaning writeback,
|
|
|
|
* but it is INCORRECT for data-integrity system calls such as fsync(). fsync()
|
|
|
|
* and msync() need to guarantee that all the data which was dirty at the time
|
|
|
|
* the call was made get new I/O started against them. If wbc->sync_mode is
|
|
|
|
* WB_SYNC_ALL then we were called for data integrity and we must wait for
|
|
|
|
* existing IO to complete.
|
|
|
|
*/
|
2017-02-10 18:38:24 +00:00
|
|
|
static int extent_write_cache_pages(struct address_space *mapping,
|
2008-09-08 15:18:08 +00:00
|
|
|
struct writeback_control *wbc,
|
2017-11-30 17:00:02 +00:00
|
|
|
struct extent_page_data *epd)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2012-06-27 21:18:41 +00:00
|
|
|
struct inode *inode = mapping->host;
|
2008-01-24 21:13:08 +00:00
|
|
|
int ret = 0;
|
|
|
|
int done = 0;
|
2009-09-18 20:03:16 +00:00
|
|
|
int nr_to_write_done = 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
struct pagevec pvec;
|
|
|
|
int nr_pages;
|
|
|
|
pgoff_t index;
|
|
|
|
pgoff_t end; /* Inclusive */
|
2016-03-08 00:56:21 +00:00
|
|
|
pgoff_t done_index;
|
|
|
|
int range_whole = 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
int scanned = 0;
|
2017-12-05 22:30:38 +00:00
|
|
|
xa_mark_t tag;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2012-06-27 21:18:41 +00:00
|
|
|
/*
|
|
|
|
* We have to hold onto the inode so that ordered extents can do their
|
|
|
|
* work when the IO finishes. The alternative to this is failing to add
|
|
|
|
* an ordered extent if the igrab() fails there and that is a huge pain
|
|
|
|
* to deal with, so instead just hold onto the inode throughout the
|
|
|
|
* writepages operation. If it fails here we are freeing up the inode
|
|
|
|
* anyway and we'd rather not waste our time writing out stuff that is
|
|
|
|
* going to be truncated anyway.
|
|
|
|
*/
|
|
|
|
if (!igrab(inode))
|
|
|
|
return 0;
|
|
|
|
|
2017-11-16 01:37:52 +00:00
|
|
|
pagevec_init(&pvec);
|
2008-01-24 21:13:08 +00:00
|
|
|
if (wbc->range_cyclic) {
|
|
|
|
index = mapping->writeback_index; /* Start from prev offset */
|
|
|
|
end = -1;
|
2020-01-03 15:38:44 +00:00
|
|
|
/*
|
|
|
|
* Start from the beginning does not need to cycle over the
|
|
|
|
* range, mark it as scanned.
|
|
|
|
*/
|
|
|
|
scanned = (index == 0);
|
2008-01-24 21:13:08 +00:00
|
|
|
} else {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
index = wbc->range_start >> PAGE_SHIFT;
|
|
|
|
end = wbc->range_end >> PAGE_SHIFT;
|
2016-03-08 00:56:21 +00:00
|
|
|
if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
|
|
|
|
range_whole = 1;
|
2008-01-24 21:13:08 +00:00
|
|
|
scanned = 1;
|
|
|
|
}
|
2018-11-01 06:49:03 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We do the tagged writepage as long as the snapshot flush bit is set
|
|
|
|
* and we are the first one who do the filemap_flush() on this inode.
|
|
|
|
*
|
|
|
|
* The nr_to_write == LONG_MAX is needed to make sure other flushers do
|
|
|
|
* not race in and drop the bit.
|
|
|
|
*/
|
|
|
|
if (range_whole && wbc->nr_to_write == LONG_MAX &&
|
|
|
|
test_and_clear_bit(BTRFS_INODE_SNAPSHOT_FLUSH,
|
|
|
|
&BTRFS_I(inode)->runtime_flags))
|
|
|
|
wbc->tagged_writepages = 1;
|
|
|
|
|
|
|
|
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
|
2011-07-15 21:26:38 +00:00
|
|
|
tag = PAGECACHE_TAG_TOWRITE;
|
|
|
|
else
|
|
|
|
tag = PAGECACHE_TAG_DIRTY;
|
2008-01-24 21:13:08 +00:00
|
|
|
retry:
|
2018-11-01 06:49:03 +00:00
|
|
|
if (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)
|
2011-07-15 21:26:38 +00:00
|
|
|
tag_pages_for_writeback(mapping, index, end);
|
2016-03-08 00:56:21 +00:00
|
|
|
done_index = index;
|
2009-09-18 20:03:16 +00:00
|
|
|
while (!done && !nr_to_write_done && (index <= end) &&
|
2017-11-16 01:35:19 +00:00
|
|
|
(nr_pages = pagevec_lookup_range_tag(&pvec, mapping,
|
|
|
|
&index, end, tag))) {
|
2008-01-24 21:13:08 +00:00
|
|
|
unsigned i;
|
|
|
|
|
|
|
|
for (i = 0; i < nr_pages; i++) {
|
|
|
|
struct page *page = pvec.pages[i];
|
|
|
|
|
btrfs: Avoid getting stuck during cyclic writebacks
During a cyclic writeback, extent_write_cache_pages() uses done_index
to update the writeback_index after the current run is over. However,
instead of current index + 1, it gets to to the current index itself.
Unfortunately, this, combined with returning on EOF instead of looping
back, can lead to the following pathlogical behavior.
1. There is a single file which has accumulated enough dirty pages to
trigger balance_dirty_pages() and the writer appending to the file
with a series of short writes.
2. balance_dirty_pages kicks in, wakes up background writeback and sleeps.
3. Writeback kicks in and the cursor is on the last page of the dirty
file. Writeback is started or skipped if already in progress. As
it's EOF, extent_write_cache_pages() returns and the cursor is set
to done_index which is pointing to the last page.
4. Writeback is done. Nothing happens till balance_dirty_pages
finishes, at which point we go back to #1.
This can almost completely stall out writing back of the file and keep
the system over dirty threshold for a long time which can mess up the
whole system. We encountered this issue in production with a package
handling application which can reliably reproduce the issue when
running under tight memory limits.
Reading the comment in the error handling section, this seems to be to
avoid accidentally skipping a page in case the write attempt on the
page doesn't succeed. However, this concern seems bogus.
On each page, the code either:
* Skips and moves onto the next page.
* Fails issue and sets done_index to index + 1.
* Successfully issues and continue to the next page if budget allows
and not EOF.
IOW, as long as it's not EOF and there's budget, the code never
retries writing back the same page. Only when a page happens to be
the last page of a particular run, we end up retrying the page, which
can't possibly guarantee anything data integrity related. Besides,
cyclic writes are only used for non-syncing writebacks meaning that
there's no data integrity implication to begin with.
Fix it by always setting done_index past the current page being
processed.
Note that this problem exists in other writepages too.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-03 14:27:13 +00:00
|
|
|
done_index = page->index + 1;
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
2018-04-10 23:36:56 +00:00
|
|
|
* At this point we hold neither the i_pages lock nor
|
|
|
|
* the page lock: the page may be truncated or
|
|
|
|
* invalidated (changing page->mapping to NULL),
|
|
|
|
* or even swizzled back from swapper_space to
|
|
|
|
* tmpfs file mapping
|
2008-01-24 21:13:08 +00:00
|
|
|
*/
|
2013-02-11 16:33:00 +00:00
|
|
|
if (!trylock_page(page)) {
|
2019-03-20 06:27:41 +00:00
|
|
|
ret = flush_write_bio(epd);
|
|
|
|
BUG_ON(ret < 0);
|
2013-02-11 16:33:00 +00:00
|
|
|
lock_page(page);
|
2011-11-01 14:08:06 +00:00
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
if (unlikely(page->mapping != mapping)) {
|
|
|
|
unlock_page(page);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2008-11-19 17:44:22 +00:00
|
|
|
if (wbc->sync_mode != WB_SYNC_NONE) {
|
2019-03-20 06:27:41 +00:00
|
|
|
if (PageWriteback(page)) {
|
|
|
|
ret = flush_write_bio(epd);
|
|
|
|
BUG_ON(ret < 0);
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
wait_on_page_writeback(page);
|
2008-11-19 17:44:22 +00:00
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
if (PageWriteback(page) ||
|
|
|
|
!clear_page_dirty_for_io(page)) {
|
|
|
|
unlock_page(page);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2017-11-30 17:00:02 +00:00
|
|
|
ret = __extent_writepage(page, wbc, epd);
|
2016-03-08 00:56:21 +00:00
|
|
|
if (ret < 0) {
|
|
|
|
done = 1;
|
|
|
|
break;
|
|
|
|
}
|
2009-09-18 20:03:16 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* the filesystem may choose to bump up nr_to_write.
|
|
|
|
* We have to make sure to honor the new nr_to_write
|
|
|
|
* at any time
|
|
|
|
*/
|
|
|
|
nr_to_write_done = wbc->nr_to_write <= 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
pagevec_release(&pvec);
|
|
|
|
cond_resched();
|
|
|
|
}
|
2016-03-08 00:56:22 +00:00
|
|
|
if (!scanned && !done) {
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* We hit the last page and there is more work to be done: wrap
|
|
|
|
* back to the start of the file
|
|
|
|
*/
|
|
|
|
scanned = 1;
|
|
|
|
index = 0;
|
2020-01-23 20:33:02 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're looping we could run into a page that is locked by a
|
|
|
|
* writer and that writer could be waiting on writeback for a
|
|
|
|
* page in our current bio, and thus deadlock, so flush the
|
|
|
|
* write bio here.
|
|
|
|
*/
|
|
|
|
ret = flush_write_bio(epd);
|
|
|
|
if (!ret)
|
|
|
|
goto retry;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2016-03-08 00:56:21 +00:00
|
|
|
|
|
|
|
if (wbc->range_cyclic || (wbc->nr_to_write > 0 && range_whole))
|
|
|
|
mapping->writeback_index = done_index;
|
|
|
|
|
2012-06-27 21:18:41 +00:00
|
|
|
btrfs_add_delayed_iput(inode);
|
2016-03-08 00:56:22 +00:00
|
|
|
return ret;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2017-12-08 13:55:59 +00:00
|
|
|
int extent_write_full_page(struct page *page, struct writeback_control *wbc)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct extent_page_data epd = {
|
|
|
|
.bio = NULL,
|
2008-11-07 03:02:51 +00:00
|
|
|
.extent_locked = 0,
|
2009-04-20 19:50:09 +00:00
|
|
|
.sync_io = wbc->sync_mode == WB_SYNC_ALL,
|
2008-01-24 21:13:08 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
ret = __extent_writepage(page, wbc, &epd);
|
2019-03-20 06:27:42 +00:00
|
|
|
ASSERT(ret <= 0);
|
|
|
|
if (ret < 0) {
|
|
|
|
end_write_bio(&epd, ret);
|
|
|
|
return ret;
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2019-03-20 06:27:42 +00:00
|
|
|
ret = flush_write_bio(&epd);
|
|
|
|
ASSERT(ret <= 0);
|
2008-01-24 21:13:08 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-12-08 13:55:58 +00:00
|
|
|
int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
|
2008-11-07 03:02:51 +00:00
|
|
|
int mode)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct address_space *mapping = inode->i_mapping;
|
|
|
|
struct page *page;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
unsigned long nr_pages = (end - start + PAGE_SIZE) >>
|
|
|
|
PAGE_SHIFT;
|
2008-11-07 03:02:51 +00:00
|
|
|
|
|
|
|
struct extent_page_data epd = {
|
|
|
|
.bio = NULL,
|
|
|
|
.extent_locked = 1,
|
2009-04-20 19:50:09 +00:00
|
|
|
.sync_io = mode == WB_SYNC_ALL,
|
2008-11-07 03:02:51 +00:00
|
|
|
};
|
|
|
|
struct writeback_control wbc_writepages = {
|
|
|
|
.sync_mode = mode,
|
|
|
|
.nr_to_write = nr_pages * 2,
|
|
|
|
.range_start = start,
|
|
|
|
.range_end = end + 1,
|
2019-07-10 19:28:17 +00:00
|
|
|
/* We're called from an async helper function */
|
|
|
|
.punt_to_cgroup = 1,
|
|
|
|
.no_cgroup_owner = 1,
|
2008-11-07 03:02:51 +00:00
|
|
|
};
|
|
|
|
|
2019-07-10 19:28:18 +00:00
|
|
|
wbc_attach_fdatawrite_inode(&wbc_writepages, inode);
|
2009-01-06 02:25:51 +00:00
|
|
|
while (start <= end) {
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
page = find_get_page(mapping, start >> PAGE_SHIFT);
|
2008-11-07 03:02:51 +00:00
|
|
|
if (clear_page_dirty_for_io(page))
|
|
|
|
ret = __extent_writepage(page, &wbc_writepages, &epd);
|
|
|
|
else {
|
2018-11-01 12:09:48 +00:00
|
|
|
btrfs_writepage_endio_finish_ordered(page, start,
|
2018-11-08 08:18:08 +00:00
|
|
|
start + PAGE_SIZE - 1, 1);
|
2008-11-07 03:02:51 +00:00
|
|
|
unlock_page(page);
|
|
|
|
}
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
put_page(page);
|
|
|
|
start += PAGE_SIZE;
|
2008-11-07 03:02:51 +00:00
|
|
|
}
|
|
|
|
|
2019-03-20 06:27:45 +00:00
|
|
|
ASSERT(ret <= 0);
|
2019-07-10 19:28:18 +00:00
|
|
|
if (ret == 0)
|
|
|
|
ret = flush_write_bio(&epd);
|
|
|
|
else
|
2019-03-20 06:27:45 +00:00
|
|
|
end_write_bio(&epd, ret);
|
2019-07-10 19:28:18 +00:00
|
|
|
|
|
|
|
wbc_detach_inode(&wbc_writepages);
|
2008-11-07 03:02:51 +00:00
|
|
|
return ret;
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2018-04-19 07:46:38 +00:00
|
|
|
int extent_writepages(struct address_space *mapping,
|
2008-01-24 21:13:08 +00:00
|
|
|
struct writeback_control *wbc)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
struct extent_page_data epd = {
|
|
|
|
.bio = NULL,
|
2008-11-07 03:02:51 +00:00
|
|
|
.extent_locked = 0,
|
2009-04-20 19:50:09 +00:00
|
|
|
.sync_io = wbc->sync_mode == WB_SYNC_ALL,
|
2008-01-24 21:13:08 +00:00
|
|
|
};
|
|
|
|
|
2017-06-23 02:30:28 +00:00
|
|
|
ret = extent_write_cache_pages(mapping, wbc, &epd);
|
2019-03-20 06:27:48 +00:00
|
|
|
ASSERT(ret <= 0);
|
|
|
|
if (ret < 0) {
|
|
|
|
end_write_bio(&epd, ret);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
ret = flush_write_bio(&epd);
|
2008-01-24 21:13:08 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-06-02 04:47:05 +00:00
|
|
|
void extent_readahead(struct readahead_control *rac)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct bio *bio = NULL;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
unsigned long bio_flags = 0;
|
Btrfs: improve multi-thread buffer read
While testing with my buffer read fio jobs[1], I find that btrfs does not
perform well enough.
Here is a scenario in fio jobs:
We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
and all of them will race on add_to_page_cache_lru(), and if one thread
successfully puts its page into the page cache, it takes the responsibility
to read the page's data.
And what's more, reading a page needs a period of time to finish, in which
other threads can slide in and process rest pages:
t1 t2 t3 t4
add Page1
read Page1 add Page2
| read Page2 add Page3
| | read Page3 add Page4
| | | read Page4
-----|------------|-----------|-----------|--------
v v v v
bio bio bio bio
Now we have four bios, each of which holds only one page since we need to
maintain consecutive pages in bio. Thus, we can end up with far more bios
than we need.
Here we're going to
a) delay the real read-page section and
b) try to put more pages into page cache.
With that said, we can make each bio hold more pages and reduce the number
of bios we need.
Here is some numbers taken from fio results:
w/o patch w patch
------------- -------- ---------------
READ: 745MB/s +25% 934MB/s
[1]:
[global]
group_reporting
thread
numjobs=4
bs=32k
rw=read
ioengine=sync
directory=/mnt/btrfs/
[READ]
filename=foobar
size=2000M
invalidate=1
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-21 03:43:09 +00:00
|
|
|
struct page *pagepool[16];
|
2013-07-25 11:22:37 +00:00
|
|
|
struct extent_map *em_cached = NULL;
|
2015-09-28 08:56:26 +00:00
|
|
|
u64 prev_em_start = (u64)-1;
|
2020-06-02 04:47:05 +00:00
|
|
|
int nr;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2020-06-02 04:47:05 +00:00
|
|
|
while ((nr = readahead_page_batch(rac, pagepool))) {
|
|
|
|
u64 contig_start = page_offset(pagepool[0]);
|
|
|
|
u64 contig_end = page_offset(pagepool[nr - 1]) + PAGE_SIZE - 1;
|
2019-03-11 07:55:38 +00:00
|
|
|
|
2020-06-02 04:47:05 +00:00
|
|
|
ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);
|
2019-03-11 07:55:38 +00:00
|
|
|
|
2020-06-02 04:47:05 +00:00
|
|
|
contiguous_readpages(pagepool, nr, contig_start, contig_end,
|
|
|
|
&em_cached, &bio, &bio_flags, &prev_em_start);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
Btrfs: improve multi-thread buffer read
While testing with my buffer read fio jobs[1], I find that btrfs does not
perform well enough.
Here is a scenario in fio jobs:
We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
and all of them will race on add_to_page_cache_lru(), and if one thread
successfully puts its page into the page cache, it takes the responsibility
to read the page's data.
And what's more, reading a page needs a period of time to finish, in which
other threads can slide in and process rest pages:
t1 t2 t3 t4
add Page1
read Page1 add Page2
| read Page2 add Page3
| | read Page3 add Page4
| | | read Page4
-----|------------|-----------|-----------|--------
v v v v
bio bio bio bio
Now we have four bios, each of which holds only one page since we need to
maintain consecutive pages in bio. Thus, we can end up with far more bios
than we need.
Here we're going to
a) delay the real read-page section and
b) try to put more pages into page cache.
With that said, we can make each bio hold more pages and reduce the number
of bios we need.
Here is some numbers taken from fio results:
w/o patch w patch
------------- -------- ---------------
READ: 745MB/s +25% 934MB/s
[1]:
[global]
group_reporting
thread
numjobs=4
bs=32k
rw=read
ioengine=sync
directory=/mnt/btrfs/
[READ]
filename=foobar
size=2000M
invalidate=1
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-07-21 03:43:09 +00:00
|
|
|
|
2013-07-25 11:22:37 +00:00
|
|
|
if (em_cached)
|
|
|
|
free_extent_map(em_cached);
|
|
|
|
|
2020-06-02 04:47:05 +00:00
|
|
|
if (bio) {
|
|
|
|
if (submit_one_bio(bio, 0, bio_flags))
|
|
|
|
return;
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* basic invalidatepage code, this waits on any locked or writeback
|
|
|
|
* ranges corresponding to the page, and then deletes any extent state
|
|
|
|
* records from the tree
|
|
|
|
*/
|
|
|
|
int extent_invalidatepage(struct extent_io_tree *tree,
|
|
|
|
struct page *page, unsigned long offset)
|
|
|
|
{
|
2010-02-03 19:33:23 +00:00
|
|
|
struct extent_state *cached_state = NULL;
|
2012-12-21 09:17:45 +00:00
|
|
|
u64 start = page_offset(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
u64 end = start + PAGE_SIZE - 1;
|
2008-01-24 21:13:08 +00:00
|
|
|
size_t blocksize = page->mapping->host->i_sb->s_blocksize;
|
|
|
|
|
2020-11-13 12:51:39 +00:00
|
|
|
/* This function is only called for the btree inode */
|
|
|
|
ASSERT(tree->owner == IO_TREE_BTREE_INODE_IO);
|
|
|
|
|
2013-02-26 08:10:22 +00:00
|
|
|
start += ALIGN(offset, blocksize);
|
2008-01-24 21:13:08 +00:00
|
|
|
if (start > end)
|
|
|
|
return 0;
|
|
|
|
|
2015-12-03 13:30:40 +00:00
|
|
|
lock_extent_bits(tree, start, end, &cached_state);
|
2009-09-02 17:24:36 +00:00
|
|
|
wait_on_page_writeback(page);
|
2020-11-13 12:51:39 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Currently for btree io tree, only EXTENT_LOCKED is utilized,
|
|
|
|
* so here we only need to unlock the extent range to free any
|
|
|
|
* existing extent state.
|
|
|
|
*/
|
|
|
|
unlock_extent_cached(tree, start, end, &cached_state);
|
2008-01-24 21:13:08 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-04-18 14:29:50 +00:00
|
|
|
/*
|
|
|
|
* a helper for releasepage, this tests for areas of the page that
|
|
|
|
* are locked or under IO and drops the related state bits if it is safe
|
|
|
|
* to drop the page.
|
|
|
|
*/
|
2018-04-19 07:46:35 +00:00
|
|
|
static int try_release_extent_state(struct extent_io_tree *tree,
|
2013-04-25 20:41:01 +00:00
|
|
|
struct page *page, gfp_t mask)
|
2008-04-18 14:29:50 +00:00
|
|
|
{
|
2012-12-21 09:17:45 +00:00
|
|
|
u64 start = page_offset(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
u64 end = start + PAGE_SIZE - 1;
|
2008-04-18 14:29:50 +00:00
|
|
|
int ret = 1;
|
|
|
|
|
2019-03-14 13:28:31 +00:00
|
|
|
if (test_range_bit(tree, start, end, EXTENT_LOCKED, 0, NULL)) {
|
2008-04-18 14:29:50 +00:00
|
|
|
ret = 0;
|
2019-03-14 13:28:31 +00:00
|
|
|
} else {
|
2009-09-24 00:28:46 +00:00
|
|
|
/*
|
btrfs: update the number of bytes used by an inode atomically
There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.
In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
The cases where this can happen are the following:
-> Case 1
If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).
This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.
If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).
Example reproducer:
$ cat reproducer-1.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)
# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!
for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done
kill $loop_pid &> /dev/null
wait
umount $DEV
$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 2
If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.
This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.
Example reproducer:
$ cat reproducer-2.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)
# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar
kill $loop_pid &> /dev/null
wait $loop_pid
done
umount $DEV
$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 3
Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.
The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.
Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.
Example reproducer:
$ cat reproducer-3.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT
extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))
# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
expected=$(stat -c %b $MNT/foo)
# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done
umount $DEV
$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
So fix this by:
1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;
2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;
3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 11:07:34 +00:00
|
|
|
* At this point we can safely clear everything except the
|
|
|
|
* locked bit, the nodatasum bit and the delalloc new bit.
|
|
|
|
* The delalloc new bit will be cleared by ordered extent
|
|
|
|
* completion.
|
2009-09-24 00:28:46 +00:00
|
|
|
*/
|
2017-10-31 15:30:47 +00:00
|
|
|
ret = __clear_extent_bit(tree, start, end,
|
btrfs: update the number of bytes used by an inode atomically
There are several occasions where we do not update the inode's number of
used bytes atomically, resulting in a concurrent stat(2) syscall to report
a value of used blocks that does not correspond to a valid value, that is,
a value that does not match neither what we had before the operation nor
what we get after the operation completes.
In extreme cases it can result in stat(2) reporting zero used blocks, which
can cause problems for some userspace tools where they can consider a file
with a non-zero size and zero used blocks as completely sparse and skip
reading data, as reported/discussed a long time ago in some threads like
the following:
https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
The cases where this can happen are the following:
-> Case 1
If we do a write (buffered or direct IO) against a file region for which
there is already an allocated extent (or multiple extents), then we have a
short time window where we can report a number of used blocks to stat(2)
that does not take into account the file region being overwritten. This
short time window happens when completing the ordered extent(s).
This happens because when we drop the extents in the write range we
decrement the inode's number of bytes and later on when we insert the new
extent(s) we increment the number of bytes in the inode, resulting in a
short time window where a stat(2) syscall can get an incorrect number of
used blocks.
If we do writes that overwrite an entire file, then we have a short time
window where we report 0 used blocks to stat(2).
Example reproducer:
$ cat reproducer-1.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
expected=$(stat -c %b $MNT/foobar)
# Create a process to keep calling stat(2) on the file and see if the
# reported number of blocks used (disk space used) changes, it should
# not because we are not increasing the file size nor punching holes.
stat_loop $MNT/foobar $expected &
loop_pid=$!
for ((i = 0; i < 50000; i++)); do
xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
done
kill $loop_pid &> /dev/null
wait
umount $DEV
$ ./reproducer-1.sh
ERROR: unexpected used blocks (got: 0 expected: 128)
ERROR: unexpected used blocks (got: 0 expected: 128)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 2
If we do a buffered write against a file region that does not have any
allocated extents, like a hole or beyond EOF, then during ordered extent
completion we have a short time window where a concurrent stat(2) syscall
can report a number of used blocks that does not correspond to the value
before or after the write operation, a value that is actually larger than
the value after the write completes.
This happens because once we start a buffered write into an unallocated
file range we increment the inode's 'new_delalloc_bytes', to make sure
any stat(2) call gets a correct used blocks value before delalloc is
flushed and completes. However at ordered extent completion, after we
inserted the new extent, we increment the inode's number of bytes used
with the size of the new extent, and only later, when clearing the range
in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
counter with the size of the extent. So this results in a short time
window where a concurrent stat(2) syscall can report a number of used
blocks that accounts for the new extent twice.
Example reproducer:
$ cat reproducer-2.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
stat_loop()
{
trap "wait; exit" SIGTERM
local filepath=$1
local expected=$2
local got
while :; do
got=$(stat -c %b $filepath)
if [ $got -ne $expected ]; then
echo -n "ERROR: unexpected used blocks"
echo " (got: $got expected: $expected)"
fi
done
}
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f $DEV > /dev/null
# mkfs.ext4 -F $DEV > /dev/null
# mkfs.f2fs -f $DEV > /dev/null
# mkfs.reiserfs -f $DEV > /dev/null
mount $DEV $MNT
touch $MNT/foobar
write_size=$((64 * 1024))
for ((i = 0; i < 16384; i++)); do
offset=$(($i * $write_size))
xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
blocks_used=$(stat -c %b $MNT/foobar)
# Fsync the file to trigger writeback and keep calling stat(2) on it
# to see if the number of blocks used changes.
stat_loop $MNT/foobar $blocks_used &
loop_pid=$!
xfs_io -c "fsync" $MNT/foobar
kill $loop_pid &> /dev/null
wait $loop_pid
done
umount $DEV
$ ./reproducer-2.sh
ERROR: unexpected used blocks (got: 265472 expected: 265344)
ERROR: unexpected used blocks (got: 284032 expected: 283904)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
-> Case 3
Another case where such problems happen is during other operations that
replace extents in a file range with other extents. Those operations are
extent cloning, deduplication and fallocate's zero range operation.
The cause of the problem is similar to the first case. When we drop the
extents from a range, we decrement the inode's number of bytes, and later
on, after inserting the new extents we increment it. Since this is not
done atomically, a concurrent stat(2) call can see and return a number of
used blocks that is smaller than it should be, does not match the number
of used blocks before or after the clone/deduplication/zero operation.
Like for the first case, when doing a clone, deduplication or zero range
operation against an entire file, we end up having a time window where we
can report 0 used blocks to a stat(2) call.
Example reproducer:
$ cat reproducer-3.sh
#!/bin/bash
MNT=/mnt/sdi
DEV=/dev/sdi
mkfs.btrfs -f $DEV > /dev/null
# mkfs.xfs -f -m reflink=1 $DEV > /dev/null
mount $DEV $MNT
extent_size=$((64 * 1024))
num_extents=16384
file_size=$(($extent_size * $num_extents))
# File foo has many small extents.
xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
> /dev/null
# File bar has much less extents and has exactly the same data as foo.
xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
expected=$(stat -c %b $MNT/foo)
# Now deduplicate bar into foo. While the deduplication is in progres,
# the number of used blocks/file size reported by stat should not change
xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null &
dedupe_pid=$!
while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
used=$(stat -c %b $MNT/foo)
if [ $used -ne $expected ]; then
echo "Unexpected blocks used: $used (expected: $expected)"
fi
done
umount $DEV
$ ./reproducer-3.sh
Unexpected blocks used: 2076800 (expected: 2097152)
Unexpected blocks used: 2097024 (expected: 2097152)
Unexpected blocks used: 2079872 (expected: 2097152)
(...)
Note that since this is a short time window where the race can happen, the
reproducer may not be able to always trigger the bug in one run, or it may
trigger it multiple times.
So fix this by:
1) Making btrfs_drop_extents() not decrement the VFS inode's number of
bytes, and instead return the number of bytes;
2) Making any code that drops extents and adds new extents update the
inode's number of bytes atomically, while holding the btrfs inode's
spinlock, which is also used by the stat(2) callback to get the inode's
number of bytes;
3) For ranges in the inode's iotree that are marked as 'delalloc new',
corresponding to previously unallocated ranges, increment the inode's
number of bytes when clearing the 'delalloc new' bit from the range,
in the same critical section that decrements the inode's
'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
An alternative would be to have btrfs_getattr() wait for any IO (ordered
extents in progress) and locking the whole range (0 to (u64)-1) while it
it computes the number of blocks used. But that would mean blocking
stat(2), which is a very used syscall and expected to be fast, waiting
for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-04 11:07:34 +00:00
|
|
|
~(EXTENT_LOCKED | EXTENT_NODATASUM | EXTENT_DELALLOC_NEW),
|
|
|
|
0, 0, NULL, mask, NULL);
|
2011-02-14 17:52:08 +00:00
|
|
|
|
|
|
|
/* if clear_extent_bit failed for enomem reasons,
|
|
|
|
* we can't allow the release to continue.
|
|
|
|
*/
|
|
|
|
if (ret < 0)
|
|
|
|
ret = 0;
|
|
|
|
else
|
|
|
|
ret = 1;
|
2008-04-18 14:29:50 +00:00
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
/*
|
|
|
|
* a helper for releasepage. As long as there are no locked extents
|
|
|
|
* in the range corresponding to the page, both state records and extent
|
|
|
|
* map records are removed
|
|
|
|
*/
|
2018-04-19 07:46:34 +00:00
|
|
|
int try_release_extent_mapping(struct page *page, gfp_t mask)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct extent_map *em;
|
2012-12-21 09:17:45 +00:00
|
|
|
u64 start = page_offset(page);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
u64 end = start + PAGE_SIZE - 1;
|
Btrfs: fix file data corruption after cloning a range and fsync
When we clone a range into a file we can end up dropping existing
extent maps (or trimming them) and replacing them with new ones if the
range to be cloned overlaps with a range in the destination inode.
When that happens we add the new extent maps to the list of modified
extents in the inode's extent map tree, so that a "fast" fsync (the flag
BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
and log corresponding extent items. However, at the end of range cloning
operation we do truncate all the pages in the affected range (in order to
ensure future reads will not get stale data). Sometimes this truncation
will release the corresponding extent maps besides the pages from the page
cache. If this happens, then a "fast" fsync operation will miss logging
some extent items, because it relies exclusively on the extent maps being
present in the inode's extent tree, leading to data loss/corruption if
the fsync ends up using the same transaction used by the clone operation
(that transaction was not committed in the meanwhile). An extent map is
released through the callback btrfs_invalidatepage(), which gets called by
truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
later ends up calling try_release_extent_mapping() which will release the
extent map if some conditions are met, like the file size being greater
than 16Mb, gfp flags allow blocking and the range not being locked (which
is the case during the clone operation) nor being the extent map flagged
as pinned (also the case for cloning).
The following example, turned into a test for fstests, reproduces the
issue:
$ mkfs.btrfs -f /dev/sdb
$ mount /dev/sdb /mnt
$ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
$ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
# reflink destination offset corresponds to the size of file bar,
# 2728Kb minus 4Kb.
$ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
$ xfs_io -c "fsync" /mnt/bar
$ md5sum /mnt/bar
95a95813a8c2abc9aa75a6c2914a077e /mnt/bar
<power fail>
$ mount /dev/sdb /mnt
$ md5sum /mnt/bar
207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
# digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
# power failure
In the above example, the destination offset of the clone operation
corresponds to the size of the "bar" file minus 4Kb. So during the clone
operation, the extent map covering the range from 2572Kb to 2728Kb gets
trimmed so that it ends at offset 2724Kb, and a new extent map covering
the range from 2724Kb to 11724Kb is created. So at the end of the clone
operation when we ask to truncate the pages in the range from 2724Kb to
2724Kb + 15908Kb, the page invalidation callback ends up removing the new
extent map (through try_release_extent_mapping()) when the page at offset
2724Kb is passed to that callback.
Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
map is removed at try_release_extent_mapping(), forcing the next fsync to
search for modified extents in the fs/subvolume tree instead of relying on
the presence of extent maps in memory. This way we can continue doing a
"fast" fsync if the destination range of a clone operation does not
overlap with an existing range or if any of the criteria necessary to
remove an extent map at try_release_extent_mapping() is not met (file
size not bigger then 16Mb or gfp flags do not allow blocking).
CC: stable@vger.kernel.org # 3.16+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-07-12 00:36:43 +00:00
|
|
|
struct btrfs_inode *btrfs_inode = BTRFS_I(page->mapping->host);
|
|
|
|
struct extent_io_tree *tree = &btrfs_inode->io_tree;
|
|
|
|
struct extent_map_tree *map = &btrfs_inode->extent_tree;
|
2008-04-18 14:29:50 +00:00
|
|
|
|
2015-11-07 00:28:21 +00:00
|
|
|
if (gfpflags_allow_blocking(mask) &&
|
2015-12-14 16:42:10 +00:00
|
|
|
page->mapping->host->i_size > SZ_16M) {
|
2008-02-15 15:40:50 +00:00
|
|
|
u64 len;
|
2008-01-29 14:59:12 +00:00
|
|
|
while (start <= end) {
|
2020-07-22 11:28:52 +00:00
|
|
|
struct btrfs_fs_info *fs_info;
|
|
|
|
u64 cur_gen;
|
|
|
|
|
2008-02-15 15:40:50 +00:00
|
|
|
len = end - start + 1;
|
2009-09-02 20:24:52 +00:00
|
|
|
write_lock(&map->lock);
|
2008-02-15 15:40:50 +00:00
|
|
|
em = lookup_extent_mapping(map, start, len);
|
2012-02-16 07:23:58 +00:00
|
|
|
if (!em) {
|
2009-09-02 20:24:52 +00:00
|
|
|
write_unlock(&map->lock);
|
2008-01-29 14:59:12 +00:00
|
|
|
break;
|
|
|
|
}
|
2008-07-18 16:01:11 +00:00
|
|
|
if (test_bit(EXTENT_FLAG_PINNED, &em->flags) ||
|
|
|
|
em->start != start) {
|
2009-09-02 20:24:52 +00:00
|
|
|
write_unlock(&map->lock);
|
2008-01-29 14:59:12 +00:00
|
|
|
free_extent_map(em);
|
|
|
|
break;
|
|
|
|
}
|
btrfs: fix race between page release and a fast fsync
When releasing an extent map, done through the page release callback, we
can race with an ongoing fast fsync and cause the fsync to miss a new
extent and not log it. The steps for this to happen are the following:
1) A page is dirtied for some inode I;
2) Writeback for that page is triggered by a path other than fsync, for
example by the system due to memory pressure;
3) When the ordered extent for the extent (a single 4K page) finishes,
we unpin the corresponding extent map and set its generation to N,
the current transaction's generation;
4) The btrfs_releasepage() callback is invoked by the system due to
memory pressure for that no longer dirty page of inode I;
5) At the same time, some task calls fsync on inode I, joins transaction
N, and at btrfs_log_inode() it sees that the inode does not have the
full sync flag set, so we proceed with a fast fsync. But before we get
into btrfs_log_changed_extents() and lock the inode's extent map tree:
6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
and we remove the extent map for the new 4Kb extent, because it is
neither pinned anymore nor locked. By calling remove_extent_mapping(),
we remove the extent map from the list of modified extents, since the
extent map does not have the logging flag set. We unlock the inode's
extent map tree;
7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
locks the inode's extent map tree and iterates its list of modified
extents, which no longer has the 4Kb extent in it, so it does not log
the extent;
8) The fsync finishes;
9) Before transaction N is committed, a power failure happens. After
replaying the log, the 4K extent of inode I will be missing, since
it was not logged due to the race with try_release_extent_mapping().
So fix this by teaching try_release_extent_mapping() to not remove an
extent map if it's still in the list of modified extents.
Fixes: ff44c6e36dc9dc ("Btrfs: do not hold the write_lock on the extent tree while logging")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-22 11:28:37 +00:00
|
|
|
if (test_range_bit(tree, em->start,
|
|
|
|
extent_map_end(em) - 1,
|
|
|
|
EXTENT_LOCKED, 0, NULL))
|
|
|
|
goto next;
|
|
|
|
/*
|
|
|
|
* If it's not in the list of modified extents, used
|
|
|
|
* by a fast fsync, we can remove it. If it's being
|
|
|
|
* logged we can safely remove it since fsync took an
|
|
|
|
* extra reference on the em.
|
|
|
|
*/
|
|
|
|
if (list_empty(&em->list) ||
|
2020-07-22 11:28:52 +00:00
|
|
|
test_bit(EXTENT_FLAG_LOGGING, &em->flags))
|
|
|
|
goto remove_em;
|
|
|
|
/*
|
|
|
|
* If it's in the list of modified extents, remove it
|
|
|
|
* only if its generation is older then the current one,
|
|
|
|
* in which case we don't need it for a fast fsync.
|
|
|
|
* Otherwise don't remove it, we could be racing with an
|
|
|
|
* ongoing fast fsync that could miss the new extent.
|
|
|
|
*/
|
|
|
|
fs_info = btrfs_inode->root->fs_info;
|
|
|
|
spin_lock(&fs_info->trans_lock);
|
|
|
|
cur_gen = fs_info->generation;
|
|
|
|
spin_unlock(&fs_info->trans_lock);
|
|
|
|
if (em->generation >= cur_gen)
|
|
|
|
goto next;
|
|
|
|
remove_em:
|
btrfs: do not set the full sync flag on the inode during page release
When removing an extent map at try_release_extent_mapping(), called through
the page release callback (btrfs_releasepage()), we always set the full
sync flag on the inode, which forces the next fsync to use a slower code
path.
This hurts performance for workloads that dirty an amount of data that
exceeds or is very close to the system's RAM memory and do frequent fsync
operations (like database servers can for example). In particular if there
are concurrent fsyncs against different files, by falling back to a full
fsync we do a lot more checksum lookups in the checksums btree, as we do
it for all the extents created in the current transaction, instead of only
the new ones since the last fsync. These checksums lookups not only take
some time but, more importantly, they also cause contention on the
checksums btree locks due to the concurrency with checksum insertions in
the btree by ordered extents from other inodes.
We actually don't need to set the full sync flag on the inode, because we
only remove extent maps that are in the list of modified extents if they
were created in a past transaction, in which case an fsync skips them as
it's pointless to log them. So stop setting the full fsync flag on the
inode whenever we remove an extent map.
This patch is part of a patchset that consists of 3 patches, which have
the following subjects:
1/3 btrfs: fix race between page release and a fast fsync
2/3 btrfs: release old extent maps during page release
3/3 btrfs: do not set the full sync flag on the inode during page release
Performance tests were ran against a branch (misc-next) containing the
whole patchset. The test exercises a workload where there are multiple
processes writing to files and fsyncing them (each writing and fsyncing
its own file), and in total the amount of data dirtied ranges from 2x to
4x the system's RAM memory (16GiB), so that the page release callback is
invoked frequently.
The following script, using fio, was used to perform the tests:
$ cat test-fsync.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 3 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=write
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=64k
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The tests were performed for different numbers of jobs, file sizes and
fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
12 cores, with cpu governance set to performance mode on all cores), 16GiB
of ram (the host has 64GiB) and using a NVMe device directly (without an
intermediary filesystem in the host). While running the tests, the host
was not used for anything else, to avoid disturbing the tests.
The obtained results were the following, and the last line printed by
fio is pasted (includes aggregated throughput and test run time).
*****************************************************
**** 1 job, 32GiB file, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec
After patchset:
WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
(+0.7% throughput, -0.8% run time)
*****************************************************
**** 2 jobs, 16GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec
After patchset:
WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
(+19.1% throughput, -16.1% runtime)
*****************************************************
**** 4 jobs, 8GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec
After patchset:
WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
(+37.8% throughput, -27.5% runtime)
*****************************************************
**** 8 jobs, 4GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec
After patchset:
WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
(+74.7% throughput, -43.0% run time)
*****************************************************
**** 16 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec
After patchset:
WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
(+146.3% throughput, -59.5% run time)
*****************************************************
**** 32 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec
After patchset:
WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
(+212.7% throughput, -68.0% run time)
*****************************************************
**** 64 jobs, 1GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec
After patchset:
WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
(+197.6% throughput, -66.4% run time)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-22 11:29:01 +00:00
|
|
|
/*
|
|
|
|
* We only remove extent maps that are not in the list of
|
|
|
|
* modified extents or that are in the list but with a
|
|
|
|
* generation lower then the current generation, so there
|
|
|
|
* is no need to set the full fsync flag on the inode (it
|
|
|
|
* hurts the fsync performance for workloads with a data
|
|
|
|
* size that exceeds or is close to the system's memory).
|
|
|
|
*/
|
2020-07-22 11:28:52 +00:00
|
|
|
remove_extent_mapping(map, em);
|
|
|
|
/* once for the rb tree */
|
|
|
|
free_extent_map(em);
|
btrfs: fix race between page release and a fast fsync
When releasing an extent map, done through the page release callback, we
can race with an ongoing fast fsync and cause the fsync to miss a new
extent and not log it. The steps for this to happen are the following:
1) A page is dirtied for some inode I;
2) Writeback for that page is triggered by a path other than fsync, for
example by the system due to memory pressure;
3) When the ordered extent for the extent (a single 4K page) finishes,
we unpin the corresponding extent map and set its generation to N,
the current transaction's generation;
4) The btrfs_releasepage() callback is invoked by the system due to
memory pressure for that no longer dirty page of inode I;
5) At the same time, some task calls fsync on inode I, joins transaction
N, and at btrfs_log_inode() it sees that the inode does not have the
full sync flag set, so we proceed with a fast fsync. But before we get
into btrfs_log_changed_extents() and lock the inode's extent map tree:
6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
and we remove the extent map for the new 4Kb extent, because it is
neither pinned anymore nor locked. By calling remove_extent_mapping(),
we remove the extent map from the list of modified extents, since the
extent map does not have the logging flag set. We unlock the inode's
extent map tree;
7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
locks the inode's extent map tree and iterates its list of modified
extents, which no longer has the 4Kb extent in it, so it does not log
the extent;
8) The fsync finishes;
9) Before transaction N is committed, a power failure happens. After
replaying the log, the 4K extent of inode I will be missing, since
it was not logged due to the race with try_release_extent_mapping().
So fix this by teaching try_release_extent_mapping() to not remove an
extent map if it's still in the list of modified extents.
Fixes: ff44c6e36dc9dc ("Btrfs: do not hold the write_lock on the extent tree while logging")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-22 11:28:37 +00:00
|
|
|
next:
|
2008-01-29 14:59:12 +00:00
|
|
|
start = extent_map_end(em);
|
2009-09-02 20:24:52 +00:00
|
|
|
write_unlock(&map->lock);
|
2008-01-29 14:59:12 +00:00
|
|
|
|
|
|
|
/* once for us */
|
2008-01-24 21:13:08 +00:00
|
|
|
free_extent_map(em);
|
2020-05-08 21:15:37 +00:00
|
|
|
|
|
|
|
cond_resched(); /* Allow large-extent preemption. */
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
}
|
2018-04-19 07:46:35 +00:00
|
|
|
return try_release_extent_state(tree, page, mask);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2011-02-23 21:23:20 +00:00
|
|
|
/*
|
|
|
|
* helper function for fiemap, which doesn't want to see any holes.
|
|
|
|
* This maps until we find something past 'last'
|
|
|
|
*/
|
2020-08-31 11:42:45 +00:00
|
|
|
static struct extent_map *get_extent_skip_holes(struct btrfs_inode *inode,
|
2017-06-23 02:09:57 +00:00
|
|
|
u64 offset, u64 last)
|
2011-02-23 21:23:20 +00:00
|
|
|
{
|
2020-08-31 11:42:45 +00:00
|
|
|
u64 sectorsize = btrfs_inode_sectorsize(inode);
|
2011-02-23 21:23:20 +00:00
|
|
|
struct extent_map *em;
|
|
|
|
u64 len;
|
|
|
|
|
|
|
|
if (offset >= last)
|
|
|
|
return NULL;
|
|
|
|
|
2013-10-31 05:03:04 +00:00
|
|
|
while (1) {
|
2011-02-23 21:23:20 +00:00
|
|
|
len = last - offset;
|
|
|
|
if (len == 0)
|
|
|
|
break;
|
2013-02-26 08:10:22 +00:00
|
|
|
len = ALIGN(len, sectorsize);
|
2020-08-31 11:42:45 +00:00
|
|
|
em = btrfs_get_extent_fiemap(inode, offset, len);
|
2011-04-19 16:00:01 +00:00
|
|
|
if (IS_ERR_OR_NULL(em))
|
2011-02-23 21:23:20 +00:00
|
|
|
return em;
|
|
|
|
|
|
|
|
/* if this isn't a hole return it */
|
2017-11-23 08:51:43 +00:00
|
|
|
if (em->block_start != EXTENT_MAP_HOLE)
|
2011-02-23 21:23:20 +00:00
|
|
|
return em;
|
|
|
|
|
|
|
|
/* this is a hole, advance to the next extent */
|
|
|
|
offset = extent_map_end(em);
|
|
|
|
free_extent_map(em);
|
|
|
|
if (offset >= last)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
[REASON]
Btrfs will try to merge extent map when inserting new extent map.
btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.
And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.
[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.
And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.
So by this method, we can merge all fiemap extents.
It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-07 02:43:15 +00:00
|
|
|
/*
|
|
|
|
* To cache previous fiemap extent
|
|
|
|
*
|
|
|
|
* Will be used for merging fiemap extent
|
|
|
|
*/
|
|
|
|
struct fiemap_cache {
|
|
|
|
u64 offset;
|
|
|
|
u64 phys;
|
|
|
|
u64 len;
|
|
|
|
u32 flags;
|
|
|
|
bool cached;
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Helper to submit fiemap extent.
|
|
|
|
*
|
|
|
|
* Will try to merge current fiemap extent specified by @offset, @phys,
|
|
|
|
* @len and @flags with cached one.
|
|
|
|
* And only when we fails to merge, cached one will be submitted as
|
|
|
|
* fiemap extent.
|
|
|
|
*
|
|
|
|
* Return value is the same as fiemap_fill_next_extent().
|
|
|
|
*/
|
|
|
|
static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
|
|
|
|
struct fiemap_cache *cache,
|
|
|
|
u64 offset, u64 phys, u64 len, u32 flags)
|
|
|
|
{
|
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
if (!cache->cached)
|
|
|
|
goto assign;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sanity check, extent_fiemap() should have ensured that new
|
2018-11-28 11:05:13 +00:00
|
|
|
* fiemap extent won't overlap with cached one.
|
btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
[REASON]
Btrfs will try to merge extent map when inserting new extent map.
btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.
And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.
[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.
And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.
So by this method, we can merge all fiemap extents.
It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-07 02:43:15 +00:00
|
|
|
* Not recoverable.
|
|
|
|
*
|
|
|
|
* NOTE: Physical address can overlap, due to compression
|
|
|
|
*/
|
|
|
|
if (cache->offset + cache->len > offset) {
|
|
|
|
WARN_ON(1);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Only merges fiemap extents if
|
|
|
|
* 1) Their logical addresses are continuous
|
|
|
|
*
|
|
|
|
* 2) Their physical addresses are continuous
|
|
|
|
* So truly compressed (physical size smaller than logical size)
|
|
|
|
* extents won't get merged with each other
|
|
|
|
*
|
|
|
|
* 3) Share same flags except FIEMAP_EXTENT_LAST
|
|
|
|
* So regular extent won't get merged with prealloc extent
|
|
|
|
*/
|
|
|
|
if (cache->offset + cache->len == offset &&
|
|
|
|
cache->phys + cache->len == phys &&
|
|
|
|
(cache->flags & ~FIEMAP_EXTENT_LAST) ==
|
|
|
|
(flags & ~FIEMAP_EXTENT_LAST)) {
|
|
|
|
cache->len += len;
|
|
|
|
cache->flags |= flags;
|
|
|
|
goto try_submit_last;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Not mergeable, need to submit cached one */
|
|
|
|
ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
|
|
|
|
cache->len, cache->flags);
|
|
|
|
cache->cached = false;
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
assign:
|
|
|
|
cache->cached = true;
|
|
|
|
cache->offset = offset;
|
|
|
|
cache->phys = phys;
|
|
|
|
cache->len = len;
|
|
|
|
cache->flags = flags;
|
|
|
|
try_submit_last:
|
|
|
|
if (cache->flags & FIEMAP_EXTENT_LAST) {
|
|
|
|
ret = fiemap_fill_next_extent(fieinfo, cache->offset,
|
|
|
|
cache->phys, cache->len, cache->flags);
|
|
|
|
cache->cached = false;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-06-22 02:01:21 +00:00
|
|
|
* Emit last fiemap cache
|
btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
[REASON]
Btrfs will try to merge extent map when inserting new extent map.
btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.
And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.
[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.
And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.
So by this method, we can merge all fiemap extents.
It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-07 02:43:15 +00:00
|
|
|
*
|
2017-06-22 02:01:21 +00:00
|
|
|
* The last fiemap cache may still be cached in the following case:
|
|
|
|
* 0 4k 8k
|
|
|
|
* |<- Fiemap range ->|
|
|
|
|
* |<------------ First extent ----------->|
|
|
|
|
*
|
|
|
|
* In this case, the first extent range will be cached but not emitted.
|
|
|
|
* So we must emit it before ending extent_fiemap().
|
btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
[REASON]
Btrfs will try to merge extent map when inserting new extent map.
btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.
And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.
[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.
And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.
So by this method, we can merge all fiemap extents.
It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-07 02:43:15 +00:00
|
|
|
*/
|
2019-03-20 10:29:46 +00:00
|
|
|
static int emit_last_fiemap_cache(struct fiemap_extent_info *fieinfo,
|
2017-06-22 02:01:21 +00:00
|
|
|
struct fiemap_cache *cache)
|
btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
[REASON]
Btrfs will try to merge extent map when inserting new extent map.
btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.
And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.
[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.
And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.
So by this method, we can merge all fiemap extents.
It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-07 02:43:15 +00:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (!cache->cached)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
|
|
|
|
cache->len, cache->flags);
|
|
|
|
cache->cached = false;
|
|
|
|
if (ret > 0)
|
|
|
|
ret = 0;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-08-31 11:42:49 +00:00
|
|
|
int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
|
2020-06-23 18:56:12 +00:00
|
|
|
u64 start, u64 len)
|
2009-01-21 19:39:14 +00:00
|
|
|
{
|
2010-11-23 19:36:57 +00:00
|
|
|
int ret = 0;
|
2009-01-21 19:39:14 +00:00
|
|
|
u64 off = start;
|
|
|
|
u64 max = start + len;
|
|
|
|
u32 flags = 0;
|
2010-11-23 19:36:57 +00:00
|
|
|
u32 found_type;
|
|
|
|
u64 last;
|
2011-02-23 21:23:20 +00:00
|
|
|
u64 last_for_get_extent = 0;
|
2009-01-21 19:39:14 +00:00
|
|
|
u64 disko = 0;
|
2020-08-31 11:42:49 +00:00
|
|
|
u64 isize = i_size_read(&inode->vfs_inode);
|
2010-11-23 19:36:57 +00:00
|
|
|
struct btrfs_key found_key;
|
2009-01-21 19:39:14 +00:00
|
|
|
struct extent_map *em = NULL;
|
2010-02-03 19:33:23 +00:00
|
|
|
struct extent_state *cached_state = NULL;
|
2010-11-23 19:36:57 +00:00
|
|
|
struct btrfs_path *path;
|
2020-08-31 11:42:49 +00:00
|
|
|
struct btrfs_root *root = inode->root;
|
btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
[REASON]
Btrfs will try to merge extent map when inserting new extent map.
btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.
And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.
[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.
And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.
So by this method, we can merge all fiemap extents.
It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-07 02:43:15 +00:00
|
|
|
struct fiemap_cache cache = { 0 };
|
2019-05-15 13:31:04 +00:00
|
|
|
struct ulist *roots;
|
|
|
|
struct ulist *tmp_ulist;
|
2009-01-21 19:39:14 +00:00
|
|
|
int end = 0;
|
2011-02-23 21:23:20 +00:00
|
|
|
u64 em_start = 0;
|
|
|
|
u64 em_len = 0;
|
|
|
|
u64 em_end = 0;
|
2009-01-21 19:39:14 +00:00
|
|
|
|
|
|
|
if (len == 0)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2010-11-23 19:36:57 +00:00
|
|
|
path = btrfs_alloc_path();
|
|
|
|
if (!path)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
2019-05-15 13:31:04 +00:00
|
|
|
roots = ulist_alloc(GFP_KERNEL);
|
|
|
|
tmp_ulist = ulist_alloc(GFP_KERNEL);
|
|
|
|
if (!roots || !tmp_ulist) {
|
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out_free_ulist;
|
|
|
|
}
|
|
|
|
|
2020-08-31 11:42:49 +00:00
|
|
|
start = round_down(start, btrfs_inode_sectorsize(inode));
|
|
|
|
len = round_up(max, btrfs_inode_sectorsize(inode)) - start;
|
2011-11-17 16:34:31 +00:00
|
|
|
|
2011-02-23 21:23:20 +00:00
|
|
|
/*
|
|
|
|
* lookup the last file extent. We're not using i_size here
|
|
|
|
* because there might be preallocation past i_size
|
|
|
|
*/
|
2020-08-31 11:42:49 +00:00
|
|
|
ret = btrfs_lookup_file_extent(NULL, root, path, btrfs_ino(inode), -1,
|
|
|
|
0);
|
2010-11-23 19:36:57 +00:00
|
|
|
if (ret < 0) {
|
2019-05-15 13:31:04 +00:00
|
|
|
goto out_free_ulist;
|
2016-05-18 00:21:48 +00:00
|
|
|
} else {
|
|
|
|
WARN_ON(!ret);
|
|
|
|
if (ret == 1)
|
|
|
|
ret = 0;
|
2010-11-23 19:36:57 +00:00
|
|
|
}
|
2016-05-18 00:21:48 +00:00
|
|
|
|
2010-11-23 19:36:57 +00:00
|
|
|
path->slots[0]--;
|
|
|
|
btrfs_item_key_to_cpu(path->nodes[0], &found_key, path->slots[0]);
|
2014-06-04 16:41:45 +00:00
|
|
|
found_type = found_key.type;
|
2010-11-23 19:36:57 +00:00
|
|
|
|
2011-02-23 21:23:20 +00:00
|
|
|
/* No extents, but there might be delalloc bits */
|
2020-08-31 11:42:49 +00:00
|
|
|
if (found_key.objectid != btrfs_ino(inode) ||
|
2010-11-23 19:36:57 +00:00
|
|
|
found_type != BTRFS_EXTENT_DATA_KEY) {
|
2011-02-23 21:23:20 +00:00
|
|
|
/* have to trust i_size as the end */
|
|
|
|
last = (u64)-1;
|
|
|
|
last_for_get_extent = isize;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* remember the start of the last extent. There are a
|
|
|
|
* bunch of different factors that go into the length of the
|
|
|
|
* extent, so its much less complex to remember where it started
|
|
|
|
*/
|
|
|
|
last = found_key.offset;
|
|
|
|
last_for_get_extent = last + 1;
|
2010-11-23 19:36:57 +00:00
|
|
|
}
|
2013-09-22 04:54:23 +00:00
|
|
|
btrfs_release_path(path);
|
2010-11-23 19:36:57 +00:00
|
|
|
|
2011-02-23 21:23:20 +00:00
|
|
|
/*
|
|
|
|
* we might have some extents allocated but more delalloc past those
|
|
|
|
* extents. so, we trust isize unless the start of the last extent is
|
|
|
|
* beyond isize
|
|
|
|
*/
|
|
|
|
if (last < isize) {
|
|
|
|
last = (u64)-1;
|
|
|
|
last_for_get_extent = isize;
|
|
|
|
}
|
|
|
|
|
2020-08-31 11:42:49 +00:00
|
|
|
lock_extent_bits(&inode->io_tree, start, start + len - 1,
|
2012-03-01 13:57:19 +00:00
|
|
|
&cached_state);
|
2011-02-23 21:23:20 +00:00
|
|
|
|
2020-08-31 11:42:49 +00:00
|
|
|
em = get_extent_skip_holes(inode, start, last_for_get_extent);
|
2009-01-21 19:39:14 +00:00
|
|
|
if (!em)
|
|
|
|
goto out;
|
|
|
|
if (IS_ERR(em)) {
|
|
|
|
ret = PTR_ERR(em);
|
|
|
|
goto out;
|
|
|
|
}
|
2010-11-23 19:36:57 +00:00
|
|
|
|
2009-01-21 19:39:14 +00:00
|
|
|
while (!end) {
|
2013-07-05 17:52:51 +00:00
|
|
|
u64 offset_in_extent = 0;
|
2011-03-08 16:54:40 +00:00
|
|
|
|
|
|
|
/* break if the extent we found is outside the range */
|
|
|
|
if (em->start >= max || extent_map_end(em) < off)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* get_extent may return an extent that starts before our
|
|
|
|
* requested range. We have to make sure the ranges
|
|
|
|
* we return to fiemap always move forward and don't
|
|
|
|
* overlap, so adjust the offsets here
|
|
|
|
*/
|
|
|
|
em_start = max(em->start, off);
|
2009-01-21 19:39:14 +00:00
|
|
|
|
2011-03-08 16:54:40 +00:00
|
|
|
/*
|
|
|
|
* record the offset from the start of the extent
|
2013-07-05 17:52:51 +00:00
|
|
|
* for adjusting the disk offset below. Only do this if the
|
|
|
|
* extent isn't compressed since our in ram offset may be past
|
|
|
|
* what we have actually allocated on disk.
|
2011-03-08 16:54:40 +00:00
|
|
|
*/
|
2013-07-05 17:52:51 +00:00
|
|
|
if (!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
|
|
|
|
offset_in_extent = em_start - em->start;
|
2011-02-23 21:23:20 +00:00
|
|
|
em_end = extent_map_end(em);
|
2011-03-08 16:54:40 +00:00
|
|
|
em_len = em_end - em_start;
|
2009-01-21 19:39:14 +00:00
|
|
|
flags = 0;
|
2018-06-20 09:02:30 +00:00
|
|
|
if (em->block_start < EXTENT_MAP_LAST_BYTE)
|
|
|
|
disko = em->block_start + offset_in_extent;
|
|
|
|
else
|
|
|
|
disko = 0;
|
2009-01-21 19:39:14 +00:00
|
|
|
|
2011-03-08 16:54:40 +00:00
|
|
|
/*
|
|
|
|
* bump off for our next call to get_extent
|
|
|
|
*/
|
|
|
|
off = extent_map_end(em);
|
|
|
|
if (off >= max)
|
|
|
|
end = 1;
|
|
|
|
|
2009-04-03 14:33:45 +00:00
|
|
|
if (em->block_start == EXTENT_MAP_LAST_BYTE) {
|
2009-01-21 19:39:14 +00:00
|
|
|
end = 1;
|
|
|
|
flags |= FIEMAP_EXTENT_LAST;
|
2009-04-03 14:33:45 +00:00
|
|
|
} else if (em->block_start == EXTENT_MAP_INLINE) {
|
2009-01-21 19:39:14 +00:00
|
|
|
flags |= (FIEMAP_EXTENT_DATA_INLINE |
|
|
|
|
FIEMAP_EXTENT_NOT_ALIGNED);
|
2009-04-03 14:33:45 +00:00
|
|
|
} else if (em->block_start == EXTENT_MAP_DELALLOC) {
|
2009-01-21 19:39:14 +00:00
|
|
|
flags |= (FIEMAP_EXTENT_DELALLOC |
|
|
|
|
FIEMAP_EXTENT_UNKNOWN);
|
2014-09-10 20:20:45 +00:00
|
|
|
} else if (fieinfo->fi_extents_max) {
|
|
|
|
u64 bytenr = em->block_start -
|
|
|
|
(em->start - em->orig_start);
|
2013-09-22 04:54:23 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* As btrfs supports shared space, this information
|
|
|
|
* can be exported to userspace tools via
|
2014-09-10 20:20:45 +00:00
|
|
|
* flag FIEMAP_EXTENT_SHARED. If fi_extents_max == 0
|
|
|
|
* then we're just getting a count and we can skip the
|
|
|
|
* lookup stuff.
|
2013-09-22 04:54:23 +00:00
|
|
|
*/
|
2020-08-31 11:42:49 +00:00
|
|
|
ret = btrfs_check_shared(root, btrfs_ino(inode),
|
2019-05-15 13:31:04 +00:00
|
|
|
bytenr, roots, tmp_ulist);
|
2014-09-10 20:20:45 +00:00
|
|
|
if (ret < 0)
|
2013-09-22 04:54:23 +00:00
|
|
|
goto out_free;
|
2014-09-10 20:20:45 +00:00
|
|
|
if (ret)
|
2013-09-22 04:54:23 +00:00
|
|
|
flags |= FIEMAP_EXTENT_SHARED;
|
2014-09-10 20:20:45 +00:00
|
|
|
ret = 0;
|
2009-01-21 19:39:14 +00:00
|
|
|
}
|
|
|
|
if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags))
|
|
|
|
flags |= FIEMAP_EXTENT_ENCODED;
|
2015-05-19 14:44:04 +00:00
|
|
|
if (test_bit(EXTENT_FLAG_PREALLOC, &em->flags))
|
|
|
|
flags |= FIEMAP_EXTENT_UNWRITTEN;
|
2009-01-21 19:39:14 +00:00
|
|
|
|
|
|
|
free_extent_map(em);
|
|
|
|
em = NULL;
|
2011-02-23 21:23:20 +00:00
|
|
|
if ((em_start >= last) || em_len == (u64)-1 ||
|
|
|
|
(last == (u64)-1 && isize <= em_end)) {
|
2009-01-21 19:39:14 +00:00
|
|
|
flags |= FIEMAP_EXTENT_LAST;
|
|
|
|
end = 1;
|
|
|
|
}
|
|
|
|
|
2011-02-23 21:23:20 +00:00
|
|
|
/* now scan forward to see if this is really the last extent. */
|
2020-08-31 11:42:49 +00:00
|
|
|
em = get_extent_skip_holes(inode, off, last_for_get_extent);
|
2011-02-23 21:23:20 +00:00
|
|
|
if (IS_ERR(em)) {
|
|
|
|
ret = PTR_ERR(em);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
if (!em) {
|
2010-11-23 19:36:57 +00:00
|
|
|
flags |= FIEMAP_EXTENT_LAST;
|
|
|
|
end = 1;
|
|
|
|
}
|
btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
[REASON]
Btrfs will try to merge extent map when inserting new extent map.
btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.
And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.
[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.
And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.
So by this method, we can merge all fiemap extents.
It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-07 02:43:15 +00:00
|
|
|
ret = emit_fiemap_extent(fieinfo, &cache, em_start, disko,
|
|
|
|
em_len, flags);
|
2015-03-24 22:12:56 +00:00
|
|
|
if (ret) {
|
|
|
|
if (ret == 1)
|
|
|
|
ret = 0;
|
2011-02-23 21:23:20 +00:00
|
|
|
goto out_free;
|
2015-03-24 22:12:56 +00:00
|
|
|
}
|
2009-01-21 19:39:14 +00:00
|
|
|
}
|
|
|
|
out_free:
|
btrfs: fiemap: Cache and merge fiemap extent before submit it to user
[BUG]
Cycle mount btrfs can cause fiemap to return different result.
Like:
# mount /dev/vdb5 /mnt/btrfs
# dd if=/dev/zero bs=16K count=4 oflag=dsync of=/mnt/btrfs/file
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
# umount /mnt/btrfs
# mount /dev/vdb5 /mnt/btrfs
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..31]: 25088..25119 32 0x0
1: [32..63]: 25120..25151 32 0x0
2: [64..95]: 25152..25183 32 0x0
3: [96..127]: 25184..25215 32 0x1
But after above fiemap, we get correct merged result if we call fiemap
again.
# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/test/file:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 25088..25215 128 0x1
[REASON]
Btrfs will try to merge extent map when inserting new extent map.
btrfs_fiemap(start=0 len=(u64)-1)
|- extent_fiemap(start=0 len=(u64)-1)
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=0 len=64k)
| | Found on-disk (ino, EXTENT_DATA, 0)
| |- add_extent_mapping()
| |- Return (em->start=0, len=16k)
|
|- fiemap_fill_next_extent(logic=0 phys=X len=16k)
|
|- get_extent_skip_holes(start=0 len=64k)
| |- btrfs_get_extent_fiemap(start=0 len=64k)
| |- btrfs_get_extent(start=16k len=48k)
| | Found on-disk (ino, EXTENT_DATA, 16k)
| |- add_extent_mapping()
| | |- try_merge_map()
| | Merge with previous em start=0 len=16k
| | resulting em start=0 len=32k
| |- Return (em->start=0, len=32K) << Merged result
|- Stripe off the unrelated range (0~16K) of return em
|- fiemap_fill_next_extent(logic=16K phys=X+16K len=16K)
^^^ Causing split fiemap extent.
And since in add_extent_mapping(), em is already merged, in next
fiemap() call, we will get merged result.
[FIX]
Here we introduce a new structure, fiemap_cache, which records previous
fiemap extent.
And will always try to merge current fiemap_cache result before calling
fiemap_fill_next_extent().
Only when we failed to merge current fiemap extent with cached one, we
will call fiemap_fill_next_extent() to submit cached one.
So by this method, we can merge all fiemap extents.
It can also be done in fs/ioctl.c, however the problem is if
fieinfo->fi_extents_max == 0, we have no space to cache previous fiemap
extent.
So I choose to merge it in btrfs.
Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-04-07 02:43:15 +00:00
|
|
|
if (!ret)
|
2019-03-20 10:29:46 +00:00
|
|
|
ret = emit_last_fiemap_cache(fieinfo, &cache);
|
2009-01-21 19:39:14 +00:00
|
|
|
free_extent_map(em);
|
|
|
|
out:
|
2020-08-31 11:42:49 +00:00
|
|
|
unlock_extent_cached(&inode->io_tree, start, start + len - 1,
|
2017-12-12 20:43:52 +00:00
|
|
|
&cached_state);
|
2019-05-15 13:31:04 +00:00
|
|
|
|
|
|
|
out_free_ulist:
|
2019-07-05 07:26:24 +00:00
|
|
|
btrfs_free_path(path);
|
2019-05-15 13:31:04 +00:00
|
|
|
ulist_free(roots);
|
|
|
|
ulist_free(tmp_ulist);
|
2009-01-21 19:39:14 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2010-08-06 17:21:20 +00:00
|
|
|
static void __free_extent_buffer(struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
kmem_cache_free(extent_buffer_cache, eb);
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
int extent_buffer_under_io(const struct extent_buffer *eb)
|
2013-08-07 18:54:37 +00:00
|
|
|
{
|
|
|
|
return (atomic_read(&eb->io_pages) ||
|
|
|
|
test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags) ||
|
|
|
|
test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2018-07-19 15:24:32 +00:00
|
|
|
* Release all pages attached to the extent buffer.
|
2013-08-07 18:54:37 +00:00
|
|
|
*/
|
2018-07-19 15:24:32 +00:00
|
|
|
static void btrfs_release_extent_buffer_pages(struct extent_buffer *eb)
|
2013-08-07 18:54:37 +00:00
|
|
|
{
|
2018-06-27 13:38:22 +00:00
|
|
|
int i;
|
|
|
|
int num_pages;
|
2018-06-27 13:38:24 +00:00
|
|
|
int mapped = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
|
2013-08-07 18:54:37 +00:00
|
|
|
|
|
|
|
BUG_ON(extent_buffer_under_io(eb));
|
|
|
|
|
2018-06-27 13:38:22 +00:00
|
|
|
num_pages = num_extent_pages(eb);
|
|
|
|
for (i = 0; i < num_pages; i++) {
|
|
|
|
struct page *page = eb->pages[i];
|
2013-08-07 18:54:37 +00:00
|
|
|
|
2015-02-09 09:31:45 +00:00
|
|
|
if (!page)
|
|
|
|
continue;
|
|
|
|
if (mapped)
|
2013-08-07 18:54:37 +00:00
|
|
|
spin_lock(&page->mapping->private_lock);
|
2015-02-09 09:31:45 +00:00
|
|
|
/*
|
|
|
|
* We do this since we'll remove the pages after we've
|
|
|
|
* removed the eb from the radix tree, so we could race
|
|
|
|
* and have this page now attached to the new eb. So
|
|
|
|
* only clear page_private if it's still connected to
|
|
|
|
* this eb.
|
|
|
|
*/
|
|
|
|
if (PagePrivate(page) &&
|
|
|
|
page->private == (unsigned long)eb) {
|
|
|
|
BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
|
|
|
|
BUG_ON(PageDirty(page));
|
|
|
|
BUG_ON(PageWriteback(page));
|
2013-08-07 18:54:37 +00:00
|
|
|
/*
|
2015-02-09 09:31:45 +00:00
|
|
|
* We need to make sure we haven't be attached
|
|
|
|
* to a new eb.
|
2013-08-07 18:54:37 +00:00
|
|
|
*/
|
2020-06-02 04:47:45 +00:00
|
|
|
detach_page_private(page);
|
2013-08-07 18:54:37 +00:00
|
|
|
}
|
2015-02-09 09:31:45 +00:00
|
|
|
|
|
|
|
if (mapped)
|
|
|
|
spin_unlock(&page->mapping->private_lock);
|
|
|
|
|
2016-05-20 01:18:45 +00:00
|
|
|
/* One for when we allocated the page */
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
put_page(page);
|
2018-06-27 13:38:22 +00:00
|
|
|
}
|
2013-08-07 18:54:37 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Helper for releasing the extent buffer.
|
|
|
|
*/
|
|
|
|
static inline void btrfs_release_extent_buffer(struct extent_buffer *eb)
|
|
|
|
{
|
2018-07-19 15:24:32 +00:00
|
|
|
btrfs_release_extent_buffer_pages(eb);
|
2020-02-14 21:11:42 +00:00
|
|
|
btrfs_leak_debug_del(&eb->fs_info->eb_leak_lock, &eb->leak_list);
|
2013-08-07 18:54:37 +00:00
|
|
|
__free_extent_buffer(eb);
|
|
|
|
}
|
|
|
|
|
2013-12-16 18:24:27 +00:00
|
|
|
static struct extent_buffer *
|
|
|
|
__alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
|
2014-06-15 00:55:29 +00:00
|
|
|
unsigned long len)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb = NULL;
|
|
|
|
|
2015-08-19 12:17:40 +00:00
|
|
|
eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL);
|
2008-01-24 21:13:08 +00:00
|
|
|
eb->start = start;
|
|
|
|
eb->len = len;
|
2013-12-16 18:24:27 +00:00
|
|
|
eb->fs_info = fs_info;
|
2012-05-16 15:00:02 +00:00
|
|
|
eb->bflags = 0;
|
btrfs: switch extent buffer tree lock to rw_semaphore
Historically we've implemented our own locking because we wanted to be
able to selectively spin or sleep based on what we were doing in the
tree. For instance, if all of our nodes were in cache then there's
rarely a reason to need to sleep waiting for node locks, as they'll
likely become available soon. At the time this code was written the
rw_semaphore didn't do adaptive spinning, and thus was orders of
magnitude slower than our home grown locking.
However now the opposite is the case. There are a few problems with how
we implement blocking locks, namely that we use a normal waitqueue and
simply wake everybody up in reverse sleep order. This leads to some
suboptimal performance behavior, and a lot of context switches in highly
contended cases. The rw_semaphores actually do this properly, and also
have adaptive spinning that works relatively well.
The locking code is also a bit of a bear to understand, and we lose the
benefit of lockdep for the most part because the blocking states of the
lock are simply ad-hoc and not mapped into lockdep.
So rework the locking code to drop all of this custom locking stuff, and
simply use a rw_semaphore for everything. This makes the locking much
simpler for everything, as we can now drop a lot of cruft and blocking
transitions. The performance numbers vary depending on the workload,
because generally speaking there doesn't tend to be a lot of contention
on the btree. However, on my test system which is an 80 core single
socket system with 256GiB of RAM and a 2TiB NVMe drive I get the
following results (with all debug options off):
dbench 200 baseline
Throughput 216.056 MB/sec 200 clients 200 procs max_latency=1471.197 ms
dbench 200 with patch
Throughput 737.188 MB/sec 200 clients 200 procs max_latency=714.346 ms
Previously we also used fs_mark to test this sort of contention, and
those results are far less impressive, mostly because there's not enough
tasks to really stress the locking
fs_mark -d /d[0-15] -S 0 -L 20 -n 100000 -s 0 -t 16
baseline
Average Files/sec: 160166.7
p50 Files/sec: 165832
p90 Files/sec: 123886
p99 Files/sec: 123495
real 3m26.527s
user 2m19.223s
sys 48m21.856s
patched
Average Files/sec: 164135.7
p50 Files/sec: 171095
p90 Files/sec: 122889
p99 Files/sec: 113819
real 3m29.660s
user 2m19.990s
sys 44m12.259s
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-20 15:46:09 +00:00
|
|
|
init_rwsem(&eb->lock);
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 14:25:08 +00:00
|
|
|
|
2020-02-14 21:11:40 +00:00
|
|
|
btrfs_leak_debug_add(&fs_info->eb_leak_lock, &eb->leak_list,
|
|
|
|
&fs_info->allocated_ebs);
|
2013-04-22 16:12:31 +00:00
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
spin_lock_init(&eb->refs_lock);
|
2008-01-24 21:13:08 +00:00
|
|
|
atomic_set(&eb->refs, 1);
|
2012-03-13 13:38:00 +00:00
|
|
|
atomic_set(&eb->io_pages, 0);
|
2010-08-06 17:21:20 +00:00
|
|
|
|
2013-02-28 14:54:18 +00:00
|
|
|
/*
|
|
|
|
* Sanity checks, currently the maximum is 64k covered by 16x 4k pages
|
|
|
|
*/
|
|
|
|
BUILD_BUG_ON(BTRFS_MAX_METADATA_BLOCKSIZE
|
|
|
|
> MAX_INLINE_EXTENT_BUFFER_SIZE);
|
|
|
|
BUG_ON(len > MAX_INLINE_EXTENT_BUFFER_SIZE);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
return eb;
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
struct extent_buffer *btrfs_clone_extent_buffer(const struct extent_buffer *src)
|
2012-05-16 15:00:02 +00:00
|
|
|
{
|
2018-03-01 17:20:27 +00:00
|
|
|
int i;
|
2012-05-16 15:00:02 +00:00
|
|
|
struct page *p;
|
|
|
|
struct extent_buffer *new;
|
2018-03-01 17:20:27 +00:00
|
|
|
int num_pages = num_extent_pages(src);
|
2012-05-16 15:00:02 +00:00
|
|
|
|
2014-06-15 01:20:26 +00:00
|
|
|
new = __alloc_extent_buffer(src->fs_info, src->start, src->len);
|
2012-05-16 15:00:02 +00:00
|
|
|
if (new == NULL)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
for (i = 0; i < num_pages; i++) {
|
2013-08-07 20:57:23 +00:00
|
|
|
p = alloc_page(GFP_NOFS);
|
2013-08-07 18:54:37 +00:00
|
|
|
if (!p) {
|
|
|
|
btrfs_release_extent_buffer(new);
|
|
|
|
return NULL;
|
|
|
|
}
|
2012-05-16 15:00:02 +00:00
|
|
|
attach_extent_buffer_page(new, p);
|
|
|
|
WARN_ON(PageDirty(p));
|
|
|
|
SetPageUptodate(p);
|
|
|
|
new->pages[i] = p;
|
2016-11-08 16:56:24 +00:00
|
|
|
copy_page(page_address(p), page_address(src->pages[i]));
|
2012-05-16 15:00:02 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
set_bit(EXTENT_BUFFER_UPTODATE, &new->bflags);
|
2018-06-27 13:38:24 +00:00
|
|
|
set_bit(EXTENT_BUFFER_UNMAPPED, &new->bflags);
|
2012-05-16 15:00:02 +00:00
|
|
|
|
|
|
|
return new;
|
|
|
|
}
|
|
|
|
|
2015-09-30 03:50:31 +00:00
|
|
|
struct extent_buffer *__alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 start, unsigned long len)
|
2012-05-16 15:00:02 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb;
|
2018-03-01 17:20:27 +00:00
|
|
|
int num_pages;
|
|
|
|
int i;
|
2012-05-16 15:00:02 +00:00
|
|
|
|
2014-06-15 01:20:26 +00:00
|
|
|
eb = __alloc_extent_buffer(fs_info, start, len);
|
2012-05-16 15:00:02 +00:00
|
|
|
if (!eb)
|
|
|
|
return NULL;
|
|
|
|
|
2018-06-29 08:56:49 +00:00
|
|
|
num_pages = num_extent_pages(eb);
|
2012-05-16 15:00:02 +00:00
|
|
|
for (i = 0; i < num_pages; i++) {
|
2013-08-07 20:57:23 +00:00
|
|
|
eb->pages[i] = alloc_page(GFP_NOFS);
|
2012-05-16 15:00:02 +00:00
|
|
|
if (!eb->pages[i])
|
|
|
|
goto err;
|
|
|
|
}
|
|
|
|
set_extent_buffer_uptodate(eb);
|
|
|
|
btrfs_set_header_nritems(eb, 0);
|
2018-06-27 13:38:24 +00:00
|
|
|
set_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
|
2012-05-16 15:00:02 +00:00
|
|
|
|
|
|
|
return eb;
|
|
|
|
err:
|
2012-10-11 13:25:16 +00:00
|
|
|
for (; i > 0; i--)
|
|
|
|
__free_page(eb->pages[i - 1]);
|
2012-05-16 15:00:02 +00:00
|
|
|
__free_extent_buffer(eb);
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2015-09-30 03:50:31 +00:00
|
|
|
struct extent_buffer *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
|
2016-06-15 13:22:56 +00:00
|
|
|
u64 start)
|
2015-09-30 03:50:31 +00:00
|
|
|
{
|
2016-06-15 13:22:56 +00:00
|
|
|
return __alloc_dummy_extent_buffer(fs_info, start, fs_info->nodesize);
|
2015-09-30 03:50:31 +00:00
|
|
|
}
|
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
static void check_buffer_tree_ref(struct extent_buffer *eb)
|
|
|
|
{
|
2013-01-29 22:49:37 +00:00
|
|
|
int refs;
|
btrfs: fix fatal extent_buffer readahead vs releasepage race
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-17 18:35:19 +00:00
|
|
|
/*
|
|
|
|
* The TREE_REF bit is first set when the extent_buffer is added
|
|
|
|
* to the radix tree. It is also reset, if unset, when a new reference
|
|
|
|
* is created by find_extent_buffer.
|
2012-03-13 13:38:00 +00:00
|
|
|
*
|
btrfs: fix fatal extent_buffer readahead vs releasepage race
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-17 18:35:19 +00:00
|
|
|
* It is only cleared in two cases: freeing the last non-tree
|
|
|
|
* reference to the extent_buffer when its STALE bit is set or
|
|
|
|
* calling releasepage when the tree reference is the only reference.
|
2012-03-13 13:38:00 +00:00
|
|
|
*
|
btrfs: fix fatal extent_buffer readahead vs releasepage race
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-17 18:35:19 +00:00
|
|
|
* In both cases, care is taken to ensure that the extent_buffer's
|
|
|
|
* pages are not under io. However, releasepage can be concurrently
|
|
|
|
* called with creating new references, which is prone to race
|
|
|
|
* conditions between the calls to check_buffer_tree_ref in those
|
|
|
|
* codepaths and clearing TREE_REF in try_release_extent_buffer.
|
2012-03-13 13:38:00 +00:00
|
|
|
*
|
btrfs: fix fatal extent_buffer readahead vs releasepage race
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-17 18:35:19 +00:00
|
|
|
* The actual lifetime of the extent_buffer in the radix tree is
|
|
|
|
* adequately protected by the refcount, but the TREE_REF bit and
|
|
|
|
* its corresponding reference are not. To protect against this
|
|
|
|
* class of races, we call check_buffer_tree_ref from the codepaths
|
|
|
|
* which trigger io after they set eb->io_pages. Note that once io is
|
|
|
|
* initiated, TREE_REF can no longer be cleared, so that is the
|
|
|
|
* moment at which any such race is best fixed.
|
2012-03-13 13:38:00 +00:00
|
|
|
*/
|
2013-01-29 22:49:37 +00:00
|
|
|
refs = atomic_read(&eb->refs);
|
|
|
|
if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
|
|
|
|
return;
|
|
|
|
|
2012-07-20 20:11:08 +00:00
|
|
|
spin_lock(&eb->refs_lock);
|
|
|
|
if (!test_and_set_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
|
2012-03-13 13:38:00 +00:00
|
|
|
atomic_inc(&eb->refs);
|
2012-07-20 20:11:08 +00:00
|
|
|
spin_unlock(&eb->refs_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
}
|
|
|
|
|
2014-06-04 23:10:31 +00:00
|
|
|
static void mark_extent_buffer_accessed(struct extent_buffer *eb,
|
|
|
|
struct page *accessed)
|
2012-03-15 22:24:42 +00:00
|
|
|
{
|
2018-03-01 17:20:27 +00:00
|
|
|
int num_pages, i;
|
2012-03-15 22:24:42 +00:00
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
check_buffer_tree_ref(eb);
|
|
|
|
|
2018-06-29 08:56:49 +00:00
|
|
|
num_pages = num_extent_pages(eb);
|
2012-03-15 22:24:42 +00:00
|
|
|
for (i = 0; i < num_pages; i++) {
|
2014-07-30 23:03:53 +00:00
|
|
|
struct page *p = eb->pages[i];
|
|
|
|
|
2014-06-04 23:10:31 +00:00
|
|
|
if (p != accessed)
|
|
|
|
mark_page_accessed(p);
|
2012-03-15 22:24:42 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-12-16 18:24:27 +00:00
|
|
|
struct extent_buffer *find_extent_buffer(struct btrfs_fs_info *fs_info,
|
|
|
|
u64 start)
|
2013-10-07 15:45:25 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
|
|
|
|
rcu_read_lock();
|
2013-12-16 18:24:27 +00:00
|
|
|
eb = radix_tree_lookup(&fs_info->buffer_radix,
|
2020-10-21 06:25:05 +00:00
|
|
|
start >> fs_info->sectorsize_bits);
|
2013-10-07 15:45:25 +00:00
|
|
|
if (eb && atomic_inc_not_zero(&eb->refs)) {
|
|
|
|
rcu_read_unlock();
|
2015-04-23 10:28:48 +00:00
|
|
|
/*
|
|
|
|
* Lock our eb's refs_lock to avoid races with
|
|
|
|
* free_extent_buffer. When we get our eb it might be flagged
|
|
|
|
* with EXTENT_BUFFER_STALE and another task running
|
|
|
|
* free_extent_buffer might have seen that flag set,
|
|
|
|
* eb->refs == 2, that the buffer isn't under IO (dirty and
|
|
|
|
* writeback flags not set) and it's still in the tree (flag
|
|
|
|
* EXTENT_BUFFER_TREE_REF set), therefore being in the process
|
|
|
|
* of decrementing the extent buffer's reference count twice.
|
|
|
|
* So here we could race and increment the eb's reference count,
|
|
|
|
* clear its stale flag, mark it as dirty and drop our reference
|
|
|
|
* before the other task finishes executing free_extent_buffer,
|
|
|
|
* which would later result in an attempt to free an extent
|
|
|
|
* buffer that is dirty.
|
|
|
|
*/
|
|
|
|
if (test_bit(EXTENT_BUFFER_STALE, &eb->bflags)) {
|
|
|
|
spin_lock(&eb->refs_lock);
|
|
|
|
spin_unlock(&eb->refs_lock);
|
|
|
|
}
|
2014-06-04 23:10:31 +00:00
|
|
|
mark_extent_buffer_accessed(eb, NULL);
|
2013-10-07 15:45:25 +00:00
|
|
|
return eb;
|
|
|
|
}
|
|
|
|
rcu_read_unlock();
|
|
|
|
|
|
|
|
return NULL;
|
|
|
|
}
|
|
|
|
|
2014-05-07 21:06:09 +00:00
|
|
|
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
|
|
|
struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
|
2016-06-15 13:22:56 +00:00
|
|
|
u64 start)
|
2014-05-07 21:06:09 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb, *exists = NULL;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
eb = find_extent_buffer(fs_info, start);
|
|
|
|
if (eb)
|
|
|
|
return eb;
|
2016-06-15 13:22:56 +00:00
|
|
|
eb = alloc_dummy_extent_buffer(fs_info, start);
|
2014-05-07 21:06:09 +00:00
|
|
|
if (!eb)
|
2019-12-03 11:24:58 +00:00
|
|
|
return ERR_PTR(-ENOMEM);
|
2014-05-07 21:06:09 +00:00
|
|
|
eb->fs_info = fs_info;
|
|
|
|
again:
|
2016-05-09 12:11:38 +00:00
|
|
|
ret = radix_tree_preload(GFP_NOFS);
|
2019-12-03 11:24:58 +00:00
|
|
|
if (ret) {
|
|
|
|
exists = ERR_PTR(ret);
|
2014-05-07 21:06:09 +00:00
|
|
|
goto free_eb;
|
2019-12-03 11:24:58 +00:00
|
|
|
}
|
2014-05-07 21:06:09 +00:00
|
|
|
spin_lock(&fs_info->buffer_lock);
|
|
|
|
ret = radix_tree_insert(&fs_info->buffer_radix,
|
2020-10-21 06:25:05 +00:00
|
|
|
start >> fs_info->sectorsize_bits, eb);
|
2014-05-07 21:06:09 +00:00
|
|
|
spin_unlock(&fs_info->buffer_lock);
|
|
|
|
radix_tree_preload_end();
|
|
|
|
if (ret == -EEXIST) {
|
|
|
|
exists = find_extent_buffer(fs_info, start);
|
|
|
|
if (exists)
|
|
|
|
goto free_eb;
|
|
|
|
else
|
|
|
|
goto again;
|
|
|
|
}
|
|
|
|
check_buffer_tree_ref(eb);
|
|
|
|
set_bit(EXTENT_BUFFER_IN_TREE, &eb->bflags);
|
|
|
|
|
|
|
|
return eb;
|
|
|
|
free_eb:
|
|
|
|
btrfs_release_extent_buffer(eb);
|
|
|
|
return exists;
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2013-12-16 18:24:27 +00:00
|
|
|
struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
|
2020-11-05 15:45:20 +00:00
|
|
|
u64 start, u64 owner_root, int level)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2016-06-15 13:22:56 +00:00
|
|
|
unsigned long len = fs_info->nodesize;
|
2018-03-01 17:20:27 +00:00
|
|
|
int num_pages;
|
|
|
|
int i;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
unsigned long index = start >> PAGE_SHIFT;
|
2008-01-24 21:13:08 +00:00
|
|
|
struct extent_buffer *eb;
|
2008-07-22 15:18:07 +00:00
|
|
|
struct extent_buffer *exists = NULL;
|
2008-01-24 21:13:08 +00:00
|
|
|
struct page *p;
|
2013-12-16 18:24:27 +00:00
|
|
|
struct address_space *mapping = fs_info->btree_inode->i_mapping;
|
2008-01-24 21:13:08 +00:00
|
|
|
int uptodate = 1;
|
2010-10-27 00:57:29 +00:00
|
|
|
int ret;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2016-06-15 13:22:56 +00:00
|
|
|
if (!IS_ALIGNED(start, fs_info->sectorsize)) {
|
2016-06-06 19:01:23 +00:00
|
|
|
btrfs_err(fs_info, "bad tree block start %llu", start);
|
|
|
|
return ERR_PTR(-EINVAL);
|
|
|
|
}
|
|
|
|
|
2013-12-16 18:24:27 +00:00
|
|
|
eb = find_extent_buffer(fs_info, start);
|
2013-10-07 15:45:25 +00:00
|
|
|
if (eb)
|
2008-07-22 15:18:07 +00:00
|
|
|
return eb;
|
|
|
|
|
2014-06-15 00:55:29 +00:00
|
|
|
eb = __alloc_extent_buffer(fs_info, start, len);
|
2008-04-01 15:21:40 +00:00
|
|
|
if (!eb)
|
2016-06-06 19:01:23 +00:00
|
|
|
return ERR_PTR(-ENOMEM);
|
btrfs: set the lockdep class for extent buffers on creation
Both Filipe and Fedora QA recently hit the following lockdep splat:
WARNING: possible recursive locking detected
5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1 Not tainted
--------------------------------------------
rsync/2610 is trying to acquire lock:
ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
but task is already holding lock:
ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&eb->lock);
lock(&eb->lock);
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by rsync/2610:
#0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190
#1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140
stack backtrace:
CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x8b/0xb0
__lock_acquire.cold+0x12d/0x2a4
? kvm_sched_clock_read+0x14/0x30
? sched_clock+0x5/0x10
lock_acquire+0xc8/0x400
? btrfs_tree_read_lock_atomic+0x34/0x140
? read_block_for_search.isra.0+0xdd/0x320
_raw_read_lock+0x3d/0xa0
? btrfs_tree_read_lock_atomic+0x34/0x140
btrfs_tree_read_lock_atomic+0x34/0x140
btrfs_search_slot+0x616/0x9a0
btrfs_lookup_dir_item+0x6c/0xb0
btrfs_lookup_dentry+0xa8/0x520
? lockdep_init_map_waits+0x4c/0x210
btrfs_lookup+0xe/0x30
__lookup_slow+0x10f/0x1e0
walk_component+0x11b/0x190
path_lookupat+0x72/0x1c0
filename_lookup+0x97/0x180
? strncpy_from_user+0x96/0x1e0
? getname_flags.part.0+0x45/0x1a0
vfs_statx+0x64/0x100
? lockdep_hardirqs_on_prepare+0xff/0x180
? _raw_spin_unlock_irqrestore+0x41/0x50
__do_sys_newlstat+0x26/0x40
? lockdep_hardirqs_on_prepare+0xff/0x180
? syscall_enter_from_user_mode+0x27/0x80
? syscall_enter_from_user_mode+0x27/0x80
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
I have also seen a report of lockdep complaining about the lock class
that was looked up being the same as the lock class on the lock we were
using, but I can't find the report.
These are problems that occur because we do not have the lockdep class
set on the extent buffer until _after_ we read the eb in properly. This
is problematic for concurrent readers, because we will create the extent
buffer, lock it, and then attempt to read the extent buffer.
If a second thread comes in and tries to do a search down the same path
they'll get the above lockdep splat because the class isn't set properly
on the extent buffer.
There was a good reason for this, we generally didn't know the real
owner of the eb until we read it, specifically in refcounted roots.
However now all refcounted roots have the same class name, so we no
longer need to worry about this. For non-refcounted trees we know
which root we're on based on the parent.
Fix this by setting the lockdep class on the eb at creation time instead
of read time. This will fix the splat and the weirdness where the class
changes in the middle of locking the block.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-11-05 15:45:21 +00:00
|
|
|
btrfs_set_buffer_lockdep_class(owner_root, eb, level);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2018-06-29 08:56:49 +00:00
|
|
|
num_pages = num_extent_pages(eb);
|
2010-08-06 17:21:20 +00:00
|
|
|
for (i = 0; i < num_pages; i++, index++) {
|
2015-08-19 12:17:40 +00:00
|
|
|
p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL);
|
2016-06-06 19:01:23 +00:00
|
|
|
if (!p) {
|
|
|
|
exists = ERR_PTR(-ENOMEM);
|
2008-07-22 15:18:07 +00:00
|
|
|
goto free_eb;
|
2016-06-06 19:01:23 +00:00
|
|
|
}
|
2012-03-07 21:20:05 +00:00
|
|
|
|
|
|
|
spin_lock(&mapping->private_lock);
|
|
|
|
if (PagePrivate(p)) {
|
|
|
|
/*
|
|
|
|
* We could have already allocated an eb for this page
|
|
|
|
* and attached one so lets see if we can get a ref on
|
|
|
|
* the existing eb, and if we can we know it's good and
|
|
|
|
* we can just return that one, else we know we can just
|
|
|
|
* overwrite page->private.
|
|
|
|
*/
|
|
|
|
exists = (struct extent_buffer *)p->private;
|
|
|
|
if (atomic_inc_not_zero(&exists->refs)) {
|
|
|
|
spin_unlock(&mapping->private_lock);
|
|
|
|
unlock_page(p);
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
put_page(p);
|
2014-06-04 23:10:31 +00:00
|
|
|
mark_extent_buffer_accessed(exists, p);
|
2012-03-07 21:20:05 +00:00
|
|
|
goto free_eb;
|
|
|
|
}
|
2015-02-24 10:47:05 +00:00
|
|
|
exists = NULL;
|
2012-03-07 21:20:05 +00:00
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
WARN_ON(PageDirty(p));
|
2020-11-13 12:51:49 +00:00
|
|
|
detach_page_private(p);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2012-03-07 21:20:05 +00:00
|
|
|
attach_extent_buffer_page(eb, p);
|
|
|
|
spin_unlock(&mapping->private_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
WARN_ON(PageDirty(p));
|
2010-08-06 17:21:20 +00:00
|
|
|
eb->pages[i] = p;
|
2008-01-24 21:13:08 +00:00
|
|
|
if (!PageUptodate(p))
|
|
|
|
uptodate = 0;
|
2011-02-10 17:35:00 +00:00
|
|
|
|
|
|
|
/*
|
2018-07-04 07:24:52 +00:00
|
|
|
* We can't unlock the pages just yet since the extent buffer
|
|
|
|
* hasn't been properly inserted in the radix tree, this
|
|
|
|
* opens a race with btree_releasepage which can free a page
|
|
|
|
* while we are still filling in all pages for the buffer and
|
|
|
|
* we could crash.
|
2011-02-10 17:35:00 +00:00
|
|
|
*/
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
if (uptodate)
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 14:25:08 +00:00
|
|
|
set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
|
2012-03-09 14:51:43 +00:00
|
|
|
again:
|
2016-05-09 12:11:38 +00:00
|
|
|
ret = radix_tree_preload(GFP_NOFS);
|
2016-06-06 19:01:23 +00:00
|
|
|
if (ret) {
|
|
|
|
exists = ERR_PTR(ret);
|
2010-10-27 00:57:29 +00:00
|
|
|
goto free_eb;
|
2016-06-06 19:01:23 +00:00
|
|
|
}
|
2010-10-27 00:57:29 +00:00
|
|
|
|
2013-12-16 18:24:27 +00:00
|
|
|
spin_lock(&fs_info->buffer_lock);
|
|
|
|
ret = radix_tree_insert(&fs_info->buffer_radix,
|
2020-10-21 06:25:05 +00:00
|
|
|
start >> fs_info->sectorsize_bits, eb);
|
2013-12-16 18:24:27 +00:00
|
|
|
spin_unlock(&fs_info->buffer_lock);
|
2013-10-07 15:45:25 +00:00
|
|
|
radix_tree_preload_end();
|
2010-10-27 00:57:29 +00:00
|
|
|
if (ret == -EEXIST) {
|
2013-12-16 18:24:27 +00:00
|
|
|
exists = find_extent_buffer(fs_info, start);
|
2013-10-07 15:45:25 +00:00
|
|
|
if (exists)
|
|
|
|
goto free_eb;
|
|
|
|
else
|
2012-03-09 14:51:43 +00:00
|
|
|
goto again;
|
2008-07-22 15:18:07 +00:00
|
|
|
}
|
|
|
|
/* add one reference for the tree */
|
2012-03-13 13:38:00 +00:00
|
|
|
check_buffer_tree_ref(eb);
|
2013-12-13 15:41:51 +00:00
|
|
|
set_bit(EXTENT_BUFFER_IN_TREE, &eb->bflags);
|
2011-02-10 17:35:00 +00:00
|
|
|
|
|
|
|
/*
|
2018-07-04 07:24:52 +00:00
|
|
|
* Now it's safe to unlock the pages because any calls to
|
|
|
|
* btree_releasepage will correctly detect that a page belongs to a
|
|
|
|
* live buffer and won't free them prematurely.
|
2011-02-10 17:35:00 +00:00
|
|
|
*/
|
2018-07-04 07:24:51 +00:00
|
|
|
for (i = 0; i < num_pages; i++)
|
|
|
|
unlock_page(eb->pages[i]);
|
2008-01-24 21:13:08 +00:00
|
|
|
return eb;
|
|
|
|
|
2008-07-22 15:18:07 +00:00
|
|
|
free_eb:
|
2015-02-24 10:47:05 +00:00
|
|
|
WARN_ON(!atomic_dec_and_test(&eb->refs));
|
2010-08-06 17:21:20 +00:00
|
|
|
for (i = 0; i < num_pages; i++) {
|
|
|
|
if (eb->pages[i])
|
|
|
|
unlock_page(eb->pages[i]);
|
|
|
|
}
|
2011-02-10 17:35:00 +00:00
|
|
|
|
2010-10-27 00:57:29 +00:00
|
|
|
btrfs_release_extent_buffer(eb);
|
2008-07-22 15:18:07 +00:00
|
|
|
return exists;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
static inline void btrfs_release_extent_buffer_rcu(struct rcu_head *head)
|
|
|
|
{
|
|
|
|
struct extent_buffer *eb =
|
|
|
|
container_of(head, struct extent_buffer, rcu_head);
|
|
|
|
|
|
|
|
__free_extent_buffer(eb);
|
|
|
|
}
|
|
|
|
|
2013-04-26 14:56:29 +00:00
|
|
|
static int release_extent_buffer(struct extent_buffer *eb)
|
2020-02-23 23:16:42 +00:00
|
|
|
__releases(&eb->refs_lock)
|
2012-03-09 21:01:49 +00:00
|
|
|
{
|
2018-06-27 13:38:23 +00:00
|
|
|
lockdep_assert_held(&eb->refs_lock);
|
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
WARN_ON(atomic_read(&eb->refs) == 0);
|
|
|
|
if (atomic_dec_and_test(&eb->refs)) {
|
2013-12-13 15:41:51 +00:00
|
|
|
if (test_and_clear_bit(EXTENT_BUFFER_IN_TREE, &eb->bflags)) {
|
2013-12-16 18:24:27 +00:00
|
|
|
struct btrfs_fs_info *fs_info = eb->fs_info;
|
2012-03-09 21:01:49 +00:00
|
|
|
|
2012-05-16 15:00:02 +00:00
|
|
|
spin_unlock(&eb->refs_lock);
|
2012-03-09 21:01:49 +00:00
|
|
|
|
2013-12-16 18:24:27 +00:00
|
|
|
spin_lock(&fs_info->buffer_lock);
|
|
|
|
radix_tree_delete(&fs_info->buffer_radix,
|
2020-10-21 06:25:05 +00:00
|
|
|
eb->start >> fs_info->sectorsize_bits);
|
2013-12-16 18:24:27 +00:00
|
|
|
spin_unlock(&fs_info->buffer_lock);
|
2013-12-13 15:41:51 +00:00
|
|
|
} else {
|
|
|
|
spin_unlock(&eb->refs_lock);
|
2012-05-16 15:00:02 +00:00
|
|
|
}
|
2012-03-09 21:01:49 +00:00
|
|
|
|
2020-02-14 21:11:42 +00:00
|
|
|
btrfs_leak_debug_del(&eb->fs_info->eb_leak_lock, &eb->leak_list);
|
2012-03-09 21:01:49 +00:00
|
|
|
/* Should be safe to release our pages at this point */
|
2018-07-19 15:24:32 +00:00
|
|
|
btrfs_release_extent_buffer_pages(eb);
|
2015-03-16 21:38:02 +00:00
|
|
|
#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
|
2018-06-27 13:38:24 +00:00
|
|
|
if (unlikely(test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags))) {
|
2015-03-16 21:38:02 +00:00
|
|
|
__free_extent_buffer(eb);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
#endif
|
2012-03-09 21:01:49 +00:00
|
|
|
call_rcu(&eb->rcu_head, btrfs_release_extent_buffer_rcu);
|
2012-07-20 20:05:36 +00:00
|
|
|
return 1;
|
2012-03-09 21:01:49 +00:00
|
|
|
}
|
|
|
|
spin_unlock(&eb->refs_lock);
|
2012-07-20 20:05:36 +00:00
|
|
|
|
|
|
|
return 0;
|
2012-03-09 21:01:49 +00:00
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
void free_extent_buffer(struct extent_buffer *eb)
|
|
|
|
{
|
2013-01-29 22:49:37 +00:00
|
|
|
int refs;
|
|
|
|
int old;
|
2008-01-24 21:13:08 +00:00
|
|
|
if (!eb)
|
|
|
|
return;
|
|
|
|
|
2013-01-29 22:49:37 +00:00
|
|
|
while (1) {
|
|
|
|
refs = atomic_read(&eb->refs);
|
2018-10-15 14:04:01 +00:00
|
|
|
if ((!test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) && refs <= 3)
|
|
|
|
|| (test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags) &&
|
|
|
|
refs == 1))
|
2013-01-29 22:49:37 +00:00
|
|
|
break;
|
|
|
|
old = atomic_cmpxchg(&eb->refs, refs, refs - 1);
|
|
|
|
if (old == refs)
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
spin_lock(&eb->refs_lock);
|
|
|
|
if (atomic_read(&eb->refs) == 2 &&
|
|
|
|
test_bit(EXTENT_BUFFER_STALE, &eb->bflags) &&
|
2012-03-13 13:38:00 +00:00
|
|
|
!extent_buffer_under_io(eb) &&
|
2012-03-09 21:01:49 +00:00
|
|
|
test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
|
|
|
|
atomic_dec(&eb->refs);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* I know this is terrible, but it's temporary until we stop tracking
|
|
|
|
* the uptodate bits and such for the extent buffers.
|
|
|
|
*/
|
2013-04-26 14:56:29 +00:00
|
|
|
release_extent_buffer(eb);
|
2012-03-09 21:01:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void free_extent_buffer_stale(struct extent_buffer *eb)
|
|
|
|
{
|
|
|
|
if (!eb)
|
2008-01-24 21:13:08 +00:00
|
|
|
return;
|
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
spin_lock(&eb->refs_lock);
|
|
|
|
set_bit(EXTENT_BUFFER_STALE, &eb->bflags);
|
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
if (atomic_read(&eb->refs) == 2 && !extent_buffer_under_io(eb) &&
|
2012-03-09 21:01:49 +00:00
|
|
|
test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
|
|
|
|
atomic_dec(&eb->refs);
|
2013-04-26 14:56:29 +00:00
|
|
|
release_extent_buffer(eb);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void clear_extent_buffer_dirty(const struct extent_buffer *eb)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2018-03-01 17:20:27 +00:00
|
|
|
int i;
|
|
|
|
int num_pages;
|
2008-01-24 21:13:08 +00:00
|
|
|
struct page *page;
|
|
|
|
|
2018-06-29 08:56:49 +00:00
|
|
|
num_pages = num_extent_pages(eb);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
for (i = 0; i < num_pages; i++) {
|
2014-07-30 23:03:53 +00:00
|
|
|
page = eb->pages[i];
|
2009-03-13 15:00:37 +00:00
|
|
|
if (!PageDirty(page))
|
2008-11-19 17:44:22 +00:00
|
|
|
continue;
|
|
|
|
|
2008-07-22 15:18:08 +00:00
|
|
|
lock_page(page);
|
2011-02-10 17:35:00 +00:00
|
|
|
WARN_ON(!PagePrivate(page));
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
clear_page_dirty_for_io(page);
|
2018-04-10 23:36:56 +00:00
|
|
|
xa_lock_irq(&page->mapping->i_pages);
|
2017-12-04 15:37:22 +00:00
|
|
|
if (!PageDirty(page))
|
|
|
|
__xa_clear_mark(&page->mapping->i_pages,
|
|
|
|
page_index(page), PAGECACHE_TAG_DIRTY);
|
2018-04-10 23:36:56 +00:00
|
|
|
xa_unlock_irq(&page->mapping->i_pages);
|
2011-11-04 16:29:37 +00:00
|
|
|
ClearPageError(page);
|
2008-07-22 15:18:08 +00:00
|
|
|
unlock_page(page);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
2012-03-13 13:38:00 +00:00
|
|
|
WARN_ON(atomic_read(&eb->refs) == 0);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2018-09-13 17:44:42 +00:00
|
|
|
bool set_extent_buffer_dirty(struct extent_buffer *eb)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2018-03-01 17:20:27 +00:00
|
|
|
int i;
|
|
|
|
int num_pages;
|
2018-09-13 17:44:42 +00:00
|
|
|
bool was_dirty;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
check_buffer_tree_ref(eb);
|
|
|
|
|
2009-03-13 15:00:37 +00:00
|
|
|
was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags);
|
2012-03-13 13:38:00 +00:00
|
|
|
|
2018-06-29 08:56:49 +00:00
|
|
|
num_pages = num_extent_pages(eb);
|
2012-03-09 21:01:49 +00:00
|
|
|
WARN_ON(atomic_read(&eb->refs) == 0);
|
2012-03-13 13:38:00 +00:00
|
|
|
WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags));
|
|
|
|
|
2018-09-13 17:44:42 +00:00
|
|
|
if (!was_dirty)
|
|
|
|
for (i = 0; i < num_pages; i++)
|
|
|
|
set_page_dirty(eb->pages[i]);
|
2018-09-13 17:46:08 +00:00
|
|
|
|
|
|
|
#ifdef CONFIG_BTRFS_DEBUG
|
|
|
|
for (i = 0; i < num_pages; i++)
|
|
|
|
ASSERT(PageDirty(eb->pages[i]));
|
|
|
|
#endif
|
|
|
|
|
2009-03-13 15:00:37 +00:00
|
|
|
return was_dirty;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2015-12-03 12:08:59 +00:00
|
|
|
void clear_extent_buffer_uptodate(struct extent_buffer *eb)
|
2008-05-12 17:39:03 +00:00
|
|
|
{
|
2018-03-01 17:20:27 +00:00
|
|
|
int i;
|
2008-05-12 17:39:03 +00:00
|
|
|
struct page *page;
|
2018-03-01 17:20:27 +00:00
|
|
|
int num_pages;
|
2008-05-12 17:39:03 +00:00
|
|
|
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 14:25:08 +00:00
|
|
|
clear_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
|
2018-06-29 08:56:49 +00:00
|
|
|
num_pages = num_extent_pages(eb);
|
2008-05-12 17:39:03 +00:00
|
|
|
for (i = 0; i < num_pages; i++) {
|
2014-07-30 23:03:53 +00:00
|
|
|
page = eb->pages[i];
|
2008-07-30 14:29:12 +00:00
|
|
|
if (page)
|
|
|
|
ClearPageUptodate(page);
|
2008-05-12 17:39:03 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-12-03 12:08:59 +00:00
|
|
|
void set_extent_buffer_uptodate(struct extent_buffer *eb)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2018-03-01 17:20:27 +00:00
|
|
|
int i;
|
2008-01-24 21:13:08 +00:00
|
|
|
struct page *page;
|
2018-03-01 17:20:27 +00:00
|
|
|
int num_pages;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2012-03-13 13:38:00 +00:00
|
|
|
set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
|
2018-06-29 08:56:49 +00:00
|
|
|
num_pages = num_extent_pages(eb);
|
2008-01-24 21:13:08 +00:00
|
|
|
for (i = 0; i < num_pages; i++) {
|
2014-07-30 23:03:53 +00:00
|
|
|
page = eb->pages[i];
|
2008-01-24 21:13:08 +00:00
|
|
|
SetPageUptodate(page);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-04-10 14:24:40 +00:00
|
|
|
int read_extent_buffer_pages(struct extent_buffer *eb, int wait, int mirror_num)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
2018-03-01 17:20:27 +00:00
|
|
|
int i;
|
2008-01-24 21:13:08 +00:00
|
|
|
struct page *page;
|
|
|
|
int err;
|
|
|
|
int ret = 0;
|
2008-04-09 20:28:12 +00:00
|
|
|
int locked_pages = 0;
|
|
|
|
int all_uptodate = 1;
|
2018-03-01 17:20:27 +00:00
|
|
|
int num_pages;
|
2010-08-06 17:21:20 +00:00
|
|
|
unsigned long num_reads = 0;
|
2008-02-07 15:50:54 +00:00
|
|
|
struct bio *bio = NULL;
|
Btrfs: Add zlib compression support
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
|
|
|
unsigned long bio_flags = 0;
|
2008-02-07 15:50:54 +00:00
|
|
|
|
Btrfs: Change btree locking to use explicit blocking points
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-02-04 14:25:08 +00:00
|
|
|
if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
|
2008-01-24 21:13:08 +00:00
|
|
|
return 0;
|
|
|
|
|
2018-06-29 08:56:49 +00:00
|
|
|
num_pages = num_extent_pages(eb);
|
2016-09-02 19:40:03 +00:00
|
|
|
for (i = 0; i < num_pages; i++) {
|
2014-07-30 23:03:53 +00:00
|
|
|
page = eb->pages[i];
|
2011-06-10 12:06:53 +00:00
|
|
|
if (wait == WAIT_NONE) {
|
2008-08-07 15:19:43 +00:00
|
|
|
if (!trylock_page(page))
|
2008-04-09 20:28:12 +00:00
|
|
|
goto unlock_exit;
|
2008-01-24 21:13:08 +00:00
|
|
|
} else {
|
|
|
|
lock_page(page);
|
|
|
|
}
|
2008-04-09 20:28:12 +00:00
|
|
|
locked_pages++;
|
Btrfs: fix memory leak in reading btree blocks
So we can read a btree block via readahead or intentional read,
and we can end up with a memory leak when something happens as
follows,
1) readahead starts to read block A but does not wait for read
completion,
2) btree_readpage_end_io_hook finds that block A is corrupted,
and it needs to clear all block A's pages' uptodate bit.
3) meanwhile an intentional read kicks in and checks block A's
pages' uptodate to decide which page needs to be read.
4) when some pages have the uptodate bit during 3)'s check so
3) doesn't count them for eb->io_pages, but they are later
cleared by 2) so we has to readpage on the page, we get
the wrong eb->io_pages which results in a memory leak of
this block.
This fixes the problem by firstly getting all pages's locking and
then checking pages' uptodate bit.
t1(readahead) t2(readahead endio) t3(the following read)
read_extent_buffer_pages end_bio_extent_readpage
for pg in eb: for page 0,1,2 in eb:
if pg is uptodate: btree_readpage_end_io_hook(pg)
num_reads++ if uptodate:
eb->io_pages = num_reads SetPageUptodate(pg) _______________
for pg in eb: for page 3 in eb: read_extent_buffer_pages
if pg is NOT uptodate: btree_readpage_end_io_hook(pg) for pg in eb:
__extent_read_full_page(pg) sanity check reports something wrong if pg is uptodate:
clear_extent_buffer_uptodate(eb) num_reads++
for pg in eb: eb->io_pages = num_reads
ClearPageUptodate(page) _______________
for pg in eb:
if pg is NOT uptodate:
__extent_read_full_page(pg)
So t3's eb->io_pages is not consistent with the number of pages it's reading,
and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative
number so that we're not able to free the eb.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-08-03 19:33:01 +00:00
|
|
|
}
|
|
|
|
/*
|
|
|
|
* We need to firstly lock all pages to make sure that
|
|
|
|
* the uptodate bit of our pages won't be affected by
|
|
|
|
* clear_extent_buffer_uptodate().
|
|
|
|
*/
|
2016-09-02 19:40:03 +00:00
|
|
|
for (i = 0; i < num_pages; i++) {
|
Btrfs: fix memory leak in reading btree blocks
So we can read a btree block via readahead or intentional read,
and we can end up with a memory leak when something happens as
follows,
1) readahead starts to read block A but does not wait for read
completion,
2) btree_readpage_end_io_hook finds that block A is corrupted,
and it needs to clear all block A's pages' uptodate bit.
3) meanwhile an intentional read kicks in and checks block A's
pages' uptodate to decide which page needs to be read.
4) when some pages have the uptodate bit during 3)'s check so
3) doesn't count them for eb->io_pages, but they are later
cleared by 2) so we has to readpage on the page, we get
the wrong eb->io_pages which results in a memory leak of
this block.
This fixes the problem by firstly getting all pages's locking and
then checking pages' uptodate bit.
t1(readahead) t2(readahead endio) t3(the following read)
read_extent_buffer_pages end_bio_extent_readpage
for pg in eb: for page 0,1,2 in eb:
if pg is uptodate: btree_readpage_end_io_hook(pg)
num_reads++ if uptodate:
eb->io_pages = num_reads SetPageUptodate(pg) _______________
for pg in eb: for page 3 in eb: read_extent_buffer_pages
if pg is NOT uptodate: btree_readpage_end_io_hook(pg) for pg in eb:
__extent_read_full_page(pg) sanity check reports something wrong if pg is uptodate:
clear_extent_buffer_uptodate(eb) num_reads++
for pg in eb: eb->io_pages = num_reads
ClearPageUptodate(page) _______________
for pg in eb:
if pg is NOT uptodate:
__extent_read_full_page(pg)
So t3's eb->io_pages is not consistent with the number of pages it's reading,
and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative
number so that we're not able to free the eb.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-08-03 19:33:01 +00:00
|
|
|
page = eb->pages[i];
|
2010-08-06 17:21:20 +00:00
|
|
|
if (!PageUptodate(page)) {
|
|
|
|
num_reads++;
|
2008-04-09 20:28:12 +00:00
|
|
|
all_uptodate = 0;
|
2010-08-06 17:21:20 +00:00
|
|
|
}
|
2008-04-09 20:28:12 +00:00
|
|
|
}
|
Btrfs: fix memory leak in reading btree blocks
So we can read a btree block via readahead or intentional read,
and we can end up with a memory leak when something happens as
follows,
1) readahead starts to read block A but does not wait for read
completion,
2) btree_readpage_end_io_hook finds that block A is corrupted,
and it needs to clear all block A's pages' uptodate bit.
3) meanwhile an intentional read kicks in and checks block A's
pages' uptodate to decide which page needs to be read.
4) when some pages have the uptodate bit during 3)'s check so
3) doesn't count them for eb->io_pages, but they are later
cleared by 2) so we has to readpage on the page, we get
the wrong eb->io_pages which results in a memory leak of
this block.
This fixes the problem by firstly getting all pages's locking and
then checking pages' uptodate bit.
t1(readahead) t2(readahead endio) t3(the following read)
read_extent_buffer_pages end_bio_extent_readpage
for pg in eb: for page 0,1,2 in eb:
if pg is uptodate: btree_readpage_end_io_hook(pg)
num_reads++ if uptodate:
eb->io_pages = num_reads SetPageUptodate(pg) _______________
for pg in eb: for page 3 in eb: read_extent_buffer_pages
if pg is NOT uptodate: btree_readpage_end_io_hook(pg) for pg in eb:
__extent_read_full_page(pg) sanity check reports something wrong if pg is uptodate:
clear_extent_buffer_uptodate(eb) num_reads++
for pg in eb: eb->io_pages = num_reads
ClearPageUptodate(page) _______________
for pg in eb:
if pg is NOT uptodate:
__extent_read_full_page(pg)
So t3's eb->io_pages is not consistent with the number of pages it's reading,
and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative
number so that we're not able to free the eb.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2016-08-03 19:33:01 +00:00
|
|
|
|
2008-04-09 20:28:12 +00:00
|
|
|
if (all_uptodate) {
|
2016-09-02 19:40:03 +00:00
|
|
|
set_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags);
|
2008-04-09 20:28:12 +00:00
|
|
|
goto unlock_exit;
|
|
|
|
}
|
|
|
|
|
Btrfs: be aware of btree inode write errors to avoid fs corruption
While we have a transaction ongoing, the VM might decide at any time
to call btree_inode->i_mapping->a_ops->writepages(), which will start
writeback of dirty pages belonging to btree nodes/leafs. This call
might return an error or the writeback might finish with an error
before we attempt to commit the running transaction. If this happens,
we might have no way of knowing that such error happened when we are
committing the transaction - because the pages might no longer be
marked dirty nor tagged for writeback (if a subsequent modification
to the extent buffer didn't happen before the transaction commit) which
makes filemap_fdata[write|wait]_range unable to find such pages (even
if they're marked with SetPageError).
So if this happens we must abort the transaction, otherwise we commit
a super block with btree roots that point to btree nodes/leafs whose
content on disk is invalid - either garbage or the content of some
node/leaf from a past generation that got cowed or deleted and is no
longer valid (for this later case we end up getting error messages like
"parent transid verify failed on 10826481664 wanted 25748 found 29562"
when reading btree nodes/leafs from disk).
Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
i_mapping would not be enough because we need to distinguish between
log tree extents (not fatal) vs non-log tree extents (fatal) and
because the next call to filemap_fdatawait_range() will catch and clear
such errors in the mapping - and that call might be from a log sync and
not from a transaction commit, which means we would not know about the
error at transaction commit time. Also, checking for the eb flag
EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
not be completely reliable, as the eb might be removed from memory and
read back when trying to get it, which clears that flag right before
reading the eb's pages from disk, making us not know about the previous
write error.
Using the new 3 flags for the btree inode also makes us achieve the
goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
writeback for all dirty pages and before filemap_fdatawait_range() is
called, the writeback for all dirty pages had already finished with
errors - because we were not using AS_EIO/AS_ENOSPC,
filemap_fdatawait_range() would return success, as it could not know
that writeback errors happened (the pages were no longer tagged for
writeback).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2014-09-26 11:25:56 +00:00
|
|
|
clear_bit(EXTENT_BUFFER_READ_ERR, &eb->bflags);
|
2012-04-16 13:42:26 +00:00
|
|
|
eb->read_mirror = 0;
|
2012-03-13 13:38:00 +00:00
|
|
|
atomic_set(&eb->io_pages, num_reads);
|
btrfs: fix fatal extent_buffer readahead vs releasepage race
Under somewhat convoluted conditions, it is possible to attempt to
release an extent_buffer that is under io, which triggers a BUG_ON in
btrfs_release_extent_buffer_pages.
This relies on a few different factors. First, extent_buffer reads done
as readahead for searching use WAIT_NONE, so they free the local extent
buffer reference while the io is outstanding. However, they should still
be protected by TREE_REF. However, if the system is doing signficant
reclaim, and simultaneously heavily accessing the extent_buffers, it is
possible for releasepage to race with two concurrent readahead attempts
in a way that leaves TREE_REF unset when the readahead extent buffer is
released.
Essentially, if two tasks race to allocate a new extent_buffer, but the
winner who attempts the first io is rebuffed by a page being locked
(likely by the reclaim itself) then the loser will still go ahead with
issuing the readahead. The loser's call to find_extent_buffer must also
race with the reclaim task reading the extent_buffer's refcount as 1 in
a way that allows the reclaim to re-clear the TREE_REF checked by
find_extent_buffer.
The following represents an example execution demonstrating the race:
CPU0 CPU1 CPU2
reada_for_search reada_for_search
readahead_tree_block readahead_tree_block
find_create_tree_block find_create_tree_block
alloc_extent_buffer alloc_extent_buffer
find_extent_buffer // not found
allocates eb
lock pages
associate pages to eb
insert eb into radix tree
set TREE_REF, refs == 2
unlock pages
read_extent_buffer_pages // WAIT_NONE
not uptodate (brand new eb)
lock_page
if !trylock_page
goto unlock_exit // not an error
free_extent_buffer
release_extent_buffer
atomic_dec_and_test refs to 1
find_extent_buffer // found
try_release_extent_buffer
take refs_lock
reads refs == 1; no io
atomic_inc_not_zero refs to 2
mark_buffer_accessed
check_buffer_tree_ref
// not STALE, won't take refs_lock
refs == 2; TREE_REF set // no action
read_extent_buffer_pages // WAIT_NONE
clear TREE_REF
release_extent_buffer
atomic_dec_and_test refs to 1
unlock_page
still not uptodate (CPU1 read failed on trylock_page)
locks pages
set io_pages > 0
submit io
return
free_extent_buffer
release_extent_buffer
dec refs to 0
delete from radix tree
btrfs_release_extent_buffer_pages
BUG_ON(io_pages > 0)!!!
We observe this at a very low rate in production and were also able to
reproduce it in a test environment by introducing some spurious delays
and by introducing probabilistic trylock_page failures.
To fix it, we apply check_tree_ref at a point where it could not
possibly be unset by a competing task: after io_pages has been
incremented. All the codepaths that clear TREE_REF check for io, so they
would not be able to clear it after this point until the io is done.
Stack trace, for reference:
[1417839.424739] ------------[ cut here ]------------
[1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
[1417839.447024] invalid opcode: 0000 [#1] SMP
[1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
[1417839.517008] Code: ed e9 ...
[1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
[1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
[1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
[1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
[1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
[1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
[1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
[1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
[1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1417839.731320] Call Trace:
[1417839.737103] release_extent_buffer+0x39/0x90
[1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
[1417839.758645] btrfs_search_slot+0x260/0x9b0
[1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
[1417839.778427] btrfs_get_extent+0x15f/0x830
[1417839.787665] ? submit_extent_page+0xc4/0x1c0
[1417839.797474] ? __do_readpage+0x299/0x7a0
[1417839.806515] __do_readpage+0x33b/0x7a0
[1417839.815171] ? btrfs_releasepage+0x70/0x70
[1417839.824597] extent_readpages+0x28f/0x400
[1417839.833836] read_pages+0x6a/0x1c0
[1417839.841729] ? startup_64+0x2/0x30
[1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
[1417839.860590] filemap_fault+0x6c7/0x990
[1417839.869252] ? xas_load+0x8/0x80
[1417839.876756] ? xas_find+0x150/0x190
[1417839.884839] ? filemap_map_pages+0x295/0x3b0
[1417839.894652] __do_fault+0x32/0x110
[1417839.902540] __handle_mm_fault+0xacd/0x1000
[1417839.912156] handle_mm_fault+0xaa/0x1c0
[1417839.921004] __do_page_fault+0x242/0x4b0
[1417839.930044] ? page_fault+0x8/0x30
[1417839.937933] page_fault+0x1e/0x30
[1417839.945631] RIP: 0033:0x33c4bae
[1417839.952927] Code: Bad RIP value.
[1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
[1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
[1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
[1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
[1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
[1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-06-17 18:35:19 +00:00
|
|
|
/*
|
|
|
|
* It is possible for releasepage to clear the TREE_REF bit before we
|
|
|
|
* set io_pages. See check_buffer_tree_ref for a more detailed comment.
|
|
|
|
*/
|
|
|
|
check_buffer_tree_ref(eb);
|
2016-09-02 19:40:03 +00:00
|
|
|
for (i = 0; i < num_pages; i++) {
|
2014-07-30 23:03:53 +00:00
|
|
|
page = eb->pages[i];
|
2016-07-11 17:39:07 +00:00
|
|
|
|
2008-04-09 20:28:12 +00:00
|
|
|
if (!PageUptodate(page)) {
|
2016-07-11 17:39:07 +00:00
|
|
|
if (ret) {
|
|
|
|
atomic_dec(&eb->io_pages);
|
|
|
|
unlock_page(page);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2008-04-09 20:28:12 +00:00
|
|
|
ClearPageError(page);
|
2020-09-14 09:37:04 +00:00
|
|
|
err = submit_extent_page(REQ_OP_READ | REQ_META, NULL,
|
|
|
|
page, page_offset(page), PAGE_SIZE, 0,
|
|
|
|
&bio, end_bio_extent_readpage,
|
|
|
|
mirror_num, 0, 0, false);
|
2016-07-11 17:39:07 +00:00
|
|
|
if (err) {
|
|
|
|
/*
|
2020-09-14 09:37:04 +00:00
|
|
|
* We failed to submit the bio so it's the
|
|
|
|
* caller's responsibility to perform cleanup
|
|
|
|
* i.e unlock page/set error bit.
|
2016-07-11 17:39:07 +00:00
|
|
|
*/
|
2020-09-14 09:37:04 +00:00
|
|
|
ret = err;
|
|
|
|
SetPageError(page);
|
|
|
|
unlock_page(page);
|
2016-07-11 17:39:07 +00:00
|
|
|
atomic_dec(&eb->io_pages);
|
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
} else {
|
|
|
|
unlock_page(page);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-10-04 03:23:14 +00:00
|
|
|
if (bio) {
|
2016-06-05 19:31:51 +00:00
|
|
|
err = submit_one_bio(bio, mirror_num, bio_flags);
|
2012-03-12 15:03:00 +00:00
|
|
|
if (err)
|
|
|
|
return err;
|
2011-10-04 03:23:14 +00:00
|
|
|
}
|
2008-02-07 15:50:54 +00:00
|
|
|
|
2011-06-10 12:06:53 +00:00
|
|
|
if (ret || wait != WAIT_COMPLETE)
|
2008-01-24 21:13:08 +00:00
|
|
|
return ret;
|
2009-01-06 02:25:51 +00:00
|
|
|
|
2016-09-02 19:40:03 +00:00
|
|
|
for (i = 0; i < num_pages; i++) {
|
2014-07-30 23:03:53 +00:00
|
|
|
page = eb->pages[i];
|
2008-01-24 21:13:08 +00:00
|
|
|
wait_on_page_locked(page);
|
2009-01-06 02:25:51 +00:00
|
|
|
if (!PageUptodate(page))
|
2008-01-24 21:13:08 +00:00
|
|
|
ret = -EIO;
|
|
|
|
}
|
2009-01-06 02:25:51 +00:00
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
return ret;
|
2008-04-09 20:28:12 +00:00
|
|
|
|
|
|
|
unlock_exit:
|
2009-01-06 02:25:51 +00:00
|
|
|
while (locked_pages > 0) {
|
2008-04-09 20:28:12 +00:00
|
|
|
locked_pages--;
|
2016-09-02 19:40:03 +00:00
|
|
|
page = eb->pages[locked_pages];
|
|
|
|
unlock_page(page);
|
2008-04-09 20:28:12 +00:00
|
|
|
}
|
|
|
|
return ret;
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
static bool report_eb_range(const struct extent_buffer *eb, unsigned long start,
|
|
|
|
unsigned long len)
|
|
|
|
{
|
|
|
|
btrfs_warn(eb->fs_info,
|
|
|
|
"access to eb bytenr %llu len %lu out of range start %lu len %lu",
|
|
|
|
eb->start, eb->len, start, len);
|
|
|
|
WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if the [start, start + len) range is valid before reading/writing
|
|
|
|
* the eb.
|
|
|
|
* NOTE: @start and @len are offset inside the eb, not logical address.
|
|
|
|
*
|
|
|
|
* Caller should not touch the dst/src memory if this function returns error.
|
|
|
|
*/
|
|
|
|
static inline int check_eb_range(const struct extent_buffer *eb,
|
|
|
|
unsigned long start, unsigned long len)
|
|
|
|
{
|
|
|
|
unsigned long offset;
|
|
|
|
|
|
|
|
/* start, start + len should not go beyond eb->len nor overflow */
|
|
|
|
if (unlikely(check_add_overflow(start, len, &offset) || offset > eb->len))
|
|
|
|
return report_eb_range(eb, start, len);
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2017-06-29 03:56:53 +00:00
|
|
|
void read_extent_buffer(const struct extent_buffer *eb, void *dstv,
|
|
|
|
unsigned long start, unsigned long len)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
size_t cur;
|
|
|
|
size_t offset;
|
|
|
|
struct page *page;
|
|
|
|
char *kaddr;
|
|
|
|
char *dst = (char *)dstv;
|
2020-04-29 21:41:20 +00:00
|
|
|
unsigned long i = start >> PAGE_SHIFT;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
if (check_eb_range(eb, start, len))
|
Btrfs: fix out of bounds array access while reading extent buffer
There is a corner case that slips through the checkers in functions
reading extent buffer, ie.
if (start < eb->len) and (start + len > eb->len),
then
a) map_private_extent_buffer() returns immediately because
it's thinking the range spans across two pages,
b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len)
and WARN_ON(start + len > eb->start + eb->len), both are OK in this
corner case, but it'd actually try to access the eb->pages out of
bounds because of (start + len > eb->len).
The case is found by switching extent inline ref type from shared data
ref to non-shared data ref, which is a kind of metadata corruption.
It'd use the wrong helper to access the eb,
eg. btrfs_extent_data_ref_root(eb, ref) is used but the %ref passing
here is "struct btrfs_shared_data_ref". And if the extent item
happens to be the first item in the eb, then offset/length will get
over eb->len which ends up an invalid memory access.
This is adding proper checks in order to avoid invalid memory access,
ie. 'general protection fault', before it's too late.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-08-09 17:10:16 +00:00
|
|
|
return;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2020-04-29 21:41:20 +00:00
|
|
|
offset = offset_in_page(start);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (len > 0) {
|
2014-07-30 23:03:53 +00:00
|
|
|
page = eb->pages[i];
|
2008-01-24 21:13:08 +00:00
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
cur = min(len, (PAGE_SIZE - offset));
|
2011-07-19 16:04:14 +00:00
|
|
|
kaddr = page_address(page);
|
2008-01-24 21:13:08 +00:00
|
|
|
memcpy(dst, kaddr + offset, cur);
|
|
|
|
|
|
|
|
dst += cur;
|
|
|
|
len -= cur;
|
|
|
|
offset = 0;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-08-10 15:42:27 +00:00
|
|
|
int read_extent_buffer_to_user_nofault(const struct extent_buffer *eb,
|
|
|
|
void __user *dstv,
|
|
|
|
unsigned long start, unsigned long len)
|
2014-01-30 15:24:01 +00:00
|
|
|
{
|
|
|
|
size_t cur;
|
|
|
|
size_t offset;
|
|
|
|
struct page *page;
|
|
|
|
char *kaddr;
|
|
|
|
char __user *dst = (char __user *)dstv;
|
2020-04-29 21:41:20 +00:00
|
|
|
unsigned long i = start >> PAGE_SHIFT;
|
2014-01-30 15:24:01 +00:00
|
|
|
int ret = 0;
|
|
|
|
|
|
|
|
WARN_ON(start > eb->len);
|
|
|
|
WARN_ON(start + len > eb->start + eb->len);
|
|
|
|
|
2020-04-29 21:41:20 +00:00
|
|
|
offset = offset_in_page(start);
|
2014-01-30 15:24:01 +00:00
|
|
|
|
|
|
|
while (len > 0) {
|
2014-07-30 23:03:53 +00:00
|
|
|
page = eb->pages[i];
|
2014-01-30 15:24:01 +00:00
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
cur = min(len, (PAGE_SIZE - offset));
|
2014-01-30 15:24:01 +00:00
|
|
|
kaddr = page_address(page);
|
2020-08-10 15:42:27 +00:00
|
|
|
if (copy_to_user_nofault(dst, kaddr + offset, cur)) {
|
2014-01-30 15:24:01 +00:00
|
|
|
ret = -EFAULT;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
dst += cur;
|
|
|
|
len -= cur;
|
|
|
|
offset = 0;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2017-06-29 03:56:53 +00:00
|
|
|
int memcmp_extent_buffer(const struct extent_buffer *eb, const void *ptrv,
|
|
|
|
unsigned long start, unsigned long len)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
size_t cur;
|
|
|
|
size_t offset;
|
|
|
|
struct page *page;
|
|
|
|
char *kaddr;
|
|
|
|
char *ptr = (char *)ptrv;
|
2020-04-29 21:41:20 +00:00
|
|
|
unsigned long i = start >> PAGE_SHIFT;
|
2008-01-24 21:13:08 +00:00
|
|
|
int ret = 0;
|
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
if (check_eb_range(eb, start, len))
|
|
|
|
return -EINVAL;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2020-04-29 21:41:20 +00:00
|
|
|
offset = offset_in_page(start);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (len > 0) {
|
2014-07-30 23:03:53 +00:00
|
|
|
page = eb->pages[i];
|
2008-01-24 21:13:08 +00:00
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
cur = min(len, (PAGE_SIZE - offset));
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2011-07-19 16:04:14 +00:00
|
|
|
kaddr = page_address(page);
|
2008-01-24 21:13:08 +00:00
|
|
|
ret = memcmp(ptr, kaddr + offset, cur);
|
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
|
|
|
|
ptr += cur;
|
|
|
|
len -= cur;
|
|
|
|
offset = 0;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void write_extent_buffer_chunk_tree_uuid(const struct extent_buffer *eb,
|
2016-11-09 16:43:38 +00:00
|
|
|
const void *srcv)
|
|
|
|
{
|
|
|
|
char *kaddr;
|
|
|
|
|
|
|
|
WARN_ON(!PageUptodate(eb->pages[0]));
|
|
|
|
kaddr = page_address(eb->pages[0]);
|
|
|
|
memcpy(kaddr + offsetof(struct btrfs_header, chunk_tree_uuid), srcv,
|
|
|
|
BTRFS_FSID_SIZE);
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void write_extent_buffer_fsid(const struct extent_buffer *eb, const void *srcv)
|
2016-11-09 16:43:38 +00:00
|
|
|
{
|
|
|
|
char *kaddr;
|
|
|
|
|
|
|
|
WARN_ON(!PageUptodate(eb->pages[0]));
|
|
|
|
kaddr = page_address(eb->pages[0]);
|
|
|
|
memcpy(kaddr + offsetof(struct btrfs_header, fsid), srcv,
|
|
|
|
BTRFS_FSID_SIZE);
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void write_extent_buffer(const struct extent_buffer *eb, const void *srcv,
|
2008-01-24 21:13:08 +00:00
|
|
|
unsigned long start, unsigned long len)
|
|
|
|
{
|
|
|
|
size_t cur;
|
|
|
|
size_t offset;
|
|
|
|
struct page *page;
|
|
|
|
char *kaddr;
|
|
|
|
char *src = (char *)srcv;
|
2020-04-29 21:41:20 +00:00
|
|
|
unsigned long i = start >> PAGE_SHIFT;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
if (check_eb_range(eb, start, len))
|
|
|
|
return;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2020-04-29 21:41:20 +00:00
|
|
|
offset = offset_in_page(start);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (len > 0) {
|
2014-07-30 23:03:53 +00:00
|
|
|
page = eb->pages[i];
|
2008-01-24 21:13:08 +00:00
|
|
|
WARN_ON(!PageUptodate(page));
|
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
cur = min(len, PAGE_SIZE - offset);
|
2011-07-19 16:04:14 +00:00
|
|
|
kaddr = page_address(page);
|
2008-01-24 21:13:08 +00:00
|
|
|
memcpy(kaddr + offset, src, cur);
|
|
|
|
|
|
|
|
src += cur;
|
|
|
|
len -= cur;
|
|
|
|
offset = 0;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void memzero_extent_buffer(const struct extent_buffer *eb, unsigned long start,
|
2016-11-08 17:09:03 +00:00
|
|
|
unsigned long len)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
size_t cur;
|
|
|
|
size_t offset;
|
|
|
|
struct page *page;
|
|
|
|
char *kaddr;
|
2020-04-29 21:41:20 +00:00
|
|
|
unsigned long i = start >> PAGE_SHIFT;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
if (check_eb_range(eb, start, len))
|
|
|
|
return;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2020-04-29 21:41:20 +00:00
|
|
|
offset = offset_in_page(start);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (len > 0) {
|
2014-07-30 23:03:53 +00:00
|
|
|
page = eb->pages[i];
|
2008-01-24 21:13:08 +00:00
|
|
|
WARN_ON(!PageUptodate(page));
|
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
cur = min(len, PAGE_SIZE - offset);
|
2011-07-19 16:04:14 +00:00
|
|
|
kaddr = page_address(page);
|
2016-11-08 17:09:03 +00:00
|
|
|
memset(kaddr + offset, 0, cur);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
len -= cur;
|
|
|
|
offset = 0;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void copy_extent_buffer_full(const struct extent_buffer *dst,
|
|
|
|
const struct extent_buffer *src)
|
2016-11-08 17:30:31 +00:00
|
|
|
{
|
|
|
|
int i;
|
2018-03-01 17:20:27 +00:00
|
|
|
int num_pages;
|
2016-11-08 17:30:31 +00:00
|
|
|
|
|
|
|
ASSERT(dst->len == src->len);
|
|
|
|
|
2018-06-29 08:56:49 +00:00
|
|
|
num_pages = num_extent_pages(dst);
|
2016-11-08 17:30:31 +00:00
|
|
|
for (i = 0; i < num_pages; i++)
|
|
|
|
copy_page(page_address(dst->pages[i]),
|
|
|
|
page_address(src->pages[i]));
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void copy_extent_buffer(const struct extent_buffer *dst,
|
|
|
|
const struct extent_buffer *src,
|
2008-01-24 21:13:08 +00:00
|
|
|
unsigned long dst_offset, unsigned long src_offset,
|
|
|
|
unsigned long len)
|
|
|
|
{
|
|
|
|
u64 dst_len = dst->len;
|
|
|
|
size_t cur;
|
|
|
|
size_t offset;
|
|
|
|
struct page *page;
|
|
|
|
char *kaddr;
|
2020-04-29 21:41:20 +00:00
|
|
|
unsigned long i = dst_offset >> PAGE_SHIFT;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
if (check_eb_range(dst, dst_offset, len) ||
|
|
|
|
check_eb_range(src, src_offset, len))
|
|
|
|
return;
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
WARN_ON(src->len != dst_len);
|
|
|
|
|
2020-04-29 21:41:20 +00:00
|
|
|
offset = offset_in_page(dst_offset);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (len > 0) {
|
2014-07-30 23:03:53 +00:00
|
|
|
page = dst->pages[i];
|
2008-01-24 21:13:08 +00:00
|
|
|
WARN_ON(!PageUptodate(page));
|
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
cur = min(len, (unsigned long)(PAGE_SIZE - offset));
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2011-07-19 16:04:14 +00:00
|
|
|
kaddr = page_address(page);
|
2008-01-24 21:13:08 +00:00
|
|
|
read_extent_buffer(src, kaddr + offset, src_offset, cur);
|
|
|
|
|
|
|
|
src_offset += cur;
|
|
|
|
len -= cur;
|
|
|
|
offset = 0;
|
|
|
|
i++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-09-30 03:50:30 +00:00
|
|
|
/*
|
|
|
|
* eb_bitmap_offset() - calculate the page and offset of the byte containing the
|
|
|
|
* given bit number
|
|
|
|
* @eb: the extent buffer
|
|
|
|
* @start: offset of the bitmap item in the extent buffer
|
|
|
|
* @nr: bit number
|
|
|
|
* @page_index: return index of the page in the extent buffer that contains the
|
|
|
|
* given bit number
|
|
|
|
* @page_offset: return offset into the page given by page_index
|
|
|
|
*
|
|
|
|
* This helper hides the ugliness of finding the byte in an extent buffer which
|
|
|
|
* contains a given bit.
|
|
|
|
*/
|
2020-04-29 01:04:10 +00:00
|
|
|
static inline void eb_bitmap_offset(const struct extent_buffer *eb,
|
2015-09-30 03:50:30 +00:00
|
|
|
unsigned long start, unsigned long nr,
|
|
|
|
unsigned long *page_index,
|
|
|
|
size_t *page_offset)
|
|
|
|
{
|
|
|
|
size_t byte_offset = BIT_BYTE(nr);
|
|
|
|
size_t offset;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The byte we want is the offset of the extent buffer + the offset of
|
|
|
|
* the bitmap item in the extent buffer + the offset of the byte in the
|
|
|
|
* bitmap item.
|
|
|
|
*/
|
2020-04-29 21:41:20 +00:00
|
|
|
offset = start + byte_offset;
|
2015-09-30 03:50:30 +00:00
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
*page_index = offset >> PAGE_SHIFT;
|
2018-12-05 14:23:03 +00:00
|
|
|
*page_offset = offset_in_page(offset);
|
2015-09-30 03:50:30 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* extent_buffer_test_bit - determine whether a bit in a bitmap item is set
|
|
|
|
* @eb: the extent buffer
|
|
|
|
* @start: offset of the bitmap item in the extent buffer
|
|
|
|
* @nr: bit number to test
|
|
|
|
*/
|
2020-04-29 01:04:10 +00:00
|
|
|
int extent_buffer_test_bit(const struct extent_buffer *eb, unsigned long start,
|
2015-09-30 03:50:30 +00:00
|
|
|
unsigned long nr)
|
|
|
|
{
|
2016-09-23 00:24:20 +00:00
|
|
|
u8 *kaddr;
|
2015-09-30 03:50:30 +00:00
|
|
|
struct page *page;
|
|
|
|
unsigned long i;
|
|
|
|
size_t offset;
|
|
|
|
|
|
|
|
eb_bitmap_offset(eb, start, nr, &i, &offset);
|
|
|
|
page = eb->pages[i];
|
|
|
|
WARN_ON(!PageUptodate(page));
|
|
|
|
kaddr = page_address(page);
|
|
|
|
return 1U & (kaddr[offset] >> (nr & (BITS_PER_BYTE - 1)));
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* extent_buffer_bitmap_set - set an area of a bitmap
|
|
|
|
* @eb: the extent buffer
|
|
|
|
* @start: offset of the bitmap item in the extent buffer
|
|
|
|
* @pos: bit number of the first bit
|
|
|
|
* @len: number of bits to set
|
|
|
|
*/
|
2020-04-29 01:04:10 +00:00
|
|
|
void extent_buffer_bitmap_set(const struct extent_buffer *eb, unsigned long start,
|
2015-09-30 03:50:30 +00:00
|
|
|
unsigned long pos, unsigned long len)
|
|
|
|
{
|
2016-09-23 00:24:20 +00:00
|
|
|
u8 *kaddr;
|
2015-09-30 03:50:30 +00:00
|
|
|
struct page *page;
|
|
|
|
unsigned long i;
|
|
|
|
size_t offset;
|
|
|
|
const unsigned int size = pos + len;
|
|
|
|
int bits_to_set = BITS_PER_BYTE - (pos % BITS_PER_BYTE);
|
2016-09-23 00:24:20 +00:00
|
|
|
u8 mask_to_set = BITMAP_FIRST_BYTE_MASK(pos);
|
2015-09-30 03:50:30 +00:00
|
|
|
|
|
|
|
eb_bitmap_offset(eb, start, pos, &i, &offset);
|
|
|
|
page = eb->pages[i];
|
|
|
|
WARN_ON(!PageUptodate(page));
|
|
|
|
kaddr = page_address(page);
|
|
|
|
|
|
|
|
while (len >= bits_to_set) {
|
|
|
|
kaddr[offset] |= mask_to_set;
|
|
|
|
len -= bits_to_set;
|
|
|
|
bits_to_set = BITS_PER_BYTE;
|
2016-10-12 08:33:21 +00:00
|
|
|
mask_to_set = ~0;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
if (++offset >= PAGE_SIZE && len > 0) {
|
2015-09-30 03:50:30 +00:00
|
|
|
offset = 0;
|
|
|
|
page = eb->pages[++i];
|
|
|
|
WARN_ON(!PageUptodate(page));
|
|
|
|
kaddr = page_address(page);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (len) {
|
|
|
|
mask_to_set &= BITMAP_LAST_BYTE_MASK(size);
|
|
|
|
kaddr[offset] |= mask_to_set;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
/**
|
|
|
|
* extent_buffer_bitmap_clear - clear an area of a bitmap
|
|
|
|
* @eb: the extent buffer
|
|
|
|
* @start: offset of the bitmap item in the extent buffer
|
|
|
|
* @pos: bit number of the first bit
|
|
|
|
* @len: number of bits to clear
|
|
|
|
*/
|
2020-04-29 01:04:10 +00:00
|
|
|
void extent_buffer_bitmap_clear(const struct extent_buffer *eb,
|
|
|
|
unsigned long start, unsigned long pos,
|
|
|
|
unsigned long len)
|
2015-09-30 03:50:30 +00:00
|
|
|
{
|
2016-09-23 00:24:20 +00:00
|
|
|
u8 *kaddr;
|
2015-09-30 03:50:30 +00:00
|
|
|
struct page *page;
|
|
|
|
unsigned long i;
|
|
|
|
size_t offset;
|
|
|
|
const unsigned int size = pos + len;
|
|
|
|
int bits_to_clear = BITS_PER_BYTE - (pos % BITS_PER_BYTE);
|
2016-09-23 00:24:20 +00:00
|
|
|
u8 mask_to_clear = BITMAP_FIRST_BYTE_MASK(pos);
|
2015-09-30 03:50:30 +00:00
|
|
|
|
|
|
|
eb_bitmap_offset(eb, start, pos, &i, &offset);
|
|
|
|
page = eb->pages[i];
|
|
|
|
WARN_ON(!PageUptodate(page));
|
|
|
|
kaddr = page_address(page);
|
|
|
|
|
|
|
|
while (len >= bits_to_clear) {
|
|
|
|
kaddr[offset] &= ~mask_to_clear;
|
|
|
|
len -= bits_to_clear;
|
|
|
|
bits_to_clear = BITS_PER_BYTE;
|
2016-10-12 08:33:21 +00:00
|
|
|
mask_to_clear = ~0;
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
if (++offset >= PAGE_SIZE && len > 0) {
|
2015-09-30 03:50:30 +00:00
|
|
|
offset = 0;
|
|
|
|
page = eb->pages[++i];
|
|
|
|
WARN_ON(!PageUptodate(page));
|
|
|
|
kaddr = page_address(page);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (len) {
|
|
|
|
mask_to_clear &= BITMAP_LAST_BYTE_MASK(size);
|
|
|
|
kaddr[offset] &= ~mask_to_clear;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2011-04-11 21:52:52 +00:00
|
|
|
static inline bool areas_overlap(unsigned long src, unsigned long dst, unsigned long len)
|
|
|
|
{
|
|
|
|
unsigned long distance = (src > dst) ? src - dst : dst - src;
|
|
|
|
return distance < len;
|
|
|
|
}
|
|
|
|
|
2008-01-24 21:13:08 +00:00
|
|
|
static void copy_pages(struct page *dst_page, struct page *src_page,
|
|
|
|
unsigned long dst_off, unsigned long src_off,
|
|
|
|
unsigned long len)
|
|
|
|
{
|
2011-07-19 16:04:14 +00:00
|
|
|
char *dst_kaddr = page_address(dst_page);
|
2008-01-24 21:13:08 +00:00
|
|
|
char *src_kaddr;
|
2010-08-06 17:21:20 +00:00
|
|
|
int must_memmove = 0;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2011-04-11 21:52:52 +00:00
|
|
|
if (dst_page != src_page) {
|
2011-07-19 16:04:14 +00:00
|
|
|
src_kaddr = page_address(src_page);
|
2011-04-11 21:52:52 +00:00
|
|
|
} else {
|
2008-01-24 21:13:08 +00:00
|
|
|
src_kaddr = dst_kaddr;
|
2010-08-06 17:21:20 +00:00
|
|
|
if (areas_overlap(src_off, dst_off, len))
|
|
|
|
must_memmove = 1;
|
2011-04-11 21:52:52 +00:00
|
|
|
}
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2010-08-06 17:21:20 +00:00
|
|
|
if (must_memmove)
|
|
|
|
memmove(dst_kaddr + dst_off, src_kaddr + src_off, len);
|
|
|
|
else
|
|
|
|
memcpy(dst_kaddr + dst_off, src_kaddr + src_off, len);
|
2008-01-24 21:13:08 +00:00
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void memcpy_extent_buffer(const struct extent_buffer *dst,
|
|
|
|
unsigned long dst_offset, unsigned long src_offset,
|
|
|
|
unsigned long len)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
size_t cur;
|
|
|
|
size_t dst_off_in_page;
|
|
|
|
size_t src_off_in_page;
|
|
|
|
unsigned long dst_i;
|
|
|
|
unsigned long src_i;
|
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
if (check_eb_range(dst, dst_offset, len) ||
|
|
|
|
check_eb_range(dst, src_offset, len))
|
|
|
|
return;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2009-01-06 02:25:51 +00:00
|
|
|
while (len > 0) {
|
2020-04-29 21:41:20 +00:00
|
|
|
dst_off_in_page = offset_in_page(dst_offset);
|
|
|
|
src_off_in_page = offset_in_page(src_offset);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2020-04-29 21:41:20 +00:00
|
|
|
dst_i = dst_offset >> PAGE_SHIFT;
|
|
|
|
src_i = src_offset >> PAGE_SHIFT;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
cur = min(len, (unsigned long)(PAGE_SIZE -
|
2008-01-24 21:13:08 +00:00
|
|
|
src_off_in_page));
|
|
|
|
cur = min_t(unsigned long, cur,
|
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-01 12:29:47 +00:00
|
|
|
(unsigned long)(PAGE_SIZE - dst_off_in_page));
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2014-07-30 23:03:53 +00:00
|
|
|
copy_pages(dst->pages[dst_i], dst->pages[src_i],
|
2008-01-24 21:13:08 +00:00
|
|
|
dst_off_in_page, src_off_in_page, cur);
|
|
|
|
|
|
|
|
src_offset += cur;
|
|
|
|
dst_offset += cur;
|
|
|
|
len -= cur;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2020-04-29 01:04:10 +00:00
|
|
|
void memmove_extent_buffer(const struct extent_buffer *dst,
|
|
|
|
unsigned long dst_offset, unsigned long src_offset,
|
|
|
|
unsigned long len)
|
2008-01-24 21:13:08 +00:00
|
|
|
{
|
|
|
|
size_t cur;
|
|
|
|
size_t dst_off_in_page;
|
|
|
|
size_t src_off_in_page;
|
|
|
|
unsigned long dst_end = dst_offset + len - 1;
|
|
|
|
unsigned long src_end = src_offset + len - 1;
|
|
|
|
unsigned long dst_i;
|
|
|
|
unsigned long src_i;
|
|
|
|
|
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g.
read_extent_buffer()), these checks have limitations:
- No overflow check
Values like start = 1024 len = -1024 can still pass the basic
(start + len) > eb->len check.
- Checks are not consistent
For read_extent_buffer() we only check (start + len) against eb->len.
While for memcmp_extent_buffer() we also check start against eb->len.
- Different error reporting mechanism
We use WARN() in read_extent_buffer() but BUG() in
memcpy_extent_buffer().
- Still modify memory if the request is obviously wrong
In read_extent_buffer() even we find (start + len) > eb->len, we still
call memset(dst, 0, len), which can easily cause memory access error
if start + len overflows.
To address above problems, this patch creates a new common function to
check such access, check_eb_range().
- Add overflow check
This function checks start, start + len against eb->len and overflow
check.
- Unified checks
- Unified error reports
Will call WARN() if CONFIG_BTRFS_DEBUG is configured.
And also do btrfs_warn() message for non-debug build.
- Exit ASAP if check fails
No more possible memory corruption.
- Add extra comment for @start @len used in those functions as it's
sometimes confused with the logical addressing instead of a range
inside the eb space
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817
[ Inspired by above report, the report itself is already addressed ]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use check_add_overflow ]
Signed-off-by: David Sterba <dsterba@suse.com>
2020-08-19 06:35:47 +00:00
|
|
|
if (check_eb_range(dst, dst_offset, len) ||
|
|
|
|
check_eb_range(dst, src_offset, len))
|
|
|
|
return;
|
2010-08-06 17:21:20 +00:00
|
|
|
if (dst_offset < src_offset) {
|
2008-01-24 21:13:08 +00:00
|
|
|
memcpy_extent_buffer(dst, dst_offset, src_offset, len);
|
|
|
|
return;
|
|
|
|
}
|
2009-01-06 02:25:51 +00:00
|
|
|
while (len > 0) {
|
2020-04-29 21:41:20 +00:00
|
|
|
dst_i = dst_end >> PAGE_SHIFT;
|
|
|
|
src_i = src_end >> PAGE_SHIFT;
|
2008-01-24 21:13:08 +00:00
|
|
|
|
2020-04-29 21:41:20 +00:00
|
|
|
dst_off_in_page = offset_in_page(dst_end);
|
|
|
|
src_off_in_page = offset_in_page(src_end);
|
2008-01-24 21:13:08 +00:00
|
|
|
|
|
|
|
cur = min_t(unsigned long, len, src_off_in_page + 1);
|
|
|
|
cur = min(cur, dst_off_in_page + 1);
|
2014-07-30 23:03:53 +00:00
|
|
|
copy_pages(dst->pages[dst_i], dst->pages[src_i],
|
2008-01-24 21:13:08 +00:00
|
|
|
dst_off_in_page - cur + 1,
|
|
|
|
src_off_in_page - cur + 1, cur);
|
|
|
|
|
|
|
|
dst_end -= cur;
|
|
|
|
src_end -= cur;
|
|
|
|
len -= cur;
|
|
|
|
}
|
|
|
|
}
|
2008-07-22 15:18:07 +00:00
|
|
|
|
2013-04-26 14:56:29 +00:00
|
|
|
int try_release_extent_buffer(struct page *page)
|
2010-10-27 00:57:29 +00:00
|
|
|
{
|
2008-07-22 15:18:07 +00:00
|
|
|
struct extent_buffer *eb;
|
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
/*
|
2016-05-20 01:18:45 +00:00
|
|
|
* We need to make sure nobody is attaching this page to an eb right
|
2012-03-09 21:01:49 +00:00
|
|
|
* now.
|
|
|
|
*/
|
|
|
|
spin_lock(&page->mapping->private_lock);
|
|
|
|
if (!PagePrivate(page)) {
|
|
|
|
spin_unlock(&page->mapping->private_lock);
|
2012-03-07 21:20:05 +00:00
|
|
|
return 1;
|
2010-11-22 03:27:44 +00:00
|
|
|
}
|
2008-07-22 15:18:07 +00:00
|
|
|
|
2012-03-09 21:01:49 +00:00
|
|
|
eb = (struct extent_buffer *)page->private;
|
|
|
|
BUG_ON(!eb);
|
2010-10-27 00:57:29 +00:00
|
|
|
|
|
|
|
/*
|
2012-03-09 21:01:49 +00:00
|
|
|
* This is a little awful but should be ok, we need to make sure that
|
|
|
|
* the eb doesn't disappear out from under us while we're looking at
|
|
|
|
* this page.
|
2010-10-27 00:57:29 +00:00
|
|
|
*/
|
2012-03-09 21:01:49 +00:00
|
|
|
spin_lock(&eb->refs_lock);
|
2012-03-13 13:38:00 +00:00
|
|
|
if (atomic_read(&eb->refs) != 1 || extent_buffer_under_io(eb)) {
|
2012-03-09 21:01:49 +00:00
|
|
|
spin_unlock(&eb->refs_lock);
|
|
|
|
spin_unlock(&page->mapping->private_lock);
|
|
|
|
return 0;
|
2009-03-13 15:00:37 +00:00
|
|
|
}
|
2012-03-09 21:01:49 +00:00
|
|
|
spin_unlock(&page->mapping->private_lock);
|
2010-10-27 00:57:29 +00:00
|
|
|
|
2010-10-27 00:57:29 +00:00
|
|
|
/*
|
2012-03-09 21:01:49 +00:00
|
|
|
* If tree ref isn't set then we know the ref on this eb is a real ref,
|
|
|
|
* so just return, this page will likely be freed soon anyway.
|
2010-10-27 00:57:29 +00:00
|
|
|
*/
|
2012-03-09 21:01:49 +00:00
|
|
|
if (!test_and_clear_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags)) {
|
|
|
|
spin_unlock(&eb->refs_lock);
|
|
|
|
return 0;
|
2009-03-13 15:00:37 +00:00
|
|
|
}
|
2010-10-27 00:57:29 +00:00
|
|
|
|
2013-04-26 14:56:29 +00:00
|
|
|
return release_extent_buffer(eb);
|
2008-07-22 15:18:07 +00:00
|
|
|
}
|
2020-11-05 15:45:09 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* btrfs_readahead_tree_block - attempt to readahead a child block
|
|
|
|
* @fs_info: the fs_info
|
|
|
|
* @bytenr: bytenr to read
|
2020-11-05 15:45:20 +00:00
|
|
|
* @owner_root: objectid of the root that owns this eb
|
2020-11-05 15:45:09 +00:00
|
|
|
* @gen: generation for the uptodate check, can be 0
|
2020-11-05 15:45:20 +00:00
|
|
|
* @level: level for the eb
|
2020-11-05 15:45:09 +00:00
|
|
|
*
|
|
|
|
* Attempt to readahead a tree block at @bytenr. If @gen is 0 then we do a
|
|
|
|
* normal uptodate check of the eb, without checking the generation. If we have
|
|
|
|
* to read the block we will not block on anything.
|
|
|
|
*/
|
|
|
|
void btrfs_readahead_tree_block(struct btrfs_fs_info *fs_info,
|
2020-11-05 15:45:20 +00:00
|
|
|
u64 bytenr, u64 owner_root, u64 gen, int level)
|
2020-11-05 15:45:09 +00:00
|
|
|
{
|
|
|
|
struct extent_buffer *eb;
|
|
|
|
int ret;
|
|
|
|
|
2020-11-05 15:45:20 +00:00
|
|
|
eb = btrfs_find_create_tree_block(fs_info, bytenr, owner_root, level);
|
2020-11-05 15:45:09 +00:00
|
|
|
if (IS_ERR(eb))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (btrfs_buffer_uptodate(eb, gen, 1)) {
|
|
|
|
free_extent_buffer(eb);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
ret = read_extent_buffer_pages(eb, WAIT_NONE, 0);
|
|
|
|
if (ret < 0)
|
|
|
|
free_extent_buffer_stale(eb);
|
|
|
|
else
|
|
|
|
free_extent_buffer(eb);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* btrfs_readahead_node_child - readahead a node's child block
|
|
|
|
* @node: parent node we're reading from
|
|
|
|
* @slot: slot in the parent node for the child we want to read
|
|
|
|
*
|
|
|
|
* A helper for btrfs_readahead_tree_block, we simply read the bytenr pointed at
|
|
|
|
* the slot in the node provided.
|
|
|
|
*/
|
|
|
|
void btrfs_readahead_node_child(struct extent_buffer *node, int slot)
|
|
|
|
{
|
|
|
|
btrfs_readahead_tree_block(node->fs_info,
|
|
|
|
btrfs_node_blockptr(node, slot),
|
2020-11-05 15:45:20 +00:00
|
|
|
btrfs_header_owner(node),
|
|
|
|
btrfs_node_ptr_generation(node, slot),
|
|
|
|
btrfs_header_level(node) - 1);
|
2020-11-05 15:45:09 +00:00
|
|
|
}
|