linux/fs/btrfs/extent_map.c

1343 lines
37 KiB
C
Raw Normal View History

License cleanup: add SPDX GPL-2.0 license identifier to files with no license Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-01 14:07:57 +00:00
// SPDX-License-Identifier: GPL-2.0
#include <linux/err.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include "messages.h"
#include "ctree.h"
#include "extent_map.h"
#include "compression.h"
#include "btrfs_inode.h"
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
#include "disk-io.h"
static struct kmem_cache *extent_map_cache;
int __init extent_map_init(void)
{
extent_map_cache = kmem_cache_create("btrfs_extent_map",
sizeof(struct extent_map), 0, 0, NULL);
if (!extent_map_cache)
return -ENOMEM;
return 0;
}
void __cold extent_map_exit(void)
{
kmem_cache_destroy(extent_map_cache);
}
/*
* Initialize the extent tree @tree. Should be called for each new inode or
* other user of the extent_map interface.
*/
void extent_map_tree_init(struct extent_map_tree *tree)
{
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
tree->root = RB_ROOT;
Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-17 17:14:17 +00:00
INIT_LIST_HEAD(&tree->modified_extents);
rwlock_init(&tree->lock);
}
/*
* Allocate a new extent_map structure. The new structure is returned with a
* reference count of one and needs to be freed using free_extent_map()
*/
struct extent_map *alloc_extent_map(void)
{
struct extent_map *em;
em = kmem_cache_zalloc(extent_map_cache, GFP_NOFS);
if (!em)
return NULL;
RB_CLEAR_NODE(&em->rb_node);
refcount_set(&em->refs, 1);
Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-17 17:14:17 +00:00
INIT_LIST_HEAD(&em->list);
return em;
}
/*
* Drop the reference out on @em by one and free the structure if the reference
* count hits zero.
*/
void free_extent_map(struct extent_map *em)
{
if (!em)
return;
if (refcount_dec_and_test(&em->refs)) {
WARN_ON(extent_map_in_tree(em));
Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-17 17:14:17 +00:00
WARN_ON(!list_empty(&em->list));
kmem_cache_free(extent_map_cache, em);
}
}
/* Do the math around the end of an extent, handling wrapping. */
static u64 range_end(u64 start, u64 len)
{
if (start + len < start)
return (u64)-1;
return start + len;
}
static void dec_evictable_extent_maps(struct btrfs_inode *inode)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
if (!btrfs_is_testing(fs_info) && is_fstree(btrfs_root_id(inode->root)))
percpu_counter_dec(&fs_info->evictable_extent_maps);
}
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
static int tree_insert(struct rb_root *root, struct extent_map *em)
{
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
struct rb_node **p = &root->rb_node;
struct rb_node *parent = NULL;
struct extent_map *entry = NULL;
struct rb_node *orig_parent = NULL;
u64 end = range_end(em->start, em->len);
while (*p) {
parent = *p;
entry = rb_entry(parent, struct extent_map, rb_node);
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
if (em->start < entry->start)
p = &(*p)->rb_left;
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
else if (em->start >= extent_map_end(entry))
p = &(*p)->rb_right;
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
else
return -EEXIST;
}
orig_parent = parent;
while (parent && em->start >= extent_map_end(entry)) {
parent = rb_next(parent);
entry = rb_entry(parent, struct extent_map, rb_node);
}
if (parent)
if (end > entry->start && em->start < extent_map_end(entry))
return -EEXIST;
parent = orig_parent;
entry = rb_entry(parent, struct extent_map, rb_node);
while (parent && em->start < entry->start) {
parent = rb_prev(parent);
entry = rb_entry(parent, struct extent_map, rb_node);
}
if (parent)
if (end > entry->start && em->start < extent_map_end(entry))
return -EEXIST;
rb_link_node(&em->rb_node, orig_parent, p);
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
rb_insert_color(&em->rb_node, root);
return 0;
}
/*
* Search through the tree for an extent_map with a given offset. If it can't
* be found, try to find some neighboring extents
*/
static struct rb_node *__tree_search(struct rb_root *root, u64 offset,
struct rb_node **prev_or_next_ret)
{
struct rb_node *n = root->rb_node;
struct rb_node *prev = NULL;
struct rb_node *orig_prev = NULL;
struct extent_map *entry;
struct extent_map *prev_entry = NULL;
ASSERT(prev_or_next_ret);
while (n) {
entry = rb_entry(n, struct extent_map, rb_node);
prev = n;
prev_entry = entry;
if (offset < entry->start)
n = n->rb_left;
else if (offset >= extent_map_end(entry))
n = n->rb_right;
else
return n;
}
orig_prev = prev;
while (prev && offset >= extent_map_end(prev_entry)) {
prev = rb_next(prev);
prev_entry = rb_entry(prev, struct extent_map, rb_node);
}
/*
* Previous extent map found, return as in this case the caller does not
* care about the next one.
*/
if (prev) {
*prev_or_next_ret = prev;
return NULL;
}
prev = orig_prev;
prev_entry = rb_entry(prev, struct extent_map, rb_node);
while (prev && offset < prev_entry->start) {
prev = rb_prev(prev);
prev_entry = rb_entry(prev, struct extent_map, rb_node);
}
*prev_or_next_ret = prev;
return NULL;
}
static inline u64 extent_map_block_len(const struct extent_map *em)
{
if (extent_map_is_compressed(em))
return em->disk_num_bytes;
return em->len;
}
static inline u64 extent_map_block_end(const struct extent_map *em)
{
const u64 block_start = extent_map_block_start(em);
const u64 block_end = block_start + extent_map_block_len(em);
if (block_end < block_start)
return (u64)-1;
return block_end;
}
static bool can_merge_extent_map(const struct extent_map *em)
{
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
if (em->flags & EXTENT_FLAG_PINNED)
return false;
/* Don't merge compressed extents, we need to know their actual size. */
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
if (extent_map_is_compressed(em))
return false;
Btrfs: Add zlib compression support This is a large change for adding compression on reading and writing, both for inline and regular extents. It does some fairly large surgery to the writeback paths. Compression is off by default and enabled by mount -o compress. Even when the -o compress mount option is not used, it is possible to read compressed extents off the disk. If compression for a given set of pages fails to make them smaller, the file is flagged to avoid future compression attempts later. * While finding delalloc extents, the pages are locked before being sent down to the delalloc handler. This allows the delalloc handler to do complex things such as cleaning the pages, marking them writeback and starting IO on their behalf. * Inline extents are inserted at delalloc time now. This allows us to compress the data before inserting the inline extent, and it allows us to insert an inline extent that spans multiple pages. * All of the in-memory extent representations (extent_map.c, ordered-data.c etc) are changed to record both an in-memory size and an on disk size, as well as a flag for compression. From a disk format point of view, the extent pointers in the file are changed to record the on disk size of a given extent and some encoding flags. Space in the disk format is allocated for compression encoding, as well as encryption and a generic 'other' field. Neither the encryption or the 'other' field are currently used. In order to limit the amount of data read for a single random read in the file, the size of a compressed extent is limited to 128k. This is a software only limit, the disk format supports u64 sized compressed extents. In order to limit the ram consumed while processing extents, the uncompressed size of a compressed extent is limited to 256k. This is a software only limit and will be subject to tuning later. Checksumming is still done on compressed extents, and it is done on the uncompressed version of the data. This way additional encodings can be layered on without having to figure out which encoding to checksum. Compression happens at delalloc time, which is basically singled threaded because it is usually done by a single pdflush thread. This makes it tricky to spread the compression load across all the cpus on the box. We'll have to look at parallel pdflush walks of dirty inodes at a later time. Decompression is hooked into readpages and it does spread across CPUs nicely. Signed-off-by: Chris Mason <chris.mason@oracle.com>
2008-10-29 18:49:59 +00:00
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
if (em->flags & EXTENT_FLAG_LOGGING)
return false;
2013-04-05 20:51:15 +00:00
/*
* We don't want to merge stuff that hasn't been written to the log yet
* since it may not reflect exactly what is on disk, and that would be
* bad.
*/
if (!list_empty(&em->list))
return false;
return true;
}
2013-04-05 20:51:15 +00:00
/* Check to see if two extent_map structs are adjacent and safe to merge. */
static bool mergeable_maps(const struct extent_map *prev, const struct extent_map *next)
{
if (extent_map_end(prev) != next->start)
return false;
if (prev->flags != next->flags)
return false;
if (next->disk_bytenr < EXTENT_MAP_LAST_BYTE - 1)
return extent_map_block_start(next) == extent_map_block_end(prev);
/* HOLES and INLINE extents. */
return next->disk_bytenr == prev->disk_bytenr;
}
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
/*
* Handle the on-disk data extents merge for @prev and @next.
*
* Only touches disk_bytenr/disk_num_bytes/offset/ram_bytes.
* For now only uncompressed regular extent can be merged.
*
* @prev and @next will be both updated to point to the new merged range.
* Thus one of them should be removed by the caller.
*/
static void merge_ondisk_extents(struct extent_map *prev, struct extent_map *next)
{
u64 new_disk_bytenr;
u64 new_disk_num_bytes;
u64 new_offset;
/* @prev and @next should not be compressed. */
ASSERT(!extent_map_is_compressed(prev));
ASSERT(!extent_map_is_compressed(next));
/*
* There are two different cases where @prev and @next can be merged.
*
* 1) They are referring to the same data extent:
*
* |<----- data extent A ----->|
* |<- prev ->|<- next ->|
*
* 2) They are referring to different data extents but still adjacent:
*
* |<-- data extent A -->|<-- data extent B -->|
* |<- prev ->|<- next ->|
*
* The calculation here always merges the data extents first, then updates
* @offset using the new data extents.
*
* For case 1), the merged data extent would be the same.
* For case 2), we just merge the two data extents into one.
*/
new_disk_bytenr = min(prev->disk_bytenr, next->disk_bytenr);
new_disk_num_bytes = max(prev->disk_bytenr + prev->disk_num_bytes,
next->disk_bytenr + next->disk_num_bytes) -
new_disk_bytenr;
new_offset = prev->disk_bytenr + prev->offset - new_disk_bytenr;
prev->disk_bytenr = new_disk_bytenr;
prev->disk_num_bytes = new_disk_num_bytes;
prev->ram_bytes = new_disk_num_bytes;
prev->offset = new_offset;
next->disk_bytenr = new_disk_bytenr;
next->disk_num_bytes = new_disk_num_bytes;
next->ram_bytes = new_disk_num_bytes;
next->offset = new_offset;
}
btrfs: introduce extra sanity checks for extent maps Since extent_map structure has the all the needed members to represent a file extent directly, we can apply all the file extent sanity checks to an extent map. The new sanity checks will cross check both the old members (block_start/block_len/orig_start) and the new members (disk_bytenr/disk_num_bytes/offset). There is a special case for offset/orig_start/start cross check, we only do such sanity check for compressed extent, as only compressed read/encoded write really utilize orig_start. This can be proved by the cleanup patch of orig_start. The checks happens at the following times: - add_extent_mapping() This is for newly added extent map - replace_extent_mapping() This is for btrfs_drop_extent_map_range() and split_extent_map() - try_merge_map() For a lot of call sites we have to properly populate all the members to pass the sanity check, meanwhile the following code needs extra modification: - setup_file_extents() from inode-tests The file extents layout of setup_file_extents() is already too invalid that tree-checker would reject most of them in real world. However there is just a special unaligned regular extent which has mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1). So instead of dropping the whole test case, here we just unify disk_num_bytes and ram_bytes to 4096 - 1. - test_case_7() from extent-map-tests An extent is inserted with 16K length, but on-disk extent size is only 4K. This means it must be a compressed extent, so set the compressed flag for it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:03 +00:00
static void dump_extent_map(struct btrfs_fs_info *fs_info, const char *prefix,
struct extent_map *em)
{
if (!IS_ENABLED(CONFIG_BTRFS_DEBUG))
return;
btrfs_crit(fs_info,
"%s, start=%llu len=%llu disk_bytenr=%llu disk_num_bytes=%llu ram_bytes=%llu offset=%llu flags=0x%x",
btrfs: introduce extra sanity checks for extent maps Since extent_map structure has the all the needed members to represent a file extent directly, we can apply all the file extent sanity checks to an extent map. The new sanity checks will cross check both the old members (block_start/block_len/orig_start) and the new members (disk_bytenr/disk_num_bytes/offset). There is a special case for offset/orig_start/start cross check, we only do such sanity check for compressed extent, as only compressed read/encoded write really utilize orig_start. This can be proved by the cleanup patch of orig_start. The checks happens at the following times: - add_extent_mapping() This is for newly added extent map - replace_extent_mapping() This is for btrfs_drop_extent_map_range() and split_extent_map() - try_merge_map() For a lot of call sites we have to properly populate all the members to pass the sanity check, meanwhile the following code needs extra modification: - setup_file_extents() from inode-tests The file extents layout of setup_file_extents() is already too invalid that tree-checker would reject most of them in real world. However there is just a special unaligned regular extent which has mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1). So instead of dropping the whole test case, here we just unify disk_num_bytes and ram_bytes to 4096 - 1. - test_case_7() from extent-map-tests An extent is inserted with 16K length, but on-disk extent size is only 4K. This means it must be a compressed extent, so set the compressed flag for it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:03 +00:00
prefix, em->start, em->len, em->disk_bytenr, em->disk_num_bytes,
em->ram_bytes, em->offset, em->flags);
btrfs: introduce extra sanity checks for extent maps Since extent_map structure has the all the needed members to represent a file extent directly, we can apply all the file extent sanity checks to an extent map. The new sanity checks will cross check both the old members (block_start/block_len/orig_start) and the new members (disk_bytenr/disk_num_bytes/offset). There is a special case for offset/orig_start/start cross check, we only do such sanity check for compressed extent, as only compressed read/encoded write really utilize orig_start. This can be proved by the cleanup patch of orig_start. The checks happens at the following times: - add_extent_mapping() This is for newly added extent map - replace_extent_mapping() This is for btrfs_drop_extent_map_range() and split_extent_map() - try_merge_map() For a lot of call sites we have to properly populate all the members to pass the sanity check, meanwhile the following code needs extra modification: - setup_file_extents() from inode-tests The file extents layout of setup_file_extents() is already too invalid that tree-checker would reject most of them in real world. However there is just a special unaligned regular extent which has mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1). So instead of dropping the whole test case, here we just unify disk_num_bytes and ram_bytes to 4096 - 1. - test_case_7() from extent-map-tests An extent is inserted with 16K length, but on-disk extent size is only 4K. This means it must be a compressed extent, so set the compressed flag for it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:03 +00:00
ASSERT(0);
}
/* Internal sanity checks for btrfs debug builds. */
static void validate_extent_map(struct btrfs_fs_info *fs_info, struct extent_map *em)
{
if (!IS_ENABLED(CONFIG_BTRFS_DEBUG))
return;
if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
if (em->disk_num_bytes == 0)
dump_extent_map(fs_info, "zero disk_num_bytes", em);
if (em->offset + em->len > em->ram_bytes)
dump_extent_map(fs_info, "ram_bytes too small", em);
if (em->offset + em->len > em->disk_num_bytes &&
!extent_map_is_compressed(em))
dump_extent_map(fs_info, "disk_num_bytes too small", em);
if (!extent_map_is_compressed(em) &&
em->ram_bytes != em->disk_num_bytes)
dump_extent_map(fs_info,
"ram_bytes mismatch with disk_num_bytes for non-compressed em",
em);
btrfs: introduce extra sanity checks for extent maps Since extent_map structure has the all the needed members to represent a file extent directly, we can apply all the file extent sanity checks to an extent map. The new sanity checks will cross check both the old members (block_start/block_len/orig_start) and the new members (disk_bytenr/disk_num_bytes/offset). There is a special case for offset/orig_start/start cross check, we only do such sanity check for compressed extent, as only compressed read/encoded write really utilize orig_start. This can be proved by the cleanup patch of orig_start. The checks happens at the following times: - add_extent_mapping() This is for newly added extent map - replace_extent_mapping() This is for btrfs_drop_extent_map_range() and split_extent_map() - try_merge_map() For a lot of call sites we have to properly populate all the members to pass the sanity check, meanwhile the following code needs extra modification: - setup_file_extents() from inode-tests The file extents layout of setup_file_extents() is already too invalid that tree-checker would reject most of them in real world. However there is just a special unaligned regular extent which has mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1). So instead of dropping the whole test case, here we just unify disk_num_bytes and ram_bytes to 4096 - 1. - test_case_7() from extent-map-tests An extent is inserted with 16K length, but on-disk extent size is only 4K. This means it must be a compressed extent, so set the compressed flag for it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:03 +00:00
} else if (em->offset) {
dump_extent_map(fs_info, "non-zero offset for hole/inline", em);
}
}
static void try_merge_map(struct btrfs_inode *inode, struct extent_map *em)
{
btrfs: introduce extra sanity checks for extent maps Since extent_map structure has the all the needed members to represent a file extent directly, we can apply all the file extent sanity checks to an extent map. The new sanity checks will cross check both the old members (block_start/block_len/orig_start) and the new members (disk_bytenr/disk_num_bytes/offset). There is a special case for offset/orig_start/start cross check, we only do such sanity check for compressed extent, as only compressed read/encoded write really utilize orig_start. This can be proved by the cleanup patch of orig_start. The checks happens at the following times: - add_extent_mapping() This is for newly added extent map - replace_extent_mapping() This is for btrfs_drop_extent_map_range() and split_extent_map() - try_merge_map() For a lot of call sites we have to properly populate all the members to pass the sanity check, meanwhile the following code needs extra modification: - setup_file_extents() from inode-tests The file extents layout of setup_file_extents() is already too invalid that tree-checker would reject most of them in real world. However there is just a special unaligned regular extent which has mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1). So instead of dropping the whole test case, here we just unify disk_num_bytes and ram_bytes to 4096 - 1. - test_case_7() from extent-map-tests An extent is inserted with 16K length, but on-disk extent size is only 4K. This means it must be a compressed extent, so set the compressed flag for it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:03 +00:00
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct extent_map_tree *tree = &inode->extent_tree;
struct extent_map *merge = NULL;
struct rb_node *rb;
Btrfs: fix race between using extent maps and merging them We have a few cases where we allow an extent map that is in an extent map tree to be merged with other extents in the tree. Such cases include the unpinning of an extent after the respective ordered extent completed or after logging an extent during a fast fsync. This can lead to subtle and dangerous problems because when doing the merge some other task might be using the same extent map and as consequence see an inconsistent state of the extent map - for example sees the new length but has seen the old start offset. With luck this triggers a BUG_ON(), and not some silent bug, such as the following one in __do_readpage(): $ cat -n fs/btrfs/extent_io.c 3061 static int __do_readpage(struct extent_io_tree *tree, 3062 struct page *page, (...) 3127 em = __get_extent_map(inode, page, pg_offset, cur, 3128 end - cur + 1, get_extent, em_cached); 3129 if (IS_ERR_OR_NULL(em)) { 3130 SetPageError(page); 3131 unlock_extent(tree, cur, end); 3132 break; 3133 } 3134 extent_offset = cur - em->start; 3135 BUG_ON(extent_map_end(em) <= cur); (...) Consider the following example scenario, where we end up hitting the BUG_ON() in __do_readpage(). We have an inode with a size of 8KiB and 2 extent maps: extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by a previous transaction extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet persisted but writeback started for it already. The extent map is pinned since there's writeback and an ordered extent in progress, so it can not be merged with extent map A yet The following sequence of steps leads to the BUG_ON(): 1) The ordered extent for extent B completes, the respective page gets its writeback bit cleared and the extent map is unpinned, at that point it is not yet merged with extent map A because it's in the list of modified extents; 2) Due to memory pressure, or some other reason, the MM subsystem releases the page corresponding to extent B - btrfs_releasepage() is called and returns 1, meaning the page can be released as it's not dirty, not under writeback anymore and the extent range is not locked in the inode's iotree. However the extent map is not released, either because we are not in a context that allows memory allocations to block or because the inode's size is smaller than 16MiB - in this case our inode has a size of 8KiB; 3) Task B needs to read extent B and ends up __do_readpage() through the btrfs_readpage() callback. At __do_readpage() it gets a reference to extent map B; 4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B while holding the write lock on the inode's extent map tree - this results in try_merge_map() being called and since it's possible to merge extent map B with extent map A now (the extent map B was removed from the list of modified extents), the merging begins - it sets extent map B's start offset to 0 (was 4KiB), but before it increments the map's length to 8KiB (4kb + 4KiB), task A is at: BUG_ON(extent_map_end(em) <= cur); The call to extent_map_end() sees the extent map has a start of 0 and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so the BUG_ON() is triggered. So it's dangerous to modify an extent map that is in the tree, because some other task might have got a reference to it before and still using it, and needs to see a consistent map while using it. Generally this is very rare since most paths that lookup and use extent maps also have the file range locked in the inode's iotree. The fsync path is pretty much the only exception where we don't do it to avoid serialization with concurrent reads. Fix this by not allowing an extent map do be merged if if it's being used by tasks other then the one attempting to merge the extent map (when the reference count of the extent map is greater than 2). Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp> Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211 CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-31 14:06:07 +00:00
/*
* We can't modify an extent map that is in the tree and that is being
* used by another task, as it can cause that other task to see it in
* inconsistent state during the merging. We always have 1 reference for
* the tree and 1 for this task (which is unpinning the extent map or
* clearing the logging flag), so anything > 2 means it's being used by
* other tasks too.
*/
if (refcount_read(&em->refs) > 2)
return;
if (!can_merge_extent_map(em))
return;
if (em->start != 0) {
rb = rb_prev(&em->rb_node);
if (rb)
merge = rb_entry(rb, struct extent_map, rb_node);
if (rb && can_merge_extent_map(merge) && mergeable_maps(merge, em)) {
em->start = merge->start;
em->len += merge->len;
em->generation = max(em->generation, merge->generation);
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
merge_ondisk_extents(merge, em);
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
em->flags |= EXTENT_FLAG_MERGED;
Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-17 17:14:17 +00:00
btrfs: introduce extra sanity checks for extent maps Since extent_map structure has the all the needed members to represent a file extent directly, we can apply all the file extent sanity checks to an extent map. The new sanity checks will cross check both the old members (block_start/block_len/orig_start) and the new members (disk_bytenr/disk_num_bytes/offset). There is a special case for offset/orig_start/start cross check, we only do such sanity check for compressed extent, as only compressed read/encoded write really utilize orig_start. This can be proved by the cleanup patch of orig_start. The checks happens at the following times: - add_extent_mapping() This is for newly added extent map - replace_extent_mapping() This is for btrfs_drop_extent_map_range() and split_extent_map() - try_merge_map() For a lot of call sites we have to properly populate all the members to pass the sanity check, meanwhile the following code needs extra modification: - setup_file_extents() from inode-tests The file extents layout of setup_file_extents() is already too invalid that tree-checker would reject most of them in real world. However there is just a special unaligned regular extent which has mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1). So instead of dropping the whole test case, here we just unify disk_num_bytes and ram_bytes to 4096 - 1. - test_case_7() from extent-map-tests An extent is inserted with 16K length, but on-disk extent size is only 4K. This means it must be a compressed extent, so set the compressed flag for it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:03 +00:00
validate_extent_map(fs_info, em);
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
rb_erase(&merge->rb_node, &tree->root);
RB_CLEAR_NODE(&merge->rb_node);
free_extent_map(merge);
dec_evictable_extent_maps(inode);
}
}
rb = rb_next(&em->rb_node);
if (rb)
merge = rb_entry(rb, struct extent_map, rb_node);
if (rb && can_merge_extent_map(merge) && mergeable_maps(em, merge)) {
em->len += merge->len;
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
merge_ondisk_extents(em, merge);
btrfs: introduce extra sanity checks for extent maps Since extent_map structure has the all the needed members to represent a file extent directly, we can apply all the file extent sanity checks to an extent map. The new sanity checks will cross check both the old members (block_start/block_len/orig_start) and the new members (disk_bytenr/disk_num_bytes/offset). There is a special case for offset/orig_start/start cross check, we only do such sanity check for compressed extent, as only compressed read/encoded write really utilize orig_start. This can be proved by the cleanup patch of orig_start. The checks happens at the following times: - add_extent_mapping() This is for newly added extent map - replace_extent_mapping() This is for btrfs_drop_extent_map_range() and split_extent_map() - try_merge_map() For a lot of call sites we have to properly populate all the members to pass the sanity check, meanwhile the following code needs extra modification: - setup_file_extents() from inode-tests The file extents layout of setup_file_extents() is already too invalid that tree-checker would reject most of them in real world. However there is just a special unaligned regular extent which has mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1). So instead of dropping the whole test case, here we just unify disk_num_bytes and ram_bytes to 4096 - 1. - test_case_7() from extent-map-tests An extent is inserted with 16K length, but on-disk extent size is only 4K. This means it must be a compressed extent, so set the compressed flag for it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:03 +00:00
validate_extent_map(fs_info, em);
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
rb_erase(&merge->rb_node, &tree->root);
RB_CLEAR_NODE(&merge->rb_node);
em->generation = max(em->generation, merge->generation);
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
em->flags |= EXTENT_FLAG_MERGED;
free_extent_map(merge);
dec_evictable_extent_maps(inode);
}
}
/*
* Unpin an extent from the cache.
*
* @inode: the inode from which we are unpinning an extent range
Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-17 17:14:17 +00:00
* @start: logical offset in the file
* @len: length of the extent
* @gen: generation that this extent has been modified in
*
* Called after an extent has been written to disk properly. Set the generation
* to the generation that actually added the file item to the inode so we know
* we need to sync this extent when we call fsync().
*
* Returns: 0 on success
* -ENOENT when the extent is not found in the tree
* -EUCLEAN if the found extent does not match the expected start
Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-17 17:14:17 +00:00
*/
int unpin_extent_cache(struct btrfs_inode *inode, u64 start, u64 len, u64 gen)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct extent_map_tree *tree = &inode->extent_tree;
int ret = 0;
struct extent_map *em;
write_lock(&tree->lock);
em = lookup_extent_mapping(tree, start, len);
if (WARN_ON(!em)) {
btrfs_warn(fs_info,
"no extent map found for inode %llu (root %lld) when unpinning extent range [%llu, %llu), generation %llu",
btrfs_ino(inode), btrfs_root_id(inode->root),
start, start + len, gen);
ret = -ENOENT;
goto out;
}
if (WARN_ON(em->start != start)) {
btrfs_warn(fs_info,
"found extent map for inode %llu (root %lld) with unexpected start offset %llu when unpinning extent range [%llu, %llu), generation %llu",
btrfs_ino(inode), btrfs_root_id(inode->root),
em->start, start, start + len, gen);
ret = -EUCLEAN;
goto out;
}
Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2012-08-17 17:14:17 +00:00
em->generation = gen;
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
em->flags &= ~EXTENT_FLAG_PINNED;
try_merge_map(inode, em);
out:
write_unlock(&tree->lock);
free_extent_map(em);
return ret;
}
void clear_em_logging(struct btrfs_inode *inode, struct extent_map *em)
{
lockdep_assert_held_write(&inode->extent_tree.lock);
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
em->flags &= ~EXTENT_FLAG_LOGGING;
if (extent_map_in_tree(em))
try_merge_map(inode, em);
}
static inline void setup_extent_mapping(struct btrfs_inode *inode,
Btrfs: more efficient btrfs_drop_extent_cache While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-02-25 14:15:13 +00:00
struct extent_map *em,
int modified)
{
refcount_inc(&em->refs);
Btrfs: more efficient btrfs_drop_extent_cache While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-02-25 14:15:13 +00:00
ASSERT(list_empty(&em->list));
Btrfs: more efficient btrfs_drop_extent_cache While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-02-25 14:15:13 +00:00
if (modified)
list_add(&em->list, &inode->extent_tree.modified_extents);
Btrfs: more efficient btrfs_drop_extent_cache While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-02-25 14:15:13 +00:00
else
try_merge_map(inode, em);
Btrfs: more efficient btrfs_drop_extent_cache While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-02-25 14:15:13 +00:00
}
/*
* Add a new extent map to an inode's extent map tree.
*
* @inode: the target inode
* @em: map to insert
* @modified: indicate whether the given @em should be added to the
* modified list, which indicates the extent needs to be logged
*
* Insert @em into the @inode's extent map tree or perform a simple
* forward/backward merge with existing mappings. The extent_map struct passed
* in will be inserted into the tree directly, with an additional reference
* taken, or a reference dropped if the merge attempt was successful.
*/
static int add_extent_mapping(struct btrfs_inode *inode,
struct extent_map *em, int modified)
{
struct extent_map_tree *tree = &inode->extent_tree;
struct btrfs_root *root = inode->root;
struct btrfs_fs_info *fs_info = root->fs_info;
int ret;
lockdep_assert_held_write(&tree->lock);
btrfs: introduce extra sanity checks for extent maps Since extent_map structure has the all the needed members to represent a file extent directly, we can apply all the file extent sanity checks to an extent map. The new sanity checks will cross check both the old members (block_start/block_len/orig_start) and the new members (disk_bytenr/disk_num_bytes/offset). There is a special case for offset/orig_start/start cross check, we only do such sanity check for compressed extent, as only compressed read/encoded write really utilize orig_start. This can be proved by the cleanup patch of orig_start. The checks happens at the following times: - add_extent_mapping() This is for newly added extent map - replace_extent_mapping() This is for btrfs_drop_extent_map_range() and split_extent_map() - try_merge_map() For a lot of call sites we have to properly populate all the members to pass the sanity check, meanwhile the following code needs extra modification: - setup_file_extents() from inode-tests The file extents layout of setup_file_extents() is already too invalid that tree-checker would reject most of them in real world. However there is just a special unaligned regular extent which has mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1). So instead of dropping the whole test case, here we just unify disk_num_bytes and ram_bytes to 4096 - 1. - test_case_7() from extent-map-tests An extent is inserted with 16K length, but on-disk extent size is only 4K. This means it must be a compressed extent, so set the compressed flag for it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:03 +00:00
validate_extent_map(fs_info, em);
ret = tree_insert(&tree->root, em);
if (ret)
return ret;
setup_extent_mapping(inode, em, modified);
if (!btrfs_is_testing(fs_info) && is_fstree(btrfs_root_id(root)))
percpu_counter_inc(&fs_info->evictable_extent_maps);
return 0;
}
static struct extent_map *
__lookup_extent_mapping(struct extent_map_tree *tree,
u64 start, u64 len, int strict)
{
struct extent_map *em;
struct rb_node *rb_node;
struct rb_node *prev_or_next = NULL;
u64 end = range_end(start, len);
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
rb_node = __tree_search(&tree->root, start, &prev_or_next);
if (!rb_node) {
if (prev_or_next)
rb_node = prev_or_next;
else
return NULL;
}
em = rb_entry(rb_node, struct extent_map, rb_node);
if (strict && !(end > em->start && start < extent_map_end(em)))
return NULL;
refcount_inc(&em->refs);
return em;
}
/*
* Lookup extent_map that intersects @start + @len range.
*
* @tree: tree to lookup in
* @start: byte offset to start the search
* @len: length of the lookup range
*
* Find and return the first extent_map struct in @tree that intersects the
* [start, len] range. There may be additional objects in the tree that
* intersect, so check the object returned carefully to make sure that no
* additional lookups are needed.
*/
struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree,
u64 start, u64 len)
{
return __lookup_extent_mapping(tree, start, len, 1);
}
/*
* Find a nearby extent map intersecting @start + @len (not an exact search).
*
* @tree: tree to lookup in
* @start: byte offset to start the search
* @len: length of the lookup range
*
* Find and return the first extent_map struct in @tree that intersects the
* [start, len] range.
*
* If one can't be found, any nearby extent may be returned
*/
struct extent_map *search_extent_mapping(struct extent_map_tree *tree,
u64 start, u64 len)
{
return __lookup_extent_mapping(tree, start, len, 0);
}
/*
* Remove an extent_map from its inode's extent tree.
*
* @inode: the inode the extent map belongs to
* @em: extent map being removed
*
* Remove @em from the extent tree of @inode. No reference counts are dropped,
* and no checks are done to see if the range is in use.
*/
void remove_extent_mapping(struct btrfs_inode *inode, struct extent_map *em)
{
struct extent_map_tree *tree = &inode->extent_tree;
lockdep_assert_held_write(&tree->lock);
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
WARN_ON(em->flags & EXTENT_FLAG_PINNED);
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
rb_erase(&em->rb_node, &tree->root);
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
if (!(em->flags & EXTENT_FLAG_LOGGING))
list_del_init(&em->list);
RB_CLEAR_NODE(&em->rb_node);
dec_evictable_extent_maps(inode);
}
Btrfs: more efficient btrfs_drop_extent_cache While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-02-25 14:15:13 +00:00
static void replace_extent_mapping(struct btrfs_inode *inode,
struct extent_map *cur,
struct extent_map *new,
int modified)
Btrfs: more efficient btrfs_drop_extent_cache While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-02-25 14:15:13 +00:00
{
btrfs: introduce extra sanity checks for extent maps Since extent_map structure has the all the needed members to represent a file extent directly, we can apply all the file extent sanity checks to an extent map. The new sanity checks will cross check both the old members (block_start/block_len/orig_start) and the new members (disk_bytenr/disk_num_bytes/offset). There is a special case for offset/orig_start/start cross check, we only do such sanity check for compressed extent, as only compressed read/encoded write really utilize orig_start. This can be proved by the cleanup patch of orig_start. The checks happens at the following times: - add_extent_mapping() This is for newly added extent map - replace_extent_mapping() This is for btrfs_drop_extent_map_range() and split_extent_map() - try_merge_map() For a lot of call sites we have to properly populate all the members to pass the sanity check, meanwhile the following code needs extra modification: - setup_file_extents() from inode-tests The file extents layout of setup_file_extents() is already too invalid that tree-checker would reject most of them in real world. However there is just a special unaligned regular extent which has mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1). So instead of dropping the whole test case, here we just unify disk_num_bytes and ram_bytes to 4096 - 1. - test_case_7() from extent-map-tests An extent is inserted with 16K length, but on-disk extent size is only 4K. This means it must be a compressed extent, so set the compressed flag for it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:03 +00:00
struct btrfs_fs_info *fs_info = inode->root->fs_info;
struct extent_map_tree *tree = &inode->extent_tree;
lockdep_assert_held_write(&tree->lock);
btrfs: introduce extra sanity checks for extent maps Since extent_map structure has the all the needed members to represent a file extent directly, we can apply all the file extent sanity checks to an extent map. The new sanity checks will cross check both the old members (block_start/block_len/orig_start) and the new members (disk_bytenr/disk_num_bytes/offset). There is a special case for offset/orig_start/start cross check, we only do such sanity check for compressed extent, as only compressed read/encoded write really utilize orig_start. This can be proved by the cleanup patch of orig_start. The checks happens at the following times: - add_extent_mapping() This is for newly added extent map - replace_extent_mapping() This is for btrfs_drop_extent_map_range() and split_extent_map() - try_merge_map() For a lot of call sites we have to properly populate all the members to pass the sanity check, meanwhile the following code needs extra modification: - setup_file_extents() from inode-tests The file extents layout of setup_file_extents() is already too invalid that tree-checker would reject most of them in real world. However there is just a special unaligned regular extent which has mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1). So instead of dropping the whole test case, here we just unify disk_num_bytes and ram_bytes to 4096 - 1. - test_case_7() from extent-map-tests An extent is inserted with 16K length, but on-disk extent size is only 4K. This means it must be a compressed extent, so set the compressed flag for it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:03 +00:00
validate_extent_map(fs_info, new);
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
WARN_ON(cur->flags & EXTENT_FLAG_PINNED);
Btrfs: more efficient btrfs_drop_extent_cache While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-02-25 14:15:13 +00:00
ASSERT(extent_map_in_tree(cur));
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
if (!(cur->flags & EXTENT_FLAG_LOGGING))
Btrfs: more efficient btrfs_drop_extent_cache While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-02-25 14:15:13 +00:00
list_del_init(&cur->list);
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
rb_replace_node(&cur->rb_node, &new->rb_node, &tree->root);
Btrfs: more efficient btrfs_drop_extent_cache While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-02-25 14:15:13 +00:00
RB_CLEAR_NODE(&cur->rb_node);
setup_extent_mapping(inode, new, modified);
Btrfs: more efficient btrfs_drop_extent_cache While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 | 5.231 - 187.392: 87 | 187.392 - 585.911: 206 | 585.911 - 1827.438: 623 | 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
2014-02-25 14:15:13 +00:00
}
static struct extent_map *next_extent_map(const struct extent_map *em)
{
struct rb_node *next;
next = rb_next(&em->rb_node);
if (!next)
return NULL;
return container_of(next, struct extent_map, rb_node);
}
static struct extent_map *prev_extent_map(struct extent_map *em)
{
struct rb_node *prev;
prev = rb_prev(&em->rb_node);
if (!prev)
return NULL;
return container_of(prev, struct extent_map, rb_node);
}
/*
* Helper for btrfs_get_extent. Given an existing extent in the tree,
* the existing extent is the nearest extent to map_start,
* and an extent that you want to insert, deal with overlap and insert
* the best fitted new extent into the tree.
*/
static noinline int merge_extent_mapping(struct btrfs_inode *inode,
struct extent_map *existing,
struct extent_map *em,
u64 map_start)
{
struct extent_map *prev;
struct extent_map *next;
u64 start;
u64 end;
u64 start_diff;
if (map_start < em->start || map_start >= extent_map_end(em))
return -EINVAL;
if (existing->start > map_start) {
next = existing;
prev = prev_extent_map(next);
} else {
prev = existing;
next = next_extent_map(prev);
}
start = prev ? extent_map_end(prev) : em->start;
start = max_t(u64, start, em->start);
end = next ? next->start : extent_map_end(em);
end = min_t(u64, end, extent_map_end(em));
start_diff = start - em->start;
em->start = start;
em->len = end - start;
btrfs: fix corrupt read due to bad offset of a compressed extent map If we attempt to insert a compressed extent map that has a range that overlaps another extent map we have in the inode's extent map tree, we can end up with an incorrect offset after adjusting the new extent map at merge_extent_mapping() because we don't update the extent map's offset. For example consider the following scenario: 1) We have a file extent item for a compressed extent covering the file range [108K, 144K) and currently there's no corresponding extent map in the inode's extent map tree; 2) The inode's size is 141K; 3) We have an encoded write (compressed) into the file range [120K, 128K), which overlaps the existing file extent item. The encoded write creates a matching extent map, adds it to the inode's extent map tree and creates an ordered extent for it. Note that the corresponding file extent item is added to the subvolume tree only when the ordered extent completes (when executing btrfs_finish_one_ordered()); 4) We have a write into the file range [160K, 164K). This writes increases the i_size of the file, and there's a hole between the current i_size (141K) and the start offset of this write, and since the old i_size is in the middle of the block [140K, 144K), we have to write zeroes to the range [141K, 144K) (3072 bytes) and therefore dirty that page. We then call btrfs_set_extent_delalloc() with a start offset of 140K. We then end up at btrfs_find_new_delalloc_bytes() which will call btrfs_get_extent() for the range [140K, 144K); 5) The btrfs_get_extent() doesn't find any extent map in the inode's extent map tree covering the range [140K, 144K), so it searches the subvolume tree for any file extent items covering that range. There it finds the file extent item for the range [108K, 144K), creates a compressed extent map for that range and then calls btrfs_add_extent_mapping() with that extent map and passes the range [140K, 144K) via the "start" and "len" parameters; 6) The call to add_extent_mapping() done by btrfs_add_extent_mapping() fails with -EEXIST because there's an extent map, created at step 2 for the [120K, 128K) range, that covers that overlaps with the range of the given extent map ([108K, 144K)). Then it does a lookup for extent map from step 2 add calls merge_extent_mapping() to adjust the input extent map ([108K, 144K)). That adjust the extent map to a start offset of 128K and a length of 16K (starting just after the extent map from step 2), but it does not update the offset field of the extent map, leaving it with a value of zero instead of updating to a value of 20K (128K - 108K = 20K). As a result any read for the range [128K, 144K) can return incorrect data since we read from a wrong section of the extent (unless both the correct and incorrect ranges happen to have the same data). So fix this by changing merge_extent_mapping() to update the extent map's offset even if it's compressed. Also add a test case to the self tests. This didn't happen before the patchset that does big changes in the extent map structure (which includes the commit in the Fixes tag below) because we kept track of the original start offset in the extent map (member "orig_start") so we could always calculate the correct offset by subtracting that offset from the start offset. A test case for fstests that triggered this problem using send/receive with compressed writes will be added soon. Fixes: 3d2ac9922465 ("btrfs: introduce new members for extent_map") Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-11 12:33:23 +00:00
if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE)
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
em->offset += start_diff;
return add_extent_mapping(inode, em, 0);
}
/*
* Add extent mapping into an inode's extent map tree.
*
* @inode: target inode
* @em_in: extent we are inserting
* @start: start of the logical range btrfs_get_extent() is requesting
* @len: length of the logical range btrfs_get_extent() is requesting
*
* Note that @em_in's range may be different from [start, start+len),
* but they must be overlapped.
*
* Insert @em_in into the inode's extent map tree. In case there is an
* overlapping range, handle the -EEXIST by either:
* a) Returning the existing extent in @em_in if @start is within the
* existing em.
* b) Merge the existing extent with @em_in passed in.
*
* Return 0 on success, otherwise -EEXIST.
*
*/
int btrfs_add_extent_mapping(struct btrfs_inode *inode,
struct extent_map **em_in, u64 start, u64 len)
{
int ret;
struct extent_map *em = *em_in;
struct btrfs_fs_info *fs_info = inode->root->fs_info;
/*
* Tree-checker should have rejected any inline extent with non-zero
* file offset. Here just do a sanity check.
*/
if (em->disk_bytenr == EXTENT_MAP_INLINE)
ASSERT(em->start == 0);
ret = add_extent_mapping(inode, em, 0);
/* it is possible that someone inserted the extent into the tree
* while we had the lock dropped. It is also possible that
* an overlapping map exists in the tree
*/
if (ret == -EEXIST) {
struct extent_map *existing;
existing = search_extent_mapping(&inode->extent_tree, start, len);
trace_btrfs_handle_em_exist(fs_info, existing, em, start, len);
/*
* existing will always be non-NULL, since there must be
* extent causing the -EEXIST.
*/
if (start >= existing->start &&
start < extent_map_end(existing)) {
free_extent_map(em);
*em_in = existing;
ret = 0;
} else {
u64 orig_start = em->start;
u64 orig_len = em->len;
/*
* The existing extent map is the one nearest to
* the [start, start + len) range which overlaps
*/
ret = merge_extent_mapping(inode, existing, em, start);
if (WARN_ON(ret)) {
free_extent_map(em);
*em_in = NULL;
btrfs_warn(fs_info,
"extent map merge error existing [%llu, %llu) with em [%llu, %llu) start %llu",
existing->start, extent_map_end(existing),
orig_start, orig_start + orig_len, start);
}
free_extent_map(existing);
}
}
ASSERT(ret == 0 || ret == -EEXIST);
return ret;
}
/*
* Drop all extent maps from a tree in the fastest possible way, rescheduling
* if needed. This avoids searching the tree, from the root down to the first
* extent map, before each deletion.
*/
static void drop_all_extent_maps_fast(struct btrfs_inode *inode)
{
struct extent_map_tree *tree = &inode->extent_tree;
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
struct rb_node *node;
write_lock(&tree->lock);
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
node = rb_first(&tree->root);
while (node) {
struct extent_map *em;
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
struct rb_node *next = rb_next(node);
em = rb_entry(node, struct extent_map, rb_node);
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
em->flags &= ~(EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING);
remove_extent_mapping(inode, em);
free_extent_map(em);
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
if (cond_resched_rwlock_write(&tree->lock))
node = rb_first(&tree->root);
else
node = next;
}
write_unlock(&tree->lock);
}
/*
* Drop all extent maps in a given range.
*
* @inode: The target inode.
* @start: Start offset of the range.
* @end: End offset of the range (inclusive value).
* @skip_pinned: Indicate if pinned extent maps should be ignored or not.
*
* This drops all the extent maps that intersect the given range [@start, @end].
* Extent maps that partially overlap the range and extend behind or beyond it,
* are split.
* The caller should have locked an appropriate file range in the inode's io
* tree before calling this function.
*/
void btrfs_drop_extent_map_range(struct btrfs_inode *inode, u64 start, u64 end,
bool skip_pinned)
{
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
struct extent_map *split;
struct extent_map *split2;
struct extent_map *em;
struct extent_map_tree *em_tree = &inode->extent_tree;
u64 len = end - start + 1;
WARN_ON(end < start);
if (end == (u64)-1) {
if (start == 0 && !skip_pinned) {
drop_all_extent_maps_fast(inode);
return;
}
len = (u64)-1;
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
} else {
/* Make end offset exclusive for use in the loop below. */
end++;
}
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
/*
* It's ok if we fail to allocate the extent maps, see the comment near
* the bottom of the loop below. We only need two spare extent maps in
* the worst case, where the first extent map that intersects our range
* starts before the range and the last extent map that intersects our
* range ends after our range (and they might be the same extent map),
* because we need to split those two extent maps at the boundaries.
*/
split = alloc_extent_map();
split2 = alloc_extent_map();
write_lock(&em_tree->lock);
em = lookup_extent_mapping(em_tree, start, len);
while (em) {
/* extent_map_end() returns exclusive value (last byte + 1). */
const u64 em_end = extent_map_end(em);
struct extent_map *next_em = NULL;
u64 gen;
unsigned long flags;
bool modified;
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
if (em_end < end) {
next_em = next_extent_map(em);
if (next_em) {
if (next_em->start < end)
refcount_inc(&next_em->refs);
else
next_em = NULL;
}
}
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
if (skip_pinned && (em->flags & EXTENT_FLAG_PINNED)) {
start = em_end;
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
goto next;
}
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
flags = em->flags;
/*
* In case we split the extent map, we want to preserve the
* EXTENT_FLAG_LOGGING flag on our extent map, but we don't want
* it on the new extent maps.
*/
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
em->flags &= ~(EXTENT_FLAG_PINNED | EXTENT_FLAG_LOGGING);
modified = !list_empty(&em->list);
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
/*
* The extent map does not cross our target range, so no need to
* split it, we can remove it directly.
*/
if (em->start >= start && em_end <= end)
goto remove_em;
gen = em->generation;
if (em->start < start) {
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
if (!split) {
split = split2;
split2 = NULL;
if (!split)
goto remove_em;
}
split->start = em->start;
split->len = start - em->start;
if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
split->disk_bytenr = em->disk_bytenr;
split->disk_num_bytes = em->disk_num_bytes;
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
split->offset = em->offset;
split->ram_bytes = em->ram_bytes;
} else {
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
split->disk_bytenr = em->disk_bytenr;
split->disk_num_bytes = 0;
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
split->offset = 0;
split->ram_bytes = split->len;
}
split->generation = gen;
split->flags = flags;
replace_extent_mapping(inode, em, split, modified);
free_extent_map(split);
split = split2;
split2 = NULL;
}
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
if (em_end > end) {
if (!split) {
split = split2;
split2 = NULL;
if (!split)
goto remove_em;
}
btrfs: fix incorrect splitting in btrfs_drop_extent_map_range In production we were seeing a variety of WARN_ON()'s in the extent_map code, specifically in btrfs_drop_extent_map_range() when we have to call add_extent_mapping() for our second split. Consider the following extent map layout PINNED [0 16K) [32K, 48K) and then we call btrfs_drop_extent_map_range for [0, 36K), with skip_pinned == true. The initial loop will have start = 0 end = 36K len = 36K we will find the [0, 16k) extent, but since we are pinned we will skip it, which has this code start = em_end; if (end != (u64)-1) len = start + len - em_end; em_end here is 16K, so now the values are start = 16K len = 16K + 36K - 16K = 36K len should instead be 20K. This is a problem when we find the next extent at [32K, 48K), we need to split this extent to leave [36K, 48k), however the code for the split looks like this split->start = start + len; split->len = em_end - (start + len); In this case we have em_end = 48K split->start = 16K + 36K // this should be 16K + 20K split->len = 48K - (16K + 36K) // this overflows as 16K + 36K is 52K and now we have an invalid extent_map in the tree that potentially overlaps other entries in the extent map. Even in the non-overlapping case we will have split->start set improperly, which will cause problems with any block related calculations. We don't actually need len in this loop, we can simply use end as our end point, and only adjust start up when we find a pinned extent we need to skip. Adjust the logic to do this, which keeps us from inserting an invalid extent map. We only skip_pinned in the relocation case, so this is relatively rare, except in the case where you are running relocation a lot, which can happen with auto relocation on. Fixes: 55ef68990029 ("Btrfs: Fix btrfs_drop_extent_cache for skip pinned case") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-08-17 20:57:30 +00:00
split->start = end;
split->len = em_end - end;
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
split->disk_bytenr = em->disk_bytenr;
split->flags = flags;
split->generation = gen;
if (em->disk_bytenr < EXTENT_MAP_LAST_BYTE) {
split->disk_num_bytes = em->disk_num_bytes;
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
split->offset = em->offset + end - em->start;
split->ram_bytes = em->ram_bytes;
} else {
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
split->disk_num_bytes = 0;
split->offset = 0;
split->ram_bytes = split->len;
}
if (extent_map_in_tree(em)) {
replace_extent_mapping(inode, em, split, modified);
} else {
int ret;
ret = add_extent_mapping(inode, split, modified);
/* Logic error, shouldn't happen. */
ASSERT(ret == 0);
if (WARN_ON(ret != 0) && modified)
btrfs_set_inode_full_sync(inode);
}
free_extent_map(split);
split = NULL;
}
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
remove_em:
if (extent_map_in_tree(em)) {
/*
* If the extent map is still in the tree it means that
* either of the following is true:
*
* 1) It fits entirely in our range (doesn't end beyond
* it or starts before it);
*
* 2) It starts before our range and/or ends after our
* range, and we were not able to allocate the extent
* maps for split operations, @split and @split2.
*
* If we are at case 2) then we just remove the entire
* extent map - this is fine since if anyone needs it to
* access the subranges outside our range, will just
* load it again from the subvolume tree's file extent
* item. However if the extent map was in the list of
* modified extents, then we must mark the inode for a
* full fsync, otherwise a fast fsync will miss this
* extent if it's new and needs to be logged.
*/
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
if ((em->start < start || em_end > end) && modified) {
ASSERT(!split);
btrfs_set_inode_full_sync(inode);
}
remove_extent_mapping(inode, em);
}
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
/*
* Once for the tree reference (we replaced or removed the
* extent map from the tree).
*/
free_extent_map(em);
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
next:
/* Once for us (for our lookup reference). */
free_extent_map(em);
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
em = next_em;
}
btrfs: drop extent map range more efficiently Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-19 14:06:40 +00:00
write_unlock(&em_tree->lock);
free_extent_map(split);
free_extent_map(split2);
}
/*
* Replace a range in the inode's extent map tree with a new extent map.
*
* @inode: The target inode.
* @new_em: The new extent map to add to the inode's extent map tree.
* @modified: Indicate if the new extent map should be added to the list of
* modified extents (for fast fsync tracking).
*
* Drops all the extent maps in the inode's extent map tree that intersect the
* range of the new extent map and adds the new extent map to the tree.
* The caller should have locked an appropriate file range in the inode's io
* tree before calling this function.
*/
int btrfs_replace_extent_map_range(struct btrfs_inode *inode,
struct extent_map *new_em,
bool modified)
{
const u64 end = new_em->start + new_em->len - 1;
struct extent_map_tree *tree = &inode->extent_tree;
int ret;
ASSERT(!extent_map_in_tree(new_em));
/*
* The caller has locked an appropriate file range in the inode's io
* tree, but getting -EEXIST when adding the new extent map can still
* happen in case there are extents that partially cover the range, and
* this is due to two tasks operating on different parts of the extent.
* See commit 18e83ac75bfe67 ("Btrfs: fix unexpected EEXIST from
* btrfs_get_extent") for an example and details.
*/
do {
btrfs_drop_extent_map_range(inode, new_em->start, end, false);
write_lock(&tree->lock);
ret = add_extent_mapping(inode, new_em, modified);
write_unlock(&tree->lock);
} while (ret == -EEXIST);
return ret;
}
/*
* Split off the first pre bytes from the extent_map at [start, start + len],
* and set the block_start for it to new_logical.
*
* This function is used when an ordered_extent needs to be split.
*/
int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre,
u64 new_logical)
{
struct extent_map_tree *em_tree = &inode->extent_tree;
struct extent_map *em;
struct extent_map *split_pre = NULL;
struct extent_map *split_mid = NULL;
int ret = 0;
unsigned long flags;
ASSERT(pre != 0);
ASSERT(pre < len);
split_pre = alloc_extent_map();
if (!split_pre)
return -ENOMEM;
split_mid = alloc_extent_map();
if (!split_mid) {
ret = -ENOMEM;
goto out_free_pre;
}
lock_extent(&inode->io_tree, start, start + len - 1, NULL);
write_lock(&em_tree->lock);
em = lookup_extent_mapping(em_tree, start, len);
if (!em) {
ret = -EIO;
goto out_unlock;
}
ASSERT(em->len == len);
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
ASSERT(!extent_map_is_compressed(em));
ASSERT(em->disk_bytenr < EXTENT_MAP_LAST_BYTE);
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
ASSERT(em->flags & EXTENT_FLAG_PINNED);
ASSERT(!(em->flags & EXTENT_FLAG_LOGGING));
ASSERT(!list_empty(&em->list));
flags = em->flags;
btrfs: use the flags of an extent map to identify the compression type Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2023-12-04 16:20:33 +00:00
em->flags &= ~EXTENT_FLAG_PINNED;
/* First, replace the em with a new extent_map starting from * em->start */
split_pre->start = em->start;
split_pre->len = pre;
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
split_pre->disk_bytenr = new_logical;
split_pre->disk_num_bytes = split_pre->len;
split_pre->offset = 0;
split_pre->ram_bytes = split_pre->len;
split_pre->flags = flags;
split_pre->generation = em->generation;
replace_extent_mapping(inode, em, split_pre, 1);
/*
* Now we only have an extent_map at:
* [em->start, em->start + pre]
*/
/* Insert the middle extent_map. */
split_mid->start = em->start + pre;
split_mid->len = em->len - pre;
split_mid->disk_bytenr = extent_map_block_start(em) + pre;
btrfs: introduce new members for extent_map Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents |<----- data extent A ----->| |<- em 1 ->|<- em 2 ->| * Both extent maps are referring to different data extents |<-- data extent A -->|<-- data extent B -->| |<- em 1 ->|<- em 2 ->| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-29 22:23:02 +00:00
split_mid->disk_num_bytes = split_mid->len;
split_mid->offset = 0;
split_mid->ram_bytes = split_mid->len;
split_mid->flags = flags;
split_mid->generation = em->generation;
add_extent_mapping(inode, split_mid, 1);
/* Once for us */
free_extent_map(em);
/* Once for the tree */
free_extent_map(em);
out_unlock:
write_unlock(&em_tree->lock);
unlock_extent(&inode->io_tree, start, start + len - 1, NULL);
free_extent_map(split_mid);
out_free_pre:
free_extent_map(split_pre);
return ret;
}
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
struct btrfs_em_shrink_ctx {
long nr_to_scan;
long scanned;
u64 last_ino;
u64 last_root;
};
static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_ctx *ctx)
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
{
const u64 cur_fs_gen = btrfs_get_fs_generation(inode->root->fs_info);
struct extent_map_tree *tree = &inode->extent_tree;
long nr_dropped = 0;
struct rb_node *node;
/*
* Take the mmap lock so that we serialize with the inode logging phase
* of fsync because we may need to set the full sync flag on the inode,
* in case we have to remove extent maps in the tree's list of modified
* extents. If we set the full sync flag in the inode while an fsync is
* in progress, we may risk missing new extents because before the flag
* is set, fsync decides to only wait for writeback to complete and then
* during inode logging it sees the flag set and uses the subvolume tree
* to find new extents, which may not be there yet because ordered
* extents haven't completed yet.
*
* We also do a try lock because otherwise we could deadlock. This is
* because the shrinker for this filesystem may be invoked while we are
* in a path that is holding the mmap lock in write mode. For example in
* a reflink operation while COWing an extent buffer, when allocating
* pages for a new extent buffer and under memory pressure, the shrinker
* may be invoked, and therefore we would deadlock by attempting to read
* lock the mmap lock while we are holding already a write lock on it.
*/
if (!down_read_trylock(&inode->i_mmap_lock))
return 0;
btrfs: stop extent map shrinker if reschedule is needed The extent map shrinker can be called in a variety of contexts where we are under memory pressure, and of them is when a task is trying to allocate memory. For this reason the shrinker is typically called with a value of struct shrink_control::nr_to_scan that is much smaller than what we return in the nr_cached_objects callback of struct super_operations (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does not take a long time and cause high latencies. However we can still take a lot of time in the shrinker even for a limited amount of nr_to_scan: 1) When traversing the red black tree that tracks open inodes in a root, as for example with millions of open inodes we get a deep tree which takes time searching for an inode; 2) Iterating over the extent map tree, which is a red black tree, of an inode when doing the rb_next() calls and when removing an extent map from the tree, since often that requires rebalancing the red black tree; 3) When trying to write lock an inode's extent map tree we may wait for a significant amount of time, because there's either another task about to do IO and searching for an extent map in the tree or inserting an extent map in the tree, and we can have thousands or even millions of extent maps for an inode. Furthermore, there can be concurrent calls to the shrinker so the lock might be busy simply because there is already another task shrinking extent maps for the same inode; 4) We often reschedule if we need to, which further increases latency. So improve on this by stopping the extent map shrinking code whenever we need to reschedule and make it skip an inode if we can't immediately lock its extent map tree. Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-08 14:42:44 +00:00
/*
btrfs: only run the extent map shrinker from kswapd tasks Currently the extent map shrinker can be run by any task when attempting to allocate memory and there's enough memory pressure to trigger it. To avoid too much latency we stop iterating over extent maps and removing them once the task needs to reschedule. This logic was introduced in commit b3ebb9b7e92a ("btrfs: stop extent map shrinker if reschedule is needed"). While that solved high latency problems for some use cases, it's still not enough because with a too high number of tasks entering the extent map shrinker code, either due to memory allocations or because they are a kswapd task, we end up having a very high level of contention on some spin locks, namely: 1) The fs_info->fs_roots_radix_lock spin lock, which we need to find roots to iterate over their inodes; 2) The spin lock of the xarray used to track open inodes for a root (struct btrfs_root::inodes) - on 6.10 kernels and below, it used to be a red black tree and the spin lock was root->inode_lock; 3) The fs_info->delayed_iput_lock spin lock since the shrinker adds delayed iputs (calls btrfs_add_delayed_iput()). Instead of allowing the extent map shrinker to be run by any task, make it run only by kswapd tasks. This still solves the problem of running into OOM situations due to an unbounded extent map creation, which is simple to trigger by direct IO writes, as described in the changelog of commit 956a17d9d050 ("btrfs: add a shrinker for extent maps"), and by a similar case when doing buffered IO on files with a very large number of holes (keeping the file open and creating many holes, whose extent maps are only released when the file is closed). Reported-by: kzd <kzd@56709.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121 Reported-by: Octavia Togami <octavia.togami@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/ Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps") CC: stable@vger.kernel.org # 6.10+ Tested-by: kzd <kzd@56709.net> Tested-by: Octavia Togami <octavia.togami@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-11 10:53:42 +00:00
* We want to be fast so if the lock is busy we don't want to spend time
btrfs: stop extent map shrinker if reschedule is needed The extent map shrinker can be called in a variety of contexts where we are under memory pressure, and of them is when a task is trying to allocate memory. For this reason the shrinker is typically called with a value of struct shrink_control::nr_to_scan that is much smaller than what we return in the nr_cached_objects callback of struct super_operations (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does not take a long time and cause high latencies. However we can still take a lot of time in the shrinker even for a limited amount of nr_to_scan: 1) When traversing the red black tree that tracks open inodes in a root, as for example with millions of open inodes we get a deep tree which takes time searching for an inode; 2) Iterating over the extent map tree, which is a red black tree, of an inode when doing the rb_next() calls and when removing an extent map from the tree, since often that requires rebalancing the red black tree; 3) When trying to write lock an inode's extent map tree we may wait for a significant amount of time, because there's either another task about to do IO and searching for an extent map in the tree or inserting an extent map in the tree, and we can have thousands or even millions of extent maps for an inode. Furthermore, there can be concurrent calls to the shrinker so the lock might be busy simply because there is already another task shrinking extent maps for the same inode; 4) We often reschedule if we need to, which further increases latency. So improve on this by stopping the extent map shrinking code whenever we need to reschedule and make it skip an inode if we can't immediately lock its extent map tree. Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-08 14:42:44 +00:00
* waiting for it - either some task is about to do IO for the inode or
* we may have another task shrinking extent maps, here in this code, so
* skip this inode.
*/
if (!write_trylock(&tree->lock)) {
up_read(&inode->i_mmap_lock);
return 0;
}
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
node = rb_first(&tree->root);
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
while (node) {
btrfs: use a regular rb_root instead of cached rb_root for extent_map_tree We are currently using a cached rb_root (struct rb_root_cached) for the rb root of struct extent_map_tree. This doesn't offer much of an advantage here because: 1) It's only advantage over the regular rb_root is that it caches a pointer to the left most node (first node), so a call to rb_first_cached() doesn't have to chase pointers until it reaches the left most node; 2) We only have two scenarios that access left most node with rb_first_cached(): When dropping all extent maps from an inode, during inode eviction; When iterating over extent maps during the extent map shrinker; 3) In both cases we keep removing extent maps, which causes deletion of the left most node so rb_erase_cached() has to call rb_next() to find out what's the next left most node and assign it to struct rb_root_cached::rb_leftmost; 4) We can do that ourselves in those two uses cases and stop using a rb_root_cached rb tree and use instead a regular rb_root rb tree. This reduces the size of struct extent_map_tree by 8 bytes and, since this structure is embedded in struct btrfs_inode, it also reduces the size of that structure by 8 bytes. So on a 64 bits platform the size of btrfs_inode is reduced from 1032 bytes down to 1024 bytes. This means we will be able to have 4 inodes per 4K page instead of 3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-05-10 16:41:04 +00:00
struct rb_node *next = rb_next(node);
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
struct extent_map *em;
em = rb_entry(node, struct extent_map, rb_node);
ctx->scanned++;
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
if (em->flags & EXTENT_FLAG_PINNED)
goto next;
/*
* If the inode is in the list of modified extents (new) and its
* generation is the same (or is greater than) the current fs
* generation, it means it was not yet persisted so we have to
* set the full sync flag so that the next fsync will not miss
* it.
*/
if (!list_empty(&em->list) && em->generation >= cur_fs_gen)
btrfs_set_inode_full_sync(inode);
remove_extent_mapping(inode, em);
trace_btrfs_extent_map_shrinker_remove_em(inode, em);
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
/* Drop the reference for the tree. */
free_extent_map(em);
nr_dropped++;
next:
if (ctx->scanned >= ctx->nr_to_scan)
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
break;
/*
btrfs: stop extent map shrinker if reschedule is needed The extent map shrinker can be called in a variety of contexts where we are under memory pressure, and of them is when a task is trying to allocate memory. For this reason the shrinker is typically called with a value of struct shrink_control::nr_to_scan that is much smaller than what we return in the nr_cached_objects callback of struct super_operations (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does not take a long time and cause high latencies. However we can still take a lot of time in the shrinker even for a limited amount of nr_to_scan: 1) When traversing the red black tree that tracks open inodes in a root, as for example with millions of open inodes we get a deep tree which takes time searching for an inode; 2) Iterating over the extent map tree, which is a red black tree, of an inode when doing the rb_next() calls and when removing an extent map from the tree, since often that requires rebalancing the red black tree; 3) When trying to write lock an inode's extent map tree we may wait for a significant amount of time, because there's either another task about to do IO and searching for an extent map in the tree or inserting an extent map in the tree, and we can have thousands or even millions of extent maps for an inode. Furthermore, there can be concurrent calls to the shrinker so the lock might be busy simply because there is already another task shrinking extent maps for the same inode; 4) We often reschedule if we need to, which further increases latency. So improve on this by stopping the extent map shrinking code whenever we need to reschedule and make it skip an inode if we can't immediately lock its extent map tree. Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-08 14:42:44 +00:00
* Stop if we need to reschedule or there's contention on the
* lock. This is to avoid slowing other tasks trying to take the
btrfs: only run the extent map shrinker from kswapd tasks Currently the extent map shrinker can be run by any task when attempting to allocate memory and there's enough memory pressure to trigger it. To avoid too much latency we stop iterating over extent maps and removing them once the task needs to reschedule. This logic was introduced in commit b3ebb9b7e92a ("btrfs: stop extent map shrinker if reschedule is needed"). While that solved high latency problems for some use cases, it's still not enough because with a too high number of tasks entering the extent map shrinker code, either due to memory allocations or because they are a kswapd task, we end up having a very high level of contention on some spin locks, namely: 1) The fs_info->fs_roots_radix_lock spin lock, which we need to find roots to iterate over their inodes; 2) The spin lock of the xarray used to track open inodes for a root (struct btrfs_root::inodes) - on 6.10 kernels and below, it used to be a red black tree and the spin lock was root->inode_lock; 3) The fs_info->delayed_iput_lock spin lock since the shrinker adds delayed iputs (calls btrfs_add_delayed_iput()). Instead of allowing the extent map shrinker to be run by any task, make it run only by kswapd tasks. This still solves the problem of running into OOM situations due to an unbounded extent map creation, which is simple to trigger by direct IO writes, as described in the changelog of commit 956a17d9d050 ("btrfs: add a shrinker for extent maps"), and by a similar case when doing buffered IO on files with a very large number of holes (keeping the file open and creating many holes, whose extent maps are only released when the file is closed). Reported-by: kzd <kzd@56709.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121 Reported-by: Octavia Togami <octavia.togami@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/ Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps") CC: stable@vger.kernel.org # 6.10+ Tested-by: kzd <kzd@56709.net> Tested-by: Octavia Togami <octavia.togami@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-11 10:53:42 +00:00
* lock.
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
*/
btrfs: stop extent map shrinker if reschedule is needed The extent map shrinker can be called in a variety of contexts where we are under memory pressure, and of them is when a task is trying to allocate memory. For this reason the shrinker is typically called with a value of struct shrink_control::nr_to_scan that is much smaller than what we return in the nr_cached_objects callback of struct super_operations (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does not take a long time and cause high latencies. However we can still take a lot of time in the shrinker even for a limited amount of nr_to_scan: 1) When traversing the red black tree that tracks open inodes in a root, as for example with millions of open inodes we get a deep tree which takes time searching for an inode; 2) Iterating over the extent map tree, which is a red black tree, of an inode when doing the rb_next() calls and when removing an extent map from the tree, since often that requires rebalancing the red black tree; 3) When trying to write lock an inode's extent map tree we may wait for a significant amount of time, because there's either another task about to do IO and searching for an extent map in the tree or inserting an extent map in the tree, and we can have thousands or even millions of extent maps for an inode. Furthermore, there can be concurrent calls to the shrinker so the lock might be busy simply because there is already another task shrinking extent maps for the same inode; 4) We often reschedule if we need to, which further increases latency. So improve on this by stopping the extent map shrinking code whenever we need to reschedule and make it skip an inode if we can't immediately lock its extent map tree. Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-08 14:42:44 +00:00
if (need_resched() || rwlock_needbreak(&tree->lock))
break;
for-6.11-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaVN3MACgkQxWXV+ddt WDtpIRAAl+1NjsEj8e5V/UYn8Jr06ujTOnrkR3PCTICxDHbUaMLkQEw21H0K/ogQ 3fOiEVpSlZOfKdYXtXaMQbC0jd/Af2eA10Uht96nAEjAtxu1uJ4cFZGu2meNdXZP xUioivJ/CElMPH2aluG6FaQvUTqmhrEr8tSoYbxzQmUd434q9kqqyjtw1tfzYDG1 VDn2f7ykhpB/8P0aoqgWSshWTmaCzG0GkuI28o1o0iZUIF/P9TKdzxlLRW6BVHE7 T2oGLEQjN1GQbCH75L4IeNJDkCBVfcDcbZkUDJ/ae4Pt/jJQTFY53YIP9wXFZQnd mdfHmK7Atpsk75ATftYSq+ENkbQ5fsuut5CD63u54gAqA4M1FncDXTAWS1Y30F76 P8juSCmsSy0o3gTflDIo/IMdntoh/JmncwwStF6oKzmyUZZzzarsqM8mc1P03ZNt 3ttlnbY7lC1TDAlD5J2wXE0INCT2pN+4C9IToWdRypeuLu6qrI7cQ0oylyp9OVQM t9umTXm0B6s1cyqEDjJf0xJZS/JTHYwu7S4EmAJwicgiLpOjABVTmO8021rVmDJy TAUu6yEhSsrTT6Dxm7/2Et1EEOKFF5hhsG1SiGD9oUIZK6B5+0waT+rbkEWl7osR 4/TAv2zX6tuCc7HIW0fQloM/6/Gyd5wcDVaQNDUzFA075uKstwY= =k5d3 -----END PGP SIGNATURE----- Merge tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "The highlights are new logic behind background block group reclaim, automatic removal of qgroup after removing a subvolume and new 'rescue=' mount options. The rest is optimizations, cleanups and refactoring. User visible features: - dynamic block group reclaim: - tunable framework to avoid situations where eager data allocations prevent creating new metadata chunks due to lack of unallocated space - reuse sysfs knob bg_reclaim_threshold (otherwise used only in zoned mode) for a fixed value threshold - new on/off sysfs knob "dynamic_reclaim" calculating the value based on heuristics, aiming to keep spare working space for relocating chunks but not to needlessly relocate partially utilized block groups or reclaim newly allocated ones - stats are exported in sysfs per block group type, files "reclaim_*" - this may increase IO load at unexpected times but the corner case of no allocatable block groups is known to be worse - automatically remove qgroup of deleted subvolumes: - adjust qgroup removal conditions, make sure all related subvolume data are already removed, or return EBUSY, also take into account setting of sysfs drop_subtree_threshold - also works in squota mode - mount option updates: new modes of 'rescue=' that allow to mount images (read-only) that could have been partially converted by user space tools - ignoremetacsums - invalid metadata checksums are ignored - ignoresuperflags - super block flags that track conversion in progress (like UUID or checksums) Core: - size of struct btrfs_inode is now below 1024 (on a release config), improved memory packing and other secondary effects - switch tracking of open inodes from rb-tree to xarray, minor performance improvement - reduce number of empty transaction commits when there are no dirty data/metadata - memory allocation optimizations (reduced numbers, reordering out of critical sections) - extent map structure optimizations and refactoring, more sanity checks - more subpage in zoned mode preparations or fixes - general snapshot code cleanups, improvements and documentation - tree-checker updates: more file extent ram_bytes fixes, continued - raid-stripe-tree update (not backward compatible): - remove extent encoding field from the structure, can be inferred from other information - requires btrfs-progs 6.9.1 or newer - cleanups and refactoring - error message updates - error handling improvements - return type and parameter cleanups and improvements" * tag 'for-6.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (152 commits) btrfs: fix extent map use-after-free when adding pages to compressed bio btrfs: fix bitmap leak when loading free space cache on duplicate entry btrfs: remove the BUG_ON() inside extent_range_clear_dirty_for_io() btrfs: move extent_range_clear_dirty_for_io() into inode.c btrfs: enhance compression error messages btrfs: fix data race when accessing the last_trans field of a root btrfs: rename the extra_gfp parameter of btrfs_alloc_page_array() btrfs: remove the extra_gfp parameter from btrfs_alloc_folio_array() btrfs: introduce new "rescue=ignoresuperflags" mount option btrfs: introduce new "rescue=ignoremetacsums" mount option btrfs: output the unrecognized super block flags as hex btrfs: remove unused Opt enums btrfs: tree-checker: add extra ram_bytes and disk_num_bytes check btrfs: fix the ram_bytes assignment for truncated ordered extents btrfs: make validate_extent_map() catch ram_bytes mismatch btrfs: ignore incorrect btrfs_file_extent_item::ram_bytes btrfs: cleanup the bytenr usage inside btrfs_extent_item_to_extent_map() btrfs: fix typo in error message in btrfs_validate_super() btrfs: move the direct IO code into its own file btrfs: pass a btrfs_inode to btrfs_set_prop() ...
2024-07-17 19:38:04 +00:00
node = next;
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
}
write_unlock(&tree->lock);
up_read(&inode->i_mmap_lock);
return nr_dropped;
}
static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx *ctx)
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
{
struct btrfs_inode *inode;
long nr_dropped = 0;
u64 min_ino = ctx->last_ino + 1;
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
inode = btrfs_find_first_inode(root, min_ino);
while (inode) {
nr_dropped += btrfs_scan_inode(inode, ctx);
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
min_ino = btrfs_ino(inode) + 1;
ctx->last_ino = btrfs_ino(inode);
btrfs: use delayed iput during extent map shrinking When putting an inode during extent map shrinking we're doing a standard iput() but that may take a long time in case the inode is dirty and we are doing the final iput that triggers eviction - the VFS will have to wait for writeback before calling the btrfs evict callback (see fs/inode.c:evict()). This slows down the task running the shrinker which may have been triggered while updating some tree for example, meaning locks are held as well as an open transaction handle. Also if the iput() ends up triggering eviction and the inode has no links anymore, then we trigger item truncation which requires flushing delayed items, space reservation to start a transaction and that may trigger the space reclaim task and wait for it, resulting in deadlocks in case the reclaim task needs for example to commit a transaction and the shrinker is being triggered from a path holding a transaction handle. Syzbot reported such a case with the following stack traces: ====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Not tainted ------------------------------------------------------ kswapd0/111 is trying to acquire lock: ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 but task is already holding lock: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (fs_reclaim){+.+.}-{0:0}: __fs_reclaim_acquire mm/page_alloc.c:3783 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3797 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3890 [inline] slab_alloc_node mm/slub.c:3980 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4019 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 create_reloc_inode+0x403/0x820 fs/btrfs/relocation.c:3911 btrfs_relocate_block_group+0x471/0xe60 fs/btrfs/relocation.c:4114 btrfs_relocate_chunk+0x143/0x450 fs/btrfs/volumes.c:3373 __btrfs_balance fs/btrfs/volumes.c:4157 [inline] btrfs_balance+0x211a/0x3f00 fs/btrfs/volumes.c:4534 btrfs_ioctl_balance fs/btrfs/ioctl.c:3675 [inline] btrfs_ioctl+0x12ed/0x8290 fs/btrfs/ioctl.c:4742 __do_compat_sys_ioctl+0x2c3/0x330 fs/ioctl.c:1007 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}: join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #1 (btrfs_trans_num_writers){++++}-{0:0}: join_transaction+0x148/0xf40 fs/btrfs/transaction.c:314 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #0 (sb_internal#3){.+.+}-{0:0}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Chain exists of: sb_internal#3 --> btrfs_trans_num_extwriters --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(btrfs_trans_num_extwriters); lock(fs_reclaim); rlock(sb_internal#3); *** DEADLOCK *** 2 locks held by kswapd0/111: #0: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_trylock_shared fs/super.c:562 [inline] #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_cache_scan+0x96/0x550 fs/super.c:196 stack backtrace: CPU: 0 PID: 111 Comm: kswapd0 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114 check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 </TASK> So fix this by using btrfs_add_delayed_iput() so that the final iput is delegated to the cleaner kthread. Link: https://lore.kernel.org/linux-btrfs/000000000000892280061a344581@google.com/ Reported-by: syzbot+3dad89b3993a4b275e72@syzkaller.appspotmail.com Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-03 10:07:26 +00:00
btrfs_add_delayed_iput(inode);
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
if (ctx->scanned >= ctx->nr_to_scan)
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
break;
btrfs: only run the extent map shrinker from kswapd tasks Currently the extent map shrinker can be run by any task when attempting to allocate memory and there's enough memory pressure to trigger it. To avoid too much latency we stop iterating over extent maps and removing them once the task needs to reschedule. This logic was introduced in commit b3ebb9b7e92a ("btrfs: stop extent map shrinker if reschedule is needed"). While that solved high latency problems for some use cases, it's still not enough because with a too high number of tasks entering the extent map shrinker code, either due to memory allocations or because they are a kswapd task, we end up having a very high level of contention on some spin locks, namely: 1) The fs_info->fs_roots_radix_lock spin lock, which we need to find roots to iterate over their inodes; 2) The spin lock of the xarray used to track open inodes for a root (struct btrfs_root::inodes) - on 6.10 kernels and below, it used to be a red black tree and the spin lock was root->inode_lock; 3) The fs_info->delayed_iput_lock spin lock since the shrinker adds delayed iputs (calls btrfs_add_delayed_iput()). Instead of allowing the extent map shrinker to be run by any task, make it run only by kswapd tasks. This still solves the problem of running into OOM situations due to an unbounded extent map creation, which is simple to trigger by direct IO writes, as described in the changelog of commit 956a17d9d050 ("btrfs: add a shrinker for extent maps"), and by a similar case when doing buffered IO on files with a very large number of holes (keeping the file open and creating many holes, whose extent maps are only released when the file is closed). Reported-by: kzd <kzd@56709.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121 Reported-by: Octavia Togami <octavia.togami@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/ Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps") CC: stable@vger.kernel.org # 6.10+ Tested-by: kzd <kzd@56709.net> Tested-by: Octavia Togami <octavia.togami@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-11 10:53:42 +00:00
cond_resched();
btrfs: stop extent map shrinker if reschedule is needed The extent map shrinker can be called in a variety of contexts where we are under memory pressure, and of them is when a task is trying to allocate memory. For this reason the shrinker is typically called with a value of struct shrink_control::nr_to_scan that is much smaller than what we return in the nr_cached_objects callback of struct super_operations (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does not take a long time and cause high latencies. However we can still take a lot of time in the shrinker even for a limited amount of nr_to_scan: 1) When traversing the red black tree that tracks open inodes in a root, as for example with millions of open inodes we get a deep tree which takes time searching for an inode; 2) Iterating over the extent map tree, which is a red black tree, of an inode when doing the rb_next() calls and when removing an extent map from the tree, since often that requires rebalancing the red black tree; 3) When trying to write lock an inode's extent map tree we may wait for a significant amount of time, because there's either another task about to do IO and searching for an extent map in the tree or inserting an extent map in the tree, and we can have thousands or even millions of extent maps for an inode. Furthermore, there can be concurrent calls to the shrinker so the lock might be busy simply because there is already another task shrinking extent maps for the same inode; 4) We often reschedule if we need to, which further increases latency. So improve on this by stopping the extent map shrinking code whenever we need to reschedule and make it skip an inode if we can't immediately lock its extent map tree. Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-08 14:42:44 +00:00
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
inode = btrfs_find_first_inode(root, min_ino);
}
if (inode) {
/*
* There are still inodes in this root or we happened to process
* the last one and reached the scan limit. In either case set
* the current root to this one, so we'll resume from the next
* inode if there is one or we will find out this was the last
* one and move to the next root.
*/
ctx->last_root = btrfs_root_id(root);
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
} else {
/*
* No more inodes in this root, set extent_map_shrinker_last_ino to 0 so
* that when processing the next root we start from its first inode.
*/
ctx->last_ino = 0;
ctx->last_root = btrfs_root_id(root) + 1;
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
}
return nr_dropped;
}
long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan)
{
struct btrfs_em_shrink_ctx ctx;
u64 start_root_id;
u64 next_root_id;
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
bool cycled = false;
long nr_dropped = 0;
ctx.scanned = 0;
ctx.nr_to_scan = nr_to_scan;
/*
* In case we have multiple tasks running this shrinker, make the next
* one start from the next inode in case it starts before we finish.
*/
spin_lock(&fs_info->extent_map_shrinker_lock);
ctx.last_ino = fs_info->extent_map_shrinker_last_ino;
fs_info->extent_map_shrinker_last_ino++;
ctx.last_root = fs_info->extent_map_shrinker_last_root;
spin_unlock(&fs_info->extent_map_shrinker_lock);
start_root_id = ctx.last_root;
next_root_id = ctx.last_root;
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
if (trace_btrfs_extent_map_shrinker_scan_enter_enabled()) {
s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
trace_btrfs_extent_map_shrinker_scan_enter(fs_info, nr_to_scan,
nr, ctx.last_root,
ctx.last_ino);
}
btrfs: only run the extent map shrinker from kswapd tasks Currently the extent map shrinker can be run by any task when attempting to allocate memory and there's enough memory pressure to trigger it. To avoid too much latency we stop iterating over extent maps and removing them once the task needs to reschedule. This logic was introduced in commit b3ebb9b7e92a ("btrfs: stop extent map shrinker if reschedule is needed"). While that solved high latency problems for some use cases, it's still not enough because with a too high number of tasks entering the extent map shrinker code, either due to memory allocations or because they are a kswapd task, we end up having a very high level of contention on some spin locks, namely: 1) The fs_info->fs_roots_radix_lock spin lock, which we need to find roots to iterate over their inodes; 2) The spin lock of the xarray used to track open inodes for a root (struct btrfs_root::inodes) - on 6.10 kernels and below, it used to be a red black tree and the spin lock was root->inode_lock; 3) The fs_info->delayed_iput_lock spin lock since the shrinker adds delayed iputs (calls btrfs_add_delayed_iput()). Instead of allowing the extent map shrinker to be run by any task, make it run only by kswapd tasks. This still solves the problem of running into OOM situations due to an unbounded extent map creation, which is simple to trigger by direct IO writes, as described in the changelog of commit 956a17d9d050 ("btrfs: add a shrinker for extent maps"), and by a similar case when doing buffered IO on files with a very large number of holes (keeping the file open and creating many holes, whose extent maps are only released when the file is closed). Reported-by: kzd <kzd@56709.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121 Reported-by: Octavia Togami <octavia.togami@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/ Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps") CC: stable@vger.kernel.org # 6.10+ Tested-by: kzd <kzd@56709.net> Tested-by: Octavia Togami <octavia.togami@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-11 10:53:42 +00:00
while (ctx.scanned < ctx.nr_to_scan) {
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
struct btrfs_root *root;
unsigned long count;
btrfs: only run the extent map shrinker from kswapd tasks Currently the extent map shrinker can be run by any task when attempting to allocate memory and there's enough memory pressure to trigger it. To avoid too much latency we stop iterating over extent maps and removing them once the task needs to reschedule. This logic was introduced in commit b3ebb9b7e92a ("btrfs: stop extent map shrinker if reschedule is needed"). While that solved high latency problems for some use cases, it's still not enough because with a too high number of tasks entering the extent map shrinker code, either due to memory allocations or because they are a kswapd task, we end up having a very high level of contention on some spin locks, namely: 1) The fs_info->fs_roots_radix_lock spin lock, which we need to find roots to iterate over their inodes; 2) The spin lock of the xarray used to track open inodes for a root (struct btrfs_root::inodes) - on 6.10 kernels and below, it used to be a red black tree and the spin lock was root->inode_lock; 3) The fs_info->delayed_iput_lock spin lock since the shrinker adds delayed iputs (calls btrfs_add_delayed_iput()). Instead of allowing the extent map shrinker to be run by any task, make it run only by kswapd tasks. This still solves the problem of running into OOM situations due to an unbounded extent map creation, which is simple to trigger by direct IO writes, as described in the changelog of commit 956a17d9d050 ("btrfs: add a shrinker for extent maps"), and by a similar case when doing buffered IO on files with a very large number of holes (keeping the file open and creating many holes, whose extent maps are only released when the file is closed). Reported-by: kzd <kzd@56709.net> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219121 Reported-by: Octavia Togami <octavia.togami@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAHPNGSSt-a4ZZWrtJdVyYnJFscFjP9S7rMcvEMaNSpR556DdLA@mail.gmail.com/ Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps") CC: stable@vger.kernel.org # 6.10+ Tested-by: kzd <kzd@56709.net> Tested-by: Octavia Togami <octavia.togami@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-08-11 10:53:42 +00:00
cond_resched();
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
spin_lock(&fs_info->fs_roots_radix_lock);
count = radix_tree_gang_lookup(&fs_info->fs_roots_radix,
(void **)&root,
(unsigned long)next_root_id, 1);
if (count == 0) {
spin_unlock(&fs_info->fs_roots_radix_lock);
if (start_root_id > 0 && !cycled) {
next_root_id = 0;
ctx.last_root = 0;
ctx.last_ino = 0;
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
cycled = true;
continue;
}
break;
}
next_root_id = btrfs_root_id(root) + 1;
root = btrfs_grab_root(root);
spin_unlock(&fs_info->fs_roots_radix_lock);
if (!root)
continue;
if (is_fstree(btrfs_root_id(root)))
nr_dropped += btrfs_scan_root(root, &ctx);
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
btrfs_put_root(root);
}
/*
* In case of multiple tasks running this extent map shrinking code this
* isn't perfect but it's simple and silences things like KCSAN. It's
* not possible to know which task made more progress because we can
* cycle back to the first root and first inode if it's not the first
* time the shrinker ran, see the above logic. Also a task that started
* later may finish ealier than another task and made less progress. So
* make this simple and update to the progress of the last task that
* finished, with the occasional possiblity of having two consecutive
* runs of the shrinker process the same inodes.
*/
spin_lock(&fs_info->extent_map_shrinker_lock);
fs_info->extent_map_shrinker_last_ino = ctx.last_ino;
fs_info->extent_map_shrinker_last_root = ctx.last_root;
spin_unlock(&fs_info->extent_map_shrinker_lock);
if (trace_btrfs_extent_map_shrinker_scan_exit_enabled()) {
s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps);
trace_btrfs_extent_map_shrinker_scan_exit(fs_info, nr_dropped,
nr, ctx.last_root,
ctx.last_ino);
}
btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit 70dec8079d78 ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-15 16:09:26 +00:00
return nr_dropped;
}