linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-29 07:31:29 +00:00

Author	SHA1	Message	Date
Filipe Manana	c48545debf	btrfs: send: use the lru cache to implement the name cache The name cache in send is basically a lru cache implemented with a radix tree and linked lists, very similar to the lru cache module which is used for the send backref cache and the cache of previously created directories during a send operation. So remove all the custom caching code for the name cache and make it use the lru cache instead. One particular detail to note is that the current cache behaves a bit differently when it comes to eviction of entries. Namely when after inserting a new name in the cache, if the cache now has 256 entries, we evict the last 128 LRU entries. The lru_cache.{c,h} module behaves a bit differently in that once we reach the cache limit, we evict a single LRU entry. In practice this doesn't make much difference, but it's actually better to evict just one entry instead of half of the entries, as there's always a chance we will need a name stored in one of that last 128 removed entries. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-15 19:36:32 +01:00
Filipe Manana	d588adae3b	btrfs: add an api to delete a specific entry from the lru cache In order to replace the open coded name cache in send with the lru cache, we need an API for the lru cache to delete a specific entry for which we did a previous lookup. This adds the API for it, and a next patch in the series will use it. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:36 +01:00
Filipe Manana	0da0c5605e	btrfs: allow a generation number to be associated with lru cache entries This allows an optional generation number to be associated to each entry of the lru cache. Entries with the same key but different generations, are stored in the linked list to which the maple tree points to. This is meant to be used when there's a small number of different generations, so the impact of searching a linked list is negligible. The goal is to get rid of the open coded name cache in the send code (which uses a radix tree and a similar linked list of values/entries) and use instead the lru cache module. For that particular use case we have at most 2 generations that are associated to each key (inode number): one generation for the send root and another generation for the parent root. The actual migration of the send name cache is done in the next patch in the series. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:36 +01:00
Filipe Manana	e8a7f49d9b	btrfs: send: cache information about created directories During an incremental send, when processing the reference for an inode we need to check if the directory where the new reference is located was already created before creating the new reference. This check, which is done by the helper did_create_dir(), can be expensive if the directory has many entries, since it consists in searching the send root's b+tree and visiting every single dir index key until we either find one which points to an inode with a number smaller than the current inode's number or until we visited all index keys. So it doesn't scale well for very large directories. So improve on this by caching created directories using a lru cache, and limiting its size to 64 entries, which results in using at most 4096 bytes of memory. The caching is optional, if we fail to allocate memory, we just proceed as before and use the existing slower path. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:36 +01:00
Filipe Manana	6273ee621f	btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems The lru cache is backed by a maple tree, which uses the unsigned long type for keys, and that type has a width of 32 bits on 32 bits systems and a width of 64 bits on 64 bits systems. Currently there is only one user of the lru cache, the send backref cache, which uses a sector number as a key, a logical address right shifted by fs_info->sectorsize_bits, so a 32 bits width is not yet a problem (the same happens with the radix tree we use to track extent buffers, fs_info->buffer_radix). However the next patches in the series will start using the lru cache for cases where inode numbers are the keys, and the inode numbers are always 64 bits, even if we are running on a 32 bits system. So adapt the lru cache to allow multiple values under the same key, by having the maple tree store a head entry that points to a list of entries instead of pointing to a single entry. This is a similar approach to what we currently do for the name cache in send (which uses a radix tree that has indexes with an unsigned long type as well), and will allow later to use the lru cache for the send name cache as well. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:36 +01:00
Filipe Manana	90b90d4ac0	btrfs: send: genericize the backref cache to allow it to be reused The backref cache is a cache backed by a maple tree and a linked list to keep track of temporal access to cached entries (the LRU entry always at the head of the list). This type of caching method is going to be useful in other scenarios, so make the cache implementation more generic and move it into its own header and source files. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
Filipe Manana	d307d2f35c	btrfs: send: initialize all the red black trees earlier After we allocate the send context object and before we initialize all the red black trees, we can jump to the 'out' label if some errors happen, and then under the 'out' label we use RB_EMPTY_ROOT() against some of the those trees, which we have not yet initialized. This happens to work out ok because the send context object was initialized to zeroes with kzalloc and the RB_ROOT initializer just happens to have the following definition: #define RB_ROOT (struct rb_root) { NULL, } But it's really neither clean nor a good practice as RB_ROOT is supposed to be opaque and in case it changes or we change those red black trees to some other data structure, it leaves us in a precarious situation. So initialize all the red black trees immediately after allocating the send context and before any jump into the 'out' label. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
Filipe Manana	8c139e1d78	btrfs: send: iterate waiting dir move rbtree only once when processing refs When processing the new references for an inode, we unnecessarily iterate twice the waiting dir moves rbtree, once with is_waiting_for_move() and if we found an entry in the rbtree, we iterate it again with a call to get_waiting_dir_move(). This is pointless, we can make this simpler and more efficient by calling only get_waiting_dir_move(), so just do that. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
Filipe Manana	474e4761f6	btrfs: send: reduce searches on parent root when checking if dir can be removed During an incremental send, every time we remove a reference (dentry) for an inode and the parent directory does not exists anymore in the send root, we go check if we can remove the directory by making a call to can_rmdir(). This helper can only return true (value 1) if all dentries were already removed, and for that it always does a search on the parent root for dir index keys - if it finds any dentry referring to an inode with a number higher then the inode currently being processed, then the directory can not be removed and it must return false (value 0). However that means if a directory that was deleted had 1000 dentries, and each one pointed to an inode with a number higher then the number of the directory's inode, we end up doing 1000 searches on the parent root. Typically files are created in a directory after the directory was created and therefore they get an higher inode number than the directory. It's also common to have the each dentry pointing to an inode with a higher number then the inodes the previous dentries point to, for example when creating a series of files inside a directory, a very common pattern. So improve on that by having the first call to can_rmdir() for a directory to check the number of the inode that the last dentry points to and cache that inode number in the orphan dir structure. Then every subsequent call to can_rmdir() can avoid doing a search on the parent root if the number of the inode currently being processed is smaller than cached inode number at the directory's orphan dir structure. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
Filipe Manana	78cf1a954d	btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() At can_rmdir() we start by searching the orphan dirs rbtree for an orphan dir object for the target directory. Later when iterating over the dir index keys, if we find that any dir entry points to inode for which there is a pending dir move or the inode was not yet processed, we exit because we can't remove the directory yet. However we end up always calling add_orphan_dir_info(), which will iterate again the rbtree and if there is already an orphan dir object (created by the first call to can_rmdir()), it returns the existing object. This is unnecessary work because in case there is already an existing orphan dir object, we got a reference to it at the start of can_rmdir(). So skip the call to add_orphan_dir_info() if we already have a reference for an orphan dir object. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
Filipe Manana	d921b9cf91	btrfs: send: avoid duplicated orphan dir allocation and initialization At can_rmdir() we are allocating and initializing an orphan dir object twice. This can be deduplicated outside of the loop that iterates over the dir index keys. So deduplicate that code, even because other patch in the series will need to add more initialization code and another one will add one more condition. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
Filipe Manana	24970ccb24	btrfs: send: remove send_progress argument from can_rmdir() All callers of can_rmdir() pass sctx->cur_ino as the value for the send_progress argument, so remove the argument and directly use sctx->cur_ino. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
Filipe Manana	498581f33c	btrfs: send: avoid extra b+tree searches when checking reference overrides During an incremental send, when processing the new references of an inode (either it's a new inode or an existing one renamed/moved), he will search the b+tree of the send or parent roots in order to find out the inode item of the parent directory and extract its generation. However we are doing that search twice, once with is_inode_existent() -> get_cur_inode_state() and then again at did_overwrite_ref() or will_overwrite_ref(). So avoid that and get the generation at get_cur_inode_state() and then propagate it up to did_overwrite_ref() and will_overwrite_ref(). This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
Filipe Manana	b3047a42f5	btrfs: send: directly return from will_overwrite_ref() and simplify it There are no resources to release before will_overwrite_ref() returns, so we don't really need the 'out' label and jumping to it when conditions are met - we can directly return and get rid of the label and jumps. Also we can deal with -ENOENT and other errors in a single if-else logic, as it's more straightforward. This helps the next patch in the series to be more simple as well. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
Filipe Manana	cb68948194	btrfs: send: avoid unnecessary generation search at did_overwrite_ref() At did_overwrite_ref() we always call get_inode_gen() to find out the generation of the inode 'ow_inode'. However we don't always need to use that generation, and in fact it's very common to not use it, so we end up doing a b+tree search on the send root, allocating a path, etc, for nothing. So improve on this by getting the generation only if we need to use it. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
Filipe Manana	e739ba307f	btrfs: send: directly return from did_overwrite_ref() and simplify it There are no resources to release before did_overwrite_ref() returns, so we don't really need the 'out' label and jumping to it when conditions are met - we can directly return and get rid of the label and jumps. Also we can deal with -ENOENT and other errors in a single if-else logic, as it's more straightforward. This helps the next patch in the series to be more simple as well. This patch is part of a larger patchset and the changelog of the last patch in the series contains a sample performance test and results. The patches that comprise the patchset are the following: btrfs: send: directly return from did_overwrite_ref() and simplify it btrfs: send: avoid unnecessary generation search at did_overwrite_ref() btrfs: send: directly return from will_overwrite_ref() and simplify it btrfs: send: avoid extra b+tree searches when checking reference overrides btrfs: send: remove send_progress argument from can_rmdir() btrfs: send: avoid duplicated orphan dir allocation and initialization btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() btrfs: send: reduce searches on parent root when checking if dir can be removed btrfs: send: iterate waiting dir move rbtree only once when processing refs btrfs: send: initialize all the red black trees earlier btrfs: send: genericize the backref cache to allow it to be reused btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems btrfs: send: cache information about created directories btrfs: allow a generation number to be associated with lru cache entries btrfs: add an api to delete a specific entry from the lru cache btrfs: send: use the lru cache to implement the name cache btrfs: send: update size of roots array for backref cache entries btrfs: send: cache utimes operations for directories if possible Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
Qu Wenruo	b7625f461d	btrfs: sysfs: update fs features directory asynchronously [BUG] Since the introduction of per-fs feature sysfs interface (/sys/fs/btrfs/<UUID>/features/), the content of that directory is never updated. Thus for the following case, that directory will not show the new features like RAID56: # mkfs.btrfs -f $dev1 $dev2 $dev3 # mount $dev1 $mnt # btrfs balance start -f -mconvert=raid5 $mnt # ls /sys/fs/btrfs/$uuid/features/ extended_iref free_space_tree no_holes skinny_metadata While after unmount and mount, we got the correct features: # umount $mnt # mount $dev1 $mnt # ls /sys/fs/btrfs/$uuid/features/ extended_iref free_space_tree no_holes raid56 skinny_metadata [CAUSE] Because we never really try to update the content of per-fs features/ directory. We had an attempt to update the features directory dynamically in commit `14e46e0495` ("btrfs: synchronize incompat feature bits with sysfs files"), but unfortunately it get reverted in commit `e410e34fad` ("Revert "btrfs: synchronize incompat feature bits with sysfs files""). The problem in the original patch is, in the context of btrfs_create_chunk(), we can not afford to update the sysfs group. The exported but never utilized function, btrfs_sysfs_feature_update() is the leftover of such attempt. As even if we go sysfs_update_group(), new files will need extra memory allocation, and we have no way to specify the sysfs update to go GFP_NOFS. [FIX] This patch will address the old problem by doing asynchronous sysfs update in the cleaner thread. This involves the following changes: - Make __btrfs_(set\|clear)_fs_(incompat\|compat_ro) helpers to set BTRFS_FS_FEATURE_CHANGED flag when needed - Update btrfs_sysfs_feature_update() to use sysfs_update_group() And drop unnecessary arguments. - Call btrfs_sysfs_feature_update() in cleaner_kthread If we have the BTRFS_FS_FEATURE_CHANGED flag set. - Wake up cleaner_kthread in btrfs_commit_transaction if we have BTRFS_FS_FEATURE_CHANGED flag By this, all the previously dangerous call sites like btrfs_create_chunk() need no new changes, as above helpers would have already set the BTRFS_FS_FEATURE_CHANGED flag. The real work happens at cleaner_kthread, thus we pay the cost of delaying the update to sysfs directory, but the delayed time should be small enough that end user can not distinguish though it might get delayed if the cleaner thread is busy with removing subvolumes or defrag. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:35 +01:00
ye xingchen	58e36c2a01	btrfs: remove duplicate include header in extent-tree.c extent-tree.h is included more than once, added in `a0231804af` ("btrfs: move extent-tree helpers into their own header file"). Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:34 +01:00
Qu Wenruo	28232909ba	btrfs: scrub: improve tree block error reporting [BUG] When debugging a scrub related metadata error, it turns out that our metadata error reporting is not ideal. The only 3 error messages are: - BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 0, gen 1 Showing we have metadata generation mismatch errors. - BTRFS error (device dm-2): unable to fixup (regular) error at logical 7110656 on dev /dev/mapper/test-scratch1 Showing which tree blocks are corrupted. - BTRFS warning (device dm-2): checksum/header error at logical 24772608 on dev /dev/mapper/test-scratch2, physical 3801088: metadata node (level 1) in tree 5 Showing which physical range the corrupted metadata is at. We have to combine the above 3 to know we have a corrupted metadata with generation mismatch. And this is already the better case, if we have other problems, like fsid mismatch, we can not even know the cause. [CAUSE] The problem is caused by the fact that, scrub_checksum_tree_block() never outputs any error message. It just return two bits for scrub: sblock->header_error, and sblock->generation_error. And later we report error in scrub_print_warning(), but unfortunately we only have two bits, there is not really much thing we can done to print any detailed errors. [FIX] This patch will do the following to enhance the error reporting of metadata scrub: - Add extra warning (ratelimited) for every error we hit This can help us to distinguish the different types of errors. Some errors can help us to know what's going wrong immediately, like bytenr mismatch. - Re-order the checks Currently we check bytenr first, then immediately generation. This can lead to false generation mismatch reports, while the fsid mismatches. Here is the new output for the bug I'm debugging (we forgot to writeback tree blocks for commit roots): BTRFS warning (device dm-2): tree block 24117248 mirror 1 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc BTRFS warning (device dm-2): tree block 24117248 mirror 0 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc Now we can immediately know it's some tree blocks didn't even get written back, other than the original confusing generation mismatch. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:34 +01:00
Boris Burkov	cb0922f264	btrfs: don't use size classes for zoned file systems When a file system has ZNS devices which are constrained by a maximum number of active block groups, then not being able to use all the block groups for every allocation is not ideal, and could cause us to loop a ton with mixed size allocations. In general, since zoned doesn't write into gaps behind where block groups are writing, it is not susceptible to the same sort of fragmentation that size classes are designed to solve, so we can skip size classes for zoned file systems in general, even though there would probably be no harm for SMR devices. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:34 +01:00
Boris Burkov	c7eec3d9aa	btrfs: load block group size class when caching Since the size class is an artifact of an arbitrary anti fragmentation strategy, it doesn't really make sense to persist it. Furthermore, most of the size class logic assumes fresh block groups. That is of course not a reasonable assumption -- we will be upgrading kernels with existing filesystems whose block groups are not classified. To work around those issues, implement logic to compute the size class of the block groups as we cache them in. To perfectly assess the state of a block group, we would have to read the entire extent tree (since the free space cache mashes together contiguous extent items) which would be prohibitively expensive for larger file systems with more extents. We can do it relatively cheaply by implementing a simple heuristic of sampling a handful of extents and picking the smallest one we see. In the happy case where the block group was classified, we will only see extents of the correct size. In the unhappy case, we will hopefully find one of the smaller extents, but there is no perfect answer anyway. Autorelocation will eventually churn up the block group if there is significant freeing anyway. There was no regression in mount performance at end state of the fsperf test suite, and the delay until the block group is marked cached is minimized by the constant number of extent samples. Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:34 +01:00
Boris Burkov	52bb7a2166	btrfs: introduce size class to block group allocator The aim of this patch is to reduce the fragmentation of block groups under certain unhappy workloads. It is particularly effective when the size of extents correlates with their lifetime, which is something we have observed causing fragmentation in the fleet at Meta. This patch categorizes extents into size classes: - x < 128KiB: "small" - 128KiB < x < 8MiB: "medium" - x > 8MiB: "large" and as much as possible reduces allocations of extents into block groups that don't match the size class. This takes advantage of any (possible) correlation between size and lifetime and also leaves behind predictable re-usable gaps when extents are freed; small writes don't gum up bigger holes. Size classes are implemented in the following way: - Mark each new block group with a size class of the first allocation that goes into it. - Add two new passes to ffe: "unset size class" and "wrong size class". First, try only matching block groups, then try unset ones, then allow allocation of new ones, and finally allow mismatched block groups. - Filtering is done just by skipping inappropriate ones, there is no special size class indexing. Other solutions I considered were: - A best fit allocator with an rb-tree. This worked well, as small writes didn't leak big holes from large freed extents, but led to regressions in ffe and write performance due to lock contention on the rb-tree with every allocation possibly updating it in parallel. Perhaps something clever could be done to do the updates in the background while being "right enough". - A fixed size "working set". This prevents freeing an extent drastically changing where writes currently land, and seems like a good option too. Doesn't take advantage of size in any way. - The same size class idea, but implemented with xarray marks. This turned out to be slower than looping the linked list and skipping wrong block groups, and is also less flexible since we must have only 3 size classes (max #marks). With the current approach we can have as many as we like. Performance testing was done via: https://github.com/josefbacik/fsperf Of particular relevance are the new fragmentation specific tests. A brief summary of the testing results: - Neutral results on existing tests. There are some minor regressions and improvements here and there, but nothing that truly stands out as notable. - Improvement on new tests where size class and extent lifetime are correlated. Fragmentation in these cases is completely eliminated and write performance is generally a little better. There is also significant improvement where extent sizes are just a bit larger than the size class boundaries. - Regression on one new tests: where the allocations are sized intentionally a hair under the borders of the size classes. Results are neutral on the test that intentionally attacks this new scheme by mixing extent size and lifetime. The full dump of the performance results can be found here: https://bur.io/fsperf/size-class-2022-11-15.txt (there are ANSI escape codes, so best to curl and view in terminal) Here is a snippet from the full results for a new test which mixes buffered writes appending to a long lived set of files and large short lived fallocates: bufferedappendvsfallocate results metric baseline current stdev diff ====================================================================================== avg_commit_ms 31.13 29.20 2.67 -6.22% bg_count 14 15.60 0 11.43% commits 11.10 12.20 0.32 9.91% elapsed 27.30 26.40 2.98 -3.30% end_state_mount_ns 11122551.90 10635118.90 851143.04 -4.38% end_state_umount_ns 1.36e+09 1.35e+09 12248056.65 -1.07% find_free_extent_calls 116244.30 114354.30 964.56 -1.63% find_free_extent_ns_max 599507.20 1047168.20 103337.08 74.67% find_free_extent_ns_mean 3607.19 3672.11 101.20 1.80% find_free_extent_ns_min 500 512 6.67 2.40% find_free_extent_ns_p50 2848 2876 37.65 0.98% find_free_extent_ns_p95 4916 5000 75.45 1.71% find_free_extent_ns_p99 20734.49 20920.48 1670.93 0.90% frag_pct_max 61.67 0 8.05 -100.00% frag_pct_mean 43.59 0 6.10 -100.00% frag_pct_min 25.91 0 16.60 -100.00% frag_pct_p50 42.53 0 7.25 -100.00% frag_pct_p95 61.67 0 8.05 -100.00% frag_pct_p99 61.67 0 8.05 -100.00% fragmented_bg_count 6.10 0 1.45 -100.00% max_commit_ms 49.80 46 5.37 -7.63% sys_cpu 2.59 2.62 0.29 1.39% write_bw_bytes 1.62e+08 1.68e+08 17975843.50 3.23% write_clat_ns_mean 57426.39 54475.95 2292.72 -5.14% write_clat_ns_p50 46950.40 42905.60 2101.35 -8.62% write_clat_ns_p99 148070.40 143769.60 2115.17 -2.90% write_io_kbytes 4194304 4194304 0 0.00% write_iops 2476.15 2556.10 274.29 3.23% write_lat_ns_max 2101667.60 2251129.50 370556.59 7.11% write_lat_ns_mean 59374.91 55682.00 2523.09 -6.22% write_lat_ns_min 17353.10 16250 1646.08 -6.36% There are some mixed improvements/regressions in most metrics along with an elimination of fragmentation in this workload. On the balance, the drastic 1->0 improvement in the happy cases seems worth the mix of regressions and improvements we do observe. Some considerations for future work: - Experimenting with more size classes - More hinting/search ordering work to approximate a best-fit allocator Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:34 +01:00
Boris Burkov	854c2f365d	btrfs: add more find_free_extent tracepoints find_free_extent is a complicated function. It consists (at least) of: - a hint that jumps into the middle of a for loop macro - a middle loop trying every raid level - an outer loop ascending through ffe loop levels - complicated logic for skipping some of those ffe loop levels - multiple underlying in-bg allocators (zoned, cluster, no cluster) Which is all to say that more tracing is helpful for debugging its behavior. Add two new tracepoints: at the entrance to the block_groups loop (hit for every raid level and every ffe_ctl loop) and at the point we seriously consider a block_group for allocation. This way we can see the whole path through the algorithm, including hints, multiple loops, etc. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:34 +01:00
Boris Burkov	cfc2de0fce	btrfs: pass find_free_extent_ctl to allocator tracepoints The allocator tracepoints currently have a pile of values from ffe_ctl. In modifying the allocator and adding more tracepoints, I found myself adding to the already long argument list of the tracepoints. It makes it a lot simpler to just send in the ffe_ctl itself. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:34 +01:00
Christoph Hellwig	36d4556745	btrfs: remove the wait argument to btrfs_start_ordered_extent Given that wait is always set to 1, so remove the argument. Last use of wait with 0 was in `0c304304fe` ("Btrfs: remove csum_bytes_left"). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:34 +01:00
Filipe Manana	235e1c7b87	btrfs: use a single variable to track return value for log_dir_items() We currently use 'ret' and 'err' to track the return value for log_dir_items(), which is confusing and likely the cause for previous bugs where log_dir_items() did not return an error when it should, fixed in previous patches. So change this and use only a single variable, 'ret', to track the return value. This is simpler and makes it similar to most of the existing code. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:34 +01:00
Filipe Manana	5cce1780dc	btrfs: use a negative value for BTRFS_LOG_FORCE_COMMIT Currently we use the value 1 for BTRFS_LOG_FORCE_COMMIT, but that value has a few inconveniences: 1) If it's ever used by btrfs_log_inode(), or any function down the call chain, we have to remember to btrfs_set_log_full_commit(), which is repetitive and has a chance to be forgotten in future use cases. btrfs_log_inode_parent() only calls btrfs_set_log_full_commit() when it gets a negative value from btrfs_log_inode(); 2) Down the call chain of btrfs_log_inode(), we may have functions that need to force a log commit, but can return either an error (negative value), false (0) or true (1). So they are forced to return some random negative to force a log commit - using BTRFS_LOG_FORCE_COMMIT would make the intention more clear. Currently the only example is flush_dir_items_batch(). So turn BTRFS_LOG_FORCE_COMMIT into a negative value. The chosen value is -(MAX_ERRNO + 1), so that it does not overlap any errno value and makes it easier to debug. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:34 +01:00
Yushan Zhou	ce394a7f39	btrfs: use PAGE_{ALIGN, ALIGNED, ALIGN_DOWN} macro The header file linux/mm.h provides PAGE_ALIGN, PAGE_ALIGNED, PAGE_ALIGN_DOWN macros. Use these macros to make code more concise. Signed-off-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:34 +01:00
Peng Hao	d31de37850	btrfs: go to matching label when cleaning em in btrfs_submit_direct When btrfs_get_chunk_map fails to allocate a new em the cleanup does not need to be done so the goto target is out_err, which is consistent with current coding style. Signed-off-by: Peng Hao <flyingpeng@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:33 +01:00
Josef Bacik	1ec49744ba	btrfs: turn on -Wmaybe-uninitialized We had a recent bug that would have been caught by a newer compiler with -Wmaybe-uninitialized and would have saved us a month of failing tests that I didn't have time to investigate. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:33 +01:00
Josef Bacik	a6ca692ec2	btrfs: fix uninitialized variable warning in run_one_async_start With -Wmaybe-uninitialized compiler complains about ret being possibly uninitialized, which isn't possible as the WQ_ constants are set only from our code, however we can handle the default case and get rid of the warning. The value is set to BLK_STS_IOERR so it does not issue any IO and could be potentially detected, but this is basically a "cannot happen" error. To catch any problems during development use the assert. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ set the error in default: ] Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:33 +01:00
Naohiro Aota	cd30d3bc78	btrfs: zoned: fix uninitialized variable warning in btrfs_get_dev_zones Fix an uninitialized warning we get with -Wmaybe-uninitialized where it thought zno may have been uninitialized, in both cases it depends on zinfo->zone_cache but we know the value won't change between checks. Reported-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/linux-btrfs/af6c527cbd8bdc782e50bd33996ee83acc3a16fb.1671221596.git.josef@toxicpanda.com/ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:33 +01:00
Josef Bacik	12adffe6cf	btrfs: fix uninitialized variable warning in btrfs_sb_log_location We only have 3 possible mirrors, and we have ASSERT()'s to make sure we're not passing in an invalid super mirror into this function, so technically this value isn't uninitialized. However -Wmaybe-uninitialized will complain, so set it to U64_MAX so if we don't have ASSERT()'s turned on it'll error out later on when it see's the zone is beyond our maximum zones. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:33 +01:00
Josef Bacik	598643250c	btrfs: fix uninitialized variable warnings in __set_extent_bit and convert_extent_bit We will pass in the parent and p pointer into our tree_search function to avoid doing a second search when inserting a new extent state into the tree. However because this is conditional upon passing in these pointers the compiler seems to think these values can be uninitialized if we're using -Wmaybe-uninitialized. Fix this by initializing these values. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:33 +01:00
Josef Bacik	efbf35a102	btrfs: fix uninitialized variable warning in btrfs_update_block_group reclaim isn't set in the alloc case, however we only care about reclaim in the !alloc case. This isn't an actual problem, however -Wmaybe-uninitialized will complain, so initialize reclaim to quiet the compiler. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:33 +01:00
Josef Bacik	ab19901359	btrfs: fix uninitialized variable warning in get_inode_gen Anybody that calls get_inode_gen() can have an uninitialized gen if there's an error. This isn't a big deal because all the users just exit if they get an error, however it makes -Wmaybe-uninitialized complain, so fix this up to always initialize the passed in gen, this quiets all of the uninitialized warnings in send.c. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:33 +01:00
Josef Bacik	0e47b25caf	btrfs: fix uninitialized variable warning in btrfs_cleanup_ordered_extents We can conditionally pass in a locked page, and then we'll use that page range to skip marking errors as that will happen in another layer. However this causes the compiler to complain because it doesn't understand we only use these values when we have the page. Make the compiler stop complaining by setting these values to 0. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:33 +01:00
Josef Bacik	fccf0c842e	btrfs: move btrfs_abort_transaction to transaction.c While trying to sync messages.[ch] I ended up with this dependency on messages.h in the rest of btrfs-progs code base because it's where btrfs_abort_transaction() was now held. We want to keep messages.[ch] limited to the kernel code, and the btrfs_abort_transaction() code better fits in the transaction code and not in messages. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ move the __cold attributes ] Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:33 +01:00
Johannes Thumshirn	0c555c97ef	btrfs: directly pass in fs_info to btrfs_merge_delayed_refs Now that none of the functions called by btrfs_merge_delayed_refs() needs a btrfs_trans_handle, directly pass in a btrfs_fs_info to btrfs_merge_delayed_refs(). Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:33 +01:00
Johannes Thumshirn	afe2d748b0	btrfs: drop trans parameter of insert_delayed_ref Now that drop_delayed_ref() doesn't need a btrfs_trans_handle, drop it from insert_delayed_ref() as well. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:32 +01:00
Johannes Thumshirn	f09f7851b7	btrfs: remove trans parameter of merge_ref Now that drop_delayed_ref() doesn't get the btrfs_trans_handle passed in anymore, we can get rid of it in merge_ref() as well. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:32 +01:00
Johannes Thumshirn	4c89493f35	btrfs: drop unused trans parameter of drop_delayed_ref drop_delayed_ref() doesn't use the btrfs_trans_handle it gets passed in, so remove it. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2023-02-13 17:50:32 +01:00
Linus Torvalds	ceaa837f96	Linux 6.2-rc8	2023-02-12 14:10:17 -08:00
John Paul Adrian Glaubitz	80510b63f7	MAINTAINERS: Add myself as maintainer for arch/sh (SUPERH) Both Rich Felker and Yoshinori Sato haven't done any work on arch/sh for a while. As I have been maintaining Debian's sh4 port since 2014, I am interested to keep the architecture alive. Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Acked-by: Yoshinori Sato <ysato@users.sourceforge.jp> Acked-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-02-12 13:57:09 -08:00
Linus Torvalds	5e98e916f9	tracing: Fix showing of TASK_COMM_LEN instead of its value The TASK_COMM_LEN was converted from a macro into an enum so that BTF would have access to it. But this unfortunately caused TASK_COMM_LEN to display in the format fields of trace events, as they are created by the TRACE_EVENT() macro and such, macros convert to their values, where as enums do not. To handle this, instead of using the field itself to be display, save the value of the array size as another field in the trace_event_fields structure, and use that instead. Not only does this fix the issue, but also converts the other trace events that have this same problem (but were not breaking tooling). With this change, the original work around `b3bc8547d3` ("tracing: Have TRACE_DEFINE_ENUM affect trace event types as well") could be reverted (but that should be done in the merge window). -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCY+lOqxQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6quYPAQD+9j+RPIUm9Ms4XCIEOXkFI04yjsbd rQSRcpYBWyP59AEAnZNYNwE11vDsKBGxPrOPgGYe4Pzfr5yOWY84mgaxgwo= =iYsE -----END PGP SIGNATURE----- Merge tag 'trace-v6.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fix from Steven Rostedt: "Fix showing of TASK_COMM_LEN instead of its value The TASK_COMM_LEN was converted from a macro into an enum so that BTF would have access to it. But this unfortunately caused TASK_COMM_LEN to display in the format fields of trace events, as they are created by the TRACE_EVENT() macro and such, macros convert to their values, where as enums do not. To handle this, instead of using the field itself to be display, save the value of the array size as another field in the trace_event_fields structure, and use that instead. Not only does this fix the issue, but also converts the other trace events that have this same problem (but were not breaking tooling). With this change, the original work around `b3bc8547d3` ("tracing: Have TRACE_DEFINE_ENUM affect trace event types as well") could be reverted (but that should be done in the merge window)" * tag 'trace-v6.2-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: Fix TASK_COMM_LEN in trace event format file	2023-02-12 13:52:17 -08:00
Linus Torvalds	711e9a4d52	for-6.2-rc7-tag -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmPo41YACgkQxWXV+ddt WDsPXA/8DPCp1PEvmkJ998wBCgSuoVvG9b4l1HOI0aFWC/giJWYsTdBF/+rFP/83 +UFBmxDsbG8tMoq73Dw8XxTvmYwRUyCdtn/AmKkGpu/l9KF4fnM+RTIh94e4DaH7 O1R5zPVOX34ScgL/bR6Hmcrw8a7q6yUmW9xORR40AAbYOccUld4nvUZOI+hVUbtN 84pphG+U4KowtX2J4fqLWALGU/2hDP9Aiq3aKOdupoiRYJacx3FoMP4aaEblJlMk ViLJYBXrJ+6v71frjT4LgSdDd7+l6QEaHHlQwIxMrf3r7AXUkMerwoiOhasMRXTB WnZjC8XeS9yogY6Ls5/gIEEWB7buz6TFJwm3rwfXMM+0OQ1g0RFvjXQPD8sOLazS X/5ToML8SZYpfkmIMnP+hBnmAMFKpjC06o40cN5/96xkqqMAwL7ws+XIlso/Hx+l Lu01cgnDLluRflWtVwMLmrhOGLStjbiDJKmG4zKl/WsyqGdodjIUyCOjhB0Wy0CN RMrkvOUwngTfAdWQYTHDdxkTdn1+b/nB+N9BvLbD8Dt+Q5H7loGR+0mS5xsRNg4Q jDY0yLDtR6bDxvcp4L2Vz1ezn+dSo8XAR9zqd4pT+7mZ6tLsf0R5F3iedAZkaqQC 1uVkjiHyi1Gq/6iKRwf72rQMNKdDmAgM+sDx0uQK5JyG8ZGqgLA= =KGNk -----END PGP SIGNATURE----- Merge tag 'for-6.2-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - one more fix for a tree-log 'write time corruption' report, update the last dir index directly and don't keep in the log context - do VFS-level inode lock around FIEMAP to prevent a deadlock with concurrent fsync, the extent-level lock is not sufficient - don't cache a single-device filesystem device to avoid cases when a loop device is reformatted and the entry gets stale * tag 'for-6.2-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: free device in btrfs_close_devices for a single device filesystem btrfs: lock the inode in shared mode before starting fiemap btrfs: simplify update of last_dir_index_offset when logging a directory	2023-02-12 11:26:36 -08:00
Linus Torvalds	e2bca0ebf7	USB fixes for 6.2-rc8 Here are 2 small USB driver fixes that resolve some reported regressions and one new device quirk. Specifically these are: - new quirk for Alcor Link AK9563 smartcard reader - revert of u_ether gadget change in 6.2-rc1 that caused problems - typec pin probe fix All of these have been in linux-next with no reported problems. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCY+jGFQ8cZ3JlZ0Brcm9h aC5jb20ACgkQMUfUDdst+ylYkQCgybIKsO7j9Mfi5nTsZktsWJRexu8AoMGuckH6 zko7UCEUrN5mk4xy5w9p =1LuO -----END PGP SIGNATURE----- Merge tag 'usb-6.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB fixes from Greg KH: "Here are 2 small USB driver fixes that resolve some reported regressions and one new device quirk. Specifically these are: - new quirk for Alcor Link AK9563 smartcard reader - revert of u_ether gadget change in 6.2-rc1 that caused problems - typec pin probe fix All of these have been in linux-next with no reported problems" * tag 'usb-6.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: usb: core: add quirk for Alcor Link AK9563 smartcard reader usb: typec: altmodes/displayport: Fix probe pin assign check Revert "usb: gadget: u_ether: Do not make UDC parent of the net device"	2023-02-12 11:18:57 -08:00
Linus Torvalds	dd78af9fde	Final EFI fix for v6.2 A fix from Darren to widen the SMBIOS match for detecting Ampere Altra machines with problematic firmware. In the mean time, we are working on a more precise check, but this is still work in progress. -----BEGIN PGP SIGNATURE----- iQGzBAABCgAdFiEE+9lifEBpyUIVN1cpw08iOZLZjyQFAmPo134ACgkQw08iOZLZ jyT7Ywv/YrEUEm6dlWDSCkhWdli7B6pcwWhcvbsPcDLZWSDuo4vIJhtav1eeWkjs E2YDewXRSwewDlHeQPX+v25loVNKqbeBUDK6kX/DkNsk+jRVVM4Xt9myJA9XO//v YDO7Srbkbk/GlBkZFNUkVYPaEy5aKVO7l/hQZy+GTYhuA/UXZVtlZrqA0EJNOnT0 xR7s+yXUNd0gBsgvTypnRACviL1qgZkY/yso51Gv/oXxzsVqm8K1XGsRVZblHwHK YfhgytI/kj6mZ4I6WaOYiCt5NTq+GT7g8lMUmHISHNXxl9qzvaZ51jV2Cxf/9Bck 1RyfsIh3JoLHBlwCrfKRqIooitRENXWlIj+8PxZYG2/ONov7MkqEork7mSb1ITJw 0uqb0tClIZE23C+fdHI7fctbNrh+CQLr1RjSz7iNX+HUWsXJRag6bDrjFXzwqQnx tLur+4QbpC8KbpwDoEQu74wveacJ6kn4r0KeKRWTp7IRsdA7NH+wQl6IJkMKr49A 41UnT1x8 =g+5h -----END PGP SIGNATURE----- Merge tag 'efi-fixes-for-v6.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fix from Ard Biesheuvel: "A fix from Darren to widen the SMBIOS match for detecting Ampere Altra machines with problematic firmware. In the mean time, we are working on a more precise check, but this is still work in progress" * tag 'efi-fixes-for-v6.2-4' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: arm64: efi: Force the use of SetVirtualAddressMap() on eMAG and Altra Max machines	2023-02-12 11:13:29 -08:00
Linus Torvalds	49a0bdb0a3	powerpc fixes for 6.2 #5 - Fix interrupt exit race with security mitigation switching. - Don't select ARCH_WANTS_NO_INSTR until warnings are fixed. - Build fix for CONFIG_NUMA=n. Thanks to: Nicholas Piggin, Randy Dunlap, Sachin Sant. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmPojjYTHG1wZUBlbGxl cm1hbi5pZC5hdQAKCRBR6+o8yOGlgPn4D/9opcq75nPUVTBmYnM5ar4MK4VtSp5b iOe12LV8W8bT0wy5A7RRoOkV+bBNHr3057S8W78Y/darkbiD+F29/Hf2pMiwxhLf l+Kl4GQbrFrerqIK6tgIZrtm/kooUOtSfzpYRduOTHnt8N6uXtg43glbr1/uIKZ8 X21wJtnlCfhiysuf9zoa1VFB43QoFa6zE1zyTgo91Ofr+4tZJN5v66YYzO2fK+co PTLjrv45tpo/srnxGFbfY6yZmQkGu0j7Z17V9as1HZ8od9y7jCbIBS+hHz6Drc2K 835VRc5pfeEO9RHobQoGOEgGxqvMCW0fbgvr2sMqaqc4z6ddtqJc9IDV/ed3q95C 5BlxNqwT2KquC/PwDem1qg0KzebP2Q3r8sKLlenEsxh1CFYdG+cBOAyQ3w6bY8Gh rRhKTDJjmOIABVQf5ZCAMKvbdOMz2peA7tnbWU/PMVKkpE7GYtkrFsme83lfq+L6 u7Kjvw1GJHcz1EhA58xYk2vSzJXyO6c8f4hbRXSuhQaVHLFmiLKp6l7dZ+2DJKkX CPRTpm7xCegYJoSUNyT8UrP1XPoln1ECEZHTo8U4L1o1wUWmoWafaXdpzbyhl36i mfyOp0JT2HwzdziFd6iCWDPovOGmqSdTgEH2PkCYl+gofKRGSYR7D1qs+e+B/zn3 Bnw/bgl5bGausQ== =pSxf -----END PGP SIGNATURE----- Merge tag 'powerpc-6.2-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: - Fix interrupt exit race with security mitigation switching. - Don't select ARCH_WANTS_NO_INSTR until warnings are fixed. - Build fix for CONFIG_NUMA=n. Thanks to Nicholas Piggin, Randy Dunlap, and Sachin Sant. * tag 'powerpc-6.2-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s/interrupt: Fix interrupt exit race with security mitigation switch powerpc/kexec_file: fix implicit decl error powerpc: Don't select ARCH_WANTS_NO_INSTR	2023-02-12 11:08:15 -08:00
David Chen	462a8e08e0	Fix page corruption caused by racy check in __free_pages When we upgraded our kernel, we started seeing some page corruption like the following consistently: BUG: Bad page state in process ganesha.nfsd pfn:1304ca page:0000000022261c55 refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x1304ca flags: 0x17ffffc0000000() raw: 0017ffffc0000000 ffff8a513ffd4c98 ffffeee24b35ec08 0000000000000000 raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000 page dumped because: nonzero mapcount CPU: 0 PID: 15567 Comm: ganesha.nfsd Kdump: loaded Tainted: P B O 5.10.158-1.nutanix.20221209.el7.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016 Call Trace: dump_stack+0x74/0x96 bad_page.cold+0x63/0x94 check_new_page_bad+0x6d/0x80 rmqueue+0x46e/0x970 get_page_from_freelist+0xcb/0x3f0 ? _cond_resched+0x19/0x40 __alloc_pages_nodemask+0x164/0x300 alloc_pages_current+0x87/0xf0 skb_page_frag_refill+0x84/0x110 ... Sometimes, it would also show up as corruption in the free list pointer and cause crashes. After bisecting the issue, we found the issue started from commit `e320d3012d` ("mm/page_alloc.c: fix freeing non-compound pages"): if (put_page_testzero(page)) free_the_page(page, order); else if (!PageHead(page)) while (order-- > 0) free_the_page(page + (1 << order), order); So the problem is the check PageHead is racy because at this point we already dropped our reference to the page. So even if we came in with compound page, the page can already be freed and PageHead can return false and we will end up freeing all the tail pages causing double free. Fixes: `e320d3012d` ("mm/page_alloc.c: fix freeing non-compound pages") Link: https://lore.kernel.org/lkml/BYAPR02MB448855960A9656EEA81141FC94D99@BYAPR02MB4488.namprd02.prod.outlook.com/ Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2023-02-12 10:30:05 -08:00

1 2 3 4 5 ...

1155524 Commits