linux

Author	SHA1	Message	Date
David Sterba	ea14b57fd1	btrfs: fix spelling of snapshotting Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
Nikolay Borisov	e38ae7a086	btrfs: Make flush_space return void The return value of flush_space was used to have significance in the early days when the code was first introduced and before the ticketed enospc rework. Since the latter got introduced the return value lost any significance whatsoever to its callers. So let's remove it. While at it also remove the unused ticket variable in btrfs_async_reclaim_metadata_space. It was used in the initial version of the ticketed ENOSPC work, however Wang Xiaoguang detected a problem with this and fixed it in `ce129655c9` ("btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress"). Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:04 +02:00
David Sterba	a4f78750ef	btrfs: get fs_info from eb in btrfs_print_leaf, remove argument Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
Nikolay Borisov	e4ff5fb5dc	btrfs: Remove unused parameters from volume.c functions This also adjusts the respective callers in other files. Those were found with -Wunused-parameter. btrfs_full_stripe_len's mapping_tree - introduced by `53b381b3ab` ("Btrfs: RAID5 and RAID6") but it was never really used even in that commit btrfs_is_parity_mirror's mirror_num - same as above chunk_drange_filter's chunk_offset - introduced by `94e60d5a5c` ("Btrfs: devid subset filter") and never used. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:03 +02:00
David Sterba	913e153572	btrfs: drop newlines from strings when using btrfs_* helpers The helpers append "\n" so we can keep the actual strings shorter. The extra newline will print an empty line. Some messages have been slightly modified to be more consistent with the rest (lowercase first letter). Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Nikolay Borisov	1174cade81	btrfs: Remove redundant checks from btrfs_alloc_data_chunk_ondemand Many commits ago the data space_info in alloc_data_chunk_ondemand used to be acquired from the inode. At that point commit `33b4d47f5e` ("Btrfs: deal with NULL space info") got introduced to deal with spurios cases where the space info could be null, following a rebalance. Nowadays, however, the space info is referenced directly from the btrfs_fs_info struct which is initialised at filesystem mount time. This makes the null checks redundant, so remove them. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:02 +02:00
Nikolay Borisov	7bdd6277e0	btrfs: Remove redundant argument of flush_space All callers of flush_space pass the same number for orig/num_bytes arguments. Let's remove one of the numbers and also modify the trace point to show only a single number - bytes requested. Seems that last point where the two parameters were treated differently is before the ticketed enospc rework. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 16:12:01 +02:00
Nikolay Borisov	23d1f73788	btrfs: remove unused sectorsize member The sectorsize member of btrfs_block_group_cache is unused. So remove it, this reduces the number of holes in the struct. With patch: /* size: 856, cachelines: 14, members: 40 / / sum members: 837, holes: 4, sum holes: 19 / / bit holes: 1, sum bit holes: 29 bits / / last cacheline: 24 bytes / Without patch: / size: 864, cachelines: 14, members: 41 / / sum members: 841, holes: 5, sum holes: 23 / / bit holes: 1, sum bit holes: 29 bits / / last cacheline: 32 bytes */ Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-08-16 14:19:53 +02:00
Omar Sandoval	17024ad0a0	Btrfs: fix early ENOSPC due to delalloc If a lot of metadata is reserved for outstanding delayed allocations, we rely on shrink_delalloc() to reclaim metadata space in order to fulfill reservation tickets. However, shrink_delalloc() has a shortcut where if it determines that space can be overcommitted, it will stop early. This made sense before the ticketed enospc system, but now it means that shrink_delalloc() will often not reclaim enough space to fulfill any tickets, leading to an early ENOSPC. (Reservation tickets don't care about being able to overcommit, they need every byte accounted for.) Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims all of the metadata it is supposed to. This fixes early ENOSPCs we were seeing when doing a btrfs receive to populate a new filesystem, as well as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs. Fixes: `957780eb27` ("Btrfs: introduce ticketed enospc infrastructure") Tested-by: Christoph Anton Mitterer <mail@christoph.anton.mitterer.name> Cc: stable@vger.kernel.org Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-07-24 16:04:26 +02:00
Jeff Mahoney	144439376b	btrfs: fix lockup in find_free_extent with read-only block groups If we have a block group that is all of the following: 1) uncached in memory 2) is read-only 3) has a disk cache state that indicates we need to recreate the cache AND the file system has enough free space fragmentation such that the request for an extent of a given size can't be honored; AND have a single CPU core; AND it's the block group with the highest starting offset such that there are no opportunities (like reading from disk) for the loop to yield the CPU; We can end up with a lockup. The root cause is simple. Once we're in the position that we've read in all of the other block groups directly and none of those block groups can honor the request, there are no more opportunities to sleep. We end up trying to start a caching thread which never gets run if we only have one core. This should present as a hung task waiting on the caching thread to make some progress, but it doesn't. Instead, it degrades into a busy loop because of the placement of the read-only check. During the first pass through the loop, block_group->cached will be set to BTRFS_CACHE_STARTED and have_caching_bg will be set. Then we hit the read-only check and short circuit the loop. We're not yet in LOOP_CACHING_WAIT, so we skip that loop back before going through the loop again for other raid groups. Then we move to LOOP_CACHING_WAIT state. During the this pass through the loop, ->cached will still be BTRFS_CACHE_STARTED, which means it's not cached, so we'll enter cache_block_group, do a lot of nothing, and return, and also set have_caching_bg again. Then we hit the read-only check and short circuit the loop. The same thing happens as before except now we DO trigger the LOOP_CACHING_WAIT && have_caching_bg check and loop back up to the top. We do this forever. There are two fixes in this patch since they address the same underlying bug. The first is to add a cond_resched to the end of the loop to ensure that the caching thread always has an opportunity to run. This will fix the soft lockup issue, but find_free_extent will still loop doing nothing until the thread has completed. The second is to move the read-only check to the top of the loop. We're never going to return an allocation within a read-only block group so we may as well skip it early. The check for ->cached == BTRFS_CACHE_ERROR would cause the same problem except that BTRFS_CACHE_ERROR is considered a "done" state and we won't re-set have_caching_bg again. Many thanks to Stephan Kulow <coolo@suse.de> for his excellent help in the testing process. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-07-24 16:04:02 +02:00
Chris Mason	6374e57ad8	btrfs: fix integer overflow in calc_reclaim_items_nr Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with v4.12-rc6. This was because commit `70e7af244` made it possible for calc_reclaim_items_nr() to return a negative number. It's not really a bug in that commit, it just didn't go far enough down the stack to find all the possible 64->32 bit overflows. This switches calc_reclaim_items_nr() to return a u64 and changes everyone that uses the results of that math to u64 as well. Reported-by: Dave Jones <davej@codemonkey.org.uk> Fixes: `70e7af2` ("Btrfs: fix delalloc accounting leak caused by u32 overflow") Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:02 +02:00
Qu Wenruo	bc42bda223	btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges [BUG] For the following case, btrfs can underflow qgroup reserved space at an error path: (Page size 4K, function name without "btrfs_" prefix) Task A \| Task B ---------------------------------------------------------------------- Buffered_write [0, 2K) \| \|- check_data_free_space() \| \| \|- qgroup_reserve_data() \| \| Range aligned to page \| \| range [0, 4K) <<< \| \| 4K bytes reserved <<< \| \|- copy pages to page cache \| \| Buffered_write [2K, 4K) \| \|- check_data_free_space() \| \| \|- qgroup_reserved_data() \| \| Range alinged to page \| \| range [0, 4K) \| \| Already reserved by A <<< \| \| 0 bytes reserved <<< \| \|- delalloc_reserve_metadata() \| \| And it FAILED (Maybe EQUOTA) \| \|- free_reserved_data_space() \|- qgroup_free_data() Range aligned to page range [0, 4K) Freeing 4K (Special thanks to Chandan for the detailed report and analyse) [CAUSE] Above Task B is freeing reserved data range [0, 4K) which is actually reserved by Task A. And at writeback time, page dirty by Task A will go through writeback routine, which will free 4K reserved data space at file extent insert time, causing the qgroup underflow. [FIX] For btrfs_qgroup_free_data(), add @reserved parameter to only free data ranges reserved by previous btrfs_qgroup_reserve_data(). So in above case, Task B will try to free 0 byte, so no underflow. Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Tested-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:02 +02:00
Qu Wenruo	364ecf3651	btrfs: qgroup: Introduce extent changeset for qgroup reserve functions Introduce a new parameter, struct extent_changeset for btrfs_qgroup_reserved_data() and its callers. Such extent_changeset was used in btrfs_qgroup_reserve_data() to record which range it reserved in current reserve, so it can free it in error paths. The reason we need to export it to callers is, at buffered write error path, without knowing what exactly which range we reserved in current allocation, we can free space which is not reserved by us. This will lead to qgroup reserved space underflow. Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:02 +02:00
Qu Wenruo	7bc329c183	btrfs: qgroup: Return actually freed bytes for qgroup release or free data btrfs_qgroup_release/free_data() only returns 0 or a negative error number (ENOMEM is the only possible error). This is normally good enough, but sometimes we need the exact byte count it freed/released. Change it to return actually released/freed bytenr number instead of 0 for success. And slightly modify related extent_changeset structure, since in btrfs one no-hole data extent won't be larger than 128M, so "unsigned int" is large enough for the use case. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:02 +02:00
Omar Sandoval	d7eae3403f	Btrfs: rework delayed ref total_bytes_pinned accounting The total_bytes_pinned counter is completely broken when accounting delayed refs: - If two drops for the same extent are merged, we will decrement total_bytes_pinned twice but only increment it once. - If an add is merged into a drop or vice versa, we will decrement the total_bytes_pinned counter but never increment it. - If multiple references to an extent are dropped, we will account it multiple times, potentially vastly over-estimating the number of bytes that will be freed by a commit and doing unnecessary work when we're close to ENOSPC. The last issue is relatively minor, but the first two make the total_bytes_pinned counter leak or underflow very often. These accounting issues were introduced in `b150a4f10d` ("Btrfs: use a percpu to keep track of possibly pinned bytes"), but they were papered over by zeroing out the counter on every commit until `d288db5dc0` ("Btrfs: fix race of using total_bytes_pinned"). We need to make sure that an extent is accounted as pinned exactly once if and only if we will drop references to it when when the transaction is committed. Ideally we would only add to total_bytes_pinned when the last reference is dropped, but this information isn't readily available for data extents. Again, this over-estimation can lead to extra commits when we're close to ENOSPC, but it's not as bad as before. The fix implemented here is to increment total_bytes_pinned when the total refmod count for an extent goes negative and decrement it if the refmod count goes back to non-negative or after we've run all of the delayed refs for that extent. Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:01 +02:00
Omar Sandoval	7be07912b3	Btrfs: return old and new total ref mods when adding delayed refs We need this to decide when to account pinned bytes. Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:01 +02:00
Omar Sandoval	0a16c7d7ae	Btrfs: always account pinned bytes when dropping a tree block ref Currently, we only increment total_bytes_pinned in btrfs_free_tree_block() when dropping the last reference on the block. However, when the delayed ref is run later, we will decrement total_bytes_pinned regardless of whether it was the last reference or not. This causes the counter to underflow when the reference we dropped was not the last reference. Fix it by incrementing the counter unconditionally, which is what btrfs_free_extent() does. This makes total_bytes_pinned an overestimate when references to shared extents are dropped, but in the worst case this will just make us try to commit the transaction to try to free up space and find we didn't free enough. Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:01 +02:00
Omar Sandoval	4da8b76d34	Btrfs: update total_bytes_pinned when pinning down extents The extents marked in pin_down_extent() will be unpinned later in unpin_extent_range(), which decrements total_bytes_pinned. pin_down_extent() must increment the counter to avoid underflowing it. Also adjust btrfs_free_tree_block() to avoid accounting for the same extent twice. Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:01 +02:00
Omar Sandoval	55e8196a57	Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT() The value of flags is one of DATA/METADATA/SYSTEM, they must exist at when add_pinned_bytes is called. Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: David Sterba <dsterba@suse.com> [ added changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:01 +02:00
Omar Sandoval	0d9f824df3	Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64 There are a few places where we pass in a negative num_bytes, so make it signed for clarity. Also move it up in the file since later patches will need it there. Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:01 +02:00
Nikolay Borisov	0eee8a494e	btrfs: Use btrfs_space_info_used instead of opencoding it Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:04 +02:00
Nikolay Borisov	d2006e6d28	btrfs: Refactor update_space_info Following the factoring out of the creation code udpate_space_info can only be called for already-existing space_info structs. As such it cannot fail. Remove superfluous error handling and make the function return void. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
Nikolay Borisov	2be12ef79f	btrfs: Separate space_info create/update Currently the struct space_info creation code is intermixed in the udpate_space_info function. There are well-defined points at which the we actually want to create brand-new space_info structs (e.g. during mount of the filesystem as well as sometimes when adding/initialising new chunks). In such cases update_space_info is called with 0 as the bytes parameter. All of this makes for spaghetti code. Fix it by factoring out the creation code in a separate create_space_info structure. This also allows to simplify the internals. Also remove BUG_ON from do_alloc_chunk since the callers handle errors. Furthermore it will make the update_space_info function not fail, allowing us to remove error handling in callers. This will come in a follow up patch. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
Liu Bo	28785f70ef	Btrfs: skip commit transaction if we don't have enough pinned bytes We commit transaction in order to reclaim space from pinned bytes because it could process delayed refs, and in may_commit_transaction(), we check first if pinned bytes are enough for the required space, we then check if that plus bytes reserved for delayed insert are enough for the required space. This changes the code to the above logic. Fixes: `b150a4f10d` ("Btrfs: use a percpu to keep track of possibly pinned bytes") Tested-by: Nikolay Borisov <nborisov@suse.com> Reported-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
Jeff Mahoney	c1c4919b11	btrfs: remove root usage from can_overcommit can_overcommit using the root to determine the allocation profile is the only use of a root in the call graph below reserve_metadata_bytes. It turns out that we only need to know whether the allocation is for the chunk root or not -- and we can pass that around as a bool instead. This allows us to pull root usage out of the reservation path all the way up to reserve_metadata_bytes itself, which uses it only to compare against fs_info->chunk_root to set the bool. In turn, this eliminates a bunch of races where we use a particular root too early in the mount process. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:00 +02:00
Jeff Mahoney	1b86826d12	btrfs: cleanup root usage by btrfs_get_alloc_profile There are two places where we don't already know what kind of alloc profile we need before calling btrfs_get_alloc_profile, but we need access to a root everywhere we call it. This patch adds helpers for btrfs_{data,metadata,system}_alloc_profile() and relegates btrfs_system_alloc_profile to a static for use in those two cases. The next patch will eliminate one of those. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:59 +02:00
Nikolay Borisov	a5ed45f822	btrfs: Convert fs_info->free_chunk_space to atomic64_t The ->free_chunk_space variable is used to track the unallocated space and access to it is protected by a spinlock, which is not used for anything else. Make the code a bit self-explanatory by switching the variable to an atomic64_t type and kill the spinlock. Signed-off-by: Nikolay Borisov <nborisov@suse.com> [ not a performance critical code, use of atomic type is ok ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:25:58 +02:00
Jeff Mahoney	a9b3311ef3	btrfs: fix race with relocation recovery and fs_root setup If we have to recover relocation during mount, we'll ultimately have to evict the orphan inode. That goes through the reservation dance, where priority_reclaim_metadata_space and flush_space expect fs_info->fs_root to be valid. That's the next thing to be set up during mount, so we crash, almost always in flush_space trying to join the transaction but priority_reclaim_metadata_space is possible as well. This call path has been problematic in the past WRT whether ->fs_root is valid yet. Commit `957780eb27` (Btrfs: introduce ticketed enospc infrastructure) added new users that are called in the direct path instead of the async path that had already been worked around. The thing is that we don't actually need the fs_root, specifically, for anything. We either use it to determine whether the root is the chunk_root for use in choosing an allocation profile or as a root to pass btrfs_join_transaction before immediately committing it. Anything that isn't the chunk root works in the former case and any root works in the latter. A simple fix is to use a root we know will always be there: the extent_root. Cc: <stable@vger.kernel.org> # v4.8+ Fixes: `957780eb27` (Btrfs: introduce ticketed enospc infrastructure) Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-01 16:56:55 +02:00
Jeff Mahoney	896533a7da	btrfs: fix memory leak in update_space_info failure path If we fail to add the space_info kobject, we'll leak the memory for the percpu counter. Fixes: `6ab0a2029c` (btrfs: publish allocation data in sysfs) Cc: <stable@vger.kernel.org> # v3.14+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-01 16:56:31 +02:00
Qu Wenruo	0966a7b130	btrfs: scrub: Introduce full stripe lock for RAID56 Unlike mirror based profiles, RAID5/6 recovery needs to read out the whole full stripe. And if we don't do proper protection, it can easily cause race condition. Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe() for RAID5/6. Which store a rb_tree of mutexes for full stripes, so scrub callers can use them to lock a full stripe to avoid race. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:27 +02:00
David Sterba	f486135eba	btrfs: remove unused qgroup members from btrfs_trans_handle The members have been effectively unused since "Btrfs: rework qgroup accounting" (`fcebe4562d`), there's no substitute for assert_qgroups_uptodate so it's removed as well. Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
Liu Bo	1a79c1f246	Btrfs: update comments in cache_save_setup We also don't bother to flush free space cache while with free space tree. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:25 +02:00
Elena Reshetova	6df8cdf5bd	btrfs: convert btrfs_delayed_ref_node.refs from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:23 +02:00
Elena Reshetova	1e4f4714d5	btrfs: convert btrfs_caching_control.count from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:23 +02:00
Elena Reshetova	9b64f57ddf	btrfs: convert btrfs_transaction.use_count from atomic_t to refcount_t refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-04-18 14:07:23 +02:00
Linus Torvalds	1827adb11a	Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull sched.h split-up from Ingo Molnar: "The point of these changes is to significantly reduce the <linux/sched.h> header footprint, to speed up the kernel build and to have a cleaner header structure. After these changes the new <linux/sched.h>'s typical preprocessed size goes down from a previous ~0.68 MB (~22K lines) to ~0.45 MB (~15K lines), which is around 40% faster to build on typical configs. Not much changed from the last version (-v2) posted three weeks ago: I eliminated quirks, backmerged fixes plus I rebased it to an upstream SHA1 from yesterday that includes most changes queued up in -next plus all sched.h changes that were pending from Andrew. I've re-tested the series both on x86 and on cross-arch defconfigs, and did a bisectability test at a number of random points. I tried to test as many build configurations as possible, but some build breakage is probably still left - but it should be mostly limited to architectures that have no cross-compiler binaries available on kernel.org, and non-default configurations" * 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (146 commits) sched/headers: Clean up <linux/sched.h> sched/headers: Remove #ifdefs from <linux/sched.h> sched/headers: Remove the <linux/topology.h> include from <linux/sched.h> sched/headers, hrtimer: Remove the <linux/wait.h> include from <linux/hrtimer.h> sched/headers, x86/apic: Remove the <linux/pm.h> header inclusion from <asm/apic.h> sched/headers, timers: Remove the <linux/sysctl.h> include from <linux/timer.h> sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> sched/headers: Remove <linux/sched.h> from <linux/sched/init.h> sched/core: Remove unused prefetch_stack() sched/headers: Remove <linux/rculist.h> from <linux/sched.h> sched/headers: Remove the 'init_pid_ns' prototype from <linux/sched.h> sched/headers: Remove <linux/signal.h> from <linux/sched.h> sched/headers: Remove <linux/rwsem.h> from <linux/sched.h> sched/headers: Remove the runqueue_is_locked() prototype sched/headers: Remove <linux/sched.h> from <linux/sched/hotplug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/debug.h> sched/headers: Remove <linux/sched.h> from <linux/sched/nohz.h> sched/headers: Remove <linux/sched.h> from <linux/sched/stat.h> sched/headers: Remove the <linux/gfp.h> include from <linux/sched.h> sched/headers: Remove <linux/rtmutex.h> from <linux/sched.h> ...	2017-03-03 10:16:38 -08:00
Ingo Molnar	f361bf4a66	sched/headers: Prepare for the reduction of <linux/sched.h>'s signal API dependency Instead of including the full <linux/signal.h>, we are going to include the types-only <linux/signal_types.h> header in <linux/sched.h>, to further decouple the scheduler header from the signal headers. This means that various files which relied on the full <linux/signal.h> need to be updated to gain an explicit dependency on it. Update the code that relies on sched.h's inclusion of the <linux/signal.h> header. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-03-02 08:42:37 +01:00
Chris Mason	e9f467d028	Merge branch 'for-chris-4.11-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.11	2017-02-28 14:35:09 -08:00
Nikolay Borisov	73f2e545b6	btrfs: Make btrfs_orphan_add take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-28 11:30:10 +01:00
Nikolay Borisov	691fa05967	btrfs: all btrfs_delalloc_release_metadata take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-28 11:30:07 +01:00
Nikolay Borisov	9f3db423f9	btrfs: Make btrfs_delalloc_reserve_metadata take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-28 11:30:07 +01:00
Nikolay Borisov	703b391a03	btrfs: Make btrfs_orphan_release_metadata take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-28 11:30:07 +01:00
Nikolay Borisov	8ed7a2a0e0	btrfs: Make btrfs_orphan_reserve_metadata take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-28 11:30:07 +01:00
Nikolay Borisov	0e6bf9b13c	btrfs: Make calc_csum_metadata_size take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-28 11:30:07 +01:00
Nikolay Borisov	baa3ba39b9	btrfs: Make drop_outstanding_extent take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-28 11:30:07 +01:00
Nikolay Borisov	04f4f91653	btrfs: make btrfs_alloc_data_chunk_ondemand take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-28 11:30:06 +01:00
Nikolay Borisov	70ddc553b5	btrfs: make btrfs_is_free_space_inode take btrfs_inode Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-28 11:30:06 +01:00
Filipe Manana	5cdd7db6c5	Btrfs: fix assertion failure when freeing block groups at close_ctree() At close_ctree() we free the block groups and then only after we wait for any running worker kthreads to finish and shutdown the workqueues. This behaviour is racy and it triggers an assertion failure when freeing block groups because while we are doing it we can have for example a block group caching kthread running, and in that case the block group's reference count can still be greater than 1 by the time we assert its reference count is 1, leading to an assertion failure: [19041.198004] assertion failed: atomic_read(&block_group->count) == 1, file: fs/btrfs/extent-tree.c, line: 9799 [19041.200584] ------------[ cut here ]------------ [19041.201692] kernel BUG at fs/btrfs/ctree.h:3418! [19041.202830] invalid opcode: 0000 [#1] PREEMPT SMP [19041.203929] Modules linked in: btrfs xor raid6_pq dm_flakey dm_mod crc32c_generic ppdev sg psmouse acpi_cpufreq pcspkr parport_pc evdev tpm_tis parport tpm_tis_core i2c_piix4 i2c_core tpm serio_raw processor button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod floppy [last unloaded: btrfs] [19041.208082] CPU: 6 PID: 29051 Comm: umount Not tainted 4.9.0-rc7-btrfs-next-36+ #1 [19041.208082] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [19041.208082] task: ffff88015f028980 task.stack: ffffc9000ad34000 [19041.208082] RIP: 0010:[<ffffffffa03e319e>] [<ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs] [19041.208082] RSP: 0018:ffffc9000ad37d60 EFLAGS: 00010286 [19041.208082] RAX: 0000000000000061 RBX: ffff88015ecb4000 RCX: 0000000000000001 [19041.208082] RDX: ffff88023f392fb8 RSI: ffffffff817ef7ba RDI: 00000000ffffffff [19041.208082] RBP: ffffc9000ad37d60 R08: 0000000000000001 R09: 0000000000000000 [19041.208082] R10: ffffc9000ad37cb0 R11: ffffffff82f2b66d R12: ffff88023431d170 [19041.208082] R13: ffff88015ecb40c0 R14: ffff88023431d000 R15: ffff88015ecb4100 [19041.208082] FS: 00007f44f3d42840(0000) GS:ffff88023f380000(0000) knlGS:0000000000000000 [19041.208082] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [19041.208082] CR2: 00007f65d623b000 CR3: 00000002166f2000 CR4: 00000000000006e0 [19041.208082] Stack: [19041.208082] ffffc9000ad37d98 ffffffffa035989f ffff88015ecb4000 ffff88015ecb5630 [19041.208082] ffff88014f6be000 0000000000000000 00007ffcf0ba6a10 ffffc9000ad37df8 [19041.208082] ffffffffa0368cd4 ffff88014e9658e0 ffffc9000ad37e08 ffffffff811a634d [19041.208082] Call Trace: [19041.208082] [<ffffffffa035989f>] btrfs_free_block_groups+0x17f/0x392 [btrfs] [19041.208082] [<ffffffffa0368cd4>] close_ctree+0x1c5/0x2e1 [btrfs] [19041.208082] [<ffffffff811a634d>] ? evict_inodes+0x132/0x141 [19041.208082] [<ffffffffa034356d>] btrfs_put_super+0x15/0x17 [btrfs] [19041.208082] [<ffffffff8118fc32>] generic_shutdown_super+0x6a/0xeb [19041.208082] [<ffffffff8119004f>] kill_anon_super+0x12/0x1c [19041.208082] [<ffffffffa0343370>] btrfs_kill_super+0x16/0x21 [btrfs] [19041.208082] [<ffffffff8118fad1>] deactivate_locked_super+0x3b/0x68 [19041.208082] [<ffffffff8118fb34>] deactivate_super+0x36/0x39 [19041.208082] [<ffffffff811a9946>] cleanup_mnt+0x58/0x76 [19041.208082] [<ffffffff811a99a2>] __cleanup_mnt+0x12/0x14 [19041.208082] [<ffffffff81071573>] task_work_run+0x6f/0x95 [19041.208082] [<ffffffff81001897>] prepare_exit_to_usermode+0xa3/0xc1 [19041.208082] [<ffffffff81001a23>] syscall_return_slowpath+0x16e/0x1d2 [19041.208082] [<ffffffff814c607d>] entry_SYSCALL_64_fastpath+0xab/0xad [19041.208082] Code: c7 ae a0 3e a0 48 89 e5 e8 4e 74 d4 e0 0f 0b 55 89 f1 48 c7 c2 0b a4 3e a0 48 89 fe 48 c7 c7 a4 a6 3e a0 48 89 e5 e8 30 74 d4 e0 <0f> 0b 55 31 d2 48 89 e5 e8 d5 b9 f7 ff 5d c3 48 63 f6 55 31 c9 [19041.208082] RIP [<ffffffffa03e319e>] assfail.constprop.41+0x1c/0x1e [btrfs] [19041.208082] RSP <ffffc9000ad37d60> [19041.279264] ---[ end trace 23330586f16f064d ]--- This started happening as of kernel 4.8, since commit `f3bca8028b` ("Btrfs: add ASSERT for block group's memory leak") introduced these assertions. So fix this by freeing the block groups only after waiting for all worker kthreads to complete and shutdown the workqueues. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2017-02-24 00:38:27 +00:00
Jeff Mahoney	77ab86bf1c	btrfs: free-space-cache, clean up unnecessary root arguments The free space cache APIs accept a root but always use the tree root. Also, btrfs_truncate_free_space_cache accepts a root AND an inode but the inode always points to the root anyway, so let's just pass the inode. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:56 +01:00
Jeff Mahoney	5e00f1939f	btrfs: convert btrfs_inc_block_group_ro to accept fs_info btrfs_inc_block_group_ro is either passed the extent root or the dev root, but it doesn't do anything with the dev tree. Let's convert to passing an fs_info and using the extent root. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:56 +01:00
Jeff Mahoney	0c9ab349c2	btrfs: flush_space always takes fs_info->fs_root We don't need to pass a root to flush_space since it always uses the fs_root. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:55 +01:00
Jeff Mahoney	87bde3cdfc	btrfs: pass fs_info to (more) routines that are only called with extent_root Outside of interactions with qgroups, the roots passed in extent-tree.c are usually passed to ensure that we don't do refcounts on log trees or to get the allocation profile for an allocation request. Otherwise, it operates on the extent root. This patch converts some more routines in extent-tree.c that are always called with the extent root to accept an fs_info instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:55 +01:00
David Sterba	8b74c03e3c	btrfs: remove unused parameter from btrfs_prepare_extent_commit Added but never used. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:52 +01:00
David Sterba	7775c8184e	btrfs: remove unused parameter from btrfs_subvolume_release_metadata Unused since qgroup refactoring that split data and metadata accounting, the btrfs_qgroup_free helper. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:52 +01:00
David Sterba	7c302b49dd	btrfs: remove unused parameter from clean_tree_block Added but never needed. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:51 +01:00
Liu Bo	4136135b08	Btrfs: use helper to get used bytes of space_info This uses a helper instead of open code around used byte of space_info everywhere. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:48 +01:00
Liu Bo	0c9b36e0d7	Btrfs: try to avoid acquiring free space ctl's lock We don't need to take the lock if the block group has not been cached. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-17 12:03:48 +01:00
Liu Bo	e4c3b2dcd1	Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist run_delalloc_nocow has used trans in two places where they don't actually need @trans. For btrfs_lookup_file_extent, we search for file extents without COWing anything, and for btrfs_cross_ref_exist, the only place where we need @trans is deferencing it in order to get running_transaction which we could easily get from the global fs_info. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:51:00 +01:00
Liu Bo	f72ad18e99	Btrfs: pass delayed_refs directly to btrfs_find_delayed_ref_head All we need is @delayed_refs, all callers have get it ahead of calling btrfs_find_delayed_ref_head since lock needs to be acquired firstly, there is no reason to deference it again inside the function. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:59 +01:00
Jeff Mahoney	003d7c59e8	btrfs: allow unlink to exceed subvolume quota Once a qgroup limit is exceeded, it's impossible to restore normal operation to the subvolume without modifying the limit or removing the subvolume. This is a surprising situation for many users used to the typical workflow with quotas on other file systems where it's possible to remove files until the used space is back under the limit. When we go to unlink a file and start the transaction, we'll hit the qgroup limit while trying to reserve space for the items we'll modify while removing the file. We discussed last month how best to handle this situation and agreed that there is no perfect solution. The best principle-of-least-surprise solution is to handle it similarly to how we already handle ENOSPC when unlinking, which is to allow the operation to succeed with the expectation that it will ultimately release space under most circumstances. This patch modifies the transaction start path to select whether to honor the qgroups limits. btrfs_start_transaction_fallback_global_rsv is the only caller that skips enforcement. The reservation and tracking still happens normally -- it just skips the enforcement step. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:59 +01:00
Omar Sandoval	310712b2f7	Btrfs: constify struct btrfs_{,disk_}key wherever possible In a lot of places, it's unclear when it's safe to reuse a struct btrfs_key after it has been passed to a helper function. Constify these arguments wherever possible to make it obvious. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:58 +01:00
David Sterba	f85b7379cd	btrfs: fix over-80 lines introduced by previous cleanups This goes as a separate patch because fixing that inside the patches caused too many many conflicts. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:57 +01:00
Nikolay Borisov	4a0cc7ca6c	btrfs: Make btrfs_ino take a struct btrfs_inode Currently btrfs_ino takes a struct inode and this causes a lot of internal btrfs functions which consume this ino to take a VFS inode, rather than btrfs' own struct btrfs_inode. In order to fix this "leak" of VFS structs into the internals of btrfs first it's necessary to eliminate all uses of struct inode for the purpose of inode. This patch does that by using BTRFS_I to convert an inode to btrfs_inode. With this problem eliminated subsequent patches will start eliminating the passing of struct inode altogether, eventually resulting in a lot cleaner code. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> [ fix btrfs_get_extent tracepoint prototype ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:51 +01:00
David Sterba	823bb20ab4	btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE The expression is open-coded in several places, this asks for a wrapper. As we know the MAX_EXTENT fits to u32, we can use the appropirate division helper. This cascades to the result type updates. Compiler is clever enough to use shift instead of integer division, so there's no change in the generated assembly. Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:51 +01:00
Jeff Mahoney	fef394f75b	btrfs: drop unused extent_op arg from btrfs_add_delayed_data_ref btrfs_add_delayed_data_ref is always called with a NULL extent_op, so let's drop the argument. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-02-14 15:50:50 +01:00
Liu Bo	e321f8a801	Btrfs: use down_read_nested to make lockdep silent If @block_group is not @used_bg, it'll try to get @used_bg's lock without droping @block_group 's lock and lockdep has throwed a scary deadlock warning about it. Fix it by using down_read_nested. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-01-03 15:19:17 +01:00
Jeff Mahoney	d028099643	btrfs: fix locking when we put back a delayed ref that's too new In __btrfs_run_delayed_refs, when we put back a delayed ref that's too new, we have already dropped the lock on locked_ref when we set ->processing = 0. This patch keeps the lock to cover that assignment. Fixes: `d7df2c796d` (Btrfs: attach delayed ref updates to delayed ref heads) Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-01-03 15:14:21 +01:00
Jeff Mahoney	aa7c8da35d	btrfs: fix error handling when run_delayed_extent_op fails In __btrfs_run_delayed_refs, the error path when run_delayed_extent_op fails sets locked_ref->processing = 0 but doesn't re-increment delayed_refs->num_heads_ready. As a result, we end up triggering the WARN_ON in btrfs_select_ref_head. Fixes: `d7df2c796d` (Btrfs: attach delayed ref updates to delayed ref heads) Reported-by: Jon Nelson <jnelson-suse@jamponi.net> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-01-03 15:14:08 +01:00
David Sterba	34441361c4	btrfs: opencode chunk locking, remove helpers The helpers are trivial and we don't use them consistently. Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:07:00 +01:00
Jeff Mahoney	3a45bb207e	btrfs: remove root parameter from transaction commit/end routines Now we only use the root parameter to print the root objectid in a tracepoint. We can use the root parameter from the transaction handle for that. It's also used to join the transaction with async commits, so we remove the comment that it's just for checking. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:07:00 +01:00
Jeff Mahoney	2ff7e61e0d	btrfs: take an fs_info directly when the root is not used otherwise There are loads of functions in btrfs that accept a root parameter but only use it to obtain an fs_info pointer. Let's convert those to just accept an fs_info pointer directly. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:59 +01:00
Jeff Mahoney	afdb571890	btrfs: simplify btrfs_wait_cache_io prototype With the exception of the one case where btrfs_wait_cache_io is called without a block group, it's called with the same arguments. The root argument is only used in the special case, so let's factor out the core and simplify the call in the normal case to require a trans, block group, and path. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:59 +01:00
Jeff Mahoney	71ff6437c2	btrfs: convert extent-tree tracepoints to use fs_info The extent-tree tracepoints all operate on the extent root, regardless of which root is passed in. Let's just use the extent root objectid instead. If it turns out that nobody is depending on the format of this tracepoint, we can drop the root printing entirely. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:59 +01:00
Jeff Mahoney	0b246afa62	btrfs: root->fs_info cleanup, add fs_info convenience variables In routines where someptr->fs_info is referenced multiple times, we introduce a convenience variable. This makes the code considerably more readable. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:59 +01:00
Jeff Mahoney	6202df6921	btrfs: root->fs_info cleanup, update_block_group{,flags} Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:58 +01:00
Jeff Mahoney	3796d33535	btrfs: root->fs_info cleanup, lock/unlock_chunks Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:58 +01:00
Jeff Mahoney	27965b6c2c	btrfs: root->fs_info cleanup, btrfs_calc_{trans,trunc}_metadata_size Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:58 +01:00
Jeff Mahoney	da17066c40	btrfs: pull node/sector/stripe sizes out of root and into fs_info We track the node sizes per-root, but they never vary from the values in the superblock. This patch messes with the 80-column style a bit, but subsequent patches to factor out root->fs_info into a convenience variable fix it up again. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:58 +01:00
Jeff Mahoney	fb456252d3	btrfs: root->fs_info cleanup, use fs_info->dev_root everywhere Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:58 +01:00
Jeff Mahoney	2b2e27eb92	btrfs: alloc_reserved_file_extent trace point should use extent_root Even though a separate root is passed in, we're still operating on the extent root. Let's use that for the trace point. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:57 +01:00
Jeff Mahoney	6bccf3ab1e	btrfs: call functions that always use the same root with fs_info instead There are many functions that are always called with the same root argument. Rather than passing the same root every time, we can pass an fs_info pointer instead and have the function get the root pointer itself. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:57 +01:00
Jeff Mahoney	5b4aacefb8	btrfs: call functions that overwrite their root parameter with fs_info There are 11 functions that accept a root parameter and immediately overwrite it. We can pass those an fs_info pointer instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-12-06 16:06:57 +01:00
Wang Xiaoguang	1d57ee9416	btrfs: improve delayed refs iterations This issue was found when I tried to delete a heavily reflinked file, when deleting such files, other transaction operation will not have a chance to make progress, for example, start_transaction() will blocked in wait_current_trans(root) for long time, sometimes it even triggers soft lockups, and the time taken to delete such heavily reflinked file is also very large, often hundreds of seconds. Using perf top, it reports that: PerfTop: 7416 irqs/sec kernel:99.8% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs) --------------------------------------------------------------------------------------- 84.37% [btrfs] [k] __btrfs_run_delayed_refs.constprop.80 11.02% [kernel] [k] delay_tsc 0.79% [kernel] [k] _raw_spin_unlock_irq 0.78% [kernel] [k] _raw_spin_unlock_irqrestore 0.45% [kernel] [k] do_raw_spin_lock 0.18% [kernel] [k] __slab_alloc It seems __btrfs_run_delayed_refs() took most cpu time, after some debug work, I found it's select_delayed_ref() causing this issue, for a delayed head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but select_delayed_ref() will firstly try to iterate node list to find BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and waste much time. To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head, then in select_delayed_ref(), if this list is not empty, we can directly use nodes in this list. With this patch, it just took about 10~15 seconds to delte the same file. Now using perf top, it reports that: PerfTop: 2734 irqs/sec kernel:99.5% exact: 0.0% [4000Hz cpu-clock], (all, 4 CPUs) ---------------------------------------------------------------------------------------- 20.74% [kernel] [k] _raw_spin_unlock_irqrestore 16.33% [kernel] [k] __slab_alloc 5.41% [kernel] [k] lock_acquired 4.42% [kernel] [k] lock_acquire 4.05% [kernel] [k] lock_release 3.37% [kernel] [k] _raw_spin_unlock_irq For normal files, this patch also gives help, at least we do not need to iterate whole list to found BTRFS_ADD_DELAYED_REF nodes. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-30 13:45:21 +01:00
Qu Wenruo	33d1f05ccb	btrfs: Export and move leaf/subtree qgroup helpers to qgroup.c Move account_shared_subtree() to qgroup.c and rename it to btrfs_qgroup_trace_subtree(). Do the same thing for account_leaf_items() and rename it to btrfs_qgroup_trace_leaf_items(). Since all these functions are only for qgroup, move them to qgroup.c and export them is more appropriate. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-30 13:45:21 +01:00
Qu Wenruo	50b3e040b7	btrfs: qgroup: Rename functions to make it follow reserve,trace,account steps Rename btrfs_qgroup_insert_dirty_extent(_nolock) to btrfs_qgroup_trace_extent(_nolock), according to the new reserve/trace/account naming schema. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-30 13:45:21 +01:00
Jeff Mahoney	0c476a5d7f	btrfs: Ensure proper sector alignment for btrfs_free_reserved_data_space This fixes the WARN_ON on BTRFS_I(inode)->reserved_extents in btrfs_destroy_inode and the WARN_ON on nonzero delalloc bytes on umount with qgroups enabled. I was able to reproduce this by setting up a small (~500kb) quota limit and writing a file one byte at a time until I hit the limit. The warnings would all hit on umount. The root cause is that we would reserve a block-sized range in both the reservation and the quota in btrfs_check_data_free_space, but if we encountered a problem (like e.g. EDQUOT), we would only release the single byte in the qgroup reservation. That caused an iotree state split, which increased the number of outstanding extents, in turn disallowing releasing the metadata reservation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-30 13:45:19 +01:00
David Sterba	b159fa2808	btrfs: remove constant parameter to memset_extent_buffer and rename it The only memset we do is to 0, so sink the parameter to the function and simplify all calls. Rename the function to reflect the behaviour. Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-30 13:45:17 +01:00
David Sterba	62d1f9fe97	btrfs: remove trivial helper btrfs_find_tree_block During the time, the function has been shrunk to the point that it just calls find_extent_buffer, just passing the parameters. Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-30 13:45:16 +01:00
Xiaoguang Wang	745699ef62	btrfs: remove useless comments Fixes: ("btrfs: update btrfs_space_info's bytes_may_use timely") Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-30 13:45:14 +01:00
Wang Xiaoguang	dc1a90c6aa	btrfs: cleanup: use already calculated value in btrfs_should_throttle_delayed_refs() Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-29 14:10:38 +01:00
Christoph Hellwig	cf8cddd38b	btrfs: don't abuse REQ_OP_* flags for btrfs_map_block btrfs_map_block supports different types of mappings, which to a large extent resemble block layer operations. But they don't always do, and currently btrfs dangerously overlays it's own flag over the block layer flags. This is just asking for a conflict, so introduce a different map flags enum inside of btrfs instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-11-29 14:10:38 +01:00
Wang Xiaoguang	9d1032cc49	btrfs: fix WARNING in btrfs_select_ref_head() This issue was found when testing in-band dedupe enospc behaviour, sometimes run_one_delayed_ref() may fail for enospc reason, then __btrfs_run_delayed_refs（）will return, but forget to add num_heads_read back, which will trigger "WARN_ON(delayed_refs->num_heads_ready == 0)" in btrfs_select_ref_head(). Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-10-24 18:20:29 +02:00
Chris Mason	19c4d2f994	Revert "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs" This reverts commit `5d8eb6fe51`. When we remove devices, we free the device structures. Delaying btfs_remove_chunk() ends up hitting a use-after-free on them. Signed-off-by: Chris Mason <clm@fb.com>	2016-10-10 13:43:31 -07:00
Josef Bacik	4867268c57	Btrfs: don't BUG() during drop snapshot Really there's lots of things that can go wrong here, kill all the BUG_ON()'s and replace the logic ones with ASSERT()'s and return EIO instead. Signed-off-by: Josef Bacik <jbacik@fb.com> [ switched to btrfs_err, errors go to common label ] Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues	6cea66e544	btrfs: Remove already completed TODO comment Fixes: `7cf5b97650` ("btrfs: qgroup: Cleanup old inaccurate facilities") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Goldwyn Rodrigues	dd12d5b804	btrfs: Do not reassign count in btrfs_run_delayed_refs Code cleanup. count is already (unsgined long)-1. That is the reason run_all was set. Do not reassign it (unsigned long)-1. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Liu Bo	a958eab0ed	Btrfs: fix memory leak in do_walk_down The extent buffer 'next' needs to be free'd conditionally. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:06 +02:00
Jeff Mahoney	ab8d0fc48d	btrfs: convert pr_* to btrfs_* where possible For many printks, we want to know which file system issued the message. This patch converts most pr_* calls to use the btrfs_* versions instead. In some cases, this means adding plumbing to allow call sites access to an fs_info pointer. fs/btrfs/check-integrity.c is left alone for another day. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 19:37:04 +02:00
Jeff Mahoney	62e855771d	btrfs: convert printk(KERN_* to use pr_* calls This patch converts printk(KERN_* style messages to use the pr_* versions. One side effect is that anything that was KERN_DEBUG is now automatically a dynamic debug message. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:08:44 +02:00
Jeff Mahoney	5d163e0e68	btrfs: unsplit printed strings CodingStyle chapter 2: "[...] never break user-visible strings such as printk messages, because that breaks the ability to grep for them." This patch unsplits user-visible strings. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:08:44 +02:00
Liu Bo	02794222c4	Btrfs: kill BUG_ON in run_delayed_tree_ref In a corrupted btrfs image, we can come across this BUG_ON and get an unreponsive system, but if we return errors instead, its caller can handle everything gracefully by aborting the current transaction. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:08:44 +02:00
Masahiro Yamada	e2c8990734	btrfs: squash lines for simple wrapper functions Remove unneeded variables and assignments. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 18:08:38 +02:00
Josef Bacik	afcdd129e0	Btrfs: add a flags field to btrfs_fs_info We have a lot of random ints in btrfs_fs_info that can be put into flags. This is mostly equivalent with the exception of how we deal with quota going on or off, now instead we set a flag when we are turning it on or off and deal with that appropriately, rather than just having a pending state that the current quota_enabled gets set to. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Luis Henriques	1f079fa2f8	btrfs: Fix warning "variable ‘blocksize’ set but not used" Variable 'blocksize' in reada_walk_down() is not used since commit `d3e46fea1b` ("btrfs: sink blocksize parameter to readahead_tree_block"). This patch simply removes this variable. Signed-off-by: Luis Henriques <luis.henriques@canonical.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Naohiro Aota	5d8eb6fe51	btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs Currently, btrfs_relocate_chunk() is removing relocated BG by itself. But the work can be done by btrfs_delete_unused_bgs() (and it's better since it trim the BG). Let's dedupe the code. While btrfs_delete_unused_bgs() is already hitting the relocated BG, it skip the BG since the BG has "ro" flag set (to keep balancing BG intact). On the other hand, btrfs cannot drop "ro" flag here to prevent additional writes. So this patch make use of "removed" flag. btrfs_delete_unused_bgs() now detect the flag to distinguish whether a read-only BG is relocating or not. Signed-off-by: Naohiro Aota <naohiro.aota@hgst.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Liu Bo	49303381f1	Btrfs: bail out if block group has different mixed flag Currently we allow inconsistence about mixed flag (BTRFS_BLOCK_GROUP_METADATA \| BTRFS_BLOCK_GROUP_DATA). We'd get ENOSPC if block group has mixed flag and btrfs doesn't. If that happens, we have one space_info with mixed flag and another space_info only with BTRFS_BLOCK_GROUP_METADATA, and global_block_rsv.space_info points to the latter one, but all bytes from block_group contributes to the mixed space_info, thus all the allocation will fail with ENOSPC. This adds a check for the above case. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> [ updated message ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Liu Bo	c79a175175	Btrfs: fix memory leak of block group cache While processing delayed refs, we may update block group's statistics and attach it to cur_trans->dirty_bgs, and later writing dirty block groups will process the list, which happens during btrfs_commit_transaction(). For whatever reason, the transaction is aborted and dirty_bgs is not processed in cleanup_transaction(), we end up with memory leak of these dirty block group cache. Since btrfs_start_dirty_block_groups() doesn't make it go to the commit critical section, this also adds the cleanup work inside it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-26 17:59:49 +02:00
Linus Torvalds	b22734a550	Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Josef fixed a problem when quotas are enabled with his latest ENOSPC rework, and Jeff added more checks into the subvol ioctls to avoid tripping up lookup_one_len" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: ensure that file descriptor used with subvol ioctls is a dir Btrfs: handle quota reserve failure properly	2016-09-23 13:39:37 -07:00
Josef Bacik	1e5ec2e709	Btrfs: handle quota reserve failure properly btrfs/022 was spitting a warning for the case that we exceed the quota. If we fail to make our quota reservation we need to clean up our data space reservation. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Tested-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-09-21 17:22:16 -07:00
Linus Torvalds	f4a9c169c2	Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I'm not proud of how long it took me to track down that one liner in btrfs_sync_log(), but the good news is the patches I was trying to blame for these problems were actually fine (sorry Filipe)" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns btrfs: do not decrease bytes_may_use when replaying extents	2016-09-09 12:52:31 -07:00
Wang Xiaoguang	ce129655c9	btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress In btrfs_async_reclaim_metadata_space(), we use ticket's address to determine whether asynchronous metadata reclaim work is making progress. ticket = list_first_entry(&space_info->tickets, struct reserve_ticket, list); if (last_ticket == ticket) { flush_state++; } else { last_ticket = ticket; flush_state = FLUSH_DELAYED_ITEMS_NR; if (commit_cycles) commit_cycles--; } But indeed it's wrong, we should not rely on local variable's address to do this check, because addresses may be same. In my test environment, I dd one 168MB file in a 256MB fs, found that for this file, every time wait_reserve_ticket() called, local variable ticket's address is same, For above codes, assume a previous ticket's address is addrA, last_ticket is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and wake up it, then another ticket is added, but with the same address addrA, now last_ticket will be same to current ticket, then current ticket's flush work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR, which may result in some enospc issues(I have seen this in my test machine). Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-06 16:31:43 +02:00
Wang Xiaoguang	ed7a694839	btrfs: do not decrease bytes_may_use when replaying extents When replaying extents, there is no need to update bytes_may_use in btrfs_alloc_logged_file_extent(), otherwise it'll trigger a WARN_ON about bytes_may_use. Fixes: ("btrfs: update btrfs_space_info's bytes_may_use timely") Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-05 17:40:41 +02:00
Linus Torvalds	4b30b6d126	Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I'm still prepping a set of fixes for btrfs fsync, just nailing down a hard to trigger memory corruption. For now, these are tested and ready." * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: fix one bug that process may endlessly wait for ticket in wait_reserve_ticket() Btrfs: fix endless loop in balancing block groups Btrfs: kill invalid ASSERT() in process_all_refs()	2016-09-03 12:40:45 -07:00
Wang Xiaoguang	e0af24849e	btrfs: fix one bug that process may endlessly wait for ticket in wait_reserve_ticket() If can_overcommit() in btrfs_calc_reclaim_metadata_size() returns true, btrfs_async_reclaim_metadata_space() will not reclaim metadata space, just return directly and also forget to wake up process which are waiting for their tickets, so these processes will wait endlessly. Fstests case generic/172 with mount option "-o compress=lzo" have revealed this bug in my test machine. Here if we have tickets to handle, we must handle them first. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-09-01 17:23:24 +02:00
Linus Torvalds	28687b935e	Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We've queued up a few different fixes in here. These range from enospc corners to fsync and quota fixes, and a few targeted at error handling for corrupt metadata/fuzzing" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix lockdep warning on deadlock against an inode's log mutex Btrfs: detect corruption when non-root leaf has zero item Btrfs: check btree node's nritems btrfs: don't create or leak aliased root while cleaning up orphans Btrfs: fix em leak in find_first_block_group btrfs: do not background blkdev_put() Btrfs: clarify do_chunk_alloc()'s return value btrfs: fix fsfreeze hang caused by delayed iputs deal btrfs: update btrfs_space_info's bytes_may_use timely btrfs: divide btrfs_update_reserved_bytes() into two functions btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster() btrfs: qgroup: Fix qgroup incorrectness caused by log replay btrfs: relocation: Fix leaking qgroups numbers on data extents btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent() btrfs: waiting on qgroup rescan should not always be interruptible btrfs: properly track when rescan worker is running btrfs: flush_space: treat return value of do_chunk_alloc properly Btrfs: add ASSERT for block group's memory leak btrfs: backref: Fix soft lockup in __merge_refs function Btrfs: fix memory leak of reloc_root	2016-08-26 20:22:01 -07:00
Josef Bacik	187ee58c62	Btrfs: fix em leak in find_first_block_group We need to call free_extent_map() on the em we look up. Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-08-25 03:58:29 -07:00
Liu Bo	28b737f6ed	Btrfs: clarify do_chunk_alloc()'s return value Function start_transaction() can return ERR_PTR(1) when flush is BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is start_transaction (return ERR_PTR(1)) -> btrfs_block_rsv_add (return 1) -> reserve_metadata_bytes (return 1) -> flush_space (return 1) -> do_chunk_alloc (return 1) With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the flush_state of ALLOC_CHUNK and it successfully allocates a new chunk, then instead of trying to reserve space again, reserve_metadata_bytes returns 1 immediately. Eventually the callers who call start_transaction() usually just do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get a panic when dereferencing a pointer which is ERR_PTR(1). The following patch fixes the above problem. "btrfs: flush_space: treat return value of do_chunk_alloc properly" https://patchwork.kernel.org/patch/7778651/ This add comments to clarify do_chunk_alloc()'s return value. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-08-25 03:58:27 -07:00
Wang Xiaoguang	18513091af	btrfs: update btrfs_space_info's bytes_may_use timely This patch can fix some false ENOSPC errors, below test script can reproduce one false ENOSPC error: #!/bin/bash dd if=/dev/zero of=fs.img bs=$((10241024)) count=128 dev=$(losetup --show -f fs.img) mkfs.btrfs -f -M $dev mkdir /tmp/mntpoint mount $dev /tmp/mntpoint cd /tmp/mntpoint xfs_io -f -c "falloc 0 $((641024*1024))" testfile Above script will fail for ENOSPC reason, but indeed fs still has free space to satisfy this request. Please see call graph: btrfs_fallocate() \|-> btrfs_alloc_data_chunk_ondemand() \| bytes_may_use += 64M \|-> btrfs_prealloc_file_range() \|-> btrfs_reserve_extent() \|-> btrfs_add_reserved_bytes() \| alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not \| change bytes_may_use, and bytes_reserved += 64M. Now \| bytes_may_use + bytes_reserved == 128M, which is greater \| than btrfs_space_info's total_bytes, false enospc occurs. \| Note, the bytes_may_use decrease operation will be done in \| end of btrfs_fallocate(), which is too late. Here is another simple case for buffered write: CPU 1 \| CPU 2 \| \|-> cow_file_range() \|-> __btrfs_buffered_write() \|-> btrfs_reserve_extent() \| \| \| \| \| \| \| \| \| ..... \| \|-> btrfs_check_data_free_space() \| \| \| \| \|-> extent_clear_unlock_delalloc() \| In CPU 1, btrfs_reserve_extent()->find_free_extent()-> btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease operation will be delayed to be done in extent_clear_unlock_delalloc(). Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's btrfs_check_data_free_space() tries to reserve 100MB data space. If 100MB > data_sinfo->total_bytes - data_sinfo->bytes_used - data_sinfo->bytes_reserved - data_sinfo->bytes_pinned - data_sinfo->bytes_readonly - data_sinfo->bytes_may_use btrfs_check_data_free_space() will try to allcate new data chunk or call btrfs_start_delalloc_roots(), or commit current transaction in order to reserve some free space, obviously a lot of work. But indeed it's not necessary as long as decreasing bytes_may_use timely, we still have free space, decreasing 128M from bytes_may_use. To fix this issue, this patch chooses to update bytes_may_use for both data and metadata in btrfs_add_reserved_bytes(). For compress path, real extent length may not be equal to file content length, so introduce a ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by file content length. Then compress path can update bytes_may_use correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC and RESERVE_FREE. As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as PREALLOC, we also need to update bytes_may_use, but can not pass EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook() to update btrfs_space_info's bytes_may_use. Meanwhile __btrfs_prealloc_file_range() will call btrfs_free_reserved_data_space() internally for both sucessful and failed path, btrfs_prealloc_file_range()'s callers does not need to call btrfs_free_reserved_data_space() any more. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-08-25 03:58:26 -07:00
Wang Xiaoguang	4824f1f412	btrfs: divide btrfs_update_reserved_bytes() into two functions This patch divides btrfs_update_reserved_bytes() into btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and next patch will extend btrfs_add_reserved_bytes（）to fix some false ENOSPC error, please see later patch for detailed info. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-08-25 03:58:25 -07:00
Qu Wenruo	cb93b52cc0	btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent() Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions: 1. btrfs_qgroup_insert_dirty_extent_nolock() Almost the same with original code. For delayed_ref usage, which has delayed refs locked. Change the return value type to int, since caller never needs the pointer, but only needs to know if they need to free the allocated memory. 2. btrfs_qgroup_insert_dirty_extent() The more encapsulated version. Will do the delayed_refs lock, memory allocation, quota enabled check and other things. The original design is to keep exported functions to minimal, but since more btrfs hacks exposed, like replacing path in balance, we need to record dirty extents manually, so we have to add such functions. Also, add comment for both functions, to info developers how to keep qgroup correct when doing hacks. Cc: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-and-Tested-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-08-25 03:58:21 -07:00
Alex Lyakas	eecba891d3	btrfs: flush_space: treat return value of do_chunk_alloc properly do_chunk_alloc returns 1 when it succeeds to allocate a new chunk. But flush_space will not convert this to 0, and will also return 1. As a result, reserve_metadata_bytes will think that flush_space failed, and may potentially return this value "1" to the caller (depends how reserve_metadata_bytes was called). The caller will also treat this as an error. For example, btrfs_block_rsv_refill does: int ret = -ENOSPC; ... ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); if (!ret) { block_rsv_add_bytes(block_rsv, num_bytes, 0); return 0; } return ret; So it will return -ENOSPC. Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-08-25 03:58:18 -07:00
Liu Bo	f3bca8028b	Btrfs: add ASSERT for block group's memory leak This adds several ASSERT()' s to report memory leak of block group cache. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2016-08-25 03:58:17 -07:00
Linus Torvalds	d58b0d980f	Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull more btrfs updates from Chris Mason: "This is part two of my btrfs pull, which is some cleanups and a batch of fixes. Most of the code here is from Jeff Mahoney, making the pointers we pass around internally more consistent and less confusing overall. I noticed a small problem right before I sent this out yesterday, so I fixed it up and re-tested overnight" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (40 commits) Btrfs: fix __MAX_CSUM_ITEMS btrfs: btrfs_abort_transaction, drop root parameter btrfs: add btrfs_trans_handle->fs_info pointer btrfs: btrfs_relocate_chunk pass extent_root to btrfs_end_transaction btrfs: convert nodesize macros to static inlines btrfs: introduce BTRFS_MAX_ITEM_SIZE btrfs: cleanup, remove prototype for btrfs_find_root_ref btrfs: copy_to_sk drop unused root parameter btrfs: simpilify btrfs_subvol_inherit_props btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root btrfs: tests, require fs_info for root btrfs: tests, move initialization into tests/ btrfs: btrfs_test_opt and friends should take a btrfs_fs_info btrfs: prefix fsid to all trace events btrfs: plumb fs_info into btrfs_work btrfs: remove obsolete part of comment in statfs btrfs: hide test-only member under ifdef btrfs: Ratelimit "no csum found" info message btrfs: Add ratelimit to btrfs printing Btrfs: fix unexpected balance crash due to BUG_ON ...	2016-08-04 19:56:16 -04:00
Linus Torvalds	ba929b6646	Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This pull is dedicated to Josef's enospc rework, which we've been testing for a few releases now. It fixes some early enospc problems and is dramatically faster. This also includes an updated fix for the delalloc accounting that happens after a fault in copy_from_user. My patch in v4.7 was almost but not quite enough" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix delalloc accounting after copy_from_user faults Btrfs: avoid deadlocks during reservations in btrfs_truncate_block Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes Btrfs: fill relocation block rsv after allocation Btrfs: always use trans->block_rsv for orphans Btrfs: change how we calculate the global block rsv Btrfs: use root when checking need_async_flush Btrfs: don't bother kicking async if there's nothing to reclaim Btrfs: fix release reserved extents trace points Btrfs: add fsid to some tracepoints Btrfs: add tracepoints for flush events Btrfs: fix delalloc reservation amount tracepoint Btrfs: trace pinned extents Btrfs: introduce ticketed enospc infrastructure Btrfs: add tracepoint for adding block groups Btrfs: warn_on for unaccounted spaces Btrfs: change delayed reservation fallback behavior Btrfs: always reserve metadata for delalloc extents Btrfs: fix callers of btrfs_block_rsv_migrate Btrfs: add bytes_readonly to the spaceinfo at once	2016-07-31 21:27:32 -04:00
Linus Torvalds	d05d7f4079	Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: - the big change is the cleanup from Mike Christie, cleaning up our uses of command types and modified flags. This is what will throw some merge conflicts - regression fix for the above for btrfs, from Vincent - following up to the above, better packing of struct request from Christoph - a 2038 fix for blktrace from Arnd - a few trivial/spelling fixes from Bart Van Assche - a front merge check fix from Damien, which could cause issues on SMR drives - Atari partition fix from Gabriel - convert cfq to highres timers, since jiffies isn't granular enough for some devices these days. From Jan and Jeff - CFQ priority boost fix idle classes, from me - cleanup series from Ming, improving our bio/bvec iteration - a direct issue fix for blk-mq from Omar - fix for plug merging not involving the IO scheduler, like we do for other types of merges. From Tahsin - expose DAX type internally and through sysfs. From Toshi and Yigal * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits) block: Fix front merge check block: do not merge requests without consulting with io scheduler block: Fix spelling in a source code comment block: expose QUEUE_FLAG_DAX in sysfs block: add QUEUE_FLAG_DAX for devices to advertise their DAX support Btrfs: fix comparison in __btrfs_map_block() block: atari: Return early for unsupported sector size Doc: block: Fix a typo in queue-sysfs.txt cfq-iosched: Charge at least 1 jiffie instead of 1 ns cfq-iosched: Fix regression in bonnie++ rewrite performance cfq-iosched: Convert slice_resid from u64 to s64 block: Convert fifo_time from ulong to u64 blktrace: avoid using timespec block/blk-cgroup.c: Declare local symbols static block/bio-integrity.c: Add #include "blk.h" block/partition-generic.c: Remove a set-but-not-used variable block: bio: kill BIO_MAX_SIZE cfq-iosched: temporarily boost queue priority for idle classes block: drbd: avoid to use BIO_MAX_SIZE block: bio: remove BIO_MAX_SECTORS ...	2016-07-26 15:03:07 -07:00
Jeff Mahoney	66642832f0	btrfs: btrfs_abort_transaction, drop root parameter __btrfs_abort_transaction doesn't use its root parameter except to obtain an fs_info pointer. We can obtain that from trans->root->fs_info for now and from trans->fs_info in a later patch. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-26 13:54:26 +02:00
Jeff Mahoney	64b6358072	btrfs: add btrfs_trans_handle->fs_info pointer btrfs_trans_handle->root is documented as for use for confirming that the root passed in to start the transaction is the same as the one ending it. It's used in several places when an fs_info pointer is needed, so let's just add an fs_info pointer directly. Eventually, the root pointer can be removed. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-26 13:54:26 +02:00
Jeff Mahoney	14a1e067b4	btrfs: introduce BTRFS_MAX_ITEM_SIZE We use BTRFS_LEAF_DATA_SIZE - sizeof(struct btrfs_item) in several places. This introduces a BTRFS_MAX_ITEM_SIZE macro to do the same. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-26 13:54:24 +02:00
Jeff Mahoney	f5ee5c9ac5	btrfs: tests, use BTRFS_FS_STATE_DUMMY_FS_INFO instead of dummy root Now that we have a dummy fs_info associated with each test that uses a root, we don't need the DUMMY_ROOT bit anymore. This lets us make choices without needing an actual root like in e.g. btrfs_find_create_tree_block. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-26 13:54:19 +02:00
Jeff Mahoney	3cdde2240d	btrfs: btrfs_test_opt and friends should take a btrfs_fs_info btrfs_test_opt and friends only use the root pointer to access the fs_info. Let's pass the fs_info directly in preparation to eliminate similar patterns all over btrfs. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-26 13:53:16 +02:00
Jeff Mahoney	bc074524e1	btrfs: prefix fsid to all trace events When using trace events to debug a problem, it's impossible to determine which file system generated a particular event. This patch adds a macro to prefix standard information to the head of a trace event. The extent_state alloc/free events are all that's left without an fs_info available. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-26 13:53:16 +02:00
David Sterba	05653ef386	btrfs: hide test-only member under ifdef Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-26 13:52:25 +02:00
Wang Xiaoguang	39581a3a1a	btrfs: fix free space calculation in dump_space_info() In btrfs, btrfs_space_info's bytes_may_use is treated as fs used space, as what we do in reserve_metadata_bytes() or btrfs_alloc_data_chunk_ondemand(), so in dump_space_info(), when calculating free space, we should also subtract btrfs_space_info's bytes_may_use. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-26 13:52:25 +02:00
Liu Bo	6fb37b756a	Btrfs: check inconsistence between chunk and block group With btrfs-corrupt-block, one can drop one chunk item and mounting will end up with a panic in btrfs_full_stripe_len(). This doesn't not remove the BUG_ON, but instead checks it a bit earlier when we find the block group item. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-26 13:52:25 +02:00
Josef Bacik	bac357dcec	Btrfs: avoid deadlocks during reservations in btrfs_truncate_block The new enospc code makes it possible to deadlock if we don't use FLUSH_LIMIT during reservations inside a transaction. This enforces the correct flush type to avoid both deadlocks and assertions Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2016-07-20 16:58:04 -07:00
Josef Bacik	8ca17f0f59	Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes We used to allow you to set FLUSH_ALL and then just wouldn't do things like commit transactions or wait on ordered extents if we noticed you were in a transaction. However now that all the flushing for FLUSH_ALL is asynchronous we've lost the ability to tell, and we could end up deadlocking. So instead use FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if we error out to preserve the previous behavior. I've also added an ASSERT() to catch anybody else who tries to do this. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	40acc3eede	Btrfs: always use trans->block_rsv for orphans This is the case all the time anyway except for relocation which could be doing a reloc root for a non ref counted root, in which case we'd end up with some random block rsv rather than the one we have our reservation in. If there isn't enough space in the block rsv we are trying to steal from we'll BUG() because we expect there to be space for the orphan to make its reservation. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	ae2e472881	Btrfs: change how we calculate the global block rsv Traditionally we've calculated the global block rsv by guessing how much of the metadata used amount was the extent tree, and then taking the data size and figuring out how large the csum tree would have to be to hold that much data. This is imprecise and falls down on MIXED file systems as we can't trust the data used amount. This resulted in failures for xfstests generic/333 because it creates lots of clones, which explodes out the extent tree. Our global reserve calculations were woefully inaccurate in this case which meant we got into a situation where we did not have enough reserved to do our work. We know we only use the global block rsv for the extent, csum, and root trees, so just get the bytes used for these trees and use that as the basis of our global reserve. Since these are not reference counted trees the bytes_used value will be accurate. This fixed the transaction aborts seen with generic/333. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	87241c2e68	Btrfs: use root when checking need_async_flush Instead of doing fs_info->fs_root in need_async_flush, which may not be set during recovery when mounting, just pass the root itself in, which makes more sense as thats what btrfs_calc_reclaim_metadata_size takes. Signed-off-by: Josef Bacik <jbacik@fb.com> Reported-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	d38b349c39	Btrfs: don't bother kicking async if there's nothing to reclaim We do this check when we start the async reclaimer thread, might as well check before we kick it off to save us some cycles. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	31bada7c4e	Btrfs: fix release reserved extents trace points We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which isn't quite right because we will go through and free that extent later when we unpin, so it messes up apps that are accounting for the reservation space. We were also unconditionally doing it in __btrfs_free_reserved_extent(), when we only actually free the reservation instead of pinning the extent. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	f376df2b7d	Btrfs: add tracepoints for flush events We want to track when we're triggering flushing from our reservation code and what flushing is being done when we start flushing. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	f485c9ee32	Btrfs: fix delalloc reservation amount tracepoint We can sometimes drop the reservation we had for our inode, so we need to remove that amount from to_reserve so that our tracepoint reports a valid amount of space. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	c51e7bb184	Btrfs: trace pinned extents Pinned extents are an important metric to keep track of for enospc. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	957780eb27	Btrfs: introduce ticketed enospc infrastructure Our enospc flushing sucks. It is born from a time where we were early enospc'ing constantly because multiple threads would race in for the same reservation and randomly starve other ones out. So I came up with this solution to block any other reservations from happening while one guy tried to flush stuff to satisfy his reservation. This gives us pretty good correctness, but completely crap latency. The solution I've come up with is ticketed reservations. Basically we try to make our reservation, and if we can't we put a ticket on a list in order and kick off an async flusher thread. This async flusher thread does the same old flushing we always did, just asynchronously. As space is freed and added back to the space_info it checks and sees if we have any tickets that need satisfying, and adds space to the tickets and wakes up anything we've satisfied. Once the flusher thread stops making progress it wakes up all the current tickets and tells them to take a hike. There is a priority list for things that can't flush, since the async flusher could do anything we need to avoid deadlocks. These guys get priority for having their reservation made, and will still do manual flushing themselves in case the async flusher isn't running. This patch gives us significantly better latencies. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	c83f8effef	Btrfs: add tracepoint for adding block groups I'm writing a tool to visualize the enospc system inside btrfs, I need this tracepoint in order to keep track of the block groups in the system. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	d555b6c380	Btrfs: warn_on for unaccounted spaces These were hidden behind enospc_debug, which isn't helpful as they indicate actual bugs, unlike the rest of the enospc_debug stuff which is really debug information. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	48c3d480e4	Btrfs: always reserve metadata for delalloc extents There are a few races in the metadata reservation stuff. First we add the bytes to the block_rsv well after we've set the bit on the inode saying that we have space for it and after we've reserved the bytes. So use the normal btrfs_block_rsv_add helper for this case. Secondly we can flush delalloc extents when we try to reserve space for our write, which means that we could have used up the space for the inode and we wouldn't know because we only check before the reservation. So instead make sure we are always reserving space for the inode update, and then if we don't need it release those bytes afterward. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	25d609f86d	Btrfs: fix callers of btrfs_block_rsv_migrate So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes. Not only this but it unconditionally changes the size of the block_rsv. This isn't a bug strictly speaking, but it makes truncate block rsv's look funny because every time we migrate bytes over its size grows, even though we only want it to be a specific size. So collapse this into one function that takes an update_size argument and make truncate and evict not update the size for consistency sake. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00
Josef Bacik	e40edf2da4	Btrfs: add bytes_readonly to the spaceinfo at once For some reason we're adding bytes_readonly to the space info after we update the space info with the block group info. This creates a tiny race where we could over-reserve space because we haven't yet taken out the bytes_readonly bit. Since we already know this information at the time we call update_space_info, just pass it along so it can be updated all at once. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>	2016-07-07 18:45:53 +02:00

1 2 3 4 5 ...

1304 Commits