linux

Author	SHA1	Message	Date
Qu Wenruo	d1b8b94a2b	btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function Quite a lot of qgroup corruption happens due to wrong time of calling btrfs_qgroup_prepare_account_extents(). Since the safest time is to call it just before btrfs_qgroup_account_extents(), there is no need to separate these 2 functions. Merging them will make code cleaner and less bug prone. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> [ changelog and comment adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:02 +02:00
Qu Wenruo	5edfd9fdc6	btrfs: qgroup: Add quick exit for non-fs extents Modify btrfs_qgroup_account_extent() to exit quicker for non-fs extents. The quick exit condition is: 1) The extent belongs to a non-fs tree Only fs-tree extents can affect qgroup numbers and is the only case where extent can be shared between different trees. Although strictly speaking extent in data-reloc or tree-reloc tree can be shared, data/tree-reloc root won't appear in the result of btrfs_find_all_roots(), so we can ignore such case. So we can check the first root in old_roots/new_roots ulist. - if we find the 1st root is a not a fs/subvol root, then we can skip the extent - if we find the 1st root is a fs/subvol root, then we must continue calculation OR 2) both 'nr_old_roots' and 'nr_new_roots' are 0 This means either such extent got allocated then freed in current transaction or it's a new reloc tree extent, whose nr_new_roots is 0. Either way it won't affect qgroup accounting and can be skipped safely. Such quick exit can make trace output more quite and less confusing: (example with fs uuid and time stamp removed) Before: ------ add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0 btrfs_qgroup_account_extent: bytenr=29556736 num_bytes=16384 nr_old_roots=0 nr_new_roots=1 ------ Extent tree block will trigger btrfs_qgroup_account_extent() trace point while no qgroup number is changed, as extent tree won't affect qgroup accounting. After: ------ add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0 ------ Now such unrelated extent won't trigger btrfs_qgroup_account_extent() trace point, making the trace less noisy. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> [ changelog and comment adjustments ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:02 +02:00
Omar Sandoval	d7eae3403f	Btrfs: rework delayed ref total_bytes_pinned accounting The total_bytes_pinned counter is completely broken when accounting delayed refs: - If two drops for the same extent are merged, we will decrement total_bytes_pinned twice but only increment it once. - If an add is merged into a drop or vice versa, we will decrement the total_bytes_pinned counter but never increment it. - If multiple references to an extent are dropped, we will account it multiple times, potentially vastly over-estimating the number of bytes that will be freed by a commit and doing unnecessary work when we're close to ENOSPC. The last issue is relatively minor, but the first two make the total_bytes_pinned counter leak or underflow very often. These accounting issues were introduced in `b150a4f10d` ("Btrfs: use a percpu to keep track of possibly pinned bytes"), but they were papered over by zeroing out the counter on every commit until `d288db5dc0` ("Btrfs: fix race of using total_bytes_pinned"). We need to make sure that an extent is accounted as pinned exactly once if and only if we will drop references to it when when the transaction is committed. Ideally we would only add to total_bytes_pinned when the last reference is dropped, but this information isn't readily available for data extents. Again, this over-estimation can lead to extra commits when we're close to ENOSPC, but it's not as bad as before. The fix implemented here is to increment total_bytes_pinned when the total refmod count for an extent goes negative and decrement it if the refmod count goes back to non-negative or after we've run all of the delayed refs for that extent. Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:01 +02:00
Omar Sandoval	7be07912b3	Btrfs: return old and new total ref mods when adding delayed refs We need this to decide when to account pinned bytes. Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:01 +02:00
Omar Sandoval	0a16c7d7ae	Btrfs: always account pinned bytes when dropping a tree block ref Currently, we only increment total_bytes_pinned in btrfs_free_tree_block() when dropping the last reference on the block. However, when the delayed ref is run later, we will decrement total_bytes_pinned regardless of whether it was the last reference or not. This causes the counter to underflow when the reference we dropped was not the last reference. Fix it by incrementing the counter unconditionally, which is what btrfs_free_extent() does. This makes total_bytes_pinned an overestimate when references to shared extents are dropped, but in the worst case this will just make us try to commit the transaction to try to free up space and find we didn't free enough. Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:01 +02:00
Omar Sandoval	4da8b76d34	Btrfs: update total_bytes_pinned when pinning down extents The extents marked in pin_down_extent() will be unpinned later in unpin_extent_range(), which decrements total_bytes_pinned. pin_down_extent() must increment the counter to avoid underflowing it. Also adjust btrfs_free_tree_block() to avoid accounting for the same extent twice. Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:01 +02:00
Omar Sandoval	55e8196a57	Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT() The value of flags is one of DATA/METADATA/SYSTEM, they must exist at when add_pinned_bytes is called. Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: David Sterba <dsterba@suse.com> [ added changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:01 +02:00
Omar Sandoval	0d9f824df3	Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64 There are a few places where we pass in a negative num_bytes, so make it signed for clarity. Also move it up in the file since later patches will need it there. Signed-off-by: Omar Sandoval <osandov@fb.com> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:17:01 +02:00
David Sterba	1164a9fb9c	btrfs: fix validation of XATTR_ITEM dir items The XATTR_ITEM is a type of a directory item so we use the common validator helper. Unlike other dir items, it can have data. The way the name len validation is currently implemented does not reflect that. We'd have to adjust by the data_len when comparing the read and item limits. However, this will not work for multi-item xattr dir items. Example from tree dump of generic/337: item 7 key (257 XATTR_ITEM 751495445) itemoff 15667 itemsize 147 location key (0 UNKNOWN.0 0) type XATTR transid 8 data_len 3 name_len 11 name: user.foobar data 123 location key (0 UNKNOWN.0 0) type XATTR transid 8 data_len 6 name_len 13 name: user.WvG1c1Td data qwerty location key (0 UNKNOWN.0 0) type XATTR transid 8 data_len 5 name_len 19 name: user.J3__T_Km3dVsW_ data hello At the point of btrfs_is_name_len_valid call we don't have access to the data_len value of the 2nd and 3rd sub-item. So simple btrfs_dir_data_len(leaf, di) would always return 3, although we'd need to get 6 and 5 respectively to get the claculations right. (read_end + name_len + data_len vs item_end) We'd have to also pass data_len externally, which is not point of the name validation. The last check is supposed to test if there's at least one dir item space after the one we're processing. I don't think this is particularly useful, validation of the next item would catch that too. So the check is removed and we don't weaken the validation. Now tests btrfs/048, btrfs/053, generic/273 and generic/337 pass. Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-29 20:06:11 +02:00
Su Yue	fbc326159a	btrfs: Verify dir_item in iterate_object_props Call verify_dir_item before memcmp_extent_buffer reading name from dir_item. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:16:04 +02:00
Su Yue	64c7b01446	btrfs: Check name_len before in btrfs_del_root_ref btrfs_del_root_ref calls btrfs_search_slot and reads name from root_ref. Call btrfs_is_name_len_valid before memcmp. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:16:04 +02:00
Su Yue	488d7c4566	btrfs: Check name_len before reading btrfs_get_name In btrfs_get_name, there's btrfs_search_slot and reads name from inode_ref/root_ref. Call btrfs_is_name_len_valid in btrfs_get_name. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:16:04 +02:00
Su Yue	59b0a7f2c7	btrfs: Check name_len before read in iterate_dir_item Since iterate_dir_item checks name_len in its own way, so use btrfs_is_name_len_valid not 'verify_dir_item' to make more strict name_len check. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ switched ENAMETOOLONG to EIO ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:16:04 +02:00
Su Yue	3c1d418448	btrfs: Check name_len in btrfs_check_ref_name_override In btrfs_log_inode, btrfs_search_forward gets the buffer and then btrfs_check_ref_name_override will read name from ref/extref for the first time. Call btrfs_is_name_len_valid before reading name. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:16:04 +02:00
Su Yue	8ee8c2d62d	btrfs: Verify dir_item in replay_xattr_deletes replay_xattr_deletes calls btrfs_search_slot to get buffer and reads name. Call verify_dir_item to check name_len in replay_xattr_deletes to avoid reading out of boundary. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:16:04 +02:00
Su Yue	26a836cec2	btrfs: Check name_len on add_inode_ref call path replay_one_buffer first reads buffers and dispatches items accroding to the item type. In this patch, add_inode_ref handles inode_ref and inode_extref. Then add_inode_ref calls ref_get_fields and extref_get_fields to read ref/extref name for the first time. So checking name_len before reading those two is fine. add_inode_ref also calls inode_in_dir to match ref/extref in parent_dir. The call graph includes btrfs_match_dir_item_name to read dir_item name in the parent dir. Checking first dir_item is not enough. Change it to verify every dir_item while doing matches. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:16:04 +02:00
Su Yue	e79a33270d	btrfs: Check name_len with boundary in verify dir_item Originally, verify_dir_item verifies name_len of dir_item with fixed values but not item boundary. If corrupted name_len was not bigger than the fixed value, for example 255, the function will think the dir_item is fine. And then reading beyond boundary will cause crash. Example: 1. Corrupt one dir_item name_len to be 255. 2. Run 'ls -lar /mnt/test/ > /dev/null' dmesg: [ 48.451449] BTRFS info (device vdb1): disk space caching is enabled [ 48.451453] BTRFS info (device vdb1): has skinny extents [ 48.489420] general protection fault: 0000 [#1] SMP [ 48.489571] Modules linked in: ext4 jbd2 mbcache btrfs xor raid6_pq [ 48.489716] CPU: 1 PID: 2710 Comm: ls Not tainted 4.10.0-rc1 #5 [ 48.489853] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 [ 48.490008] task: ffff880035df1bc0 task.stack: ffffc90004800000 [ 48.490008] RIP: 0010:read_extent_buffer+0xd2/0x190 [btrfs] [ 48.490008] RSP: 0018:ffffc90004803d98 EFLAGS: 00010202 [ 48.490008] RAX: 000000000000001b RBX: 000000000000001b RCX: 0000000000000000 [ 48.490008] RDX: ffff880079dbf36c RSI: 0005080000000000 RDI: ffff880079dbf368 [ 48.490008] RBP: ffffc90004803dc8 R08: ffff880078e8cc48 R09: ffff880000000000 [ 48.490008] R10: 0000160000000000 R11: 0000000000001000 R12: ffff880079dbf288 [ 48.490008] R13: ffff880078e8ca88 R14: 0000000000000003 R15: ffffc90004803e20 [ 48.490008] FS: 00007fef50c60800(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000 [ 48.490008] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 48.490008] CR2: 000055f335ac2ff8 CR3: 000000007356d000 CR4: 00000000001406e0 [ 48.490008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 48.490008] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 48.490008] Call Trace: [ 48.490008] btrfs_real_readdir+0x3b7/0x4a0 [btrfs] [ 48.490008] iterate_dir+0x181/0x1b0 [ 48.490008] SyS_getdents+0xa7/0x150 [ 48.490008] ? fillonedir+0x150/0x150 [ 48.490008] entry_SYSCALL_64_fastpath+0x18/0xad [ 48.490008] RIP: 0033:0x7fef5032546b [ 48.490008] RSP: 002b:00007ffeafcdb830 EFLAGS: 00000206 ORIG_RAX: 000000000000004e [ 48.490008] RAX: ffffffffffffffda RBX: 00007fef5061db38 RCX: 00007fef5032546b [ 48.490008] RDX: 0000000000008000 RSI: 000055f335abaff0 RDI: 0000000000000003 [ 48.490008] RBP: 00007fef5061dae0 R08: 00007fef5061db48 R09: 0000000000000000 [ 48.490008] R10: 000055f335abafc0 R11: 0000000000000206 R12: 00007fef5061db38 [ 48.490008] R13: 0000000000008040 R14: 00007fef5061db38 R15: 000000000000270e [ 48.490008] RIP: read_extent_buffer+0xd2/0x190 [btrfs] RSP: ffffc90004803d98 [ 48.499455] ---[ end trace 321920d8e8339505 ]--- Fix it by adding a parameter @slot and check name_len with item boundary by calling btrfs_is_name_len_valid. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> rev Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:16:04 +02:00
Su Yue	19c6dcbfa7	btrfs: Introduce btrfs_is_name_len_valid to avoid reading beyond boundary Introduce function btrfs_is_name_len_valid. The function compares parameter @name_len with item boundary then returns true if name_len is valid. Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> [ s/btrfs_leaf_data/BTRFS_LEAF_DATA_OFFSET/ ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:16:04 +02:00
David Sterba	66b4993e95	btrfs: move dev stats accounting out of wait_dev_flush We should really just wait in wait_dev_flush and let the caller decide what to do with the error value. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:03:39 +02:00
David Sterba	2980d5745f	btrfs: account as waiting for IO, while waiting fot the flush bio completion Similar to what submit_bio_wait does, we should account for IO while waiting for a bio completion. This has marginal visible effects, flush bio is short-lived. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:03:39 +02:00
David Sterba	e0ae999414	btrfs: preallocate device flush bio For devices that support flushing, we allocate a bio, submit, wait for it and then free it. The bio allocation does not fail so ENOMEM is not a problem but we still may unnecessarily stress the allocation subsystem. Instead, we can allocate the bio at the same time we allocate the device and reuse it each time we need to flush the barriers. The bio is reset before each use. Reference counting is simplified to just device allocation (get) and freeing (put). The bio used to be submitted through the integrity checker which will find out that bio has no data attached and call submit_bio. Status of the bio in flight needs to be tracked separately in case the device caches get switched off between write and wait. Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 19:03:38 +02:00
Filipe Manana	fdb1388994	Btrfs: incremental send, fix invalid path for unlink commands An incremental send can contain unlink operations with an invalid target path when we rename some directory inode A, then rename some file inode B to the old name of inode A and directory inode A is an ancestor of inode B in the parent snapshot (but not anymore in the send snapshot). Consider the following example scenario where this issue happens. Parent snapshot: . (ino 256) \| \|--- dir1/ (ino 257) \|--- dir2/ (ino 258) \| \|--- file1 (ino 259) \| \|--- file3 (ino 261) \| \|--- dir3/ (ino 262) \|--- file22 (ino 260) \|--- dir4/ (ino 263) Send snapshot: . (ino 256) \| \|--- dir1/ (ino 257) \|--- dir2/ (ino 258) \|--- dir3 (ino 260) \|--- file3/ (ino 262) \|--- dir4/ (ino 263) \|--- file11 (ino 269) \|--- file33 (ino 261) When attempting to apply the corresponding incremental send stream, an unlink operation contains an invalid path which makes the receiver fail. The following is verbose output of the btrfs receive command: receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...) utimes utimes dir1 utimes dir1/dir2 link dir1/dir3/dir4/file11 -> dir1/dir2/file1 unlink dir1/dir2/file1 utimes dir1/dir2 truncate dir1/dir3/dir4/file11 size=0 utimes dir1/dir3/dir4/file11 rename dir1/dir3 -> o262-7-0 link dir1/dir3 -> o262-7-0/file22 unlink dir1/dir3/file22 ERROR: unlink dir1/dir3/file22 failed. Not a directory The following steps happen during the computation of the incremental send stream the lead to this issue: 1) Before we start processing the new and deleted references for inode 260, we compute the full path of the deleted reference ("dir1/dir3/file22") and cache it in the list of deleted references for our inode. 2) We then start processing the new references for inode 260, for which there is only one new, located at "dir1/dir3". When processing this new reference, we check that inode 262, which was not yet processed, collides with the new reference and because of that we orphanize inode 262 so its new full path becomes "o262-7-0". 3) After the orphanization of inode 262, we create the new reference for inode 260 by issuing a link command with a target path of "dir1/dir3" and a source path of "o262-7-0/file22". 4) We then start processing the deleted references for inode 260, for which there is only one with the base name of "file22", and issue an unlink operation containing the target path computed at step 1, which is wrong because that path no longer exists and should be replaced with "o262-7-0/file22". So fix this issue by recomputing the full path of deleted references if when we processed the new references for an inode we ended up orphanizing any other inode that is an ancestor of our inode in the parent snapshot. A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> [ adjusted after prev patch removed fs_path::dir_path and dir_path_len ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 16:53:10 +02:00
Filipe Manana	72c3668fed	Btrfs: send, fix invalid path after renaming and linking file Currently an incremental snapshot can generate link operations which contain an invalid target path. Such case happens when in the send snapshot a file was renamed, a new hard link added for it and some other inode (with a lower number) got renamed to the former name of that file. Example: Parent snapshot . (ino 256) \| \|--- f1 (ino 257) \|--- f2 (ino 258) \|--- f3 (ino 259) Send snapshot . (ino 256) \| \|--- f2 (ino 257) \|--- f3 (ino 258) \|--- f4 (ino 259) \|--- f5 (ino 258) The following steps happen when computing the incremental send stream: 1) When processing inode 257, inode 258 is orphanized (renamed to "o258-7-0"), because its current reference has the same name as the new reference for inode 257; 2) When processing inode 258, we iterate over all its new references, which have the names "f3" and "f5". The first iteration sees name "f5" and renames the inode from its orphan name ("o258-7-0") to "f5", while the second iteration sees the name "f3" and, incorrectly, issues a link operation with a target name matching the orphan name, which no longer exists. The first iteration had reset the current valid path of the inode to "f5", but in the second iteration we lost it because we found another inode, with a higher number of 259, which has a reference named "f3" as well, so we orphanized inode 259 and recomputed the current valid path of inode 258 to its old orphan name because inode 259 could be an ancestor of inode 258 and therefore the current valid path could contain the pre-orphanization name of inode 259. However in this case inode 259 is not an ancestor of inode 258 so the current valid path should not be recomputed. This makes the receiver fail with the following error: ERROR: link f3 -> o258-7-0 failed: No such file or directory So fix this by not recomputing the current valid path for an inode whenever we find a colliding reference from some not yet processed inode (inode number higher then the one currently being processed), unless that other inode is an ancestor of the one we are currently processing. A test case for fstests will follow soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 16:53:03 +02:00
Filipe Manana	609805d809	Btrfs: fix invalid extent maps due to hole punching While punching a hole in a range that is not aligned with the sector size (currently the same as the page size) we can end up leaving an extent map in memory with a length that is smaller then the sector size or with a start offset that is not aligned to the sector size. Both cases are not expected and can lead to problems. This issue is easily detected after the patch from commit `a7e3b975a0` ("Btrfs: fix reported number of inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the following for example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo $ xfs_io -c "fpunch 60K 90K" /mnt/foo $ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo $ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo $ umount /mnt After the unmount operation we can see several warnings emmitted due to underflows related to space reservation counters: [ 2837.443299] ------------[ cut here ]------------ [ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 btrfs_destroy_inode+0xe8/0x27e [btrfs] [ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button se rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_gene ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: G W 4.10.0-rc8-btrfs-next-43+ #1 [ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 2837.462379] Call Trace: [ 2837.462379] dump_stack+0x68/0x92 [ 2837.462379] __warn+0xc2/0xdd [ 2837.462379] warn_slowpath_null+0x1d/0x1f [ 2837.462379] btrfs_destroy_inode+0xe8/0x27e [btrfs] [ 2837.462379] destroy_inode+0x3d/0x55 [ 2837.462379] evict+0x177/0x17e [ 2837.462379] dispose_list+0x50/0x71 [ 2837.462379] evict_inodes+0x132/0x141 [ 2837.462379] generic_shutdown_super+0x3f/0xeb [ 2837.462379] kill_anon_super+0x12/0x1c [ 2837.462379] btrfs_kill_super+0x16/0x21 [btrfs] [ 2837.462379] deactivate_locked_super+0x30/0x68 [ 2837.462379] deactivate_super+0x36/0x39 [ 2837.462379] cleanup_mnt+0x58/0x76 [ 2837.462379] __cleanup_mnt+0x12/0x14 [ 2837.462379] task_work_run+0x77/0x9b [ 2837.462379] prepare_exit_to_usermode+0x9d/0xc5 [ 2837.462379] syscall_return_slowpath+0x196/0x1b9 [ 2837.462379] entry_SYSCALL_64_fastpath+0xab/0xad [ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7 [ 2837.462379] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [ 2837.462379] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7 [ 2837.462379] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910 [ 2837.462379] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015 [ 2837.462379] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64 [ 2837.462379] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0 [ 2837.519355] ---[ end trace e79345fe24b30b8d ]--- [ 2837.596256] ------------[ cut here ]------------ [ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 btrfs_free_block_groups+0x246/0x3eb [btrfs] [ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: G W 4.10.0-rc8-btrfs-next-43+ #1 [ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 2837.663359] Call Trace: [ 2837.663359] dump_stack+0x68/0x92 [ 2837.663359] __warn+0xc2/0xdd [ 2837.663359] warn_slowpath_null+0x1d/0x1f [ 2837.663359] btrfs_free_block_groups+0x246/0x3eb [btrfs] [ 2837.663359] close_ctree+0x1dd/0x2e1 [btrfs] [ 2837.663359] ? evict_inodes+0x132/0x141 [ 2837.663359] btrfs_put_super+0x15/0x17 [btrfs] [ 2837.663359] generic_shutdown_super+0x6a/0xeb [ 2837.663359] kill_anon_super+0x12/0x1c [ 2837.663359] btrfs_kill_super+0x16/0x21 [btrfs] [ 2837.663359] deactivate_locked_super+0x30/0x68 [ 2837.663359] deactivate_super+0x36/0x39 [ 2837.663359] cleanup_mnt+0x58/0x76 [ 2837.663359] __cleanup_mnt+0x12/0x14 [ 2837.663359] task_work_run+0x77/0x9b [ 2837.663359] prepare_exit_to_usermode+0x9d/0xc5 [ 2837.663359] syscall_return_slowpath+0x196/0x1b9 [ 2837.663359] entry_SYSCALL_64_fastpath+0xab/0xad [ 2837.663359] RIP: 0033:0x7f3ef3e6b9a7 [ 2837.663359] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [ 2837.663359] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7 [ 2837.663359] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910 [ 2837.663359] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015 [ 2837.663359] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64 [ 2837.663359] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0 [ 2837.739445] ---[ end trace e79345fe24b30b8e ]--- [ 2837.745595] ------------[ cut here ]------------ [ 2837.746412] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5700 btrfs_free_block_groups+0x261/0x3eb [btrfs] [ 2837.747955] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [ 2837.755395] CPU: 8 PID: 2474 Comm: umount Tainted: G W 4.10.0-rc8-btrfs-next-43+ #1 [ 2837.756769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 2837.758526] Call Trace: [ 2837.758925] dump_stack+0x68/0x92 [ 2837.759383] __warn+0xc2/0xdd [ 2837.759383] warn_slowpath_null+0x1d/0x1f [ 2837.759383] btrfs_free_block_groups+0x261/0x3eb [btrfs] [ 2837.759383] close_ctree+0x1dd/0x2e1 [btrfs] [ 2837.759383] ? evict_inodes+0x132/0x141 [ 2837.759383] btrfs_put_super+0x15/0x17 [btrfs] [ 2837.759383] generic_shutdown_super+0x6a/0xeb [ 2837.759383] kill_anon_super+0x12/0x1c [ 2837.759383] btrfs_kill_super+0x16/0x21 [btrfs] [ 2837.759383] deactivate_locked_super+0x30/0x68 [ 2837.759383] deactivate_super+0x36/0x39 [ 2837.759383] cleanup_mnt+0x58/0x76 [ 2837.759383] __cleanup_mnt+0x12/0x14 [ 2837.759383] task_work_run+0x77/0x9b [ 2837.759383] prepare_exit_to_usermode+0x9d/0xc5 [ 2837.759383] syscall_return_slowpath+0x196/0x1b9 [ 2837.759383] entry_SYSCALL_64_fastpath+0xab/0xad [ 2837.759383] RIP: 0033:0x7f3ef3e6b9a7 [ 2837.759383] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [ 2837.759383] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7 [ 2837.759383] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910 [ 2837.759383] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015 [ 2837.759383] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64 [ 2837.759383] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0 [ 2837.777063] ---[ end trace e79345fe24b30b8f ]--- [ 2837.778235] ------------[ cut here ]------------ [ 2837.778856] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:9825 btrfs_free_block_groups+0x348/0x3eb [btrfs] [ 2837.791385] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [ 2837.797711] CPU: 8 PID: 2474 Comm: umount Tainted: G W 4.10.0-rc8-btrfs-next-43+ #1 [ 2837.798594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 2837.800118] Call Trace: [ 2837.800515] dump_stack+0x68/0x92 [ 2837.801015] __warn+0xc2/0xdd [ 2837.801471] warn_slowpath_null+0x1d/0x1f [ 2837.801698] btrfs_free_block_groups+0x348/0x3eb [btrfs] [ 2837.801698] close_ctree+0x1dd/0x2e1 [btrfs] [ 2837.801698] ? evict_inodes+0x132/0x141 [ 2837.801698] btrfs_put_super+0x15/0x17 [btrfs] [ 2837.801698] generic_shutdown_super+0x6a/0xeb [ 2837.801698] kill_anon_super+0x12/0x1c [ 2837.801698] btrfs_kill_super+0x16/0x21 [btrfs] [ 2837.801698] deactivate_locked_super+0x30/0x68 [ 2837.801698] deactivate_super+0x36/0x39 [ 2837.801698] cleanup_mnt+0x58/0x76 [ 2837.801698] __cleanup_mnt+0x12/0x14 [ 2837.801698] task_work_run+0x77/0x9b [ 2837.801698] prepare_exit_to_usermode+0x9d/0xc5 [ 2837.801698] syscall_return_slowpath+0x196/0x1b9 [ 2837.801698] entry_SYSCALL_64_fastpath+0xab/0xad [ 2837.801698] RIP: 0033:0x7f3ef3e6b9a7 [ 2837.801698] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 [ 2837.801698] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7 [ 2837.801698] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910 [ 2837.801698] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015 [ 2837.801698] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64 [ 2837.801698] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0 [ 2837.818441] ---[ end trace e79345fe24b30b90 ]--- [ 2837.818991] BTRFS info (device sdc): space_info 1 has 7974912 free, is not full [ 2837.819830] BTRFS info (device sdc): space_info total=8388608, used=417792, pinned=0, reserved=0, may_use=18446744073709547520, readonly=0 What happens in the above example is the following: 1) When punching the hole, at btrfs_punch_hole(), the variable tail_len is set to 2048 (as tail_start is 148Kb + 1 and offset + len is 150Kb). This results in the creation of an extent map with a length of 2Kb starting at file offset 148Kb, through find_first_non_hole() -> btrfs_get_extent(). 2) The second write (first write after the hole punch operation), sets the range [50Kb, 152Kb[ to delalloc. 3) The third write, at btrfs_find_new_delalloc_bytes(), sees the extent map covering the range [148Kb, 150Kb[ and ends up calling set_extent_bit() for the same range, which results in splitting an existing extent state record, covering the range [148Kb, 152Kb[ into two 2Kb extent state records, covering the ranges [148Kb, 150Kb[ and [150Kb, 152Kb[. 4) Finally at lock_and_cleanup_extent_if_need(), immediately after calling btrfs_find_new_delalloc_bytes() we clear the delalloc bit from the range [100Kb, 152Kb[ which results in the btrfs_clear_bit_hook() callback being invoked against the two 2Kb extent state records that cover the ranges [148Kb, 150Kb[ and [150Kb, 152Kb[. When called against the first 2Kb extent state, it calls btrfs_delalloc_release_metadata() with a length argument of 2048 bytes. That function rounds up the length to a sector size aligned length, so it ends up considering a length of 4096 bytes, and then calls calc_csum_metadata_size() which results in decrementing the inode's csum_bytes counter by 4096 bytes, so after it stays a value of 0 bytes. Then the same happens when btrfs_clear_bit_hook() is called against the second extent state that has a length of 2Kb, covering the range [150Kb, 152Kb[, the length is rounded up to 4096 and calc_csum_metadata_size() ends up being called to decrement 4096 bytes from the inode's csum_bytes counter, which at that time has a value of 0, leading to an underflow, which is exactly what triggers the first warning, at btrfs_destroy_inode(). All the other warnings relate to several space accounting counters that underflow as well due to similar reasons. A similar case but where the hole punching operation creates an extent map with a start offset not aligned to the sector size is the following: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "fpunch 695K 820K" $SCRATCH_MNT/bar $ xfs_io -c "pwrite -S 0xaa 1008K 307K" $SCRATCH_MNT/bar $ xfs_io -c "pwrite -S 0xbb -b 630K 1073K 630K" $SCRATCH_MNT/bar $ xfs_io -c "pwrite -S 0xcc -b 459K 1068K 459K" $SCRATCH_MNT/bar $ umount /mnt During the unmount operation we get similar traces for the same reasons as in the first example. So fix the hole punching operation to make sure it never creates extent maps with a length that is not aligned to the sector size nor with a start offset that is not aligned to the sector size, as this breaks all assumptions and it's a land mine. Fixes: `d77815461f` ("btrfs: Avoid trucating page or punching hole in a already existed hole.") Cc: <stable@vger.kernel.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 16:52:45 +02:00
Jeff Mahoney	cddf3b2cb3	btrfs: add cond_resched to btrfs_qgroup_trace_leaf_items On an uncontended system, we can end up hitting soft lockups while doing replace_path. At the core, and frequently called is btrfs_qgroup_trace_leaf_items, so it makes sense to add a cond_resched there. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-21 15:48:01 +02:00
Nikolay Borisov	7dfb8be11b	btrfs: Round down values which are written for total_bytes_size We got an internal report about a file system not wanting to mount following `99e3ecfcb9` ("Btrfs: add more validation checks for superblock"). BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with fs_devices total_rw_bytes 1000203820544 Subtracting the numbers we get a difference of less than a 4kb. Upon closer inspection it became apparent that mkfs actually rounds down the size of the device to a multiple of sector size. However, the same cannot be said for various functions which modify the total size and are called from btrfs_balance as well as when adding a new device. So this patch ensures that values being saved into on-disk data structures are always rounded down to a multiple of sectorsize. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-20 14:22:48 +02:00
Nikolay Borisov	eca152edf5	btrfs: Manually implement device_total_bytes getter/setter The device->total_bytes member needs to always be rounded down to sectorsize so that it corresponds to the value of super->total_bytes. However, there are multiple places where the setter is fed a value which is not rounded which can cause a fs to be unmountable due to the check introduced in `99e3ecfcb9` ("Btrfs: add more validation checks for superblock"). This patch implements the getter/setter manually so that in a later patch I can add necessary code to catch offenders. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-20 14:22:48 +02:00
David Sterba	0d0c71b317	btrfs: obsolete and remove mount option alloc_start The mount option alloc_start was used in the past for debugging and stressing the chunk allocator. Not meant to be used by users, so we're not breaking anybody's setup. There was some added complexity handling changes of the value and when it was not same as default. Such code has likely been untested and I think it's better to remove it. This patch kills all use of alloc_start, and by doing that also fixes a bug when alloc_size is set, potentially called from statfs: in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU protection is temporarily dropped so btrfs_account_dev_extents_size can be called and then RCU is locked again! Doing that inside list_for_each_entry_rcu is just asking for trouble, but unlikely to be observed in practice. Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-20 14:22:48 +02:00
David Sterba	fac03c8dae	btrfs: move fs_info::fs_frozen to the flags We can keep the state among the other fs_info flags, there's no reason why fs_frozen would need to be separate. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-20 14:22:42 +02:00
David Sterba	79b4f4c605	btrfs: cleanup duplicate return value in insert_inline_extent The pattern when err is used for function exit and ret is used for return values of callees is not used here. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-20 14:22:12 +02:00
David Sterba	6165572c11	btrfs: use GFP_KERNEL in btrfs_init_dev_replace_tgtdev The function is called from ioctl context and we don't hold any locks that take part in writeback. Right now it's only fs_info::volume_mutex. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:04 +02:00
David Sterba	6a44517d79	btrfs: use GFP_KERNEL in btrfs_calc_avail_data_space We don't hold any locks here. Inidirectly called from statfs. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:04 +02:00
Nikolay Borisov	0eee8a494e	btrfs: Use btrfs_space_info_used instead of opencoding it Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:04 +02:00
Anand Jain	4fc6441aac	btrfs: wait part of the write_dev_flush() can be separated out Submit and wait parts of write_dev_flush() can be split into two separate functions for better readability. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:04 +02:00
Anand Jain	cea7c8bf77	btrfs: remove redundant null bdev counting during flush submission There is no extra benefit to count null bdev during the submit loop, as these null devices will be anyway checked during command completion device loop just after the submit loop. We are holding the device_list_mutex, the device->bdev status won't change in between. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:04 +02:00
Anand Jain	12b9bf0b94	btrfs: write_dev_flush does not return ENOMEM anymore Since commit "btrfs: btrfs_io_bio_alloc never fails, skip error handling" write_dev_flush will not return ENOMEM in the sending part. We do not need to check for it in the callers. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ updated changelog ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:04 +02:00
Timofey Titovets	170607ebd9	Btrfs: compression must free at least one sector size We already skip storing data where compression does not make the result at least one byte less. Let's make the logic better and check that compression frees at least one sector size of bytes, otherwise it's not that useful. Signed-off-by: Timofey Titovets <nefelim4ag@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> [ changelog updated ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:04 +02:00
David Sterba	c5e4c3d750	btrfs: sink gfp parameter to btrfs_io_bio_alloc We can hardcode GFP_NOFS to btrfs_io_bio_alloc, although it means we change it back from GFP_KERNEL in scrub. I'd rather save a few stack bytes from not passing the gfp flags in the remaining, more imporatant, contexts and the bio allocating API now looks more consistent. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:04 +02:00
David Sterba	184f999e12	btrfs: add helper to initialize the non-bio part of btrfs_io_bio We use btrfs_bioset for bios and ask to allocate the entire size of btrfs_io_bio from btrfs bio_alloc_bioset. The member 'bio' is initialized but the bytes from 0 to offset of 'bio' are left uninitialized. Although we initialize some of the members in our helpers, we should initialize the whole structures. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:03 +02:00
David Sterba	fa1bcbe0a5	btrfs: document mandatory order of bio in btrfs_io_bio Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:03 +02:00
Liu Bo	ef7cdac101	Btrfs: skip checksum verification if IO error occurs Currently dio read also goes to verify checksum if -EIO has been returned, although it usually fails on checksum, it's not necessary at all, we could directly check if there is another copy to read. And with this, the behavior of dio read is now consistent with that of buffered read. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use bool for uptodate ] Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:03 +02:00
Liu Bo	e3d37faba2	Btrfs: tolerate errors if we have retried successfully With raid1 profile, dio read isn't tolerating IO errors if read length is less than the stripe length (64K). Our bio didn't get split in btrfs_submit_direct_hook() if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED) is true and that happens when the read length is less than 64k. In this case, if the underlying device returns error somehow, bio->bi_error has recorded that error. If we could recover the correct data from another copy in profile raid1/10/5/6, with btrfs_subio_endio_read() returning 0, bio would have the correct data in its vector, but bio->bi_error is not updated accordingly so that the following dio_end_io(dio_bio, bio->bi_error) makes directIO think this read has failed. This fixes the problem by setting bio's error to 0 if a good copy has been found. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:03 +02:00
David Sterba	c821e7f3da	btrfs: pass bytes to btrfs_bio_alloc Most callers of btrfs_bio_alloc convert from bytes to sectors. Hide that in the helper and simplify the logic in the callsers. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:03 +02:00
David Sterba	9886b17433	btrfs: opencode trivial compressed_bio_alloc, simplify error handling compressed_bio_alloc is now a trivial wrapper around btrfs_bio_alloc, no point keeping it. The error handling can be simplified, as we know btrfs_bio_alloc will never fail. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:03 +02:00
David Sterba	9f2179a5e7	btrfs: remove redundant parameters from btrfs_bio_alloc All callers pass gfp_flags=GFP_NOFS and nr_vecs=BIO_MAX_PAGES. submit_extent_page adds __GFP_HIGH that does not make a difference in our case as it allows access to memory reserves but otherwise does not change the constraints. Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:03 +02:00
David Sterba	8b6c1d56f2	btrfs: sink gfp parameter to btrfs_bio_clone All callers pass GFP_NOFS. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:03 +02:00
David Sterba	e4f5690386	btrfs: btrfs_io_bio_alloc never fails, skip error handling Update direct callers of btrfs_io_bio_alloc that do error handling, that we can now remove. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:02 +02:00
David Sterba	3aa8e074ab	btrfs: btrfs_bio_clone never fails, skip error handling Update direct callers of btrfs_bio_clone that do error handling, that we can now remove. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:02 +02:00
David Sterba	0c4dd97c5e	btrfs: btrfs_bio_alloc never fails, skip error handling Update direct callers of btrfs_bio_alloc that do error handling, that we can now remove. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:02 +02:00
David Sterba	6e707bcd1f	btrfs: bioset allocations will never fail, adapt our helpers Christoph pointed out that bio allocations backed by a bioset will never fail. As we always use a bioset for all bio allocations, we can skip the error handling. This patch adjusts our low-level helpers, the cascaded changes to all callers will come next. CC: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>	2017-06-19 18:26:02 +02:00

1 2 3 4 5 ...

6238 Commits