linux

Author	SHA1	Message	Date
Gerhard Heift	ba346b357d	btrfs: tree_search, search_ioctl: direct copy to userspace By copying each found item seperatly to userspace, we do not need extra buffer in the kernel. Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>	2014-06-12 18:22:05 -07:00
Gerhard Heift	550ac1d85e	btrfs: new function read_extent_buffer_to_user This new function reads the content of an extent directly to user memory. Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>	2014-06-12 18:21:56 -07:00
Gerhard Heift	9b6e817d02	btrfs: tree_search, copy_to_sk: return needed size on EOVERFLOW If an item in tree_search is too large to be stored in the given buffer, return the needed size (including the header). Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>	2014-06-12 18:21:47 -07:00
Gerhard Heift	8f5f6178f3	btrfs: tree_search, copy_to_sk: return EOVERFLOW for too small buffer In copy_to_sk, if an item is too large for the given buffer, it now returns -EOVERFLOW instead of copying a search_header with len = 0. For backward compatibility for the first item it still copies such a header to the buffer, but not any other following items, which could have fitted. tree_search changes -EOVERFLOW back to 0 to behave similiar to the way it behaved before this patch. Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>	2014-06-12 18:21:39 -07:00
Gerhard Heift	1254444288	btrfs: tree_search, search_ioctl: accept varying buffer rewrite search_ioctl to accept a buffer with varying size Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>	2014-06-12 18:21:26 -07:00
Gerhard Heift	25c9bc2e2b	btrfs: tree_search: eliminate redundant nr_items check If the amount of items reached the given limit of nr_items, we can leave copy_to_sk without updating the key. Also by returning 1 we leave the loop in search_ioctl without rechecking if we reached the given limit. Signed-off-by: Gerhard Heift <Gerhard@Heift.Name> Signed-off-by: Chris Mason <clm@fb.com> Acked-by: David Sterba <dsterba@suse.cz>	2014-06-12 18:20:39 -07:00
Linus Torvalds	16b9057804	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "This the bunch that sat in -next + lock_parent() fix. This is the minimal set; there's more pending stuff. In particular, I really hope to get acct.c fixes merged this cycle - we need that to deal sanely with delayed-mntput stuff. In the next pile, hopefully - that series is fairly short and localized (kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more iov_iter work. Most of prereqs for ->splice_write with sane locking order are there and Kent's dio rewrite would also fit nicely on top of this pile" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits) lock_parent: don't step on stale ->d_parent of all-but-freed one kill generic_file_splice_write() ceph: switch to iter_file_splice_write() shmem: switch to iter_file_splice_write() nfs: switch to iter_splice_write_file() fs/splice.c: remove unneeded exports ocfs2: switch to iter_file_splice_write() ->splice_write() via ->write_iter() bio_vec-backed iov_iter optimize copy_page_{to,from}_iter() bury generic_file_aio_{read,write} lustre: get rid of messing with iovecs ceph: switch to ->write_iter() ceph_sync_direct_write: stop poking into iov_iter guts ceph_sync_read: stop poking into iov_iter guts new helper: copy_page_from_iter() fuse: switch to ->write_iter() btrfs: switch to ->write_iter() ocfs2: switch to ->write_iter() xfs: switch to ->write_iter() ...	2014-06-12 10:30:18 -07:00
Al Viro	9c1d5284c7	Merge commit '9f12600fe425bc28f0ccba034a77783c09c15af4' into for-linus Backmerge of dcache.c changes from mainline. It's that, or complete rebase... Conflicts: fs/splice.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-06-12 00:28:09 -04:00
Linus Torvalds	859862ddd2	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "The biggest change here is Josef's rework of the btrfs quota accounting, which improves the in-memory tracking of delayed extent operations. I had been working on Btrfs stack usage for a while, mostly because it had become impossible to do long stress runs with slab, lockdep and pagealloc debugging turned on without blowing the stack. Even though you upgraded us to a nice king sized stack, I kept most of the patches. We also have some very hard to find corruption fixes, an awesome sysfs use after free, and the usual assortment of optimizations, cleanups and other fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (80 commits) Btrfs: convert smp_mb__{before,after}_clear_bit Btrfs: fix scrub_print_warning to handle skinny metadata extents Btrfs: make fsync work after cloning into a file Btrfs: use right type to get real comparison Btrfs: don't check nodes for extent items Btrfs: don't release invalid page in btrfs_page_exists_in_range() Btrfs: make sure we retry if page is a retriable exception Btrfs: make sure we retry if we couldn't get the page btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56 trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/ Btrfs: fix leaf corruption after __btrfs_drop_extents Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled btrfs: free delayed node outside of root->inode_lock btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX Btrfs: fix transaction leak during fsync call btrfs: Avoid trucating page or punching hole in a already existed hole. Btrfs: update commit root on snapshot creation after orphan cleanup Btrfs: ioctl, don't re-lock extent range when not necessary Btrfs: avoid visiting all extent items when cloning a range ...	2014-06-11 09:22:21 -07:00
Chris Mason	c7548af69d	Btrfs: convert smp_mb__{before,after}_clear_bit The new call is smp_mb__{before,after}_atomic. The __ gives us extra protection from the atomic rays. Signed-off-by: Chris Mason <clm@fb.com>	2014-06-10 13:10:47 -07:00
Liu Bo	6eda71d0c0	Btrfs: fix scrub_print_warning to handle skinny metadata extents The skinny extents are intepreted incorrectly in scrub_print_warning(), and end up hitting the BUG() in btrfs_extent_inline_ref_size. Reported-by: Konstantinos Skarlatos <k.skarlatos@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:17 -07:00
Filipe Manana	7ffbb598a0	Btrfs: make fsync work after cloning into a file When cloning into a file, we were correctly replacing the extent items in the target range and removing the extent maps. However we weren't replacing the extent maps with new ones that point to the new extents - as a consequence, an incremental fsync (when the inode doesn't have the full sync flag) was a NOOP, since it relies on the existence of extent maps in the modified list of the inode's extent map tree, which was empty. Therefore add new extent maps to reflect the target clone range. A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:16 -07:00
Liu Bo	cd857dd6bc	Btrfs: use right type to get real comparison We want to make sure the point is still within the extent item, not to verify the memory it's pointing to. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:15 -07:00
Josef Bacik	8a56457f5f	Btrfs: don't check nodes for extent items The backref code was looking at nodes as well as leaves when we tried to populate extent item entries. This is not good, and although we go away with it for the most part because we'd skip where disk_bytenr != random_memory, sometimes random_memory would match and suddenly boom. This fixes that problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:14 -07:00
Filipe Manana	6fdef6d43c	Btrfs: don't release invalid page in btrfs_page_exists_in_range() In inode.c:btrfs_page_exists_in_range(), if the page we got from the radix tree is an exception entry, which can't be retried, we exit the loop with a non-NULL page and then call page_cache_release against it, which is not ok since it's not a valid page. This could also make us return true when we shouldn't. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:14 -07:00
Filipe Manana	809f901625	Btrfs: make sure we retry if page is a retriable exception In inode.c:btrfs_page_exists_in_range(), if the page we get from the radix tree is an exception which should make us retry, set page to NULL in order to really retry, because otherwise we don't get another loop iteration executed (page != NULL makes the while loop exit). This also was making us call page_cache_release after exiting the loop, which isn't correct because page doesn't point to a valid page, and possibly return true from the function when we shouldn't. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:13 -07:00
Filipe Manana	91405151eb	Btrfs: make sure we retry if we couldn't get the page In inode.c:btrfs_page_exists_in_range(), if we can't get the page we need to retry. However we weren't retrying because we weren't setting page to NULL, which makes the while loop exit immediately and will make us call page_cache_release after exiting the loop which is incorrect because our page get didn't succeed. This could also make us return true when we shouldn't. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:12 -07:00
Gui Hecheng	c81d57679e	btrfs: replace EINVAL with EOPNOTSUPP for dev_replace raid56 To return EOPNOTSUPP is more user friendly than to return EINVAL, and then user-space tool will show that the dev_replace operation for raid56 is not currently supported rather than showing that there is an invalid argument. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:12 -07:00
Antonio Ospite	9391558411	trivial: fs/btrfs/ioctl.c: fix typo s/substract/subtract/ Signed-off-by: Antonio Ospite <ao2@ao2.it> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:11 -07:00
Liu Bo	0b43e04f70	Btrfs: fix leaf corruption after __btrfs_drop_extents Several reports about leaf corruption has been floating on the list, one of them points to __btrfs_drop_extents(), and we find that the leaf becomes corrupted after __btrfs_drop_extents(), it's really a rare case but it does exist. The problem turns out to be btrfs_next_leaf() called in __btrfs_drop_extents(). So in btrfs_next_leaf(), we release the current path to re-search the last key of the leaf for locating next leaf, and we've taken it into account that there might be balance operations between leafs during this 'unlock and re-lock' dance, so we check the path again and advance it if there are now more items available. But things are a bit different if that last key happens to be removed and balance gets a bigger key as the last one, and btrfs_search_slot will return it with ret > 0, IOW, nothing change in this leaf except the new last key, then we think we're okay because there is no more item balanced in, fine, we thinks we can go to the next leaf. However, we should return that bigger key, otherwise we deserve leaf corruption, for example, in endio, skipping that key means that __btrfs_drop_extents() thinks it has dropped all extent matched the required range and finish_ordered_io can safely insert a new extent, but it actually doesn't and ends up a leaf corruption. One may be asking that why our locking on extent io tree doesn't work as expected, ie. it should avoid this kind of race situation. But in __btrfs_drop_extents(), we don't always find extents which are included within our locking range, IOW, extents can start before our searching start, in this case locking on extent io tree doesn't protect us from the race. This takes the special case into account. Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:10 -07:00
Filipe Manana	337c6f6830	Btrfs: ensure btrfs_prev_leaf doesn't miss 1 item We might have had an item with the previous key in the tree right before we released our path. And after we released our path, that item might have been pushed to the first slot (0) of the leaf we were holding due to a tree balance. Alternatively, an item with the previous key can exist as the only element of a leaf (big fat item). Therefore account for these 2 cases, so that our callers (like btrfs_previous_item) don't miss an existing item with a key matching the previous key we computed above. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:09 -07:00
Filipe Manana	f82a9901b0	Btrfs: fix clone to deal with holes when NO_HOLES feature is enabled If the NO_HOLES feature is enabled holes don't have file extent items in the btree that represent them anymore. This made the clone operation ignore the gaps that exist between consecutive file extent items and therefore not create the holes at the destination. When not using the NO_HOLES feature, the holes were created at the destination. A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:09 -07:00
Jeff Mahoney	964930312a	btrfs: free delayed node outside of root->inode_lock On heavy workloads, we're seeing soft lockup warnings on root->inode_lock in __btrfs_release_delayed_node. The low hanging fruit is to reduce the size of the critical section. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:08 -07:00
Gui Hecheng	902c68a4da	btrfs: replace EINVAL with ERANGE for resize when ULLONG_MAX To be accurate about the error case, if the new size is beyond ULLONG_MAX, return ERANGE instead of EINVAL. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:07 -07:00
Filipe Manana	b05fd8742f	Btrfs: fix transaction leak during fsync call If btrfs_log_dentry_safe() returns an error, we set ret to 1 and fall through with the goal of committing the transaction. However, in the case where the inode doesn't need a full sync, we would call btrfs_wait_ordered_range() against the target range for our inode, and if it returned an error, we would return without commiting or ending the transaction. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:06 -07:00
Qu Wenruo	d77815461f	btrfs: Avoid trucating page or punching hole in a already existed hole. btrfs_punch_hole() will truncate unaligned pages or punch hole on a already existed hole. This will cause unneeded zero page or holes splitting the original huge hole. This patch will skip already existed holes before any page truncating or hole punching. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:06 -07:00
Filipe Manana	3821f34888	Btrfs: update commit root on snapshot creation after orphan cleanup On snapshot creation (either writable or read-only), we do orphan cleanup against the root of the snapshot. If the cleanup did remove any orphans, then the current root node will be different from the commit root node until the next transaction commit happens. A send operation always uses the commit root of a snapshot - this means it will see the orphans if it starts computing the send stream before the next transaction commit happens (triggered by a timer or sync() for .e.g), which is when the commit root gets assigned a reference to current root, where the orphans are not visible anymore. The consequence of send seeing the orphans is explained below. For example: mkfs.btrfs -f /dev/sdd mount -o commit=999 /dev/sdd /mnt # open a file with O_TMPFILE and leave it open # write some data to the file btrfs subvolume snapshot -r /mnt /mnt/snap1 btrfs send /mnt/snap1 -f /tmp/send.data The send operation will fail with the following error: ERROR: send ioctl failed with -116: Stale file handle What happens here is that our snapshot has an orphan inode still visible through the commit root, that corresponds to the tmpfile. However send will attempt to call inode.c:btrfs_iget(), with the goal of reading the file's data, which will return -ESTALE because it will use the current root (and not the commit root) of the snapshot. Of course, there are other cases where we can get orphans, but this example using a tmpfile makes it much easier to reproduce the issue. Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if the commit root is different from the current root, just commit the transaction associated with the snapshot's root (if it exists), so that a send will not see any orphans that don't exist anymore. This also guarantees a send will always see the same content regardless of whether a transaction commit happened already before the send was requested and after the orphan cleanup (meaning the commit root and current roots are the same) or it hasn't happened yet (commit and current roots are different). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:05 -07:00
Filipe Manana	ff5df9b884	Btrfs: ioctl, don't re-lock extent range when not necessary In ioctl.c:lock_extent_range(), after locking our target range, the ordered extent that btrfs_lookup_first_ordered_extent() returns us may not overlap our target range at all. In this case we would just unlock our target range, wait for any new ordered extents that overlap the range to complete, lock again the range and repeat all these steps until we don't get any ordered extent and the delalloc flag isn't set in the io tree for our target range. Therefore just stop if we get an ordered extent that doesn't overlap our target range and the dealalloc flag isn't set for the range in the inode's io tree. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:04 -07:00
Filipe Manana	2c463823cb	Btrfs: avoid visiting all extent items when cloning a range When cloning a range of a file, we were visiting all the extent items in the btree that belong to our source inode. We don't need to visit those extent items that don't overlap the range we are cloning, as doing so only makes us waste time and do unnecessary btree navigations (btrfs_next_leaf) for inodes that have a large number of file extent items in the btree. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:04 -07:00
Filipe Manana	c55bfa67e9	Btrfs: set dead flag on the right root when destroying snapshot We were setting the BTRFS_ROOT_SUBVOL_DEAD flag on the root of the parent of our target snapshot, instead of setting it in the target snapshot's root. This is easy to observe by running the following scenario: mkfs.btrfs -f /dev/sdd mount /dev/sdd /mnt btrfs subvolume create /mnt/first_subvol btrfs subvolume snapshot -r /mnt /mnt/mysnap1 btrfs subvolume delete /mnt/first_subvol btrfs subvolume snapshot -r /mnt /mnt/mysnap2 btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data The send command failed because the send ioctl returned -EPERM. A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:03 -07:00
Filipe Manana	c125b8bff1	Btrfs: ensure readers see new data after a clone operation We were cleaning the clone target file range from the page cache before we did replace the file extent items in the fs tree. This was racy, as right after cleaning the relevant range from the page cache and before replacing the file extent items, a read against that range could be performed by another task and populate again the page cache with stale data (stale after the cloning finishes). This would result in reads after the clone operation successfully finishes to get old data (and potentially for a very long time). Therefore evict the pages after replacing the file extent items, so that subsequent reads will always get the new data. Similarly, we were prone to races while cloning the file extent items because we weren't locking the target range and wait for any existing ordered extents against that range to complete. It was possible that after cloning the extent items, a write operation that was performed before the clone operation and overlaps the same range, would end up undoing all or part of the work the clone operation did (a worker task running inode.c:btrfs_finish_ordered_io). Therefore lock the target range in the io tree, wait for all pending ordered extents against that range to finish and then safely perform the cloning. The issue of reading stale data after the clone operation is easy to reproduce by running the following C program in a loop until it exits with return value 1. #include <unistd.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <errno.h> #include <pthread.h> #include <fcntl.h> #include <assert.h> #include <asm/types.h> #include <linux/ioctl.h> #include <sys/stat.h> #include <sys/types.h> #include <sys/ioctl.h> #define SRC_FILE "/mnt/sdd/foo" #define DST_FILE "/mnt/sdd/bar" #define FILE_SIZE (16 * 1024) #define PATTERN_SRC 'X' #define PATTERN_DST 'Y' struct btrfs_ioctl_clone_range_args { __s64 src_fd; __u64 src_offset, src_length; __u64 dest_offset; }; #define BTRFS_IOCTL_MAGIC 0x94 #define BTRFS_IOC_CLONE_RANGE _IOW(BTRFS_IOCTL_MAGIC, 13, \ struct btrfs_ioctl_clone_range_args) static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; static int clone_done = 0; static int reader_ready = 0; static int stale_data = 0; static void reader_loop(void arg) { char buf[4096], want_buf[4096]; memset(want_buf, PATTERN_SRC, 4096); pthread_mutex_lock(&mutex); reader_ready = 1; pthread_mutex_unlock(&mutex); while (1) { int done, fd, ret; fd = open(DST_FILE, O_RDONLY); assert(fd != -1); pthread_mutex_lock(&mutex); done = clone_done; pthread_mutex_unlock(&mutex); ret = read(fd, buf, 4096); assert(ret == 4096); close(fd); if (done) { ret = memcmp(buf, want_buf, 4096); if (ret == 0) { printf("Found new content\n"); } else { printf("Found old content\n"); pthread_mutex_lock(&mutex); stale_data = 1; pthread_mutex_unlock(&mutex); } break; } } return NULL; } int main(int argc, char *argv[]) { pthread_t reader; int ret, i, fd; struct btrfs_ioctl_clone_range_args clone_args; int fd1, fd2; ret = remove(SRC_FILE); if (ret == -1 && errno != ENOENT) { fprintf(stderr, "Error deleting src file: %s\n", strerror(errno)); return 1; } ret = remove(DST_FILE); if (ret == -1 && errno != ENOENT) { fprintf(stderr, "Error deleting dst file: %s\n", strerror(errno)); return 1; } fd = open(SRC_FILE, O_CREAT \| O_WRONLY \| O_TRUNC, S_IRWXU); assert(fd != -1); for (i = 0; i < FILE_SIZE; i++) { char c = PATTERN_SRC; ret = write(fd, &c, 1); assert(ret == 1); } close(fd); fd = open(DST_FILE, O_CREAT \| O_WRONLY \| O_TRUNC, S_IRWXU); assert(fd != -1); for (i = 0; i < FILE_SIZE; i++) { char c = PATTERN_DST; ret = write(fd, &c, 1); assert(ret == 1); } close(fd); sync(); ret = pthread_create(&reader, NULL, reader_loop, NULL); assert(ret == 0); while (1) { int r; pthread_mutex_lock(&mutex); r = reader_ready; pthread_mutex_unlock(&mutex); if (r) break; } fd1 = open(SRC_FILE, O_RDONLY); if (fd1 < 0) { fprintf(stderr, "Error open src file: %s\n", strerror(errno)); return 1; } fd2 = open(DST_FILE, O_RDWR); if (fd2 < 0) { fprintf(stderr, "Error open dst file: %s\n", strerror(errno)); return 1; } clone_args.src_fd = fd1; clone_args.src_offset = 0; clone_args.src_length = 4096; clone_args.dest_offset = 0; ret = ioctl(fd2, BTRFS_IOC_CLONE_RANGE, &clone_args); assert(ret == 0); close(fd1); close(fd2); pthread_mutex_lock(&mutex); clone_done = 1; pthread_mutex_unlock(&mutex); ret = pthread_join(reader, NULL); assert(ret == 0); pthread_mutex_lock(&mutex); ret = stale_data ? 1 : 0; pthread_mutex_unlock(&mutex); return ret; } Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:02 -07:00
Rickard Strandqvist	8321cf2596	fs: btrfs: volumes.c: Fix for possible null pointer dereference There is otherwise a risk of a possible null pointer dereference. Was largely found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:01 -07:00
Jeff Mahoney	c1895442be	btrfs: allocate raid type kobjects dynamically We are currently allocating space_info objects in an array when we allocate space_info. When a user does something like: # btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt # btrfs balance start -mconvert=single -dconvert=single /mnt -f # btrfs balance start -mconvert=raid1 -dconvert=raid1 / We can end up with memory corruption since the kobject hasn't been reinitialized properly and the name pointer was left set. The rationale behind allocating them statically was to avoid creating a separate kobject container that just contained the raid type. It used the index in the array to determine the index. Ultimately, though, this wastes more memory than it saves in all but the most complex scenarios and introduces kobject lifetime questions. This patch allocates the kobjects dynamically instead. Note that we also remove the kobject_get/put of the parent kobject since kobject_add and kobject_del do that internally. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:01 -07:00
Filipe Manana	7e3ae33efa	Btrfs: send, use the right limits for xattr names and values We were limiting the sum of the xattr name and value lengths to PATH_MAX, which is not correct, specially on filesystems created with btrfs-progs v3.12 or higher, where the default leaf size is max(16384, PAGE_SIZE), or systems with page sizes larger than 4096 bytes. Xattrs have their own specific maximum name and value lengths, which depend on the leaf size, therefore use these limits to be able to send xattrs with sizes larger than PATH_MAX. A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:21:00 -07:00
Filipe Manana	1af56070e3	Btrfs: send, don't error in the presence of subvols/snapshots If we are doing an incremental send and the base snapshot has a directory with name X that doesn't exist anymore in the second snapshot and a new subvolume/snapshot exists in the second snapshot that has the same name as the directory (name X), the incremental send would fail with -ENOENT error. This is because it attempts to lookup for an inode with a number matching the objectid of a root, which doesn't exist. Steps to reproduce: mkfs.btrfs -f /dev/sdd mount /dev/sdd /mnt mkdir /mnt/testdir btrfs subvolume snapshot -r /mnt /mnt/mysnap1 rmdir /mnt/testdir btrfs subvolume create /mnt/testdir btrfs subvolume snapshot -r /mnt /mnt/mysnap2 btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data A test case for xfstests follows. Reported-by: Robert White <rwhite@pobox.com> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:59 -07:00
Chris Mason	a79b7d4b3e	Btrfs: async delayed refs Delayed extent operations are triggered during transaction commits. The goal is to queue up a healthly batch of changes to the extent allocation tree and run through them in bulk. This farms them off to async helper threads. The goal is to have the bulk of the delayed operations being done in the background, but this is also important to limit our stack footprint. Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:58 -07:00
Chris Mason	40f765805f	Btrfs: split up __extent_writepage to lower stack usage __extent_writepage has two unrelated parts. First it does the delayed allocation dance and second it does the mapping and IO for the page we're actually writing. This splits it up into those two parts so the stack from one doesn't impact the stack from the other. Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:58 -07:00
Alex Gartrell	fc4adbff82	btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking In these instances, we are trying to determine if a page has been accessed since we began the operation for the sake of retry. This is easily accomplished by doing a gang lookup in the page mapping radix tree, and it saves us the dependency on the flag (so that we might eventually delete it). btrfs_page_exists_in_range borrows heavily from find_get_page, replacing the radix tree look up with a gang lookup of 1, so that we can find the next highest page >= index and see if it falls into our lock range. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Alex Gartrell <agartrell@fb.com>	2014-06-09 17:20:57 -07:00
Chris Mason	0e378df15c	Btrfs: cut down stack usage in btree_write_cache_pages This adds noinline_for_stack to two helpers used by btree_write_cache_pages. It shaves us down from 424 bytes on the stack to 280. Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:56 -07:00
Chris Mason	d4452bc526	Btrfs: break up __btrfs_write_out_cache to cut down stack usage __btrfs_write_out_cache was one of our stack pigs. This breaks it up into helper functions and slims it down to 194 bytes. Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:55 -07:00
Josef Bacik	2a10840945	Btrfs: free tmp ulist for qgroup rescan Memory leaks are bad mmkay? Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:55 -07:00
Anand Jain	402a0f4759	btrfs: usage error should not be logged into system log I have an opinion that system logs /var/log/messages are valuable info to investigate the real system issues at the data center. People handling data center issues do spend a lot time and efforts analyzing messages files. Having usage error logged into /var/log/messages is something we should avoid. Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:54 -07:00
David Sterba	67a77eb147	btrfs: remove newline from inode cache kthread name Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:53 -07:00
David Sterba	351fd35321	btrfs: remove stale newlines from log messages I've noticed an extra line after "use no compression", but search revealed much more in messages of more critical levels and rare errors. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:53 -07:00
Chris Mason	7d78874273	Btrfs: fix double free in find_lock_delalloc_range We need to NULL the cached_state after freeing it, otherwise we might free it again if find_delalloc_range doesn't find anything. Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org	2014-06-09 17:20:52 -07:00
ZhangZhen	58dfae6365	btrfs: replace simple_strtoull() with kstrtoull() use the newer and more pleasant kstrtoull() to replace simple_strtoull(), because simple_strtoull() is marked for obsoletion. Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:51 -07:00
Wang Shilong	298658414a	Btrfs: set right total device count for seeding support Seeding device support allows us to create a new filesystem based on existed filesystem. However newly created filesystem's @total_devices should include seed devices. This patch fix the following problem: # mkfs.btrfs -f /dev/sdb # btrfstune -S 1 /dev/sdb # mount /dev/sdb /mnt # btrfs device add -f /dev/sdc /mnt --->fs_devices->total_devices = 1 # umount /mnt # mount /dev/sdc /mnt --->fs_devices->total_devices = 2 This is because we record right @total_devices in superblock, but @fs_devices->total_devices is reset to be 0 in btrfs_prepare_sprout(). Fix this problem by not resetting @fs_devices->total_devices. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:50 -07:00
Guangliang Zhao	45ff35d6b9	Btrfs: remove OPT_acl parse when acl disabled Even CONFIG_BTRFS_FS_POSIX_ACL is not defined, the acl still could been enabled using a mount option, and now fs/btrfs/acl.o is not built, so the mount options will appear to be supported but will be silently ignored. Signed-off-by: Guangliang Zhao <lucienchao@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:50 -07:00
Josef Bacik	faa2dbf004	Btrfs: add sanity tests for new qgroup accounting code This exercises the various parts of the new qgroup accounting code. We do some basic stuff and do some things with the shared refs to make sure all that code works. I had to add a bunch of infrastructure because I needed to be able to insert items into a fake tree without having to do all the hard work myself, hopefully this will be usefull in the future. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:49 -07:00
Josef Bacik	fcebe4562d	Btrfs: rework qgroup accounting Currently qgroups account for space by intercepting delayed ref updates to fs trees. It does this by adding sequence numbers to delayed ref updates so that it can figure out how the tree looked before the update so we can adjust the counters properly. The problem with this is that it does not allow delayed refs to be merged, so if you say are defragging an extent with 5k snapshots pointing to it we will thrash the delayed ref lock because we need to go back and manually merge these things together. Instead we want to process quota changes when we know they are going to happen, like when we first allocate an extent, we free a reference for an extent, we add new references etc. This patch accomplishes this by only adding qgroup operations for real ref changes. We only modify the sequence number when we need to lookup roots for bytenrs, this reduces the amount of churn on the sequence number and allows us to merge delayed refs as we add them most of the time. This patch encompasses a bunch of architectural changes 1) qgroup ref operations: instead of tracking qgroup operations through the delayed refs we simply add new ref operations whenever we notice that we need to when we've modified the refs themselves. 2) tree mod seq: we no longer have this separation of major/minor counters. this makes the sequence number stuff much more sane and we can remove some locking that was needed to protect the counter. 3) delayed ref seq: we now read the tree mod seq number and use that as our sequence. This means each new delayed ref doesn't have it's own unique sequence number, rather whenever we go to lookup backrefs we inc the sequence number so we can make sure to keep any new operations from screwing up our world view at that given point. This allows us to merge delayed refs during runtime. With all of these changes the delayed ref stuff is a little saner and the qgroup accounting stuff no longer goes negative in some cases like it was before. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:48 -07:00
Liu Bo	5dca6eea91	Btrfs: mark mapping with error flag to report errors to userspace According to commit `865ffef379` (fs: fix fsync() error reporting), it's not stable to just check error pages because pages can be truncated or invalidated, we should also mark mapping with error flag so that a later fsync can catch the error. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:47 -07:00
Liu Bo	29cc83f69c	Btrfs: fix NULL pointer crash of deleting a seed device Same as normal devices, seed devices should be initialized with fs_info->dev_root as well, otherwise we'll get a NULL pointer crash. Cc: Chris Murphy <lists@colorremedies.com> Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:47 -07:00
Wang Shilong	f017f15f7c	Btrfs: fix joining same transaction handle more than twice We hit something like the following function call flows: \|->run_delalloc_range() \|->btrfs_join_transaction() \|->cow_file_range() \|->btrfs_join_transaction() \|->find_free_extent() \|->btrfs_join_transaction() Trace infomation can be seen as: [ 7411.127040] ------------[ cut here ]------------ [ 7411.127060] WARNING: CPU: 0 PID: 11557 at fs/btrfs/transaction.c:383 start_transaction+0x561/0x580 [btrfs]() [ 7411.127079] CPU: 0 PID: 11557 Comm: kworker/u8:9 Tainted: G O 3.13.0+ #4 [ 7411.127080] Hardware name: LENOVO QiTianM4350/ , BIOS F1KT52AUS 05/24/2013 [ 7411.127085] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-5) [ 7411.127092] Call Trace: [ 7411.127097] [<ffffffff815b87b0>] dump_stack+0x45/0x56 [ 7411.127101] [<ffffffff81051ffd>] warn_slowpath_common+0x7d/0xa0 [ 7411.127102] [<ffffffff810520da>] warn_slowpath_null+0x1a/0x20 [ 7411.127109] [<ffffffffa0444fb1>] start_transaction+0x561/0x580 [btrfs] [ 7411.127115] [<ffffffffa0445027>] btrfs_join_transaction+0x17/0x20 [btrfs] [ 7411.127120] [<ffffffffa0431c91>] find_free_extent+0xa21/0xb50 [btrfs] [ 7411.127126] [<ffffffffa0431f68>] btrfs_reserve_extent+0xa8/0x1a0 [btrfs] [ 7411.127131] [<ffffffffa04322ce>] btrfs_alloc_free_block+0xee/0x440 [btrfs] [ 7411.127137] [<ffffffffa043bd6e>] ? btree_set_page_dirty+0xe/0x10 [btrfs] [ 7411.127142] [<ffffffffa041da51>] __btrfs_cow_block+0x121/0x530 [btrfs] [ 7411.127146] [<ffffffffa041dfff>] btrfs_cow_block+0x11f/0x1c0 [btrfs] [ 7411.127151] [<ffffffffa0421b74>] btrfs_search_slot+0x1d4/0x9c0 [btrfs] [ 7411.127157] [<ffffffffa0438567>] btrfs_lookup_file_extent+0x37/0x40 [btrfs] [ 7411.127163] [<ffffffffa0456bfc>] __btrfs_drop_extents+0x16c/0xd90 [btrfs] [ 7411.127169] [<ffffffffa0444ae3>] ? start_transaction+0x93/0x580 [btrfs] [ 7411.127171] [<ffffffff811663e2>] ? kmem_cache_alloc+0x132/0x140 [ 7411.127176] [<ffffffffa041cd9a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [ 7411.127182] [<ffffffffa044aa61>] cow_file_range_inline+0x181/0x2e0 [btrfs] [ 7411.127187] [<ffffffffa044aead>] cow_file_range+0x2ed/0x440 [btrfs] [ 7411.127194] [<ffffffffa0464d7f>] ? free_extent_buffer+0x4f/0xb0 [btrfs] [ 7411.127200] [<ffffffffa044b38f>] run_delalloc_nocow+0x38f/0xa60 [btrfs] [ 7411.127207] [<ffffffffa0461600>] ? test_range_bit+0x30/0x180 [btrfs] [ 7411.127212] [<ffffffffa044bd48>] run_delalloc_range+0x2e8/0x350 [btrfs] [ 7411.127219] [<ffffffffa04618f9>] ? find_lock_delalloc_range+0x1a9/0x1e0 [btrfs] [ 7411.127222] [<ffffffff812a1e71>] ? blk_queue_bio+0x2c1/0x330 [ 7411.127228] [<ffffffffa0462ad4>] __extent_writepage+0x2f4/0x760 [btrfs] Here we fix it by avoiding joining transaction again if we have held a transaction handle when allocating chunk in find_free_extent(). Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:46 -07:00
Miao Xie	995946dd29	Btrfs: use helpers for last_trans_log_full_commit instead of opencode Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:45 -07:00
Filipe Manana	1f21ef0a34	Btrfs: check if items are ordered when a leaf is marked dirty To ease finding bugs during development related to modifying btree leaves in such a way that it makes its items not sorted by key anymore. Since this is an expensive check, it's only enabled if CONFIG_BTRFS_FS_CHECK_INTEGRITY is set, which isn't meant to be enabled for regular users. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:45 -07:00
Filipe Manana	35045bf2fd	Btrfs: don't access non-existent key when csum tree is empty When the csum tree is empty, our leaf (path->nodes[0]) has a number of items equal to 0 and since btrfs_header_nritems() returns an unsigned integer (and so is our local nritems variable) the following comparison always evaluates to false: if (path->slots[0] >= nritems - 1) { As the casting rules lead to: if ((u32)0 >= (u32)4294967295) { This makes us access key at slot paths->slots[0] + 1 (1) of the empty leaf some lines below: btrfs_item_key_to_cpu(path->nodes[0], &found_key, slot); if (found_key.objectid != BTRFS_EXTENT_CSUM_OBJECTID \|\| found_key.type != BTRFS_EXTENT_CSUM_KEY) { found_next = 1; goto insert; } So just don't access such non-existent slot and don't set found_next to 1 when the tree is empty. It's very unlikely we'll get a random key with the objectid and type values above, which is where we could go into trouble. If nritems is 0, just set found_next to 1 anyway as it will make us insert a csum item covering our whole extent (or the whole leaf) when the tree is empty. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:44 -07:00
Wang Shilong	de348ee022	Btrfs: make sure there are not any read requests before stopping workers In close_ctree(), after we have stopped all workers,there maybe still some read requests(for example readahead) to submit and this maybe trigger an oops that user reported before: kernel BUG at fs/btrfs/async-thread.c:619! By hacking codes, i can reproduce this problem with one cpu available. We fix this potential problem by invalidating all btree inode pages before stopping all workers. Thanks to Miao for pointing out this problem. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:43 -07:00
Tsutomu Itoh	59885b3930	Btrfs: fix possible memory leak in btrfs_create_tree() In btrfs_create_tree(), if btrfs_insert_root() fails, we should free root->commit_root. Reported-by: Alex Lyakas <alex@zadarastorage.com> Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:42 -07:00
ZhangZhen	776e4aae55	btrfs: remove useless ACL check posix_acl_xattr_set() already does the check, and it's the only way to feed in an ACL from userspace. So the check here is useless, remove it. Signed-off-by: zhang zhen <zhenzhang.zhang@huawei.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:42 -07:00
Anand Jain	4d90d28b1c	btrfs: btrfs_rm_device() should zero mirror SB as well This fix will ensure all SB copies on the disk is zeroed when the disk is intentionally removed. This helps to better manage disks in the user land. This version of patch also merges the Zach patch as below. btrfs: don't double brelse on device rm Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:41 -07:00
Miao Xie	27cdeb7096	Btrfs: use bitfield instead of integer data type for the some variants in btrfs_root Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:40 -07:00
Filipe Manana	f959492fc1	Btrfs: send, fix more issues related to directory renames This is a continuation of the previous changes titled: Btrfs: fix incremental send's decision to delay a dir move/rename Btrfs: part 2, fix incremental send's decision to delay a dir move/rename There's a few more cases where a directory rename/move must be delayed which was previously overlooked. If our immediate ancestor has a lower inode number than ours and it doesn't have a delayed rename/move operation associated to it, it doesn't mean there isn't any non-direct ancestor of our current inode that needs to be renamed/moved before our current inode (i.e. with a higher inode number than ours). So we can't stop the search if our immediate ancestor has a lower inode number than ours, we need to navigate the directory hierarchy upwards until we hit the root or: 1) find an ancestor with an higher inode number that was renamed/moved in the send root too (or already has a pending rename/move registered); 2) find an ancestor that is a new directory (higher inode number than ours and exists only in the send root). Reproducer for case 1) $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b $ mkdir -p /mnt/a/c/d $ mkdir /mnt/a/b/e $ mkdir /mnt/a/c/d/f $ mv /mnt/a/b /mnt/a/c/d/2b $ mkdir /mnt/a/x $ mkdir /mnt/a/y $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mv /mnt/a/x /mnt/a/y $ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e $ mv /mnt/a/c/d /mnt/a/h/2d $ mv /mnt/a/c /mnt/a/h/2d/2b/2c $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send Simple reproducer for case 2) $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b $ mkdir /mnt/a/c $ mv /mnt/a/b /mnt/a/c/b2 $ mkdir /mnt/a/e $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mv /mnt/a/c/b2 /mnt/a/e/b3 $ mkdir /mnt/a/e/b3/f $ mkdir /mnt/a/h $ mv /mnt/a/c /mnt/a/e/b3/f/c2 $ mv /mnt/a/e /mnt/a/h/e2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send Another simple reproducer for case 2) $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b $ mkdir /mnt/a/c $ mkdir /mnt/a/b/d $ mkdir /mnt/a/c/e $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mkdir /mnt/a/b/d/f $ mkdir /mnt/a/b/g $ mv /mnt/a/c/e /mnt/a/b/g/e2 $ mv /mnt/a/c /mnt/a/b/d/f/c2 $ mv /mnt/a/b/d/f /mnt/a/b/g/e2/f2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send More complex reproducer for case 2) $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b $ mkdir -p /mnt/a/c/d $ mkdir /mnt/a/b/e $ mkdir /mnt/a/c/d/f $ mv /mnt/a/b /mnt/a/c/d/2b $ mkdir /mnt/a/x $ mkdir /mnt/a/y $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mv /mnt/a/x /mnt/a/y $ mv /mnt/a/c/d/2b/e /mnt/a/c/d/2b/2e $ mv /mnt/a/c/d /mnt/a/h/2d $ mv /mnt/a/c /mnt/a/h/2d/2b/2c $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send For both cases the incremental send would enter an infinite loop when building path strings. While solving these cases, this change also re-implements the code to detect when directory moves/renames should be delayed. Instead of dealing with several specific cases separately, it's now more generic handling all cases with a simple detection algorithm and if when applying a delayed move/rename there's a path loop detected, it further delays the move/rename registering a new ancestor inode as the dependency inode (so our rename happens after that ancestor is renamed). Tests for these cases is being added to xfstests too. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:40 -07:00
Filipe Manana	a10c40766c	Btrfs: send, remove dead code from __get_cur_name_and_parent Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:39 -07:00
Filipe Manana	c992ec94f2	Btrfs: send, account for orphan directories when building path strings If we have directories with a pending move/rename operation, we must take into account any orphan directories that got created before executing the pending move/rename. Those orphan directories are directories with an inode number higher then the current send progress and that don't exist in the parent snapshot, they are created before current progress reaches their inode number, with a generated name of the form oN-M-I and at the root of the filesystem tree, and later when progress matches their inode number, moved/renamed to their final location. Reproducer: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b/c/d $ mkdir /mnt/a/b/e $ mv /mnt/a/b/c /mnt/a/b/e/CC $ mkdir /mnt/a/b/e/CC/d/f $ mkdir /mnt/a/g $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mkdir /mnt/a/g/h $ mv /mnt/a/b/e /mnt/a/g/h/EE $ mv /mnt/a/g/h/EE/CC/d /mnt/a/g/h/EE/DD $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send The second receive command failed with the following error: ERROR: rename a/b/e/CC/d -> o264-7-0/EE/DD failed. No such file or directory A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:38 -07:00
Filipe Manana	b46ab97bcd	Btrfs: send, avoid unnecessary inode item lookup in the btree Regardless of whether the caller is interested or not in knowing the inode's generation (dir_gen != NULL), get_first_ref always does a btree lookup to get the inode item. Avoid this useless lookup if dir_gen parameter is NULL (which is in some cases). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:37 -07:00
Gui Hecheng	23f8f9b7ca	btrfs: add dev maxs limit for __btrfs_alloc_chunk in kernel space For RAID0,5,6,10, For system chunk, there shouldn't be too many stripes to make a btrfs_chunk that exceeds BTRFS_SYSTEM_CHUNK_ARRAY_SIZE For data/meta chunk, there shouldn't be too many stripes to make a btrfs_chunk that exceeds a leaf. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:36 -07:00
Gui Hecheng	5f43f86e3f	btrfs: fix wrong max system array size check in kernel space For system chunk array, We copy a "disk_key" and an chunk item each time, so there should be enough space to hold both of them, not only the chunk item. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:36 -07:00
Qu Wenruo	65d33fd7a6	btrfs: Add check to avoid cleanup roots already in fs_info->dead_roots. Current btrfs_orphan_cleanup will also cleanup roots which is already in fs_info->dead_roots without protection. This will have conditional race with fs_info->cleaner_kthread. This patch will use refs in root->root_item to detect roots in dead_roots and avoid conflicts. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:35 -07:00
Miao Xie	21c7e75654	Btrfs: reclaim the reserved metadata space at background Before applying this patch, the task had to reclaim the metadata space by itself if the metadata space was not enough. And When the task started the space reclamation, all the other tasks which wanted to reserve the metadata space were blocked. At some cases, they would be blocked for a long time, it made the performance fluctuate wildly. So we introduce the background metadata space reclamation, when the space is about to be exhausted, we insert a reclaim work into the workqueue, the worker of the workqueue helps us to reclaim the reserved space at the background. By this way, the tasks needn't reclaim the space by themselves at most cases, and even if the tasks have to reclaim the space or are blocked for the space reclamation, they will get enough space more quickly. Here is my test result(Tested by compilebench): Memory: 2GB CPU: 2Cores * 1CPU Partition: 40GB(SSD) Test command: # compilebench -D <mnt> -m Without this patch: intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s) compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s) read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s) delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s) With this patch: intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s) compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s) read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s) delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s) Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:34 -07:00
Miao Xie	32d6b47fe6	Btrfs: output warning instead of error when loading free space cache failed If we fail to load a free space cache, we can rebuild it from the extent tree, so it is not a serious error, we should not output a error message that would make the users uncomfortable. This patch uses warning message instead of it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:33 -07:00
Qu Wenruo	5a1972bd9f	btrfs: Add ctime/mtime update for btrfs device add/remove. Btrfs will send uevent to udev inform the device change, but ctime/mtime for the block device inode is not udpated, which cause libblkid used by btrfs-progs unable to detect device change and use old cache, causing 'btrfs dev scan; btrfs dev rmove; btrfs dev scan' give an error message. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Cc: Karel Zak <kzak@redhat.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:33 -07:00
David Sterba	61155aa04e	btrfs: assert that send is not in progres before root deletion CC: Miao Xie <miaox@cn.fujitsu.com> CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:32 -07:00
David Sterba	521e0546c9	btrfs: protect snapshots from deleting during send The patch "Btrfs: fix protection between send and root deletion" (`18f687d538`) does not actually prevent to delete the snapshot and just takes care during background cleaning, but this seems rather user unfriendly, this patch implements the idea presented in http://www.spinics.net/lists/linux-btrfs/msg30813.html - add an internal root_item flag to denote a dead root - check if the send_in_progress is set and refuse to delete, otherwise set the flag and proceed - check the flag in send similar to the btrfs_root_readonly checks, for all involved roots The root lookup in send via btrfs_read_fs_root_no_name will check if the root is really dead or not. If it is, ENOENT, aborted send. If it's alive, it's protected by send_in_progress, send can continue. CC: Miao Xie <miaox@cn.fujitsu.com> CC: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:31 -07:00
Daeseok Youn	944a4515b2	btrfs: remove redundant null check in btrfs_dentry_release() It doesn't need to check NULL for kfree() Signed-off-by: Daeseok Youn <daeseok.youn@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:31 -07:00
Filipe Manana	ef3b9af50b	Btrfs: implement inode_operations callback tmpfile This implements the tmpfile callback of struct inode_operations, introduced in the linux kernel 3.11, and implemented already by some filesystems. This callback is invoked by the VFS when the flag O_TMPFILE is passed to the open system call. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.cz>	2014-06-09 17:20:30 -07:00
David Sterba	e4ef90ff61	btrfs: make FS_INFO ioctl available to anyone This ioctl provides basic info about the filesystem that can be obtained in other ways (eg. sysfs), there's no reason to restrict it to CAP_SYSADMIN. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:29 -07:00
David Sterba	7d6213c5a7	btrfs: make DEV_INFO ioctl available to anyone This ioctl provides basic info about the devices that can be obtained in other ways (eg. sysfs), there's no reason to restrict it to CAP_SYSADMIN. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:28 -07:00
David Sterba	df93589a17	btrfs: export more from FS_INFO to sysfs Similar to the FS_INFO updates, export the basic filesystem info through sysfs: node size, sector size and clone alignment. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:28 -07:00
David Sterba	80a773fbfc	btrfs: retrieve more info from FS_INFO ioctl Provide the basic information about filesystem through the ioctl: * b-tree node size (same as leaf size) * sector size * expected alignment of CLONE_RANGE and EXTENT_SAME ioctl arguments Backward compatibility: if the values are 0, kernel does not provide this information, the applications should ignore them. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:27 -07:00
David Sterba	7d824b6f9c	btrfs: balance filter: add limit of processed chunks This started as debugging helper, to watch the effects of converting between raid levels on multiple devices, but could be useful standalone. In my case the usage filter was not finegrained enough and led to converting too many chunks at once. Another example use is in connection with drange+devid or vrange filters that allow to work with a specific chunk or even with a chunk on a given device. The limit filter applies last, the value of 0 means no limiting. CC: Ilya Dryomov <idryomov@gmail.com> CC: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:26 -07:00
Filipe Manana	fc19c5e736	Btrfs: fix leaf corruption caused by ENOSPC while hole punching While running a stress test with multiple threads writing to the same btrfs file system, I ended up with a situation where a leaf was corrupted in that it had 2 file extent item keys that had the same exact key. I was able to detect this quickly thanks to the following patch which triggers an assertion as soon as a leaf is marked dirty if there are duplicated keys or out of order keys: Btrfs: check if items are ordered when a leaf is marked dirty (https://patchwork.kernel.org/patch/3955431/) Basically while running the test, I got the following in dmesg: [28877.415877] WARNING: CPU: 2 PID: 10706 at fs/btrfs/file.c:553 btrfs_drop_extent_cache+0x435/0x440 [btrfs]() (...) [28877.415917] Call Trace: [28877.415922] [<ffffffff816f1189>] dump_stack+0x4e/0x68 [28877.415926] [<ffffffff8104a32c>] warn_slowpath_common+0x8c/0xc0 [28877.415929] [<ffffffff8104a37a>] warn_slowpath_null+0x1a/0x20 [28877.415944] [<ffffffffa03775a5>] btrfs_drop_extent_cache+0x435/0x440 [btrfs] [28877.415949] [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0 [28877.415962] [<ffffffffa03777d9>] fill_holes+0x229/0x3e0 [btrfs] [28877.415972] [<ffffffffa0345865>] ? block_rsv_add_bytes+0x55/0x80 [btrfs] [28877.415984] [<ffffffffa03792cb>] btrfs_fallocate+0xb6b/0xc20 [btrfs] (...) [29854.132560] BTRFS critical (device sdc): corrupt leaf, bad key order: block=955232256,root=1, slot=24 [29854.132565] BTRFS info (device sdc): leaf 955232256 total ptrs 40 free space 778 (...) [29854.132637] item 23 key (3486 108 667648) itemoff 2694 itemsize 53 [29854.132638] extent data disk bytenr 14574411776 nr 286720 [29854.132639] extent data offset 0 nr 286720 ram 286720 [29854.132640] item 24 key (3486 108 954368) itemoff 2641 itemsize 53 [29854.132641] extent data disk bytenr 0 nr 0 [29854.132643] extent data offset 0 nr 0 ram 0 [29854.132644] item 25 key (3486 108 954368) itemoff 2588 itemsize 53 [29854.132645] extent data disk bytenr 8699670528 nr 77824 [29854.132646] extent data offset 0 nr 77824 ram 77824 [29854.132647] item 26 key (3486 108 1146880) itemoff 2535 itemsize 53 [29854.132648] extent data disk bytenr 8699670528 nr 77824 [29854.132649] extent data offset 0 nr 77824 ram 77824 (...) [29854.132707] kernel BUG at fs/btrfs/ctree.h:3901! (...) [29854.132771] Call Trace: [29854.132779] [<ffffffffa0342b5c>] setup_items_for_insert+0x2dc/0x400 [btrfs] [29854.132791] [<ffffffffa0378537>] __btrfs_drop_extents+0xba7/0xdd0 [btrfs] [29854.132794] [<ffffffff8109c0d6>] ? trace_hardirqs_on_caller+0x16/0x1d0 [29854.132797] [<ffffffff8109c29d>] ? trace_hardirqs_on+0xd/0x10 [29854.132800] [<ffffffff8118e7be>] ? kmem_cache_alloc+0xfe/0x1c0 [29854.132810] [<ffffffffa036783b>] insert_reserved_file_extent.constprop.66+0xab/0x310 [btrfs] [29854.132820] [<ffffffffa036a6c6>] __btrfs_prealloc_file_range+0x116/0x340 [btrfs] [29854.132830] [<ffffffffa0374d53>] btrfs_prealloc_file_range+0x23/0x30 [btrfs] (...) So this is caused by getting an -ENOSPC error while punching a file hole, more specifically, we get -ENOSPC error from __btrfs_drop_extents in the while loop of file.c:btrfs_punch_hole() when it's unable to modify the btree to delete one or more file extent items due to lack of enough free space. When this happens, in btrfs_punch_hole(), we attempt to reclaim free space by switching our transaction block reservation object to root->fs_info->trans_block_rsv, end our transaction and start a new transaction basically - and, we keep increasing our current offset (cur_offset) as long as it's smaller than the end of the target range (lockend) - this makes use leave the loop with cur_offset == drop_end which in turn makes us call fill_holes() for inserting a file extent item that represents a 0 bytes range hole (and this insertion succeeds, as in the meanwhile more space became available). This 0 bytes file hole extent item is a problem because any subsequent caller of __btrfs_drop_extents (regular file writes, or fallocate calls for e.g.), with a start file offset that is equal to the offset of the hole, will not remove this extent item due to the following conditional in the while loop of __btrfs_drop_extents: if (extent_end <= search_start) { path->slots[0]++; goto next_slot; } This later makes the call to setup_items_for_insert() (at the very end of __btrfs_drop_extents), insert a new file extent item with the same offset as the 0 bytes file hole extent item that follows it. Needless is to say that this causes chaos, either when reading the leaf from disk (btree_readpage_end_io_hook), where we perform leaf sanity checks or in subsequent operations that manipulate file extent items, as in the fallocate call as shown by the dmesg trace above. Without my other patch to perform the leaf sanity checks once a leaf is marked as dirty (if the integrity checker is enabled), it would have been much harder to debug this issue. This change might fix a few similar issues reported by users in the mailing list regarding assertion failures in btrfs_set_item_key_safe calls performed by __btrfs_drop_extents, such as the following report: http://comments.gmane.org/gmane.comp.file-systems.btrfs/32938 Asking fill_holes() to create a 0 bytes wide file hole item also produced the first warning in the trace above, as we passed a range to btrfs_drop_extent_cache that has an end smaller (by -1) than its start. On 3.14 kernels this issue manifests itself through leaf corruption, as we get duplicated file extent item keys in a leaf when calling setup_items_for_insert(), but on older kernels, setup_items_for_insert() isn't called by __btrfs_drop_extents(), instead we have callers of __btrfs_drop_extents(), namely the functions inode.c:insert_inline_extent() and inode.c:insert_reserved_file_extent(), calling btrfs_insert_empty_item() to insert the new file extent item, which would fail with error -EEXIST, instead of inserting a duplicated key - which is still a serious issue as it would make all similar file extent item replace operations keep failing if they target the same file range. Cc: stable@vger.kernel.org Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:26 -07:00
Liu Bo	d2cbf2a260	Btrfs: do not increment on bio_index one by one 'bio_index' is just a index, it's really not necessary to do increment one by one. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:25 -07:00
Filipe Manana	a1a50f60a6	Btrfs: read inode size after acquiring the mutex when punching a hole In a previous change, commit `12870f1c9b`, I accidentally moved the roundup of inode->i_size to outside of the critical section delimited by the inode mutex, which is not atomic and not correct since the size can be changed by other task before we acquire the mutex. Therefore fix it. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:24 -07:00
Tobias Klauser	7fb18a0664	btrfs: Remove unnecessary check for NULL iput() already checks for the inode being NULL, thus it's unnecessary to check before calling. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:23 -07:00
Zach Brown	166ae5a418	btrfs: fix inline compressed read err corruption uncompress_inline() is dropping the error from btrfs_decompress() after testing it and zeroing the page that was supposed to hold decompressed data. This can silently turn compressed inline data in to zeros if decompression fails due to corrupt compressed data or memory allocation failure. I verified this by manually forcing the error from btrfs_decompress() for a silly named copy of od: if (!strcmp(current->comm, "failod")) ret = -ENOMEM; # od -x /mnt/btrfs/dir/80 \| head -1 0000000 3031 3038 310a 2d30 6f70 6e69 0a74 3031 # echo 3 > /proc/sys/vm/drop_caches # cp $(which od) /tmp/failod # /tmp/failod -x /mnt/btrfs/dir/80 \| head -1 0000000 0000 0000 0000 0000 0000 0000 0000 0000 The fix is to pass the error to its caller. Which still has a BUG_ON(). So we fix that too. There seems to be no reason for the zeroing of the page on the error from btrfs_decompress() but not from the allocation error a few lines above. So the page zeroing is removed. Signed-off-by: Zach Brown <zab@redhat.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:23 -07:00
Zach Brown	774bcb35f0	btrfs: return ptr error from compression workspace The btrfs compression wrappers translated errors from workspace allocation to either -ENOMEM or -1. The compression type workspace allocators are already returning a ERR_PTR(-ENOMEM). Just return that and get rid of the magical -1. This helps a future patch return errors from the compression wrappers. Signed-off-by: Zach Brown <zab@redhat.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:22 -07:00
Zach Brown	60e1975acb	btrfs: return errno instead of -1 from compression The compression layer seems to have been built to return -1 and have callers make up errors that make sense. This isn't great because there are different errors that originate down in the compression layer. Let's return real negative errnos from the compression layer so that callers can pass on the error without having to guess what happened. ENOMEM for allocation failure, E2BIG when compression exceeds the uncompressed input, and EIO for everything else. This helps a future path return errors from btrfs_decompress(). Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:21 -07:00
Stefan Behrens	98806b446d	btrfs: check_int: propagate out-of-memory error upwards This issue was not causing any harm but IMO (and in the opinion of the static code checker) it is better to propagate this error status upwards. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:21 -07:00
Filipe Manana	61391d5622	Btrfs: fix hang on error (such as ENOSPC) when writing extent pages When running low on available disk space and having several processes doing buffered file IO, I got the following trace in dmesg: [ 4202.720152] INFO: task kworker/u8:1:5450 blocked for more than 120 seconds. [ 4202.720401] Not tainted 3.13.0-fdm-btrfs-next-26+ #1 [ 4202.720596] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4202.720874] kworker/u8:1 D 0000000000000001 0 5450 2 0x00000000 [ 4202.720904] Workqueue: btrfs-flush_delalloc normal_work_helper [btrfs] [ 4202.720908] ffff8801f62ddc38 0000000000000082 ffff880203ac2490 00000000001d3f40 [ 4202.720913] ffff8801f62ddfd8 00000000001d3f40 ffff8800c4f0c920 ffff880203ac2490 [ 4202.720918] 00000000001d4a40 ffff88020fe85a40 ffff88020fe85ab8 0000000000000001 [ 4202.720922] Call Trace: [ 4202.720931] [<ffffffff816a3cb9>] schedule+0x29/0x70 [ 4202.720950] [<ffffffffa01ec48d>] btrfs_start_ordered_extent+0x6d/0x110 [btrfs] [ 4202.720956] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0 [ 4202.720972] [<ffffffffa01ec559>] btrfs_run_ordered_extent_work+0x29/0x40 [btrfs] [ 4202.720988] [<ffffffffa0201987>] normal_work_helper+0x137/0x2c0 [btrfs] [ 4202.720994] [<ffffffff810680e5>] process_one_work+0x1f5/0x530 (...) [ 4202.721027] 2 locks held by kworker/u8:1/5450: [ 4202.721028] #0: (%s-%s){++++..}, at: [<ffffffff81068083>] process_one_work+0x193/0x530 [ 4202.721037] #1: ((&work->normal_work)){+.+...}, at: [<ffffffff81068083>] process_one_work+0x193/0x530 [ 4202.721054] INFO: task btrfs:7891 blocked for more than 120 seconds. [ 4202.721258] Not tainted 3.13.0-fdm-btrfs-next-26+ #1 [ 4202.721444] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 4202.721699] btrfs D 0000000000000001 0 7891 7890 0x00000001 [ 4202.721704] ffff88018c2119e8 0000000000000086 ffff8800a33d2490 00000000001d3f40 [ 4202.721710] ffff88018c211fd8 00000000001d3f40 ffff8802144b0000 ffff8800a33d2490 [ 4202.721714] ffff8800d8576640 ffff88020fe85bc0 ffff88020fe85bc8 7fffffffffffffff [ 4202.721718] Call Trace: [ 4202.721723] [<ffffffff816a3cb9>] schedule+0x29/0x70 [ 4202.721727] [<ffffffff816a2ebc>] schedule_timeout+0x1dc/0x270 [ 4202.721732] [<ffffffff8109bd79>] ? mark_held_locks+0xb9/0x140 [ 4202.721736] [<ffffffff816a90c0>] ? _raw_spin_unlock_irq+0x30/0x40 [ 4202.721740] [<ffffffff8109bf0d>] ? trace_hardirqs_on_caller+0x10d/0x1d0 [ 4202.721744] [<ffffffff816a488f>] wait_for_completion+0xdf/0x120 [ 4202.721749] [<ffffffff8107fa90>] ? try_to_wake_up+0x310/0x310 [ 4202.721765] [<ffffffffa01ebee4>] btrfs_wait_ordered_extents+0x1f4/0x280 [btrfs] [ 4202.721781] [<ffffffffa020526e>] btrfs_mksubvol.isra.62+0x30e/0x5a0 [btrfs] [ 4202.721786] [<ffffffff8108e620>] ? bit_waitqueue+0xc0/0xc0 [ 4202.721799] [<ffffffffa02056a9>] btrfs_ioctl_snap_create_transid+0x1a9/0x1b0 [btrfs] [ 4202.721813] [<ffffffffa020583a>] btrfs_ioctl_snap_create_v2+0x10a/0x170 [btrfs] (...) It turns out that extent_io.c:__extent_writepage(), which ends up being called through filemap_fdatawrite_range() in btrfs_start_ordered_extent(), was getting -ENOSPC when calling the fill_delalloc callback. In this situation, it returned without the writepage_end_io_hook callback (inode.c:btrfs_writepage_end_io_hook) ever being called for the respective page, which prevents the ordered extent's bytes_left count from ever reaching 0, and therefore a finish_ordered_fn work is never queued into the endio_write_workers queue. This makes the task that called btrfs_start_ordered_extent() hang forever on the wait queue of the ordered extent. This is fairly easy to reproduce using a small filesystem and fsstress on a quad core vm: mkfs.btrfs -f -b `expr 2100 \* 1024 \* 1024` /dev/sdd mount /dev/sdd /mnt fsstress -p 6 -d /mnt -n 100000 -x \ "btrfs subvolume snapshot -r /mnt /mnt/mysnap" \ -f allocsp=0 \ -f bulkstat=0 \ -f bulkstat1=0 \ -f chown=0 \ -f creat=1 \ -f dread=0 \ -f dwrite=0 \ -f fallocate=1 \ -f fdatasync=0 \ -f fiemap=0 \ -f freesp=0 \ -f fsync=0 \ -f getattr=0 \ -f getdents=0 \ -f link=0 \ -f mkdir=0 \ -f mknod=0 \ -f punch=1 \ -f read=0 \ -f readlink=0 \ -f rename=0 \ -f resvsp=0 \ -f rmdir=0 \ -f setxattr=0 \ -f stat=0 \ -f symlink=0 \ -f sync=0 \ -f truncate=1 \ -f unlink=0 \ -f unresvsp=0 \ -f write=4 So just ensure that if an error happens while writing the extent page we call the writepage_end_io_hook callback. Also make it return the error code and ensure the caller (extent_write_cache_pages) processes all pages in the page vector even if an error happens only for some of them, so that ordered extents end up released. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-09 17:20:20 -07:00
Linus Torvalds	3f17ea6dea	Merge branch 'next' (accumulated 3.16 merge window patches) into master Now that 3.15 is released, this merges the 'next' branch into 'master', bringing us to the normal situation where my 'master' branch is the merge window. * accumulated work in next: (6809 commits) ufs: sb mutex merge + mutex_destroy powerpc: update comments for generic idle conversion cris: update comments for generic idle conversion idle: remove cpu_idle() forward declarations nbd: zero from and len fields in NBD_CMD_DISCONNECT. mm: convert some level-less printks to pr_* MAINTAINERS: adi-buildroot-devel is moderated MAINTAINERS: add linux-api for review of API/ABI changes mm/kmemleak-test.c: use pr_fmt for logging fs/dlm/debug_fs.c: replace seq_printf by seq_puts fs/dlm/lockspace.c: convert simple_str to kstr fs/dlm/config.c: convert simple_str to kstr mm: mark remap_file_pages() syscall as deprecated mm: memcontrol: remove unnecessary memcg argument from soft limit functions mm: memcontrol: clean up memcg zoneinfo lookup mm/memblock.c: call kmemleak directly from memblock_(alloc\|free) mm/mempool.c: update the kmemleak stack trace for mempool allocations lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations mm: introduce kmemleak_update_trace() mm/kmemleak.c: use %u to print ->checksum ...	2014-06-08 11:31:16 -07:00
Filipe Manana	01a9a8a9e2	Btrfs: send, fix corrupted path strings for long paths If a path has more than 230 characters, we allocate a new buffer to use for the path, but we were forgotting to copy the contents of the previous buffer into the new one, which has random content from the kmalloc call. Test: mkfs.btrfs -f /dev/sdd mount /dev/sdd /mnt TEST_PATH="/mnt/fdmanana/.config/google-chrome-mysetup/Default/Pepper_Data/Shockwave_Flash/WritableRoot/#SharedObjects/JSHJ4ZKN/s.wsj.net/[[IMPORT]]/players.edgesuite.net/flash/plugins/osmf/advanced-streaming-plugin/v2.7/osmf1.6/Ak#" mkdir -p $TEST_PATH echo "hello world" > $TEST_PATH/amaiAdvancedStreamingPlugin.txt btrfs subvolume snapshot -r /mnt /mnt/mysnap1 btrfs send /mnt/mysnap1 -f /tmp/1.snap A test for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Cc: Marc Merlin <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Chris Mason <clm@fb.com>	2014-06-06 12:00:46 -07:00
Mel Gorman	2457aec637	mm: non-atomically mark page accessed during page cache allocation where possible aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-04 16:54:10 -07:00
Linus Torvalds	776edb5931	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull core locking updates from Ingo Molnar: "The main changes in this cycle were: - reduced/streamlined smp_mb__() interface that allows more usecases and makes the existing ones less buggy, especially in rarer architectures - add rwsem implementation comments - bump up lockdep limits" 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) rwsem: Add comments to explain the meaning of the rwsem's count field lockdep: Increase static allocations arch: Mass conversion of smp_mb__() arch,doc: Convert smp_mb__() arch,xtensa: Convert smp_mb__() arch,x86: Convert smp_mb__() arch,tile: Convert smp_mb__() arch,sparc: Convert smp_mb__() arch,sh: Convert smp_mb__() arch,score: Convert smp_mb__() arch,s390: Convert smp_mb__() arch,powerpc: Convert smp_mb__() arch,parisc: Convert smp_mb__() arch,openrisc: Convert smp_mb__() arch,mn10300: Convert smp_mb__() arch,mips: Convert smp_mb__() arch,metag: Convert smp_mb__() arch,m68k: Convert smp_mb__() arch,m32r: Convert smp_mb__() arch,ia64: Convert smp_mb__() ...	2014-06-03 12:57:53 -07:00
Linus Torvalds	11da37b263	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull two btrfs fixes from Chris Mason: "This has two fixes that we've been testing for 3.16, but since both are safe and fix real bugs, it makes sense to send for 3.15 instead" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: send, fix incorrect ref access when using extrefs Btrfs: fix EIO on reading file after ioctl clone works on it	2014-05-22 05:40:13 +09:00
Filipe Manana	51a60253a5	Btrfs: send, fix incorrect ref access when using extrefs When running send, if an inode only has extended reference items associated to it and no regular references, send.c:get_first_ref() was incorrectly assuming the reference it found was of type BTRFS_INODE_REF_KEY due to use of the wrong key variable. This caused weird behaviour when using the found item has a regular reference, such as weird path string, and occasionally (when lucky) a crash: [ 190.600652] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC [ 190.600994] Modules linked in: btrfs xor raid6_pq binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc psmouse serio_raw evbug pcspkr i2c_piix4 e1000 floppy [ 190.602565] CPU: 2 PID: 14520 Comm: btrfs Not tainted 3.13.0-fdm-btrfs-next-26+ #1 [ 190.602728] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 190.602868] task: ffff8800d447c920 ti: ffff8801fa79e000 task.ti: ffff8801fa79e000 [ 190.603030] RIP: 0010:[<ffffffff813266b4>] [<ffffffff813266b4>] memcpy+0x54/0x110 [ 190.603262] RSP: 0018:ffff8801fa79f880 EFLAGS: 00010202 [ 190.603395] RAX: ffff8800d4326e3f RBX: 000000000000036a RCX: ffff880000000000 [ 190.603553] RDX: 000000000000032a RSI: ffe708844042936a RDI: ffff8800d43271a9 [ 190.603710] RBP: ffff8801fa79f8c8 R08: 00000000003a4ef0 R09: 0000000000000000 [ 190.603867] R10: 793a4ef09f000000 R11: 9f0000000053726f R12: ffff8800d43271a9 [ 190.604020] R13: 0000160000000000 R14: ffff8802110134f0 R15: 000000000000036a [ 190.604020] FS: 00007fb423d09b80(0000) GS:ffff880216200000(0000) knlGS:0000000000000000 [ 190.604020] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 190.604020] CR2: 00007fb4229d4b78 CR3: 00000001f5d76000 CR4: 00000000000006e0 [ 190.604020] Stack: [ 190.604020] ffffffffa01f4d49 ffff8801fa79f8f0 00000000000009f9 ffff8801fa79f8c8 [ 190.604020] 00000000000009f9 ffff880211013260 000000000000f971 ffff88021147dba8 [ 190.604020] 00000000000009f9 ffff8801fa79f918 ffffffffa02367f5 ffff8801fa79f928 [ 190.604020] Call Trace: [ 190.604020] [<ffffffffa01f4d49>] ? read_extent_buffer+0xb9/0x120 [btrfs] [ 190.604020] [<ffffffffa02367f5>] fs_path_add_from_extent_buffer+0x45/0x60 [btrfs] [ 190.604020] [<ffffffffa0238806>] get_first_ref+0x1f6/0x210 [btrfs] [ 190.604020] [<ffffffffa0238994>] __get_cur_name_and_parent+0x174/0x3a0 [btrfs] [ 190.604020] [<ffffffff8118df3d>] ? kmem_cache_alloc_trace+0x11d/0x1e0 [ 190.604020] [<ffffffffa0236674>] ? fs_path_alloc+0x24/0x60 [btrfs] [ 190.604020] [<ffffffffa0238c91>] get_cur_path+0xd1/0x240 [btrfs] (...) Steps to reproduce (either crash or some weirdness like an odd path string): mkfs.btrfs -f -O extref /dev/sdd mount /dev/sdd /mnt mkdir /mnt/testdir touch /mnt/testdir/foobar for i in `seq 1 2550`; do ln /mnt/testdir/foobar /mnt/testdir/foobar_link_`printf "%04d" $i` done ln /mnt/testdir/foobar /mnt/testdir/final_foobar_name rm -f /mnt/testdir/foobar for i in `seq 1 2550`; do rm -f /mnt/testdir/foobar_link_`printf "%04d" $i` done btrfs subvolume snapshot -r /mnt /mnt/mysnap btrfs send /mnt/mysnap -f /tmp/mysnap.send Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>	2014-05-20 10:18:26 -07:00
Liu Bo	d3ecfcdf91	Btrfs: fix EIO on reading file after ioctl clone works on it For inline data extent, we need to make its length aligned, otherwise, we can get a phantom extent map which confuses readpages() to return -EIO. This can be detected by xfstests/btrfs/035. Reported-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-05-20 10:17:48 -07:00
Al Viro	b30ac0fc41	btrfs: switch to ->write_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-05-06 17:39:41 -04:00
Al Viro	aad4f8bb42	switch simple generic_file_aio_read() users to ->read_iter() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-05-06 17:37:55 -04:00
Al Viro	0c949334a9	iov_iter_truncate() Now It Can Be Done(tm) - we don't need to do iov_shorten() in generic_file_direct_write() anymore, now that all ->direct_IO() instances are converted to proper iov_iter methods and honour iter->count and iter->iov_offset properly. Get rid of count/ocount arguments of generic_file_direct_write(), while we are at it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-05-06 17:32:54 -04:00
Al Viro	28060d5d9b	btrfs: switch check_direct_IO() to iov_iter ... and don't open-code iov_iter_alignment() there Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-05-06 17:32:53 -04:00
Al Viro	71d8e532b1	start adding the tag to iov_iter For now, just use the same thing we pass to ->direct_IO() - it's all iovec-based at the moment. Pass it explicitly to iov_iter_init() and account for kvec vs. iovec in there, by the same kludge NFS ->direct_IO() uses. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-05-06 17:32:49 -04:00
Al Viro	31b140398c	switch {__,}blockdev_direct_IO() to iov_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-05-06 17:32:46 -04:00
Al Viro	a6cbcd4a4a	get rid of pointless iov_length() in ->direct_IO() all callers have iov_length(iter->iov, iter->nr_segs) == iov_iter_count(iter) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-05-06 17:32:45 -04:00
Al Viro	d8d3d94b80	pass iov_iter to ->direct_IO() unmodified, for now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-05-06 17:32:44 -04:00
Al Viro	cb66a7a1f1	kill generic_segment_checks() all callers of ->aio_read() and ->aio_write() have iov/nr_segs already checked - generic_segment_checks() done after that is just an odd way to spell iov_length(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-05-06 17:32:43 -04:00
Al Viro	0ae5e4d370	__btrfs_direct_write(): switch to iov_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-05-06 17:32:43 -04:00
Al Viro	f8579f8673	generic_file_direct_write(): switch to iov_iter Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-05-06 17:32:42 -04:00
Linus Torvalds	33c0022f0e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: limit the path size in send to PATH_MAX Btrfs: correctly set profile flags on seqlock retry Btrfs: use correct key when repeating search for extent item Btrfs: fix inode caching vs tree log Btrfs: fix possible memory leaks in open_ctree() Btrfs: avoid triggering bug_on() when we fail to start inode caching task Btrfs: move btrfs_{set,clear}_and_info() to ctree.h btrfs: replace error code from btrfs_drop_extents btrfs: Change the hole range to a more accurate value. btrfs: fix use-after-free in mount_subvol()	2014-04-27 13:26:28 -07:00
Chris Mason	cfd4a535b6	Btrfs: limit the path size in send to PATH_MAX fs_path_ensure_buf is used to make sure our path buffers for send are big enough for the path names as we construct them. The buffer size is limited to 32K by the length field in the struct. But bugs in the path construction can end up trying to build a huge buffer, and we'll do invalid memmmoves when the buffer length field wraps. This patch is step one, preventing the overflows. Signed-off-by: Chris Mason <clm@fb.com>	2014-04-26 05:02:03 -07:00
Filipe Manana	f8213bdc89	Btrfs: correctly set profile flags on seqlock retry If we had to retry on the profiles seqlock (due to a concurrent write), we would set bits on the input flags that corresponded both to the current profile and to previous values of the profile. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-24 16:43:33 -07:00
Filipe Manana	9ce49a0b4f	Btrfs: use correct key when repeating search for extent item If skinny metadata is enabled and our first tree search fails to find a skinny extent item, we may repeat a tree search for a "fat" extent item (if the previous item in the leaf is not the "fat" extent we're looking for). However we were not setting the new key's objectid to the right value, as we previously used the same key variable to peek at the previous item in the leaf, which has a different objectid. So just set the right objectid to avoid modifying/deleting a wrong item if we repeat the tree search. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-24 16:43:33 -07:00
Miao Xie	1c70d8fb4d	Btrfs: fix inode caching vs tree log Currently, with inode cache enabled, we will reuse its inode id immediately after unlinking file, we may hit something like following: \|->iput inode \|->return inode id into inode cache \|->create dir,fsync \|->power off An easy way to reproduce this problem is: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt -o inode_cache,commit=100 dd if=/dev/zero of=/mnt/data bs=1M count=10 oflag=sync inode_id=`ls -i /mnt/data \| awk '{print $1}'` rm -f /mnt/data i=1 while [ 1 ] do mkdir /mnt/dir_$i test1=`stat /mnt/dir_$i \| grep Inode: \| awk '{print $4}'` if [ $test1 -eq $inode_id ] then dd if=/dev/zero of=/mnt/dir_$i/data bs=1M count=1 oflag=sync echo b > /proc/sysrq-trigger fi sleep 1 i=$(($i+1)) done mount /dev/sdb /mnt umount /dev/sdb btrfs check /dev/sdb We fix this problem by adding unlinked inode's id into pinned tree, and we can not reuse them until committing transaction. Cc: stable@vger.kernel.org Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-24 16:43:33 -07:00
Wang Shilong	28c16cbbc3	Btrfs: fix possible memory leaks in open_ctree() Fix possible memory leaks in the following error handling paths: read_tree_block() btrfs_recover_log_trees btrfs_commit_super() btrfs_find_orphan_roots() btrfs_cleanup_fs_roots() Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-24 16:43:32 -07:00
Wang Shilong	e60efa8425	Btrfs: avoid triggering bug_on() when we fail to start inode caching task When running stress test(including snapshots,balance,fstress), we trigger the following BUG_ON() which is because we fail to start inode caching task. [ 181.131945] kernel BUG at fs/btrfs/inode-map.c:179! [ 181.137963] invalid opcode: 0000 [#1] SMP [ 181.217096] CPU: 11 PID: 2532 Comm: btrfs Not tainted 3.14.0 #1 [ 181.240521] task: ffff88013b621b30 ti: ffff8800b6ada000 task.ti: ffff8800b6ada000 [ 181.367506] Call Trace: [ 181.371107] [<ffffffffa036c1be>] btrfs_return_ino+0x9e/0x110 [btrfs] [ 181.379191] [<ffffffffa038082b>] btrfs_evict_inode+0x46b/0x4c0 [btrfs] [ 181.387464] [<ffffffff810b5a70>] ? autoremove_wake_function+0x40/0x40 [ 181.395642] [<ffffffff811dc5fe>] evict+0x9e/0x190 [ 181.401882] [<ffffffff811dcde3>] iput+0xf3/0x180 [ 181.408025] [<ffffffffa03812de>] btrfs_orphan_cleanup+0x1ee/0x430 [btrfs] [ 181.416614] [<ffffffffa03a6abd>] btrfs_mksubvol.isra.29+0x3bd/0x450 [btrfs] [ 181.425399] [<ffffffffa03a6cd6>] btrfs_ioctl_snap_create_transid+0x186/0x190 [btrfs] [ 181.435059] [<ffffffffa03a6e3b>] btrfs_ioctl_snap_create_v2+0xeb/0x130 [btrfs] [ 181.444148] [<ffffffffa03a9656>] btrfs_ioctl+0xf76/0x2b90 [btrfs] [ 181.451971] [<ffffffff8117e565>] ? handle_mm_fault+0x475/0xe80 [ 181.459509] [<ffffffff8167ba0c>] ? __do_page_fault+0x1ec/0x520 [ 181.467046] [<ffffffff81185b35>] ? do_mmap_pgoff+0x2f5/0x3c0 [ 181.474393] [<ffffffff811d4da8>] do_vfs_ioctl+0x2d8/0x4b0 [ 181.481450] [<ffffffff811d5001>] SyS_ioctl+0x81/0xa0 [ 181.488021] [<ffffffff81680b69>] system_call_fastpath+0x16/0x1b We should avoid triggering BUG_ON() here, instead, we output warning messages and clear inode_cache option. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-24 16:43:32 -07:00
Wang Shilong	9d89ce6587	Btrfs: move btrfs_{set,clear}_and_info() to ctree.h Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-24 16:43:32 -07:00
David Sterba	3f9e3df8da	btrfs: replace error code from btrfs_drop_extents There's a case which clone does not handle and used to BUG_ON instead, (testcase xfstests/btrfs/035), now returns EINVAL. This error code is confusing to the ioctl caller, as it normally signifies errorneous arguments. Change it to ENOPNOTSUPP which allows a fall back to copy instead of clone. This does not affect the common reflink operation. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-24 16:43:32 -07:00
Qu Wenruo	c5f7d0bb29	btrfs: Change the hole range to a more accurate value. Commit `3ac0d7b96a` fixed the btrfs expanding write problem but the hole punched is sometimes too large for some iovec, which has unmapped data ranges. This patch will change to hole range to a more accurate value using the counts checked by the write check routines. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-24 16:43:32 -07:00
Peter Zijlstra	4e857c58ef	arch: Mass conversion of smp_mb__() Mostly scripted conversion of the smp_mb__ barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2014-04-18 14:20:48 +02:00
Christoph Jaeger	0040e606e3	btrfs: fix use-after-free in mount_subvol() Pointer 'newargs' is used after the memory that it points to has already been freed. Picked up by Coverity - CID 1201425. Fixes: `0723a0473f` ("btrfs: allow mounting btrfs subvolumes with different ro/rw options") Signed-off-by: Christoph Jaeger <christophjaeger@linux.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-14 11:31:08 -07:00
Linus Torvalds	5166701b36	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it does shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...	2014-04-12 14:49:50 -07:00
Linus Torvalds	3123bca719	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull second set of btrfs updates from Chris Mason: "The most important changes here are from Josef, fixing a btrfs regression in 3.14 that can cause corruptions in the extent allocation tree when snapshots are in use. Josef also fixed some deadlocks in send/recv and other assorted races when balance is running" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits) Btrfs: fix compile warnings on on avr32 platform btrfs: allow mounting btrfs subvolumes with different ro/rw options btrfs: export global block reserve size as space_info btrfs: fix crash in remount(thread_pool=) case Btrfs: abort the transaction when we don't find our extent ref Btrfs: fix EINVAL checks in btrfs_clone Btrfs: fix unlock in __start_delalloc_inodes() Btrfs: scrub raid56 stripes in the right way Btrfs: don't compress for a small write Btrfs: more efficient io tree navigation on wait_extent_bit Btrfs: send, build path string only once in send_hole btrfs: filter invalid arg for btrfs resize Btrfs: send, fix data corruption due to incorrect hole detection Btrfs: kmalloc() doesn't return an ERR_PTR Btrfs: fix snapshot vs nocow writting btrfs: Change the expanding write sequence to fix snapshot related bug. btrfs: make device scan less noisy btrfs: fix lockdep warning with reclaim lock inversion Btrfs: hold the commit_root_sem when getting the commit root during send Btrfs: remove transaction from send ...	2014-04-11 14:16:53 -07:00
Wang Shilong	e4fbaee292	Btrfs: fix compile warnings on on avr32 platform fs/btrfs/scrub.c: In function 'get_raid56_logic_offset': fs/btrfs/scrub.c:2269: warning: comparison of distinct pointer types lacks a cast fs/btrfs/scrub.c:2269: warning: right shift count >= width of type fs/btrfs/scrub.c:2269: warning: passing argument 1 of '__div64_32' from incompatible pointer type Since @rot is an int type, we should not use do_div(), fix it. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-11 06:35:50 -07:00
Harald Hoyer	0723a0473f	btrfs: allow mounting btrfs subvolumes with different ro/rw options Given the following /etc/fstab entries: /dev/sda3 /mnt/foo btrfs subvol=foo,ro 0 0 /dev/sda3 /mnt/bar btrfs subvol=bar,rw 0 0 you can't issue: $ mount /mnt/foo $ mount /mnt/bar You would have to do: $ mount /mnt/foo $ mount -o remount,rw /mnt/foo $ mount --bind -o remount,ro /mnt/foo $ mount /mnt/bar or $ mount /mnt/bar $ mount --rw /mnt/foo $ mount --bind -o remount,ro /mnt/foo With this patch you can do $ mount /mnt/foo $ mount /mnt/bar $ cat /proc/self/mountinfo 49 33 0:41 /foo /mnt/foo ro,relatime shared:36 - btrfs /dev/sda3 rw,ssd,space_cache 87 33 0:41 /bar /mnt/bar rw,relatime shared:74 - btrfs /dev/sda3 rw,ssd,space_cache Signed-off-by: Chris Mason <clm@fb.com>	2014-04-10 13:32:50 -07:00
Kirill A. Shutemov	f1820361f8	mm: implement ->map_pages for page cache filemap_map_pages() is generic implementation of ->map_pages() for filesystems who uses page cache. It should be safe to use filemap_map_pages() for ->map_pages() if filesystem use filemap_fault() for ->fault(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Ning Qu <quning@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-04-07 16:35:53 -07:00
David Sterba	36523e9512	btrfs: export global block reserve size as space_info Introduce a block group type bit for a global reserve and fill the space info for SPACE_INFO ioctl. This should replace the newly added ioctl (`01e219e806`) to get just the 'size' part of the global reserve, while the actual usage can be now visible in the 'btrfs fi df' output during ENOSPC stress. The unpatched userspace tools will show the blockgroup as 'unknown'. CC: Jeff Mahoney <jeffm@suse.com> CC: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 10:41:53 -07:00
Sergei Trofimovich	800ee2247f	btrfs: fix crash in remount(thread_pool=) case Reproducer: mount /dev/ubda /mnt mount -oremount,thread_pool=42 /mnt Gives a crash: ? btrfs_workqueue_set_max+0x0/0x70 btrfs_resize_thread_pool+0xe3/0xf0 ? sync_filesystem+0x0/0xc0 ? btrfs_resize_thread_pool+0x0/0xf0 btrfs_remount+0x1d2/0x570 ? kern_path+0x0/0x80 do_remount_sb+0xd9/0x1c0 do_mount+0x26a/0xbf0 ? kfree+0x0/0x1b0 SyS_mount+0xc4/0x110 It's a call btrfs_workqueue_set_max(fs_info->scrub_wr_completion_workers, new_pool_size); with fs_info->scrub_wr_completion_workers = NULL; as scrub wqs get created only on user's demand. Patch skips not-created-yet workqueues. Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> CC: Qu Wenruo <quwenruo@cn.fujitsu.com> CC: Chris Mason <clm@fb.com> CC: Josef Bacik <jbacik@fb.com> CC: linux-btrfs@vger.kernel.org Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 10:41:52 -07:00
Josef Bacik	c4a050bbbb	Btrfs: abort the transaction when we don't find our extent ref I'm not sure why we weren't aborting here in the first place, it is obviously a bad time from the fact that we print the leaf and yell loudly about it. Fix this up, otherwise we panic because our path could be pointing into oblivion. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:51 -07:00
Chris Mason	3a29bc0928	Btrfs: fix EINVAL checks in btrfs_clone btrfs_drop_extents can now return -EINVAL, but only one caller in btrfs_clone was checking for it. This adds it to the caller for inline extents, which is where we really need it. Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:50 -07:00
Wang Shilong	a1ecaabbf9	Btrfs: fix unlock in __start_delalloc_inodes() This patch fix a regression caused by the following patch: Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock break while loop will make us call @spin_unlock() without calling @spin_lock() before, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:50 -07:00
Wang Shilong	3b080b2564	Btrfs: scrub raid56 stripes in the right way Steps to reproduce: # mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5 # mount /dev/sda8 /mnt # btrfs scrub start -BR /mnt # echo $? <--unverified errors make return value be 3 This is because we don't setup right mapping between physical and logical address for raid56, which makes checksum mismatch. But we will find everthing is fine later when rechecking using btrfs_map_block(). This patch fixed the problem by settuping right mappings and we only verify data stripes' checksums. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:49 -07:00
Wang Shilong	68bb462d42	Btrfs: don't compress for a small write To compress a small file range(<=blocksize) that is not an inline extent can not save disk space at all. skip it can save us some cpu time. This patch can also fix wrong setting nocompression flag for inode, say a case when @total_in is 4096, and then we get @total_compressed 52,because we do aligment to page cache size firstly, and then we get into conclusion @total_in=@total_compressed thus we will clear this inode's compression flag. An exception comes from inserting inline extent failure but we still have @total_compressed < @total_in,so we will still reset inode's flag, this is ok, because we don't have good compression effect. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:48 -07:00
Filipe Manana	c50d3e71c3	Btrfs: more efficient io tree navigation on wait_extent_bit If we don't reschedule use rb_next to find the next extent state instead of a full tree search, which is more efficient and safe since we didn't release the io tree's lock. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:47 -07:00
Filipe Manana	c715e155c9	Btrfs: send, build path string only once in send_hole There's no point building the path string in each iteration of the send_hole loop, as it produces always the same string. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:46 -07:00
Gui Hecheng	9a40f1222a	btrfs: filter invalid arg for btrfs resize Originally following cmds will work: # btrfs fi resize -10A <mnt> # btrfs fi resize -10Gaha <mnt> Filter the arg by checking the return pointer of memparse. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:45 -07:00
Filipe Manana	766b5e5ae7	Btrfs: send, fix data corruption due to incorrect hole detection During an incremental send, when we finish processing an inode (corresponding to a regular file) we would assume the gap between the end of the last processed file extent and the file's size corresponded to a file hole, and therefore incorrectly send a bunch of zero bytes to overwrite that region in the file. This affects only kernel 3.14. Reproducer: mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt xfs_io -f -c "falloc -k 0 268435456" /mnt/foo btrfs subvolume snapshot -r /mnt /mnt/mysnap0 xfs_io -c "pwrite -S 0x01 -b 9216 16190218 9216" /mnt/foo xfs_io -c "pwrite -S 0x02 -b 1121 198720104 1121" /mnt/foo xfs_io -c "pwrite -S 0x05 -b 9216 107887439 9216" /mnt/foo xfs_io -c "pwrite -S 0x06 -b 9216 225520207 9216" /mnt/foo xfs_io -c "pwrite -S 0x07 -b 67584 102138300 67584" /mnt/foo xfs_io -c "pwrite -S 0x08 -b 7000 94897484 7000" /mnt/foo xfs_io -c "pwrite -S 0x09 -b 113664 245083212 113664" /mnt/foo xfs_io -c "pwrite -S 0x10 -b 123 17937788 123" /mnt/foo xfs_io -c "pwrite -S 0x11 -b 39936 229573311 39936" /mnt/foo xfs_io -c "pwrite -S 0x12 -b 67584 174792222 67584" /mnt/foo xfs_io -c "pwrite -S 0x13 -b 9216 249253213 9216" /mnt/foo xfs_io -c "pwrite -S 0x16 -b 67584 150046083 67584" /mnt/foo xfs_io -c "pwrite -S 0x17 -b 39936 118246040 39936" /mnt/foo xfs_io -c "pwrite -S 0x18 -b 67584 215965442 67584" /mnt/foo xfs_io -c "pwrite -S 0x19 -b 33792 97096725 33792" /mnt/foo xfs_io -c "pwrite -S 0x20 -b 125952 166300596 125952" /mnt/foo xfs_io -c "pwrite -S 0x21 -b 123 1078957 123" /mnt/foo xfs_io -c "pwrite -S 0x25 -b 9216 212044492 9216" /mnt/foo xfs_io -c "pwrite -S 0x26 -b 7000 265037146 7000" /mnt/foo xfs_io -c "pwrite -S 0x27 -b 42757 215922685 42757" /mnt/foo xfs_io -c "pwrite -S 0x28 -b 7000 69865411 7000" /mnt/foo xfs_io -c "pwrite -S 0x29 -b 67584 67948958 67584" /mnt/foo xfs_io -c "pwrite -S 0x30 -b 39936 266967019 39936" /mnt/foo xfs_io -c "pwrite -S 0x31 -b 1121 19582453 1121" /mnt/foo xfs_io -c "pwrite -S 0x32 -b 17408 257710255 17408" /mnt/foo xfs_io -c "pwrite -S 0x33 -b 39936 3895518 39936" /mnt/foo xfs_io -c "pwrite -S 0x34 -b 125952 12045847 125952" /mnt/foo xfs_io -c "pwrite -S 0x35 -b 17408 19156379 17408" /mnt/foo xfs_io -c "pwrite -S 0x36 -b 39936 50160066 39936" /mnt/foo xfs_io -c "pwrite -S 0x37 -b 113664 9549793 113664" /mnt/foo xfs_io -c "pwrite -S 0x38 -b 105472 94391506 105472" /mnt/foo xfs_io -c "pwrite -S 0x39 -b 23552 143632863 23552" /mnt/foo xfs_io -c "pwrite -S 0x40 -b 39936 241283845 39936" /mnt/foo xfs_io -c "pwrite -S 0x41 -b 113664 199937606 113664" /mnt/foo xfs_io -c "pwrite -S 0x42 -b 67584 67380093 67584" /mnt/foo xfs_io -c "pwrite -S 0x43 -b 67584 26793129 67584" /mnt/foo xfs_io -c "pwrite -S 0x44 -b 39936 14421913 39936" /mnt/foo xfs_io -c "pwrite -S 0x45 -b 123 253097405 123" /mnt/foo xfs_io -c "pwrite -S 0x46 -b 1121 128233424 1121" /mnt/foo xfs_io -c "pwrite -S 0x47 -b 105472 91577959 105472" /mnt/foo xfs_io -c "pwrite -S 0x48 -b 1121 7245381 1121" /mnt/foo xfs_io -c "pwrite -S 0x49 -b 113664 182414694 113664" /mnt/foo xfs_io -c "pwrite -S 0x50 -b 9216 32750608 9216" /mnt/foo xfs_io -c "pwrite -S 0x51 -b 67584 266546049 67584" /mnt/foo xfs_io -c "pwrite -S 0x52 -b 67584 87969398 67584" /mnt/foo xfs_io -c "pwrite -S 0x53 -b 9216 260848797 9216" /mnt/foo xfs_io -c "pwrite -S 0x54 -b 39936 119461243 39936" /mnt/foo xfs_io -c "pwrite -S 0x55 -b 7000 200178693 7000" /mnt/foo xfs_io -c "pwrite -S 0x56 -b 9216 243316029 9216" /mnt/foo xfs_io -c "pwrite -S 0x57 -b 7000 209658229 7000" /mnt/foo xfs_io -c "pwrite -S 0x58 -b 101376 179745192 101376" /mnt/foo xfs_io -c "pwrite -S 0x59 -b 9216 64012300 9216" /mnt/foo xfs_io -c "pwrite -S 0x60 -b 125952 181705139 125952" /mnt/foo xfs_io -c "pwrite -S 0x61 -b 23552 235737348 23552" /mnt/foo xfs_io -c "pwrite -S 0x62 -b 113664 106021355 113664" /mnt/foo xfs_io -c "pwrite -S 0x63 -b 67584 135753552 67584" /mnt/foo xfs_io -c "pwrite -S 0x64 -b 23552 95730888 23552" /mnt/foo xfs_io -c "pwrite -S 0x65 -b 11 17311415 11" /mnt/foo xfs_io -c "pwrite -S 0x66 -b 33792 120695553 33792" /mnt/foo xfs_io -c "pwrite -S 0x67 -b 9216 17164631 9216" /mnt/foo xfs_io -c "pwrite -S 0x68 -b 9216 136065853 9216" /mnt/foo xfs_io -c "pwrite -S 0x69 -b 67584 37752198 67584" /mnt/foo xfs_io -c "pwrite -S 0x70 -b 101376 189717473 101376" /mnt/foo xfs_io -c "pwrite -S 0x71 -b 7000 227463698 7000" /mnt/foo xfs_io -c "pwrite -S 0x72 -b 9216 12655137 9216" /mnt/foo xfs_io -c "pwrite -S 0x73 -b 7000 7488866 7000" /mnt/foo xfs_io -c "pwrite -S 0x74 -b 113664 87813649 113664" /mnt/foo xfs_io -c "pwrite -S 0x75 -b 33792 25802183 33792" /mnt/foo xfs_io -c "pwrite -S 0x76 -b 39936 93524024 39936" /mnt/foo xfs_io -c "pwrite -S 0x77 -b 33792 113336388 33792" /mnt/foo xfs_io -c "pwrite -S 0x78 -b 105472 184955320 105472" /mnt/foo xfs_io -c "pwrite -S 0x79 -b 101376 225691598 101376" /mnt/foo xfs_io -c "pwrite -S 0x80 -b 23552 77023155 23552" /mnt/foo xfs_io -c "pwrite -S 0x81 -b 11 201888192 11" /mnt/foo xfs_io -c "pwrite -S 0x82 -b 11 115332492 11" /mnt/foo xfs_io -c "pwrite -S 0x83 -b 67584 230278015 67584" /mnt/foo xfs_io -c "pwrite -S 0x84 -b 11 120589073 11" /mnt/foo xfs_io -c "pwrite -S 0x85 -b 125952 202207819 125952" /mnt/foo xfs_io -c "pwrite -S 0x86 -b 113664 86672080 113664" /mnt/foo xfs_io -c "pwrite -S 0x87 -b 17408 208459603 17408" /mnt/foo xfs_io -c "pwrite -S 0x88 -b 7000 73372211 7000" /mnt/foo xfs_io -c "pwrite -S 0x89 -b 7000 42252122 7000" /mnt/foo xfs_io -c "pwrite -S 0x90 -b 23552 46784881 23552" /mnt/foo xfs_io -c "pwrite -S 0x91 -b 101376 63172351 101376" /mnt/foo xfs_io -c "pwrite -S 0x92 -b 23552 59341931 23552" /mnt/foo xfs_io -c "pwrite -S 0x93 -b 39936 239599283 39936" /mnt/foo xfs_io -c "pwrite -S 0x94 -b 67584 175643105 67584" /mnt/foo xfs_io -c "pwrite -S 0x97 -b 23552 105534880 23552" /mnt/foo xfs_io -c "pwrite -S 0x98 -b 113664 8236844 113664" /mnt/foo xfs_io -c "pwrite -S 0x99 -b 125952 144489686 125952" /mnt/foo xfs_io -c "pwrite -S 0xa0 -b 7000 73273112 7000" /mnt/foo xfs_io -c "pwrite -S 0xa1 -b 125952 194580243 125952" /mnt/foo xfs_io -c "pwrite -S 0xa2 -b 123 56296779 123" /mnt/foo xfs_io -c "pwrite -S 0xa3 -b 11 233066845 11" /mnt/foo xfs_io -c "pwrite -S 0xa4 -b 39936 197727090 39936" /mnt/foo xfs_io -c "pwrite -S 0xa5 -b 101376 53579812 101376" /mnt/foo xfs_io -c "pwrite -S 0xa6 -b 9216 85669738 9216" /mnt/foo xfs_io -c "pwrite -S 0xa7 -b 125952 21266322 125952" /mnt/foo xfs_io -c "pwrite -S 0xa8 -b 23552 125726568 23552" /mnt/foo xfs_io -c "pwrite -S 0xa9 -b 9216 18423680 9216" /mnt/foo xfs_io -c "pwrite -S 0xb0 -b 1121 165901483 1121" /mnt/foo btrfs subvolume snapshot -r /mnt /mnt/mysnap1 xfs_io -c "pwrite -S 0xff -b 10 16190218 10" /mnt/foo btrfs subvolume snapshot -r /mnt /mnt/mysnap2 md5sum /mnt/foo # returns 79e53f1466bfc09fd82b450689e6119e md5sum /mnt/mysnap2/foo # returns 79e53f1466bfc09fd82b450689e6119e too btrfs send /mnt/mysnap1 -f /tmp/1.snap btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt btrfs receive /mnt -f /tmp/1.snap btrfs receive /mnt -f /tmp/2.snap md5sum /mnt/mysnap2/foo # returns 2bb414c5155767cedccd7063e51beabd !! A testcase for xfstests follows soon too. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:45 -07:00
Dan Carpenter	84dbeb87d1	Btrfs: kmalloc() doesn't return an ERR_PTR The error handling was copy and pasted from memdup_user(). It should be checking for NULL obviously. Fixes: `abccd00f8a` ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:44 -07:00
Wang Shilong	e9894fd3e3	Btrfs: fix snapshot vs nocow writting While running fsstress and snapshots concurrently, we will hit something like followings: Thread 1 Thread 2 \|->fallocate \|->write pages \|->join transaction \|->add ordered extent \|->end transaction \|->flushing data \|->creating pending snapshots \|->write data into src root's fallocated space After above work flows finished, we will get a state that source and snapshot root share same space, but source root have written data into fallocated space, this will make fsck fail to verify checksums for snapshot root's preallocating file extent data.Nocow writting also has this same problem. Fix this problem by syncing snapshots with nocow writting: 1.for nocow writting,if there are pending snapshots, we will fall into COW way. 2.if there are pending nocow writes, snapshots for this root will be blocked until nocow writting finish. Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:43 -07:00
Qu Wenruo	3ac0d7b96a	btrfs: Change the expanding write sequence to fix snapshot related bug. When testing fsstress with snapshot making background, some snapshot following problem. Snapshot 270: inode 323: size 0 Snapshot 271: inode 323: size 349145 \|-------Hole---\|---------Empty gap-------\|-------Hole-----\| 0 122880 172032 349145 Snapshot 272: inode 323: size 349145 \|-------Hole---\|------------Data---------\|-------Hole-----\| 0 122880 172032 349145 The fsstress operation on inode 323 is the following: write: offset 126832 len 43124 truncate: size 349145 Since the write with offset is consist of 2 operations: 1. punch hole 2. write data Hole punching is faster than data write, so hole punching in write and truncate is done first and then buffered write, so the snapshot 271 got empty gap, which will not pass btrfsck. To fix the bug, this patch will change the write sequence which will first punch a hole covering the write end if a hole is needed. Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:42 -07:00
David Sterba	60999ca4b4	btrfs: make device scan less noisy Print the message only when the device is seen for the first time. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:41 -07:00
Jeff Mahoney	ed55b6ac07	btrfs: fix lockdep warning with reclaim lock inversion When encountering memory pressure, testers have run into the following lockdep warning. It was caused by __link_block_group calling kobject_add with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL, which gets us into reclaim context. The kobject doesn't actually need to be added under the lock -- it just needs to ensure that it's only added for the first block group to be linked. ========================================================= [ INFO: possible irq lock inversion dependency detected ] 3.14.0-rc8-default #1 Not tainted --------------------------------------------------------- kswapd0/169 just changed the state of lock: (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs] but this lock took another, RECLAIM_FS-unsafe lock in the past: (&found->groups_sem){+++++.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&found->groups_sem); local_irq_disable(); lock(&delayed_node->mutex); lock(&found->groups_sem); <Interrupt> lock(&delayed_node->mutex); * DEADLOCK * 2 locks held by kswapd0/169: #0: (shrinker_rwsem){++++..}, at: [<ffffffff81159e8a>] shrink_slab+0x3a/0x160 #1: (&type->s_umount_key#27){++++..}, at: [<ffffffff811bac6f>] grab_super_passive+0x3f/0x90 Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:40 -07:00
Josef Bacik	3f8a18cc53	Btrfs: hold the commit_root_sem when getting the commit root during send We currently rely too heavily on roots being read-only to save us from just accessing root->commit_root. We can easily balance blocks out from underneath a read only root, so to save us from getting screwed make sure we only access root->commit_root under the commit root sem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-07 09:08:39 -07:00
Josef Bacik	9e351cc862	Btrfs: remove transaction from send Lets try this again. We can deadlock the box if we send on a box and try to write onto the same fs with the app that is trying to listen to the send pipe. This is because the writer could get stuck waiting for a transaction commit which is being blocked by the send. So fix this by making sure looking at the commit roots is always going to be consistent. We do this by keeping track of which roots need to have their commit roots swapped during commit, and then taking the commit_root_sem and swapping them all at once. Then make sure we take a read lock on the commit_root_sem in cases where we search the commit root to make sure we're always looking at a consistent view of the commit roots. Previously we had problems with this because we would swap a fs tree commit root and then swap the extent tree commit root independently which would cause the backref walking code to screw up sometimes. With this patch we no longer deadlock and pass all the weird send/receive corner cases. Thanks, Reportedy-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-06 17:39:30 -07:00
Josef Bacik	a26e8c9f75	Btrfs: don't clear uptodate if the eb is under IO So I have an awful exercise script that will run snapshot, balance and send/receive in parallel. This sometimes would crash spectacularly and when it came back up the fs would be completely hosed. Turns out this is because of a bad interaction of balance and send/receive. Send will hold onto its entire path for the whole send, but its blocks could get relocated out from underneath it, and because it doesn't old tree locks theres nothing to keep this from happening. So it will go to read in a slot with an old transid, and we could have re-allocated this block for something else and it could have a completely different transid. But because we think it is invalid we clear uptodate and re-read in the block. If we do this before we actually write out the new block we could write back stale data to the fs, and boom we're screwed. Now we definitely need to fix this disconnect between send and balance, but we really really need to not allow ourselves to accidently read in stale data over new data. So make sure we check if the extent buffer is not under io before clearing uptodate, this will kick back EIO to the caller instead of reading in stale data and keep us from corrupting the fs. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-06 17:34:37 -07:00
Josef Bacik	573a075567	Btrfs: check for an extent_op on the locked ref We could have possibly added an extent_op to the locked_ref while we dropped locked_ref->lock, so check for this case as well and loop around. Otherwise we could lose flag updates which would lead to extent tree corruption. Thanks, cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-06 17:34:36 -07:00
Josef Bacik	ba8b028933	Btrfs: do not reset last_snapshot after relocation This was done to allow NO_COW to continue to be NO_COW after relocation but it is not right. When relocating we will convert blocks to FULL_BACKREF that we relocate. We can leave some of these full backref blocks behind if they are not cow'ed out during the relocation, like if we fail the relocation with ENOSPC and then just drop the reloc tree. Then when we go to cow the block again we won't lookup the extent flags because we won't think there has been a snapshot recently which means we will do our normal ref drop thing instead of adding back a tree ref and dropping the shared ref. This will cause btrfs_free_extent to blow up because it can't find the ref we are trying to free. This was found with my ref verifying tool. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-04-06 17:34:35 -07:00
Linus Torvalds	24e7ea3bea	Major changes for 3.14 include support for the newly added ZERO_RANGE and COLLAPSE_RANGE fallocate operations, and scalability improvements in the jbd2 layer and in xattr handling when the extended attributes spill over into an external block. Other than that, the usual clean ups and minor bug fixes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABCAAGBQJTPbD2AAoJENNvdpvBGATwDmUQANSfGYIQazB8XKKgtNTMiG/Y Ky7n1JzN9lTX/6nMsqQnbfCweLRmxqpWUBuyKDRHUi8IG0/voXSTFsAOOgz0R15A ERRRWkVvHixLpohuL/iBdEMFHwNZYPGr3jkm0EIgzhtXNgk5DNmiuMwvHmCY27kI kdNZIw9fip/WRNoFLDBGnLGC37aanoHhCIbVlySy5o9LN1pkC8BgXAYV0Rk19SVd bWCudSJEirFEqWS5H8vsBAEm/ioxTjwnNL8tX8qms6orZ6h8yMLFkHoIGWPw3Q15 a0TSUoMyav50Yr59QaDeWx9uaPQVeK41wiYFI2rZOnyG2ts0u0YXs/nLwJqTovgs rzvbdl6cd3Nj++rPi97MTA7iXK96WQPjsDJoeeEgnB0d/qPyTk6mLKgftzLTNgSa ZmWjrB19kr6CMbebMC4L6eqJ8Fr66pCT8c/iue8wc4MUHi7FwHKH64fqWvzp2YT/ +165dqqo2JnUv7tIp6sUi1geun+bmDHLZFXgFa7fNYFtcU3I+uY1mRr3eMVAJndA 2d6ASe/KhQbpVnjKJdQ8/b833ZS3p+zkgVPrd68bBr3t7gUmX91wk+p1ct6rUPLr 700F+q/pQWL8ap0pU9Ht/h3gEJIfmRzTwxlOeYyOwDseqKuS87PSB3BzV3dDunSU DrPKlXwIgva7zq5/S0Vr =4s1Z -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Major changes for 3.14 include support for the newly added ZERO_RANGE and COLLAPSE_RANGE fallocate operations, and scalability improvements in the jbd2 layer and in xattr handling when the extended attributes spill over into an external block. Other than that, the usual clean ups and minor bug fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits) ext4: fix premature freeing of partial clusters split across leaf blocks ext4: remove unneeded test of ret variable ext4: fix comment typo ext4: make ext4_block_zero_page_range static ext4: atomically set inode->i_flags in ext4_set_inode_flags() ext4: optimize Hurd tests when reading/writing inodes ext4: kill i_version support for Hurd-castrated file systems ext4: each filesystem creates and uses its own mb_cache fs/mbcache.c: doucple the locking of local from global data fs/mbcache.c: change block and index hash chain to hlist_bl_node ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate ext4: refactor ext4_fallocate code ext4: Update inode i_size after the preallocation ext4: fix partial cluster handling for bigalloc file systems ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents ext4: only call sync_filesystm() when remounting read-only fs: push sync_filesystem() down to the file system's remount_fs() jbd2: improve error messages for inconsistent journal heads jbd2: minimize region locked by j_list_lock in jbd2_journal_forget() jbd2: minimize region locked by j_list_lock in journal_get_create_access() ...	2014-04-04 15:39:39 -07:00
Linus Torvalds	53c566625f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs changes from Chris Mason: "This is a pretty long stream of bug fixes and performance fixes. Qu Wenruo has replaced the btrfs async threads with regular kernel workqueues. We'll keep an eye out for performance differences, but it's nice to be using more generic code for this. We still have some corruption fixes and other patches coming in for the merge window, but this batch is tested and ready to go" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits) Btrfs: fix a crash of clone with inline extents's split btrfs: fix uninit variable warning Btrfs: take into account total references when doing backref lookup Btrfs: part 2, fix incremental send's decision to delay a dir move/rename Btrfs: fix incremental send's decision to delay a dir move/rename Btrfs: remove unnecessary inode generation lookup in send Btrfs: fix race when updating existing ref head btrfs: Add trace for btrfs_workqueue alloc/destroy Btrfs: less fs tree lock contention when using autodefrag Btrfs: return EPERM when deleting a default subvolume Btrfs: add missing kfree in btrfs_destroy_workqueue Btrfs: cache extent states in defrag code path Btrfs: fix deadlock with nested trans handles Btrfs: fix possible empty list access when flushing the delalloc inodes Btrfs: split the global ordered extents mutex Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock Btrfs: reclaim delalloc metadata more aggressively Btrfs: remove unnecessary lock in may_commit_transaction() Btrfs: remove the unnecessary flush when preparing the pages Btrfs: just do dirty page flush for the inode with compression before direct IO ...	2014-04-04 15:31:36 -07:00
Johannes Weiner	91b0abe36a	mm + fs: store shadow entries in page cache Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-04-03 16:21:01 -07:00
Johannes Weiner	0cd6144aad	mm + fs: prepare for non-page entries in page cache radix trees shmem mappings already contain exceptional entries where swap slot information is remembered. To be able to store eviction information for regular page cache, prepare every site dealing with the radix trees directly to handle entries other than pages. The common lookup functions will filter out non-page entries and return NULL for page cache holes, just as before. But provide a raw version of the API which returns non-page entries as well, and switch shmem over to use it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-04-03 16:21:00 -07:00
Dan Carpenter	45d4f85504	fs/direct-io.c: remove some left over checks We know that "ret > 0" is true here. These tests were left over from commit `02afc27fae` ('direct-io: Handle O_(D)SYNC AIO') and aren't needed any more. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-04-03 16:20:57 -07:00
Al Viro	5cb6c6c7eb	generic_file_direct_write(): get rid of ppos argument always equal to &iocb->ki_pos. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-04-01 23:19:35 -04:00
Al Viro	867c4f9329	btrfs_file_aio_write(): get rid of ppos Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-04-01 23:19:35 -04:00
Al Viro	9e8c2af96e	callers of iov_copy_from_user_atomic() don't need pagecache_disable() ... it does that itself (via kmap_atomic()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-04-01 23:19:20 -04:00
Liu Bo	00fdf13a2e	Btrfs: fix a crash of clone with inline extents's split xfstests's btrfs/035 triggers a BUG_ON, which we use to detect the split of inline extents in __btrfs_drop_extents(). For inline extents, we cannot duplicate another EXTENT_DATA item, because it breaks the rule of inline extents, that is, 'start offset' needs to be 0. We have set limitations for the source inode's compressed inline extents, because it needs to decompress and recompress. Now the destination inode's inline extents also need similar limitations. With this, xfstests btrfs/035 doesn't run into panic. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-03-21 17:35:18 -07:00
Chris Mason	73b802f447	btrfs: fix uninit variable warning fs/btrfs/send.c:2926: warning: ‘entry’ may be used uninitialized in this function Signed-off-by: Chris Mason <clm@fb.com>	2014-03-21 15:30:44 -07:00
Josef Bacik	4485386853	Btrfs: take into account total references when doing backref lookup I added an optimization for large files where we would stop searching for backrefs once we had looked at the number of references we currently had for this extent. This works great most of the time, but for snapshots that point to this extent and has changes in the original root this assumption falls on it face. So keep track of any delayed ref mods made and add in the actual ref count as reported by the extent item and use that to limit how far down an inode we'll search for extents. Thanks, Reportedy-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com> Reported-by: Hugo Mills <hugo@carfax.org.uk> Tested-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Chris Mason <clm@fb.com>	2014-03-21 15:28:09 -07:00
Filipe Manana	bfa7e1f8be	Btrfs: part 2, fix incremental send's decision to delay a dir move/rename For an incremental send, fix the process of determining whether the directory inode we're currently processing needs to have its move/rename operation delayed. We were ignoring the fact that if the inode's new immediate ancestor has a higher inode number than ours but wasn't renamed/moved, we might still need to delay our move/rename, because some other ancestor directory higher in the hierarchy might have an inode number higher than ours and was renamed/moved too - in this case we have to wait for rename/move of that ancestor to happen before our current directory's rename/move operation. Simple steps to reproduce this issue: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/x1/x2 $ mkdir /mnt/a/Z $ mkdir -p /mnt/a/x1/x2/x3/x4/x5 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mv /mnt/a/x1/x2/x3 /mnt/a/Z/X33 $ mv /mnt/a/x1/x2 /mnt/a/Z/X33/x4/x5/X22 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send The incremental send caused the kernel code to enter an infinite loop when building the path string for directory Z after its references are processed. A more complex scenario: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir -p /mnt/a/b/c/d $ mkdir /mnt/a/b/c/d/e $ mkdir /mnt/a/b/c/d/f $ mv /mnt/a/b/c/d/e /mnt/a/b/c/d/f/E2 $ mkdir /mmt/a/b/c/g $ mv /mnt/a/b/c/d /mnt/a/b/D2 $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mkdir /mnt/a/o $ mv /mnt/a/b/c/g /mnt/a/b/D2/f/G2 $ mv /mnt/a/b/D2 /mnt/a/b/dd $ mv /mnt/a/b/c /mnt/a/C2 $ mv /mnt/a/b/dd/f /mnt/a/o/FF $ mv /mnt/a/b /mnt/a/o/FF/E2/BB $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-03-21 15:25:48 -07:00
Filipe Manana	7b119a8b89	Btrfs: fix incremental send's decision to delay a dir move/rename It's possible to change the parent/child relationship between directories in such a way that if a child directory has a higher inode number than its parent, it doesn't necessarily means the child rename/move operation can be performed immediately. The parent migth have its own rename/move operation delayed, therefore in this case the child needs to have its rename/move operation delayed too, and be performed after its new parent's rename/move. Steps to reproduce the issue: $ umount /mnt $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt $ mkdir /mnt/A $ mkdir /mnt/B $ mkdir /mnt/C $ mv /mnt/C /mnt/A $ mv /mnt/B /mnt/A/C $ mkdir /mnt/A/C/D $ btrfs subvolume snapshot -r /mnt /mnt/snap1 $ btrfs send /mnt/snap1 -f /tmp/base.send $ mv /mnt/A/C/D /mnt/A/D2 $ mv /mnt/A/C/B /mnt/A/D2/B2 $ mv /mnt/A/C /mnt/A/D2/B2/C2 $ btrfs subvolume snapshot -r /mnt /mnt/snap2 $ btrfs send -p /mnt/snap1 /mnt/snap2 -f /tmp/incremental.send The incremental send caused the kernel code to enter an infinite loop when building the path string for directory C after its references are processed. The necessary conditions here are that C has an inode number higher than both A and B, and B as an higher inode number higher than A, and D has the highest inode number, that is: inode_number(A) < inode_number(B) < inode_number(C) < inode_number(D) The same issue could happen if after the first snapshot there's any number of intermediary parent directories between A2 and B2, and between B2 and C2. A test case for xfstests follows, covering this simple case and more advanced ones, with files and hard links created inside the directories. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-03-21 15:24:27 -07:00
Filipe Manana	425b5dafc8	Btrfs: remove unnecessary inode generation lookup in send No need to search in the send tree for the generation number of the inode, we already have it in the recorded_ref structure passed to us. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-03-20 17:15:28 -07:00
Filipe Manana	21543baddc	Btrfs: fix race when updating existing ref head While we update an existing ref head's extent_op, we're not holding its spinlock, so while we're updating its extent_op contents (key, flags) we can have a task running __btrfs_run_delayed_refs() that holds the ref head's lock and sets its extent_op to NULL right after the task updating the ref head just checked its extent_op was not NULL. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-03-20 17:15:28 -07:00
Qu Wenruo	c3a468915a	btrfs: Add trace for btrfs_workqueue alloc/destroy Since most of the btrfs_workqueue is printed as pointer address, for easier analysis, add trace for btrfs_workqueue alloc/destroy. So it is possible to determine the workqueue that a given work belongs to(by comparing the wq pointer address with alloc trace event). Signed-off-by: Qu Wenruo <quenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-03-20 17:15:28 -07:00
Filipe Manana	f094c9bd3e	Btrfs: less fs tree lock contention when using autodefrag When finding new extents during an autodefrag, don't do so many fs tree lookups to find an extent with a size smaller then the target treshold. Instead, after each fs tree forward search immediately unlock upper levels and process the entire leaf while holding a read lock on the leaf, since our leaf processing is very fast. This reduces lock contention, allowing for higher concurrency when other tasks want to write/update items related to other inodes in the fs tree, as we're not holding read locks on upper tree levels while processing the leaf and we do less tree searches. Test: sysbench --test=fileio --file-num=512 --file-total-size=16G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --file-rw-ratio=3 --file-io-mode=sync --max-time=1800 \ --max-requests=10000000000 [prepare\|run] (fileystem mounted with -o autodefrag, averages of 5 runs) Before this change: 58.852Mb/sec throughtput, read 77.589Gb, written 25.863Gb After this change: 63.034Mb/sec throughtput, read 83.102Gb, written 27.701Gb Test machine: quad core intel i5-3570K, 32Gb of RAM, SSD. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-03-20 17:15:27 -07:00
Guangyu Sun	72de6b5393	Btrfs: return EPERM when deleting a default subvolume The error message is confusing: # btrfs sub delete /mnt/mysub/ Delete subvolume '/mnt/mysub' ERROR: cannot delete '/mnt/mysub' - Directory not empty The error message does not make sense to me: It's not about deleting a directory but it's a subvolume, and it doesn't matter if the subvolume is empty or not. Maybe EPERM or is more appropriate in this case, combined with an explanatory kernel log message. (e.g. "subvolume with ID 123 cannot be deleted because it is configured as default subvolume.") Reported-by: Koen De Wit <koen.de.wit@oracle.com> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-03-20 17:15:27 -07:00
Filipe Manana	ef66af101a	Btrfs: add missing kfree in btrfs_destroy_workqueue Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-03-20 17:15:27 -07:00
Filipe Manana	308d9800b2	Btrfs: cache extent states in defrag code path When locking file ranges in the inode's io_tree, cache the first extent state that belongs to the target range, so that when unlocking the range we don't need to search in the io_tree again, reducing cpu time and making and therefore holding the io_tree's lock for a shorter period. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-03-20 17:15:27 -07:00
Josef Bacik	3bbb24b20a	Btrfs: fix deadlock with nested trans handles Zach found this deadlock that would happen like this btrfs_end_transaction <- reduce trans->use_count to 0 btrfs_run_delayed_refs btrfs_cow_block find_free_extent btrfs_start_transaction <- increase trans->use_count to 1 allocate chunk btrfs_end_transaction <- decrease trans->use_count to 0 btrfs_run_delayed_refs lock tree block we are cowing above ^^ We need to only decrease trans->use_count if it is above 1, otherwise leave it alone. This will make nested trans be the only ones who decrease their added ref, and will let us get rid of the trans->use_count++ hack if we have to commit the transaction. Thanks, cc: stable@vger.kernel.org Reported-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Tested-by: Zach Brown <zab@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-03-20 17:15:27 -07:00
Theodore Ts'o	02b9984d64	fs: push sync_filesystem() down to the file system's remount_fs() Previously, the no-op "mount -o mount /dev/xxx" operation when the file system is already mounted read-write causes an implied, unconditional syncfs(). This seems pretty stupid, and it's certainly documented or guaraunteed to do this, nor is it particularly useful, except in the case where the file system was mounted rw and is getting remounted read-only. However, it's possible that there might be some file systems that are actually depending on this behavior. In most file systems, it's probably fine to only call sync_filesystem() when transitioning from read-write to read-only, and there are some file systems where this is not needed at all (for example, for a pseudo-filesystem or something like romfs). Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: linux-fsdevel@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: Artem Bityutskiy <dedekind1@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Jan Kara <jack@suse.cz> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Anders Larsen <al@alarsen.net> Cc: Phillip Lougher <phillip@squashfs.org.uk> Cc: Kees Cook <keescook@chromium.org> Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: Petr Vandrovec <petr@vandrovec.name> Cc: xfs@oss.sgi.com Cc: linux-btrfs@vger.kernel.org Cc: linux-cifs@vger.kernel.org Cc: samba-technical@lists.samba.org Cc: codalist@coda.cs.cmu.edu Cc: linux-ext4@vger.kernel.org Cc: linux-f2fs-devel@lists.sourceforge.net Cc: fuse-devel@lists.sourceforge.net Cc: cluster-devel@redhat.com Cc: linux-mtd@lists.infradead.org Cc: jfs-discussion@lists.sourceforge.net Cc: linux-nfs@vger.kernel.org Cc: linux-nilfs@vger.kernel.org Cc: linux-ntfs-dev@lists.sourceforge.net Cc: ocfs2-devel@oss.oracle.com Cc: reiserfs-devel@vger.kernel.org	2014-03-13 10:14:33 -04:00
Miao Xie	573bfb72f7	Btrfs: fix possible empty list access when flushing the delalloc inodes We didn't have a lock to protect the access to the delalloc inodes list, that is we might access a empty delalloc inodes list if someone start flushing delalloc inodes because the delalloc inodes were moved into a other list temporarily. Fix it by wrapping the access with a lock. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:29 -04:00
Miao Xie	31f3d255c6	Btrfs: split the global ordered extents mutex When we create a snapshot, we just need wait the ordered extents in the source fs/file root, but because we use the global mutex to protect this ordered extents list of the source fs/file root to avoid accessing a empty list, if someone got the mutex to access the ordered extents list of the other fs/file root, we had to wait. This patch splits the above global mutex, now every fs/file root has its own mutex to protect its own list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:28 -04:00
Miao Xie	6c255e67ce	Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock We needn't flush all delalloc inodes when we doesn't get s_umount lock, or we would make the tasks wait for a long time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:27 -04:00
Miao Xie	24af7dd188	Btrfs: reclaim delalloc metadata more aggressively generic/074 in xfstests failed sometimes because of the enospc error, the reason of this problem is that we just reclaimed the space we need from the reserved space for delalloc, and then tried to reserve the space, but if some task did no-flush reservation between the above reclamation and reservation, Task1 Task2 shrink_delalloc() reclaim 1 block (The space that can be reserved now is 1 block) do no-flush reservation reserve 1 block (The space that can be reserved now is 0 block) reserving 1 block failed the reservation of Task1 failed, but in fact, there was enough space to reserve if we could reclaim more space before. Fix this problem by the aggressive reclamation of the reserved delalloc metadata space. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:26 -04:00
Miao Xie	0424c54897	Btrfs: remove unnecessary lock in may_commit_transaction() The reason is: - The per-cpu counter has its own lock to protect itself. - Here we needn't get a exact value. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:25 -04:00
Miao Xie	b88935bf98	Btrfs: remove the unnecessary flush when preparing the pages Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:25 -04:00
Miao Xie	41bd9ca459	Btrfs: just do dirty page flush for the inode with compression before direct IO As the comment in the btrfs_direct_IO says, only the compressed pages need be flush again to make sure they are on the disk, but the common pages needn't, so we add a if statement to check if the inode has compressed pages or not, if no, skip the flush. And in order to prevent the write ranges from intersecting, we need wait for the running ordered extents. But the current code waits for them twice, one is done before the direct IO starts (in btrfs_wait_ordered_range()), the other is before we get the blocks, it is unnecessary. because we can do the direct IO without holding i_mutex, it means that the intersected ordered extents may happen during the direct IO, the first wait can not avoid this problem. So we use filemap_fdatawrite_range() instead of btrfs_wait_ordered_range() to remove the first wait. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:24 -04:00
Miao Xie	af7a65097b	Btrfs: wake up the tasks that wait for the io earlier The tasks that wait for the IO_DONE flag just care about the io of the dirty pages, so it is better to wake up them immediately after all the pages are written, not the whole process of the io completes. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:23 -04:00
Miao Xie	8b9d83cd6b	Btrfs: fix early enospc due to the race of the two ordered extent wait btrfs_wait_ordered_roots() moves all the list entries to a new list, and then deals with them one by one. But if the other task invokes this function at that time, it would get a empty list. It makes the enospc error happens more early. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:22 -04:00
Miao Xie	8257b2dc3c	Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume If the snapshot creation happened after the nocow write but before the dirty data flush, we would fail to flush the dirty data because of no space. So we must keep track of when those nocow write operations start and when they end, if there are nocow writers, the snapshot creators must wait. In order to implement this function, I introduce btrfs_{start, end}_nocow_write(), which is similar to mnt_{want,drop}_write(). These two functions are only used for nocow file write operations. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:22 -04:00
Qu Wenruo	52483bc26f	btrfs: Add ftrace for btrfs_workqueue Add ftrace for btrfs_workqueue for further workqueue tunning. This patch needs to applied after the workqueue replace patchset. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:21 -04:00
Qu Wenruo	6db8914f97	btrfs: Cleanup the btrfs_workqueue related function type The new btrfs_workqueue still use open-coded function defition, this patch will change them into btrfs_func_t type which is much the same as kernel workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:20 -04:00
Liu Bo	2131bcd38b	Btrfs: add readahead for send_write Btrfs send reads data from disk and then writes to a stream via pipe or a file via flush. Currently we're going to read each page a time, so every page results in a disk read, which is not friendly to disks, esp. HDD. Given that, the performance can be gained by adding readahead for those pages. Here is a quick test: $ btrfs subvolume create send $ xfs_io -f -c "pwrite 0 1G" send/foobar $ btrfs subvolume snap -r send ro $ time "btrfs send ro -f /dev/null" w/o w real 1m37.527s 0m9.097s user 0m0.122s 0m0.086s sys 0m53.191s 0m12.857s Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:19 -04:00
Liu Bo	a4d96d6254	Btrfs: share the same code for __record_{new,deleted}_ref This has no functional change, only picks out the same part of two functions, and makes it shared. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:19 -04:00
Filipe Manana	fcbd2154d1	Btrfs: avoid unnecessary utimes update in incremental send When we're finishing processing of an inode, if we're dealing with a directory inode that has a pending move/rename operation, we don't need to send a utimes update instruction to the send stream, as we'll do it later after doing the move/rename operation. Therefore we save some time here building paths and doing btree lookups. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:18 -04:00
Filipe Manana	e2127cf008	Btrfs: make defrag not fragment files when using prealloc extents When using prealloc extents, a file defragment operation may actually fragment the file and increase the amount of data space used by the file. This change fixes that behaviour. Example: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt $ cd /mnt $ xfs_io -f -c 'falloc 0 1048576' foobar && sync $ xfs_io -c 'pwrite -S 0xff -b 100000 5000 100000' foobar $ xfs_io -c 'pwrite -S 0xac -b 100000 200000 100000' foobar $ xfs_io -c 'pwrite -S 0xe1 -b 100000 900000 100000' foobar && sync Before defragmenting the file: $ btrfs filesystem df /mnt Data, single: total=8.00MiB, used=1.25MiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00 $ btrfs-debug-tree /dev/sdb3 (...) item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 0 nr 4096 item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 4096 nr 102400 ram 1048576 extent compression 0 item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 106496 nr 90112 item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 196608 nr 106496 ram 1048576 extent compression 0 item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 303104 nr 593920 item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 897024 nr 106496 ram 1048576 extent compression 0 item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 1003520 nr 45056 (...) Now defragmenting the file results in more data space used than before: $ btrfs filesystem defragment -f foobar && sync $ btrfs filesystem df /mnt Data, single: total=8.00MiB, used=1.55MiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00 And the corresponding file extent items are now no longer perfectly sequential as before, and we're now needlessly using more space from data block groups: $ btrfs-debug-tree /dev/sdb3 (...) item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 0 nr 4096 ram 1048576 extent compression 0 item 7 key (257 EXTENT_DATA 4096) itemoff 15757 itemsize 53 extent data disk byte 13893632 nr 102400 extent data offset 0 nr 102400 ram 102400 extent compression 0 item 8 key (257 EXTENT_DATA 106496) itemoff 15704 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 106496 nr 90112 ram 1048576 extent compression 0 item 9 key (257 EXTENT_DATA 196608) itemoff 15651 itemsize 53 extent data disk byte 13996032 nr 106496 extent data offset 0 nr 106496 ram 106496 extent compression 0 item 10 key (257 EXTENT_DATA 303104) itemoff 15598 itemsize 53 prealloc data disk byte 12845056 nr 1048576 prealloc data offset 303104 nr 593920 item 11 key (257 EXTENT_DATA 897024) itemoff 15545 itemsize 53 extent data disk byte 14102528 nr 106496 extent data offset 0 nr 106496 ram 106496 extent compression 0 item 12 key (257 EXTENT_DATA 1003520) itemoff 15492 itemsize 53 extent data disk byte 12845056 nr 1048576 extent data offset 1003520 nr 45056 ram 1048576 extent compression 0 (...) With this change, the above example will no longer cause allocation of new data space nor change the sequentiality of the file extents, that is, defragment will be effectless, leaving all extent items pointing to the extent starting at disk byte 12845056. In a 20Gb filesystem I had, mounted with the autodefrag option and 20 files of 400Mb each, initially consisting of a single prealloc extent of 400Mb, having random writes happening at a low rate, lead to a total of over ~17Gb of data space used, not far from eventually reaching an ENOSPC state. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:17 -04:00
Filipe Manana	dec8ef9055	Btrfs: correctly flush data on defrag when compression is enabled When the defrag flag BTRFS_DEFRAG_RANGE_START_IO is set and compression enabled, we weren't flushing completely, as writing compressed extents is a 2 steps process, one to compress the data and another one to write the compressed data to disk. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:16 -04:00
Qu Wenruo	d458b0540e	btrfs: Cleanup the "_struct" suffix in btrfs_workequeue Since the "_struct" suffix is mainly used for distinguish the differnt btrfs_work between the original and the newly created one, there is no need using the suffix since all btrfs_workers are changed into btrfs_workqueue. Also this patch fixed some codes whose code style is changed due to the too long "_struct" suffix. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:16 -04:00
Qu Wenruo	a046e9c88b	btrfs: Cleanup the old btrfs_worker. Since all the btrfs_worker is replaced with the newly created btrfs_workqueue, the old codes can be easily remove. Signed-off-by: Quwenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:15 -04:00
Qu Wenruo	0339ef2f42	btrfs: Replace fs_info->scrub_* workqueue with btrfs_workqueue. Replace the fs_info->scrub_* with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:14 -04:00
Qu Wenruo	fc97fab0ea	btrfs: Replace fs_info->qgroup_rescan_worker workqueue with btrfs_workqueue. Replace the fs_info->qgroup_rescan_worker with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:13 -04:00
Qu Wenruo	5b3bc44e2e	btrfs: Replace fs_info->delayed_workers workqueue with btrfs_workqueue. Replace the fs_info->delayed_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:12 -04:00
Qu Wenruo	dc6e320998	btrfs: Replace fs_info->fixup_workers workqueue with btrfs_workqueue. Replace the fs_info->fixup_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:12 -04:00
Qu Wenruo	736cfa15e8	btrfs: Replace fs_info->readahead_workers workqueue with btrfs_workqueue. Replace the fs_info->readahead_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:11 -04:00
Qu Wenruo	e66f0bb144	btrfs: Replace fs_info->cache_workers workqueue with btrfs_workqueue. Replace the fs_info->cache_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:10 -04:00
Qu Wenruo	d05a33ac26	btrfs: Replace fs_info->rmw_workers workqueue with btrfs_workqueue. Replace the fs_info->rmw_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:09 -04:00
Qu Wenruo	fccb5d86d8	btrfs: Replace fs_info->endio_* workqueue with btrfs_workqueue. Replace the fs_info->endio_* workqueues with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:08 -04:00
Qu Wenruo	a44903abe9	btrfs: Replace fs_info->flush_workers with btrfs_workqueue. Replace the fs_info->submit_workers with the newly created btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:07 -04:00
Qu Wenruo	a8c93d4ef6	btrfs: Replace fs_info->submit_workers with btrfs_workqueue. Much like the fs_info->workers, replace the fs_info->submit_workers use the same btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:07 -04:00
Qu Wenruo	afe3d24267	btrfs: Replace fs_info->delalloc_workers with btrfs_workqueue Much like the fs_info->workers, replace the fs_info->delalloc_workers use the same btrfs_workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:06 -04:00
Qu Wenruo	5cdc7ad337	btrfs: Replace fs_info->workers with btrfs_workqueue. Use the newly created btrfs_workqueue_struct to replace the original fs_info->workers Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:05 -04:00
Qu Wenruo	0bd9289c28	btrfs: Add threshold workqueue based on kernel workqueue The original btrfs_workers has thresholding functions to dynamically create or destroy kthreads. Though there is no such function in kernel workqueue because the worker is not created manually, we can still use the workqueue_set_max_active to simulated the behavior, mainly to achieve a better HDD performance by setting a high threshold on submit_workers. (Sadly, no resource can be saved) So in this patch, extra workqueue pending counters are introduced to dynamically change the max active of each btrfs_workqueue_struct, hoping to restore the behavior of the original thresholding function. Also, workqueue_set_max_active use a mutex to protect workqueue_struct, which is not meant to be called too frequently, so a new interval mechanism is applied, that will only call workqueue_set_max_active after a count of work is queued. Hoping to balance both the random and sequence performance on HDD. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:04 -04:00
Qu Wenruo	1ca08976ae	btrfs: Add high priority workqueue support for btrfs_workqueue_struct Add high priority function to btrfs_workqueue. This is implemented by embedding a btrfs_workqueue into a btrfs_workqueue and use some helper functions to differ the normal priority wq and high priority wq. So the high priority wq is completely independent from the normal workqueue. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:03 -04:00
Qu Wenruo	08a9ff3264	btrfs: Added btrfs_workqueue_struct implemented ordered execution based on kernel workqueue Use kernel workqueue to implement a new btrfs_workqueue_struct, which has the ordering execution feature like the btrfs_worker. The func is executed in a concurrency way, and the ordred_func/ordered_free is executed in the sequence them are queued after the corresponding func is done. The new btrfs_workqueue works much like the original one, one workqueue for normal work and a list for ordered work. When a work is queued, ordered work will be added to the list and helper function will be queued into the workqueue. The helper function will execute a normal work and then check and execute as many ordered work as possible in the sequence they were queued. At this patch, high priority work queue or thresholding is not added yet. The high priority feature and thresholding will be added in the following patches. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:03 -04:00
Qu Wenruo	f5961d41d7	btrfs: Cleanup the unused struct async_sched. The struct async_sched is not used by any codes and can be removed. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fusionio.com> Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:02 -04:00
Liu Bo	644d1940ab	Btrfs: skip search tree for REG files It is really unnecessary to search tree again for @gen, @mode and @rdev in the case of REG inodes' creation, as we've got btrfs_inode_item in sctx, and @gen, @mode and @rdev can easily be fetched. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:01 -04:00
Miao Xie	7b2b70851f	Btrfs: fix preallocate vs double nocow write We can not release the reserved metadata space for the first write if we find the write position is pre-allocated. Because the kernel might write the data on the disk before we do the second write but after the can-nocow check, if we release the space for the first write, we might fail to update the metadata because of no space. Fix this problem by end nocow write if there is dirty data in the range whose space is pre-allocated. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:00 -04:00
Miao Xie	c933956ddf	Btrfs: fix wrong lock range and write size in check_can_nocow() The write range may not be sector-aligned, for example: \|--------\|--------\| <- write range, sector-unaligned, size: 2blocks \|--------\|--------\|--------\| <- correct lock range, size: 3blocks But according to the old code, we used the size of write range to calculate the lock range directly, not considered the offset, we would get a wrong lock range: \|--------\|--------\| <- write range, sector-unaligned, size: 2blocks \|--------\|--------\| <- wrong lock range, size: 2blocks And besides that, the old code also had the same problem when calculating the real write size. Correct them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:17:00 -04:00
David Sterba	9c9ca00bd3	btrfs: send: simplify allocation code in fs_path_ensure_buf Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:59 -04:00
David Sterba	1b2782c8ed	btrfs: send: fix old buffer length in fs_path_ensure_buf In "btrfs: send: lower memory requirements in common case" the code to save the old_buf_len was incorrectly moved to a wrong place and broke the original logic. Reported-by: Filipe David Manana <fdmanana@gmail.com> Signed-off-by: David Sterba <dsterba@suse.cz> Reviewed-by: Filipe David Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:58 -04:00
Filipe Manana	176840b3aa	Btrfs: more efficient btrfs_drop_extent_cache While droping extent map structures from the extent cache that cover our target range, we would remove each extent map structure from the red black tree and then add either 1 or 2 new extent map structures if the former extent map covered sections outside our target range. This change simply attempts to replace the existing extent map structure with a new one that covers the subsection we're not interested in, instead of doing a red black remove operation followed by an insertion operation. The number of elements in an inode's extent map tree can get very high for large files under random writes. For example, while running the following test: sysbench --test=fileio --file-num=1 --file-total-size=10G \ --file-test-mode=rndrw --num-threads=32 --file-block-size=32768 \ --max-requests=500000 --file-rw-ratio=2 [prepare\|run] I captured the following histogram capturing the number of extent_map items in the red black tree while that test was running: Count: 122462 Range: 1.000 - 172231.000; Mean: 96415.831; Median: 101855.000; Stddev: 49700.981 Percentiles: 90th: 160120.000; 95th: 166335.000; 99th: 171070.000 1.000 - 5.231: 452 \| 5.231 - 187.392: 87 \| 187.392 - 585.911: 206 \| 585.911 - 1827.438: 623 \| 1827.438 - 5695.245: 1962 # 5695.245 - 17744.861: 6204 #### 17744.861 - 55283.764: 21115 ############ 55283.764 - 172231.000: 91813 ##################################################### Benchmark: sysbench --test=fileio --file-num=1 --file-total-size=10G --file-test-mode=rndwr \ --num-threads=64 --file-block-size=32768 --max-requests=0 --max-time=60 \ --file-io-mode=sync --file-fsync-freq=0 [prepare\|run] Before this change: 122.1Mb/sec After this change: 125.07Mb/sec (averages of 5 test runs) Test machine: quad core intel i5-3570K, 32Gb of ram, SSD Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:57 -04:00
Filipe Manana	f2071b2155	Btrfs: more efficient split extent state insertion When we split an extent state there's no need to start the rbtree search from the root node - we can start it from the original extent state node, since we would end up in its subtree if we do the search starting at the root node anyway. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:57 -04:00
Filipe Manana	cbc0e9287d	Btrfs: remove unneeded field / smaller extent_map structure We don't need to have an unsigned int field in the extent_map struct to tell us whether the extent map is in the inode's extent_map tree or not. We can use the rb_node struct field and the RB_CLEAR_NODE and RB_EMPTY_NODE macros to achieve the same task. This reduces sizeof(struct extent_map) from 152 bytes to 144 bytes (on a 64 bits system). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:56 -04:00
Wang Shilong	e84752d434	Btrfs: skip locking when searching commit root We won't change commit root, skip locking dance with commit root when walking backrefs, this can speed up btrfs send operations. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:55 -04:00
Wang Shilong	32a447896c	Btrfs: wake up @scrub_pause_wait as much as we can check if @scrubs_running=@scrubs_paused condition inside wait_event() is not an atomic operation which means we may inc/dec @scrub_running/ paused at any time. Let's wake up @scrub_pause_wait as much as we can to let commit transaction blocked less. An example below: Thread1 Thread2 \|->scrub_blocked_if_needed() \|->scrub_pending_trans_workers_inc \|->increase @scrub_paused \|->increase @scrub_running \|->wake up scrub_pause_wait list \|->scrub blocked \|->increase @scrub_paused Thread3 is commiting transaction which is blocked at btrfs_scrub_pause(). So after Thread2 increase @scrub_paused, we meet the condition @scrub_paused=@scrub_running, but transaction will be still blocked until another calling to wake up @scrub_pause_wait. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:54 -04:00
Wang Shilong	c0af8f0b1c	Btrfs: cancel scrub on transaction abortion If we fail to commit transaction, we'd better cancel scrub operations. Suggested-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:54 -04:00
Wang Shilong	12cf93728d	Btrfs: device_replace: fix deadlock for nocow case commit `cb7ab02156` cause a following deadlock found by xfstests,btrfs/011: Thread1 is commiting transaction which is blocked at btrfs_scrub_pause(). Thread2 is calling btrfs_file_aio_write() which has held inode's @i_mutex and commit transaction(blocked because Thread1 is committing transaction). Thread3 is copy_nocow_page worker which will also try to hold inode @i_mutex, so thread3 will wait Thread1 finished. Thread4 is waiting pending workers finished which will wait Thread3 finished. So the problem is like this: Thread1--->Thread4--->Thread3--->Thread2---->Thread1 Deadlock happens! we fix it by letting Thread1 go firstly, which means we won't block transaction commit while we are waiting pending workers finished. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:53 -04:00
Wang Shilong	6cf7f77e6b	Btrfs: fix a possible deadlock between scrub and transaction committing btrfs_scrub_continue() will be called when cleaning up transaction.However, this can only be called if btrfs_scrub_pause() is called before. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:52 -04:00
Sachin Kamat	886322e8e7	btrfs: Use PTR_ERR_OR_ZERO PTR_RET is deprecated. Use PTR_ERR_OR_ZERO instead. While at it also include missing err.h header. Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:51 -04:00
Filipe Manana	bf0d1f441d	Btrfs: fix send issuing outdated paths for utimes, chown and chmod When doing an incremental send, if we had a directory pending a move/rename operation and none of its parents, except for the immediate parent, were pending a move/rename, after processing the directory's references, we would be issuing utimes, chown and chmod intructions against am outdated path - a path which matched the one in the parent root. This change also simplifies a bit the code that deals with building a path for a directory which has a move/rename operation delayed. Steps to reproduce: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/d/e $ mkdir /mnt/btrfs/a/b/c/f $ chmod 0777 /mnt/btrfs/a/b/c/d/e $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send $ mv /mnt/btrfs/a/b/c/f /mnt/btrfs/a/b/f2 $ mv /mnt/btrfs/a/b/c/d/e /mnt/btrfs/a/b/f2/e2 $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2 $ mv /mnt/btrfs/a/b/c2/d /mnt/btrfs/a/b/c2/d2 $ chmod 0700 /mnt/btrfs/a/b/f2/e2 $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send $ umount /mnt/btrfs $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ btrfs receive /mnt/btrfs -f /tmp/base.send $ btrfs receive /mnt/btrfs -f /tmp/incremental.send The second btrfs receive command failed with: ERROR: chmod a/b/c/d/e failed. No such file or directory A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:51 -04:00
Filipe Manana	6baa4293af	Btrfs: correctly determine if blocks are shared in btrfs_compare_trees Just comparing the pointers (logical disk addresses) of the btree nodes is not completely bullet proof, we have to check if their generation numbers match too. It is guaranteed that a COW operation will result in a block with a different logical disk address than the original block's address, but over time we can reuse that former logical disk address. For example, creating a 2Gb filesystem on a loop device, and having a script running in a loop always updating the access timestamp of a file, resulted in the same logical disk address being reused for the same fs btree block in about only 4 minutes. This could make us skip entire subtrees when doing an incremental send (which is currently the only user of btrfs_compare_trees). However the odds of getting 2 blocks at the same tree level, with the same logical disk address, equal first slot keys and different generations, should hopefully be very low. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:50 -04:00
Filipe Manana	9dc442143b	Btrfs: fix send attempting to rmdir non-empty directories The incremental send algorithm assumed that it was possible to issue a directory remove (rmdir) if the the inode number it was currently processing was greater than (or equal) to any inode that referenced the directory's inode. This wasn't a valid assumption because any such inode might be a child directory that is pending a move/rename operation, because it was moved into a directory that has a higher inode number and was moved/renamed too - in other words, the case the following commit addressed: `9f03740a95` (Btrfs: fix infinite path build loops in incremental send) This made an incremental send issue an rmdir operation before the target directory was actually empty, which made btrfs receive fail. Therefore it needs to wait for all pending child directory inodes to be moved/renamed before sending an rmdir operation. Simple steps to reproduce this issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/x $ mkdir /mnt/btrfs/a/b/y $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send $ mv /mnt/btrfs/a/b/y /mnt/btrfs/a/b/YY $ mv /mnt/btrfs/a/b/c/x /mnt/btrfs/a/b/YY $ rmdir /mnt/btrfs/a/b/c $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send $ umount /mnt/btrfs $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ btrfs receive /mnt/btrfs -f /tmp/base.send $ btrfs receive /mnt/btrfs -f /tmp/incremental.send The second btrfs receive command failed with: ERROR: rmdir o259-6-0 failed. Directory not empty A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:49 -04:00
Filipe Manana	29d6d30f5c	Btrfs: send, don't send rmdir for same target multiple times When doing an incremental send, if we delete a directory that has N > 1 hardlinks for the same file and that file has the highest inode number inside the directory contents, an incremental send would send N times an rmdir operation against the directory. This made the btrfs receive command fail on the second rmdir instruction, as the target directory didn't exist anymore. Steps to reproduce the issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c $ echo 'ola mundo' > /mnt/btrfs/a/b/c/foo.txt $ ln /mnt/btrfs/a/b/c/foo.txt /mnt/btrfs/a/b/c/bar.txt $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send $ rm -f /mnt/btrfs/a/b/c/foo.txt $ rm -f /mnt/btrfs/a/b/c/bar.txt $ rmdir /mnt/btrfs/a/b/c $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send $ umount /mnt/btrfs $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ btrfs receive /mnt/btrfs -f /tmp/base.send $ btrfs receive /mnt/btrfs -f /tmp/incremental.send The second btrfs receive command failed with: ERROR: rmdir o259-6-0 failed. No such file or directory A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:48 -04:00
Filipe Manana	2b863a135f	Btrfs: incremental send, fix invalid path after dir rename This fixes yet one more case not caught by the commit titled: Btrfs: fix infinite path build loops in incremental send In this case, even before the initial full send, we have a directory which is a child of a directory with a higher inode number. Then we perform the initial send, and after we rename both the child and the parent, without moving them around. After doing these 2 renames, an incremental send sent a rename instruction for the child directory which contained an invalid "from" path (referenced the parent's old name, not the new one), which made the btrfs receive command fail. Steps to reproduce: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b $ mkdir /mnt/btrfs/d $ mkdir /mnt/btrfs/a/b/c $ mv /mnt/btrfs/d /mnt/btrfs/a/b/c $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ btrfs send /mnt/btrfs/snap1 -f /tmp/base.send $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/x $ mv /mnt/btrfs/a/b/x/d /mnt/btrfs/a/b/x/y $ btrfs subvolume snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 -f /tmp/incremental.send $ umout /mnt/btrfs $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ btrfs receive /mnt/btrfs -f /tmp/base.send $ btrfs receive /mnt/btrfs -f /tmp/incremental.send The second btrfs receive command failed with: "ERROR: rename a/b/c/d -> a/b/x/y failed. No such file or directory" A test case for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:48 -04:00
Filipe Manana	12870f1c9b	Btrfs: don't insert useless holes when punching beyond the inode's size If we punch beyond the size of an inode, we'll correctly remove any prealloc extents, but we'll also insert file extent items representing holes (disk bytenr == 0) that start with a key offset that lies beyond the inode's size and are not contiguous with the last file extent item. Example: $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo $XFS_IO_PROG -c "fpunch 582007 864596" $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo btrfs-debug-tree output: item 4 key (257 INODE_ITEM 0) itemoff 15885 itemsize 160 inode generation 6 transid 6 size 132254 block group 0 mode 100600 links 1 item 5 key (257 INODE_REF 256) itemoff 15872 itemsize 13 inode ref index 2 namelen 3 name: foo item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 90112 ram 122880 extent compression 0 item 7 key (257 EXTENT_DATA 90112) itemoff 15766 itemsize 53 extent data disk byte 12845056 nr 4096 gen 6 extent data offset 0 nr 45056 ram 45056 extent compression 2 item 8 key (257 EXTENT_DATA 585728) itemoff 15713 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 860160 ram 860160 extent compression 0 The last extent item, which represents a hole, is useless as it lies beyond the inode's size. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:47 -04:00
Filipe Manana	85fdfdf611	Btrfs: cleanup delayed-ref.c:find_ref_head() The argument last wasn't used, all callers supplied a NULL value for it. Also removed unnecessary intermediate storage of the result of key comparisons. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:46 -04:00
Filipe Manana	6103fb43fb	Btrfs: remove unnecessary ref heads rb tree search When we didn't find the exact ref head we were looking for, if return_bigger != 0 we set a new search key to match either the next node after the last one we found or the first one in the ref heads rb tree, and then did another full tree search. For both cases this ended up being pointless as we would end up returning an entry we already had before repeating the search. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:46 -04:00
Justin Maggard	2c6a92b009	btrfs: wake up transaction thread upon remount Now that we can adjust the commit interval with a remount, we need to wake up the transaction thread or else he will continue to sleep until the previous transaction interval has elapsed before waking up. So, if we go from a large commit interval to something smaller, the transaction thread will not wake up until the large interval has expired. This also causes the cleaner thread to stay sleeping, since it gets woken up by the transaction thread. Fix it by simply waking up the transaction thread during a remount. Signed-off-by: Justin Maggard <jmaggard10@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:45 -04:00
Miao Xie	50471a388c	Btrfs: stop joining the log transaction if sync log fails If the log sync fails, there is something wrong in the log tree, we should not continue to join the log transaction and log the metadata. What we should do is to do a full commit. This patch fixes this problem by setting ->last_trans_log_full_commit to the current transaction id, it will tell the tasks not to join the log transaction, and do a full commit. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:44 -04:00
Miao Xie	d1433debe7	Btrfs: just wait or commit our own log sub-transaction We might commit the log sub-transaction which didn't contain the metadata we logged. It was because we didn't record the log transid and just select the current log sub-transaction to commit, but the right one might be committed by the other task already. Actually, we needn't do anything and it is safe that we go back directly in this case. This patch improves the log sync by the above idea. We record the transid of the log sub-transaction in which we log the metadata, and the transid of the log sub-transaction we have committed. If the committed transid is >= the transid we record when logging the metadata, we just go back. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:43 -04:00
Miao Xie	8b050d350c	Btrfs: fix skipped error handle when log sync failed It is possible that many tasks sync the log tree at the same time, but only one task can do the sync work, the others will wait for it. But those wait tasks didn't get the result of the log sync, and returned 0 when they ended the wait. It caused those tasks skipped the error handle, and the serious problem was they told the users the file sync succeeded but in fact they failed. This patch fixes this problem by introducing a log context structure, we insert it into the a global list. When the sync fails, we will set the error number of every log context in the list, then the waiting tasks get the error number of the log context and handle the error if need. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:43 -04:00
Miao Xie	bb14a59b61	Btrfs: use signed integer instead of unsigned long integer for log transid The log trans id is initialized to be 0 every time we create a log tree, and the log tree need be re-created after a new transaction is started, it means the log trans id is unlikely to be a huge number, so we can use signed integer instead of unsigned long integer to save a bit space. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:42 -04:00
Miao Xie	7483e1a446	Btrfs: remove unnecessary memory barrier in btrfs_sync_log() Mutex unlock implies certain memory barriers to make sure all the memory operation completes before the unlock, and the next mutex lock implies memory barriers to make sure the all the memory happens after the lock. So it is a full memory barrier(smp_mb), we needn't add memory barriers. Remove them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:41 -04:00
Miao Xie	e87ac13687	Btrfs: don't start the log transaction if the log tree init fails The old code would start the log transaction even the log tree init failed, it was unnecessary. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:40 -04:00
Miao Xie	48cab2e071	Btrfs: fix the skipped transaction commit during the file sync We may abort the wait earlier if ->last_trans_log_full_commit was set to the current transaction id, at this case, we need commit the current transaction instead of the log sub-transaction. But the current code didn't tell the caller to do it (return 0, not -EAGAIN). Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:40 -04:00
Miao Xie	5c902ba622	Btrfs: use ACCESS_ONCE to prevent the optimize accesses to ->last_trans_log_full_commit ->last_trans_log_full_commit may be changed by the other tasks without lock, so we need prevent the compiler from the optimize access just like tmp = fs_info->last_trans_log_full_commit if (tmp == ...) ... <do something> if (tmp == ...) ... In fact, we need get the new value of ->last_trans_log_full_commit during the second access. Fix it by ACCESS_ONCE(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:39 -04:00
Liu Bo	7813b3db0a	Btrfs: avoid warning bomb of btrfs_invalidate_inodes So after transaction is aborted, we need to cleanup inode resources by calling btrfs_invalidate_inodes(), and btrfs_invalidate_inodes() hopes roots' refs to be zero in old times and sets a WARN_ON(), however, this is not always true within cleaning up transaction, so we get to detect transaction abortion and not warn at all. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:38 -04:00
Liu Bo	2a85d9cac1	Btrfs: fix possible deadlock in btrfs_cleanup_transaction [13654.480669] ====================================================== [13654.480905] [ INFO: possible circular locking dependency detected ] [13654.481003] 3.12.0+ #4 Tainted: G W O [13654.481060] ------------------------------------------------------- [13654.481060] btrfs-transacti/9347 is trying to acquire lock: [13654.481060] (&(&root->ordered_extent_lock)->rlock){+.+...}, at: [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs] [13654.481060] but task is already holding lock: [13654.481060] (&(&fs_info->ordered_root_lock)->rlock){+.+...}, at: [<ffffffffa02d3015>] btrfs_cleanup_transaction+0x1e5/0x570 [btrfs] [13654.481060] which lock already depends on the new lock. [13654.481060] the existing dependency chain (in reverse order) is: [13654.481060] -> #1 (&(&fs_info->ordered_root_lock)->rlock){+.+...}: [13654.481060] [<ffffffff810c4103>] lock_acquire+0x93/0x130 [13654.481060] [<ffffffff81689991>] _raw_spin_lock+0x41/0x50 [13654.481060] [<ffffffffa02f011b>] __btrfs_add_ordered_extent+0x39b/0x450 [btrfs] [13654.481060] [<ffffffffa02f0202>] btrfs_add_ordered_extent+0x32/0x40 [btrfs] [13654.481060] [<ffffffffa02df6aa>] run_delalloc_nocow+0x78a/0x9d0 [btrfs] [13654.481060] [<ffffffffa02dfc0d>] run_delalloc_range+0x31d/0x390 [btrfs] [13654.481060] [<ffffffffa02f7c00>] __extent_writepage+0x310/0x780 [btrfs] [13654.481060] [<ffffffffa02f830a>] extent_write_cache_pages.isra.29.constprop.48+0x29a/0x410 [btrfs] [13654.481060] [<ffffffffa02f879d>] extent_writepages+0x4d/0x70 [btrfs] [13654.481060] [<ffffffffa02d9f68>] btrfs_writepages+0x28/0x30 [btrfs] [13654.481060] [<ffffffff8114be91>] do_writepages+0x21/0x50 [13654.481060] [<ffffffff81140d49>] __filemap_fdatawrite_range+0x59/0x60 [13654.481060] [<ffffffff81140e13>] filemap_fdatawrite_range+0x13/0x20 [13654.481060] [<ffffffffa02f1db9>] btrfs_wait_ordered_range+0x49/0x140 [btrfs] [13654.481060] [<ffffffffa0318fe2>] __btrfs_write_out_cache+0x682/0x8b0 [btrfs] [13654.481060] [<ffffffffa031952d>] btrfs_write_out_cache+0x8d/0xe0 [btrfs] [13654.481060] [<ffffffffa02c7083>] btrfs_write_dirty_block_groups+0x593/0x680 [btrfs] [13654.481060] [<ffffffffa0345307>] commit_cowonly_roots+0x14b/0x20d [btrfs] [13654.481060] [<ffffffffa02d7c1a>] btrfs_commit_transaction+0x43a/0x9d0 [btrfs] [13654.481060] [<ffffffffa030061a>] btrfs_create_uuid_tree+0x5a/0x100 [btrfs] [13654.481060] [<ffffffffa02d5a8a>] open_ctree+0x21da/0x2210 [btrfs] [13654.481060] [<ffffffffa02ab6fe>] btrfs_mount+0x68e/0x870 [btrfs] [13654.481060] [<ffffffff811b2409>] mount_fs+0x39/0x1b0 [13654.481060] [<ffffffff811cd653>] vfs_kern_mount+0x63/0xf0 [13654.481060] [<ffffffff811cfcce>] do_mount+0x23e/0xa90 [13654.481060] [<ffffffff811d05a3>] SyS_mount+0x83/0xc0 [13654.481060] [<ffffffff81692b52>] system_call_fastpath+0x16/0x1b [13654.481060] -> #0 (&(&root->ordered_extent_lock)->rlock){+.+...}: [13654.481060] [<ffffffff810c340a>] __lock_acquire+0x150a/0x1a70 [13654.481060] [<ffffffff810c4103>] lock_acquire+0x93/0x130 [13654.481060] [<ffffffff81689991>] _raw_spin_lock+0x41/0x50 [13654.481060] [<ffffffffa02d30a1>] btrfs_cleanup_transaction+0x271/0x570 [btrfs] [13654.481060] [<ffffffffa02d35ce>] transaction_kthread+0x22e/0x270 [btrfs] [13654.481060] [<ffffffff81079efa>] kthread+0xea/0xf0 [13654.481060] [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0 [13654.481060] other info that might help us debug this: [13654.481060] Possible unsafe locking scenario: [13654.481060] CPU0 CPU1 [13654.481060] ---- ---- [13654.481060] lock(&(&fs_info->ordered_root_lock)->rlock); [13654.481060] lock(&(&root->ordered_extent_lock)->rlock); [13654.481060] lock(&(&fs_info->ordered_root_lock)->rlock); [13654.481060] lock(&(&root->ordered_extent_lock)->rlock); [13654.481060] * DEADLOCK * [...] ====================================================== btrfs_destroy_all_ordered_extents() gets &fs_info->ordered_root_lock __BEFORE__ acquiring &root->ordered_extent_lock, while btrfs_[add,remove]_ordered_extent() acquires &fs_info->ordered_root_lock __AFTER__ getting &root->ordered_extent_lock. This patch fixes the above problem. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:37 -04:00
Filipe David Borba Manana	d5f375270a	Btrfs: faster/more efficient insertion of file extent items This is an extension to my previous commit titled: "Btrfs: faster file extent item replace operations" (hash `1acae57b16`) Instead of inserting the new file extent item if we deleted existing file extent items covering our target file range, also allow to insert the new file extent item if we didn't find any existing items to delete and replace_extent != 0, since in this case our caller would do another tree search to insert the new file extent item anyway, therefore just combine the two tree searches into a single one, saving cpu time, reducing lock contention and reducing btree node/leaf COW operations. This covers the case where applications keep doing tail append writes to files, which for example is the case of Apache CouchDB (its database and view index files are always open with O_APPEND). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:37 -04:00
Stanislaw Gruszka	51b98effa4	btrfs: always choose work from prio_head first In case we do not refill, we can overwrite cur pointer from prio_head by one from not prioritized head, what looks as something that was not intended. This change make we always take works from prio_head first until it's not empty. Signed-off-by: Stanislaw Gruszka <stf_xl@wp.pl> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:36 -04:00
Wang Shilong	dcfd5ad2fc	Revert "Btrfs: remove transaction from btrfs send" This reverts commit `41ce9970a8`. Previously i was thinking we can use readonly root's commit root safely while it is not true, readonly root may be cowed with the following cases. 1.snapshot send root will cow source root. 2.balance,device operations will also cow readonly send root to relocate. So i have two ideas to make us safe to use commit root. -->approach 1: make it protected by transaction and end transaction properly and we research next item from root node(see btrfs_search_slot_for_read()). -->approach 2: add another counter to local root structure to sync snapshot with send. and add a global counter to sync send with exclusive device operations. So with approach 2, send can use commit root safely, because we make sure send root can not be cowed during send. Unfortunately, it make codes ugly and more complex to maintain. To make snapshot and send exclusively, device operations and send operation exclusively with each other is a little confusing for common users. So why not drop into previous way. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:35 -04:00
Wang Shilong	bcbba5e659	Btrfs: skip readonly root for snapshot-aware defragment Btrfs send is assuming readonly root won't change, let's skip readonly root. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:16:34 -04:00
Wang Shilong	850a8cdffe	Btrfs: switch to btrfs_previous_extent_item() Since we have introduced btrfs_previous_extent_item() to search previous extent item, just switch into it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:54 -04:00
Hidetoshi Seto	f88ba6a2a4	Btrfs: skip submitting barrier for missing device I got an error on v3.13: BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.) how to reproduce: > mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2 > wipefs -a /dev/sdf2 > mount -o degraded /dev/sdf1 /mnt > btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt The reason of the error is that barrier_all_devices() failed to submit barrier to the missing device. However it is clear that we cannot do anything on missing device, and also it is not necessary to care chunks on the missing device. This patch stops sending/waiting barrier if device is missing. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:53 -04:00
Josef Bacik	29bce2f399	Btrfs: unlock extent and pages on error in cow_file_range When I converted the BUG_ON() for the free_space_cache_inode in cow_file_range I made it so we just return an error instead of unlocking all of our various stuff. This is a mistake and causes us to hang when we run into this. This patch fixes this problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:53 -04:00
Josef Bacik	c581afc8db	Btrfs: balance delayed inode updates While trying to reproduce a delayed ref problem I noticed the box kept falling over using all 80gb of my ram with btrfs_inode's and btrfs_delayed_node's. Turns out this is because we only throttle delayed inode updates in btrfs_dirty_inode, which doesn't actually get called that often, especially when all you are doing is creating a bunch of files. So balance delayed inode updates everytime we create a new inode. With this patch we no longer use up all of our ram with delayed inode updates. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:52 -04:00
David Sterba	1bae30982b	btrfs: add simple debugfs interface Help during debugging to export various interesting infromation and tunables without the need of extra mount options or ioctls. Usage: * declare your variable in sysfs.h, and include where you need it * define the variable in sysfs.c and make it visible via debugfs_create_TYPE Depends on CONFIG_DEBUG_FS. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:51 -04:00
David Sterba	ace0105076	btrfs: send: lower memory requirements in common case The fs_path structure uses an inline buffer and falls back to a chain of allocations, but vmalloc is not necessary because PATH_MAX fits into PAGE_SIZE. The size of fs_path has been reduced to 256 bytes from PAGE_SIZE, usually 4k. Experimental measurements show that most paths on a single filesystem do not exceed 200 bytes, and these get stored into the inline buffer directly, which is now 230 bytes. Longer paths are kmalloced when needed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:50 -04:00
Filipe David Borba Manana	dff6d0adbe	Btrfs: make some tree searches in send.c more efficient We have this pattern where we do search for a contiguous group of items in a tree and everytime we find an item, we process it, then we release our path, increment the offset of the search key, do another full tree search and repeat these steps until a tree search can't find more items we're interested in. Instead of doing these full tree searches after processing each item, just process the next item/slot in our leaf and don't release the path. Since all these trees are read only and we always use the commit root for a search and skip node/leaf locks, we're not affecting concurrency on the trees. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:49 -04:00
Filipe David Borba Manana	a0859c0998	Btrfs: use right extent item position in send when finding extent clones This was a leftover from the commit: `74dd17fbe3` (Btrfs: fix btrfs send for inline items and compression) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:48 -04:00
David Sterba	57fb8910c2	btrfs: send: remove BUG_ON from name_cache_delete If cleaning the name cache fails, we could try to proceed at the cost of some memory leak. This is not expected to happen often. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:48 -04:00
David Sterba	4d1a63b21b	btrfs: send: remove BUG from process_all_refs There are only 2 static callers, the BUG would normally be never reached, but let's be nice. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:47 -04:00
David Sterba	1f5a7ff999	btrfs: send: squeeze bitfilelds in fs_path We know that buf_len is at most PATH_MAX, 4k, and can merge it with the reversed member. This saves 3 bytes in favor of inline_buf. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:46 -04:00
David Sterba	e25a812206	btrfs: send: remove virtual_mem member from fs_path We don't need to keep track of that, it's available via is_vmalloc_addr. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:45 -04:00
David Sterba	b23ab57d48	btrfs: send: remove prepared member from fs_path The member is used only to return value back from fs_path_prepare_for_add, we can do it locally and save 8 bytes for the inline_buf path. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:44 -04:00
David Sterba	64792f2535	btrfs: send: replace check with an assert in gen_unique_name The buffer passed to snprintf can hold the fully expanded format string, 64 = 3x largest ULL + 3x char + trailing null. I don't think that removing the check entirely is a good idea, hence the ASSERT. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:44 -04:00
Filipe David Borba Manana	5ed7f9ff15	Btrfs: more send support for parent/child dir relationship inversion The commit titled "Btrfs: fix infinite path build loops in incremental send" didn't cover a particular case where the parent-child relationship inversion of directories doesn't imply a rename of the new parent directory. This was due to a simple logic mistake, a logical and instead of a logical or. Steps to reproduce: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ mv /mnt/btrfs/a/b/bar1/bar2/bar3/bar4 /mnt/btrfs/a/b/k44 $ mv /mnt/btrfs/a/b/bar1/bar2/bar3 /mnt/btrfs/a/b/k44 $ mv /mnt/btrfs/a/b/bar1/bar2 /mnt/btrfs/a/b/k44/bar3 $ mv /mnt/btrfs/a/b/bar1 /mnt/btrfs/a/b/k44/bar3/bar2/k11 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send A patch to update the test btrfs/030 from xfstests, so that it covers this case, will be submitted soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:43 -04:00
Filipe David Borba Manana	03cb4fb9d8	Btrfs: fix send dealing with file renames and directory moves This fixes a case that the commit titled: Btrfs: fix infinite path build loops in incremental send didn't cover. If the parent-child relationship between 2 directories is inverted, both get renamed, and the former parent has a file that got renamed too (but remains a child of that directory), the incremental send operation would use the file's old path after sending an unlink operation for that old path, causing receive to fail on future operations like changing owner, permissions or utimes of the corresponding inode. This is not a regression from the commit mentioned before, as without that commit we would fall into the issues that commit fixed, so it's just one case that wasn't covered before. Simple steps to reproduce this issue are: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/d $ touch /mnt/btrfs/a/b/c/d/file $ mkdir -p /mnt/btrfs/a/b/x $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ mv /mnt/btrfs/a/b/x /mnt/btrfs/a/b/c/x2 $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c/x2/d2 $ mv /mnt/btrfs/a/b/c/x2/d2/file /mnt/btrfs/a/b/c/x2/d2/file2 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send A patch to update the test btrfs/030 from xfstests, so that it covers this case, will be submitted soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:42 -04:00
Wang Shilong	98cfee2143	Btrfs: only add roots if necessary in find_parent_nodes() find_all_leafs() dosen't need add all roots actually, add roots only if we need, this can avoid unnecessary ulist dance. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:41 -04:00
Hugo Mills	abccd00f8a	btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit and 64-bit systems. This means that it is impossible to use btrfs receive on a system with a 64-bit kernel and 32-bit userspace, because the structure size (and hence the ioctl number) is different. This patch adds a compatibility structure and ioctl to deal with the above case. Signed-off-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:40 -04:00
Filipe David Borba Manana	d86477b303	Btrfs: add missing error check in incremental send Function wait_for_parent_move() returns negative value if an error happened, 0 if we don't need to wait for the parent's move, and 1 if the wait is needed. Before this change an error return value was being treated like the return value 1, which was not correct. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:40 -04:00
Miao Xie	c404e0dc2c	Btrfs: fix use-after-free in the finishing procedure of the device replace During device replace test, we hit a null pointer deference (It was very easy to reproduce it by running xfstests' btrfs/011 on the devices with the virtio scsi driver). There were two bugs that caused this problem: - We might allocate new chunks on the replaced device after we updated the mapping tree. And we forgot to replace the source device in those mapping of the new chunks. - We might get the mapping information which including the source device before the mapping information update. And then submit the bio which was based on that mapping information after we freed the source device. For the first bug, we can fix it by doing mapping tree update and source device remove in the same context of the chunk mutex. The chunk mutex is used to protect the allocable device list, the above method can avoid the new chunk allocation, and after we remove the source device, all the new chunks will be allocated on the new device. So it can fix the first bug. For the second bug, we need make sure all flighting bios are finished and no new bios are produced during we are removing the source device. To fix this problem, we introduced a global @bio_counter, we not only inc/dec @bio_counter outsize of map_blocks, but also inc it before submitting bio and dec @bio_counter when ending bios. Since Raid56 is a little different and device replace dosen't support raid56 yet, it is not addressed in the patch and I add comments to make sure we will fix it in the future. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:39 -04:00
Miao Xie	391cd9df81	Btrfs: fix unprotected alloc list insertion during the finishing procedure of replace the alloc list of the filesystem is protected by ->chunk_mutex, we need get that mutex when we insert the new device into the list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:38 -04:00
Kusanagi Kouichi	23ad5b17dc	btrfs: Return EXDEV for cross file system snapshot EXDEV seems an appropriate error if an operation fails bacause it crosses file system boundaries. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:37 -04:00
Miao Xie	827463c49f	Btrfs: don't mix the ordered extents of all files together during logging the inodes There was a problem in the old code: If we failed to log the csum, we would free all the ordered extents in the log list including those ordered extents that were logged successfully, it would make the log committer not to wait for the completion of the ordered extents. This patch doesn't insert the ordered extents that is about to be logged into a global list, instead, we insert them into a local list. If we log the ordered extents successfully, we splice them with the global list, or we will throw them away, then do full sync. It can also reduce the lock contention and the traverse time of list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com>	2014-03-10 15:15:36 -04:00
Linus Torvalds	3962dfbe22	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We have a small collection of fixes in my for-linus branch. The big thing that stands out is a revert of a new ioctl. Users haven't shipped yet in btrfs-progs, and Dave Sterba found a better way to export the information" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: use right clone root offset for compressed extents btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105 Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol Btrfs: fix max_inline mount option Btrfs: fix a lockdep warning when cleaning up aborted transaction Revert "btrfs: add ioctl to export size of global metadata reservation"	2014-02-16 11:05:27 -08:00
Filipe David Borba Manana	93de4ba864	Btrfs: use right clone root offset for compressed extents For non compressed extents, iterate_extent_inodes() gives us offsets that take into account the data offset from the file extent items, while for compressed extents it doesn't. Therefore we have to adjust them before placing them in a send clone instruction. Not doing this adjustment leads to the receiving end requesting for a wrong a file range to the clone ioctl, which results in different file content from the one in the original send root. Issue reproducible with the following excerpt from the test I made for xfstests: _scratch_mkfs _scratch_mount "-o compress-force=lzo" $XFS_IO_PROG -f -c "truncate 118811" $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0x0d -b 39987 92267 39987" $SCRATCH_MNT/foo $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 $XFS_IO_PROG -c "pwrite -S 0x3e -b 80000 200000 80000" $SCRATCH_MNT/foo $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT $XFS_IO_PROG -c "pwrite -S 0xdc -b 10000 250000 10000" $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0xff -b 10000 300000 10000" $SCRATCH_MNT/foo # will be used for incremental send to be able to issue clone operations $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/clones_snap $BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 $FSSUM_PROG -A -f -w $tmp/1.fssum $SCRATCH_MNT/mysnap1 $FSSUM_PROG -A -f -w $tmp/2.fssum -x $SCRATCH_MNT/mysnap2/mysnap1 \ -x $SCRATCH_MNT/mysnap2/clones_snap $SCRATCH_MNT/mysnap2 $FSSUM_PROG -A -f -w $tmp/clones.fssum $SCRATCH_MNT/clones_snap \ -x $SCRATCH_MNT/clones_snap/mysnap1 -x $SCRATCH_MNT/clones_snap/mysnap2 $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap $BTRFS_UTIL_PROG send $SCRATCH_MNT/clones_snap -f $tmp/clones.snap $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 \ -c $SCRATCH_MNT/clones_snap $SCRATCH_MNT/mysnap2 -f $tmp/2.snap _scratch_unmount _scratch_mkfs _scratch_mount $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap $FSSUM_PROG -r $tmp/1.fssum $SCRATCH_MNT/mysnap1 2>> $seqres.full $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/clones.snap $FSSUM_PROG -r $tmp/clones.fssum $SCRATCH_MNT/clones_snap 2>> $seqres.full $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap $FSSUM_PROG -r $tmp/2.fssum $SCRATCH_MNT/mysnap2 2>> $seqres.full Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-15 08:04:27 -08:00
Anand Jain	f085381e6d	btrfs: fix null pointer deference at btrfs_sysfs_add_one+0x105 bdev is null when disk has disappeared and mounted with the degrade option stack trace --------- btrfs_sysfs_add_one+0x105/0x1c0 [btrfs] open_ctree+0x15f3/0x1fe0 [btrfs] btrfs_mount+0x5db/0x790 [btrfs] ? alloc_pages_current+0xa4/0x160 mount_fs+0x34/0x1b0 vfs_kern_mount+0x62/0xf0 do_mount+0x22e/0xa80 ? __get_free_pages+0x9/0x40 ? copy_mount_options+0x31/0x170 SyS_mount+0x7e/0xc0 system_call_fastpath+0x16/0x1b --------- reproducer: ------- mkfs.btrfs -draid1 -mraid1 /dev/sdc /dev/sdd (detach a disk) devmgt detach /dev/sdc [1] mount -o degrade /dev/sdd /btrfs ------- [1] github.com/anajain/devmgt.git Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Tested-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-15 08:03:09 -08:00
Josef Bacik	3a0dfa6a12	Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol A user was running into errors from an NFS export of a subvolume that had a default subvol set. When we mount a default subvol we will use d_obtain_alias() to find an existing dentry for the subvolume in the case that the root subvol has already been mounted, or a dummy one is allocated in the case that the root subvol has not already been mounted. This allows us to connect the dentry later on if we wander into the path. However if we don't ever wander into the path we will keep DCACHE_DISCONNECTED set for a long time, which angers NFS. It doesn't appear to cause any problems but it is annoying nonetheless, so simply unset DCACHE_DISCONNECTED in the get_default_root case and switch btrfs_lookup() to use d_materialise_unique() instead which will make everything play nicely together and reconnect stuff if we wander into the defaul subvol path from a different way. With this patch I'm no longer getting the NFS errors when exporting a volume that has been mounted with a default subvol set. Thanks, cc: bfields@fieldses.org cc: ebiederm@xmission.com Signed-off-by: Josef Bacik <jbacik@fb.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-14 13:44:32 -08:00
Mitch Harder	feb5f96589	Btrfs: fix max_inline mount option Currently, the only mount option for max_inline that has any effect is max_inline=0. Any other value that is supplied to max_inline will be adjusted to a minimum of 4k. Since max_inline has an effective maximum of ~3900 bytes due to page size limitations, the current behaviour only has meaning for max_inline=0. This patch will allow the the max_inline mount option to accept non-zero values as indicated in the documentation. Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-14 13:44:32 -08:00
Liu Bo	a9d2d4adb6	Btrfs: fix a lockdep warning when cleaning up aborted transaction Given now we have 2 spinlock for management of delayed refs, CONFIG_DEBUG_SPINLOCK=y helped me find this, [ 4723.413809] BUG: spinlock wrong CPU on CPU#1, btrfs-transacti/2258 [ 4723.414882] lock: 0xffff880048377670, .magic: dead4ead, .owner: btrfs-transacti/2258, .owner_cpu: 2 [ 4723.417146] CPU: 1 PID: 2258 Comm: btrfs-transacti Tainted: G W O 3.12.0+ #4 [ 4723.421321] Call Trace: [ 4723.421872] [<ffffffff81680fe7>] dump_stack+0x54/0x74 [ 4723.422753] [<ffffffff81681093>] spin_dump+0x8c/0x91 [ 4723.424979] [<ffffffff816810b9>] spin_bug+0x21/0x26 [ 4723.425846] [<ffffffff81323956>] do_raw_spin_unlock+0x66/0x90 [ 4723.434424] [<ffffffff81689bf7>] _raw_spin_unlock+0x27/0x40 [ 4723.438747] [<ffffffffa015da9e>] btrfs_cleanup_one_transaction+0x35e/0x710 [btrfs] [ 4723.443321] [<ffffffffa015df54>] btrfs_cleanup_transaction+0x104/0x570 [btrfs] [ 4723.444692] [<ffffffff810c1b5d>] ? trace_hardirqs_on_caller+0xfd/0x1c0 [ 4723.450336] [<ffffffff810c1c2d>] ? trace_hardirqs_on+0xd/0x10 [ 4723.451332] [<ffffffffa015e5ee>] transaction_kthread+0x22e/0x270 [btrfs] [ 4723.452543] [<ffffffffa015e3c0>] ? btrfs_cleanup_transaction+0x570/0x570 [btrfs] [ 4723.457833] [<ffffffff81079efa>] kthread+0xea/0xf0 [ 4723.458990] [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140 [ 4723.460133] [<ffffffff81692aac>] ret_from_fork+0x7c/0xb0 [ 4723.460865] [<ffffffff81079e10>] ? kthread_create_on_node+0x140/0x140 [ 4723.496521] ------------[ cut here ]------------ ---------------------------------------------------------------------- The reason is that we get to call cond_resched_lock(&head_ref->lock) while still holding @delayed_refs->lock. So it's different with __btrfs_run_delayed_refs(), where we do drop-acquire dance before and after actually processing delayed refs. Here we don't drop the lock, others are not able to add new delayed refs to head_ref, so cond_resched_lock(&head_ref->lock) is not necessary here. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-14 13:44:32 -08:00
Chris Mason	11bcac89c0	Revert "btrfs: add ioctl to export size of global metadata reservation" This reverts commit `01e219e806`. David Sterba found a different way to provide these features without adding a new ioctl. We haven't released any progs with this ioctl yet, so I'm taking this out for now until we finalize things. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: David Sterba <dsterba@suse.cz> CC: Jeff Mahoney <jeffm@suse.com>	2014-02-14 13:42:13 -08:00
Linus Torvalds	9c1db77981	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is a small collection of fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix data corruption when reading/updating compressed extents Btrfs: don't loop forever if we can't run because of the tree mod log btrfs: reserve no transaction units in btrfs_ioctl_set_features btrfs: commit transaction after setting label and features Btrfs: fix assert screwup for the pending move stuff	2014-02-09 11:12:26 -08:00
Filipe David Borba Manana	a2aa75e18a	Btrfs: fix data corruption when reading/updating compressed extents When using a mix of compressed file extents and prealloc extents, it is possible to fill a page of a file with random, garbage data from some unrelated previous use of the page, instead of a sequence of zeroes. A simple sequence of steps to get into such case, taken from the test case I made for xfstests, is: _scratch_mkfs _scratch_mount "-o compress-force=lzo" $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar This results in the following file items in the fs tree: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 inode generation 6 transid 6 size 542872 block group 0 mode 100600 item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 24576 ram 266240 extent compression 0 item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53 prealloc data disk byte 12849152 nr 241664 gen 6 prealloc data offset 0 nr 241664 item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53 extent data disk byte 12845056 nr 4096 gen 6 extent data offset 0 nr 20480 ram 20480 extent compression 2 item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53 prealloc data disk byte 13090816 nr 405504 gen 6 prealloc data offset 0 nr 258048 The on disk extent at offset 266240 (which corresponds to 1 single disk block), contains 5 compressed chunks of file data. Each of the first 4 compress 4096 bytes of file data, while the last one only compresses 3024 bytes of file data. Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 = 1072 bytes) should always return zeroes (our next extent is a prealloc one). The solution here is the compression code path to zero the remaining (untouched) bytes of the last page it uncompressed data into, as the information about how much space the file data consumes in the last page is not known in the upper layer fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing the remainder of the page but only if it corresponds to the last page of the inode and if the inode's size is not a multiple of the page size. This would cause not only returning random data on reads, but also permanently storing random data when updating parts of the region that should be zeroed. For the example above, it means updating a single byte in the region [285648 ; 286720[ would store that byte correctly but also store random data on disk. A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-08 17:57:15 -08:00
Josef Bacik	27a377db74	Btrfs: don't loop forever if we can't run because of the tree mod log A user reported a 100% cpu hang with my new delayed ref code. Turns out I forgot to increase the count check when we can't run a delayed ref because of the tree mod log. If we can't run any delayed refs during this there is no point in continuing to look, and we need to break out. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-08 17:57:15 -08:00
David Sterba	8051aa1a3d	btrfs: reserve no transaction units in btrfs_ioctl_set_features Added in patch "btrfs: add ioctls to query/change feature bits online" modifications to superblock don't need to reserve metadata blocks when starting a transaction. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-08 17:57:15 -08:00
Jeff Mahoney	d0270aca88	btrfs: commit transaction after setting label and features The set_fslabel ioctl uses btrfs_end_transaction, which means it's possible that the change will be lost if the system crashes, same for the newly set features. Let's use btrfs_commit_transaction instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-08 17:57:15 -08:00
Josef Bacik	6cc98d90f8	Btrfs: fix assert screwup for the pending move stuff Wang noticed that he was failing btrfs/030 even though me and Filipe couldn't reproduce. Turns out this is because Wang didn't have CONFIG_BTRFS_ASSERT set, which meant that a key part of Filipe's original patch was not being built in. This appears to be a mess up with merging Filipe's patch as it does not exist in his original patch. Fix this by changing how we make sure del_waiting_dir_move asserts that it did not error and take the function out of the ifdef check. This makes btrfs/030 pass with the assert on or off. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Filipe Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-08 17:57:15 -08:00
Linus Torvalds	878a876b2e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe is fixing compile and boot problems with our crc32c rework, and Josef has disabled snapshot aware defrag for now. As the number of snapshots increases, we're hitting OOM. For the short term we're disabling things until a bigger fix is ready" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: use late_initcall instead of module_init Btrfs: use btrfs_crc32c everywhere instead of libcrc32c Btrfs: disable snapshot aware defrag for now	2014-02-04 12:26:56 -08:00
Filipe David Borba Manana	60efa5eb2e	Btrfs: use late_initcall instead of module_init It seems that when init_btrfs_fs() is called, crc32c/crc32c-intel might not always be already initialized, which results in the call to crypto_alloc_shash() returning -ENOENT, as experienced by Ahmet who reported this. Therefore make sure init_btrfs_fs() is called after crc32c is initialized (which is at initialization level 6, module_init), by using late_initcall (which is at initialization level 7) instead of module_init for btrfs. Reported-and-Tested-by: Ahmet Inan <ainan@mathematik.uni-freiburg.de> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-03 09:01:28 -08:00
Filipe David Borba Manana	0b947aff15	Btrfs: use btrfs_crc32c everywhere instead of libcrc32c After the commit titled "Btrfs: fix btrfs boot when compiled as built-in", LIBCRC32C requirement was removed from btrfs' Kconfig. This made it not possible to build a kernel with btrfs enabled (either as module or built-in) if libcrc32c is not enabled as well. So just replace all uses of libcrc32c with the equivalent function in btrfs hash.h - btrfs_crc32c. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-03 09:01:27 -08:00
Josef Bacik	8101c8dbf6	Btrfs: disable snapshot aware defrag for now It's just broken and it's taking a lot of effort to fix it, so for now just disable it so people can defrag in peace. Thanks, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-02-03 09:01:27 -08:00
Linus Torvalds	e7651b819e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This is a pretty big pull, and most of these changes have been floating in btrfs-next for a long time. Filipe's properties work is a cool building block for inheriting attributes like compression down on a per inode basis. Jeff Mahoney kicked in code to export filesystem info into sysfs. Otherwise, lots of performance improvements, cleanups and bug fixes. Looks like there are still a few other small pending incrementals, but I wanted to get the bulk of this in first" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (149 commits) Btrfs: fix spin_unlock in check_ref_cleanup Btrfs: setup inode location during btrfs_init_inode_locked Btrfs: don't use ram_bytes for uncompressed inline items Btrfs: fix btrfs_search_slot_for_read backwards iteration Btrfs: do not export ulist functions Btrfs: rework ulist with list+rb_tree Btrfs: fix memory leaks on walking backrefs failure Btrfs: fix send file hole detection leading to data corruption Btrfs: add a reschedule point in btrfs_find_all_roots() Btrfs: make send's file extent item search more efficient Btrfs: fix to catch all errors when resolving indirect ref Btrfs: fix protection between walking backrefs and root deletion btrfs: fix warning while merging two adjacent extents Btrfs: fix infinite path build loops in incremental send btrfs: undo sysfs when open_ctree() fails Btrfs: fix snprintf usage by send's gen_unique_name btrfs: fix defrag 32-bit integer overflow btrfs: sysfs: list the NO_HOLES feature btrfs: sysfs: don't show reserved incompat feature btrfs: call permission checks earlier in ioctls and return EPERM ...	2014-01-30 20:08:20 -08:00
Linus Torvalds	f568849eda	Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-block Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...	2014-01-30 11:19:05 -08:00
Chris Mason	cf93da7bcf	Btrfs: fix spin_unlock in check_ref_cleanup Our goto out should have gone a little farther. Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:31 -08:00
Chris Mason	90d3e592e9	Btrfs: setup inode location during btrfs_init_inode_locked We have a race during inode init because the BTRFS_I(inode)->location is setup after the inode hash table lock is dropped. btrfs_find_actor uses the location field, so our search might not find an existing inode in the hash table if we race with the inode init code. This commit changes things to setup the location field sooner. Also the find actor now uses only the location objectid to match inodes. For inode hashing, we just need a unique and stable test, it doesn't have to reflect the inode numbers we show to userland. Signed-off-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org	2014-01-29 07:06:30 -08:00
Chris Mason	514ac8ad87	Btrfs: don't use ram_bytes for uncompressed inline items If we truncate an uncompressed inline item, ram_bytes isn't updated to reflect the new size. The fixe uses the size directly from the item header when reading uncompressed inlines, and also fixes truncate to update the size as it goes. Reported-by: Jens Axboe <axboe@fb.com> Signed-off-by: Chris Mason <clm@fb.com> CC: stable@vger.kernel.org	2014-01-29 07:06:29 -08:00
Filipe David Borba Manana	23c6bf6a91	Btrfs: fix btrfs_search_slot_for_read backwards iteration If the current path's leaf slot is 0, we do search for the previous leaf (via btrfs_prev_leaf) and set the new path's leaf slot to a value corresponding to the number of items - 1 of the former leaf. Fix this by using the slot set by btrfs_prev_leaf, decrementing it by 1 if it's equal to the leaf's number of items. Use of btrfs_search_slot_for_read() for backward iteration is used in particular by the send feature, which could miss items when the input leaf has less items than its previous leaf. This could be reproduced by running btrfs/007 from xfstests in a loop. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:28 -08:00
Wang Shilong	49fc647a2c	Btrfs: do not export ulist functions There are not any users that use ulist except Btrfs,don't export them. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:27 -08:00
Wang Shilong	4c7a6f74ce	Btrfs: rework ulist with list+rb_tree We are really suffering from now ulist's implementation, some developers gave their try, and i just gave some of my ideas for things: 1. use list+rb_tree instead of arrary+rb_tree 2. add cur_list to iterator rather than ulist structure. 3. add seqnum into every node when they are added, this is used to do selfcheck when iterating node. I noticed Zach Brown's comments before, long term is to kick off ulist implementation, however, for now, we need at least avoid arrary from ulist. Cc: Liu Bo <bo.li.liu@oracle.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Zach Brown <zab@redhat.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:27 -08:00
Wang Shilong	f05c474688	Btrfs: fix memory leaks on walking backrefs failure When walking backrefs, we may iterate every inode's extent and add/merge them into ulist, and the caller will free memory from ulist. However, if we fail to allocate inode's extents element memory or ulist_add() fail to allocate memory, we won't add allocated memory into ulist, and the caller won't free some allocated memory thus memory leaks happen. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:26 -08:00
Filipe David Borba Manana	bf54f412f0	Btrfs: fix send file hole detection leading to data corruption There was a case where file hole detection was incorrect and it would cause an incremental send to override a section of a file with zeroes. This happened in the case where between the last leaf we processed which contained a file extent item for our current inode and the leaf we're currently are at (and has a file extent item for our current inode) there are only leafs containing exclusively file extent items for our current inode, and none of them was updated since the previous send operation. The file hole detection code would incorrectly consider the file range covered by these leafs as a hole. A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:25 -08:00
Wang Shilong	bca1a29003	Btrfs: add a reschedule point in btrfs_find_all_roots() I can easily trigger the following warnings when enabling quota in my virtual machine(running Opensuse), Steps are firstly creating a subvolume full of fragment extents, and then create many snapshots (500 in my test case). [ 2362.808459] BUG: soft lockup - CPU#0 stuck for 22s! [btrfs-qgroup-re:1970] [ 2362.809023] task: e4af8450 ti: e371c000 task.ti: e371c000 [ 2362.809026] EIP: 0060:[<fa38f4ae>] EFLAGS: 00000246 CPU: 0 [ 2362.809049] EIP is at __merge_refs+0x5e/0x100 [btrfs] [ 2362.809051] EAX: 00000000 EBX: cfadbcf0 ECX: 00000000 EDX: cfadbcb0 [ 2362.809052] ESI: dd8d3370 EDI: e371dde0 EBP: e371dd6c ESP: e371dd5c [ 2362.809054] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 [ 2362.809055] CR0: 80050033 CR2: ac454d50 CR3: 009a9000 CR4: 001407d0 [ 2362.809099] Stack: [ 2362.809100] 00000001 e371dde0 dfcc6890 f29f8000 e371de28 fa39016d 00000011 00000001 [ 2362.809105] 99bfc000 00000000 93928000 00000000 00000001 00000050 e371dda8 00000001 [ 2362.809109] f3a31000 f3413000 00000001 e371ddb8 000040a8 00000202 00000000 00000023 [ 2362.809113] Call Trace: [ 2362.809136] [<fa39016d>] find_parent_nodes+0x34d/0x1280 [btrfs] [ 2362.809156] [<fa391172>] btrfs_find_all_roots+0xb2/0x110 [btrfs] [ 2362.809174] [<fa3934a8>] btrfs_qgroup_rescan_worker+0x358/0x7a0 [btrfs] [ 2362.809180] [<c024d0ce>] ? lock_timer_base.isra.39+0x1e/0x40 [ 2362.809199] [<fa3648df>] worker_loop+0xff/0x470 [btrfs] [ 2362.809204] [<c027a88a>] ? __wake_up_locked+0x1a/0x20 [ 2362.809221] [<fa3647e0>] ? btrfs_queue_worker+0x2b0/0x2b0 [btrfs] [ 2362.809225] [<c025ebbc>] kthread+0x9c/0xb0 [ 2362.809229] [<c06b487b>] ret_from_kernel_thread+0x1b/0x30 [ 2362.809233] [<c025eb20>] ? kthread_create_on_node+0x110/0x110 By adding a reschedule point at the end of btrfs_find_all_roots(), i no longer hit these warnings. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:25 -08:00
Filipe David Borba Manana	7fdd29d02e	Btrfs: make send's file extent item search more efficient Instead of looking for a file extent item, process it, release the path and do a btree search for the next file extent item, just process all file extent items in a leaf without intermediate btree searches. This way we save cpu and we're not blocking other tasks or affecting concurrency on the btree, because send's paths use the commit root and skip btree node/leaf locking. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:24 -08:00
Wang Shilong	95def2ede1	Btrfs: fix to catch all errors when resolving indirect ref We can only tolerate ENOENT here, for other errors, we should return directly. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:23 -08:00
Wang Shilong	538f72cdf0	Btrfs: fix protection between walking backrefs and root deletion There is a race condition between resolving indirect ref and root deletion, and we should gurantee that root can not be destroyed to avoid accessing broken tree here. Here we fix it by holding @subvol_srcu, and we will release it as soon as we have held root node lock. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:23 -08:00
Gui Hecheng	3c9665df0c	btrfs: fix warning while merging two adjacent extents When we have two adjacent extents in relink_extent_backref, we try to merge them. When we use btrfs_search_slot to locate the slot for the current extent, we shouldn't set "ins_len = 1", because we will merge it into the previous extent rather than insert a new item. Otherwise, we may happen to create a new leaf in btrfs_search_slot and path->slot[0] will be 0. Then we try to fetch the previous item using "path->slots[0]--", and it will cause a warning as follows: [ 145.713385] WARNING: CPU: 3 PID: 1796 at fs/btrfs/extent_io.c:5043 map_private_extent_buffer+0xd4/0xe0 [ 145.713387] btrfs bad mapping eb start `5337088` len 4096, wanted 167772306 8 ... [ 145.713462] [<ffffffffa034b1f4>] map_private_extent_buffer+0xd4/0xe0 [ 145.713476] [<ffffffffa030097a>] ? btrfs_free_path+0x2a/0x40 [ 145.713485] [<ffffffffa0340864>] btrfs_get_token_64+0x64/0xf0 [ 145.713498] [<ffffffffa033472c>] relink_extent_backref+0x41c/0x820 [ 145.713508] [<ffffffffa0334d69>] btrfs_finish_ordered_io+0x239/0xa80 I encounter this warning when running defrag having mkfs.btrfs with option -M. At the same time there are read/writes & snapshots running at background. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:22 -08:00
Filipe David Borba Manana	9f03740a95	Btrfs: fix infinite path build loops in incremental send The send operation processes inodes by their ascending number, and assumes that any rename/move operation can be successfully performed (sent to the caller) once all previous inodes (those with a smaller inode number than the one we're currently processing) were processed. This is not true when an incremental send had to process an hierarchical change between 2 snapshots where the parent-children relationship between directory inodes was reversed - that is, parents became children and children became parents. This situation made the path building code go into an infinite loop, which kept allocating more and more memory that eventually lead to a krealloc warning being displayed in dmesg: WARNING: CPU: 1 PID: 5705 at mm/page_alloc.c:2477 __alloc_pages_nodemask+0x365/0xad0() Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) snd_hda_codec_hdmi snd_hda_codec_realtek joydev radeon snd_hda_intel snd_hda_codec snd_hwdep snd_seq_midi snd_pcm psmouse i915 snd_rawmidi serio_raw snd_seq_midi_event lpc_ich snd_seq snd_timer ttm snd_seq_device rfcomm drm_kms_helper parport_pc bnep bluetooth drm ppdev snd soundcore i2c_algo_bit snd_page_alloc binfmt_misc video lp parport r8169 mii hid_generic usbhid hid CPU: 1 PID: 5705 Comm: btrfs Tainted: G O 3.13.0-rc7-fdm-btrfs-next-18+ #3 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012 [ 5381.660441] 00000000000009ad ffff8806f6f2f4e8 ffffffff81777434 0000000000000007 [ 5381.660447] 0000000000000000 ffff8806f6f2f528 ffffffff8104a9ec ffff8807038f36f0 [ 5381.660452] 0000000000000000 0000000000000206 ffff8807038f2490 ffff8807038f36f0 [ 5381.660457] Call Trace: [ 5381.660464] [<ffffffff81777434>] dump_stack+0x4e/0x68 [ 5381.660471] [<ffffffff8104a9ec>] warn_slowpath_common+0x8c/0xc0 [ 5381.660476] [<ffffffff8104aa3a>] warn_slowpath_null+0x1a/0x20 [ 5381.660480] [<ffffffff81144995>] __alloc_pages_nodemask+0x365/0xad0 [ 5381.660487] [<ffffffff8108313f>] ? local_clock+0x4f/0x60 [ 5381.660491] [<ffffffff811430e8>] ? free_one_page+0x98/0x440 [ 5381.660495] [<ffffffff8108313f>] ? local_clock+0x4f/0x60 [ 5381.660502] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50 [ 5381.660508] [<ffffffff81095fb8>] ? trace_hardirqs_off_caller+0x28/0xd0 [ 5381.660515] [<ffffffff81183caf>] alloc_pages_current+0x10f/0x1f0 [ 5381.660520] [<ffffffff8113fae4>] ? __get_free_pages+0x14/0x50 [ 5381.660524] [<ffffffff8113fae4>] __get_free_pages+0x14/0x50 [ 5381.660530] [<ffffffff8115dace>] kmalloc_order_trace+0x3e/0x100 [ 5381.660536] [<ffffffff81191ea0>] __kmalloc_track_caller+0x220/0x230 [ 5381.660560] [<ffffffffa0729fdb>] ? fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs] [ 5381.660564] [<ffffffff8178085c>] ? retint_restore_args+0xe/0xe [ 5381.660569] [<ffffffff811580ef>] krealloc+0x6f/0xb0 [ 5381.660586] [<ffffffffa0729fdb>] fs_path_ensure_buf.part.12+0x6b/0x200 [btrfs] [ 5381.660601] [<ffffffffa072a208>] fs_path_prepare_for_add+0x98/0xb0 [btrfs] [ 5381.660615] [<ffffffffa072a2bc>] fs_path_add_path+0x2c/0x60 [btrfs] [ 5381.660628] [<ffffffffa072c55c>] get_cur_path+0x7c/0x1c0 [btrfs] Even without this loop, the incremental send couldn't succeed, because it would attempt to send a rename/move operation for the lower inode before the highest inode number was renamed/move. This issue is easy to trigger with the following steps: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ mkdir -p /mnt/btrfs/a/b/c/d $ mkdir /mnt/btrfs/a/b/c2 $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap1 $ mv /mnt/btrfs/a/b/c/d /mnt/btrfs/a/b/c2/d2 $ mv /mnt/btrfs/a/b/c /mnt/btrfs/a/b/c2/d2/cc $ btrfs subvol snapshot -r /mnt/btrfs /mnt/btrfs/snap2 $ btrfs send -p /mnt/btrfs/snap1 /mnt/btrfs/snap2 > /tmp/incremental.send The structure of the filesystem when the first snapshot is taken is: . (ino 256) \|-- a (ino 257) \|-- b (ino 258) \|-- c (ino 259) \| \|-- d (ino 260) \| \|-- c2 (ino 261) And its structure when the second snapshot is taken is: . (ino 256) \|-- a (ino 257) \|-- b (ino 258) \|-- c2 (ino 261) \|-- d2 (ino 260) \|-- cc (ino 259) Before the move/rename operation is performed for the inode 259, the move/rename for inode 260 must be performed, since 259 is now a child of 260. A test case for xfstests, with a more complex scenario, will follow soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-29 07:06:22 -08:00
Anand Jain	2365dd3ca0	btrfs: undo sysfs when open_ctree() fails reproducer: mkfs.btrfs -f /dev/sdb &&\ mount /dev/sdb /btrfs &&\ btrfs dev add -f /dev/sdc /btrfs &&\ umount /btrfs &&\ wipefs -a /dev/sdc &&\ mount -o degraded /dev/sdb /btrfs //above mount fails so try with RO mount -o degraded,ro /dev/sdb /btrfs ------ sysfs: cannot create duplicate filename '/fs/btrfs/3f48c79e-5ed0-4e87-b189-86e749e503f4' :: dump_stack+0x49/0x5e warn_slowpath_common+0x87/0xb0 warn_slowpath_fmt+0x41/0x50 strlcat+0x69/0x80 sysfs_warn_dup+0x87/0xa0 sysfs_add_one+0x40/0x50 create_dir+0x76/0xc0 sysfs_create_dir_ns+0x7a/0xc0 kobject_add_internal+0xad/0x220 kobject_add_varg+0x38/0x60 kobject_init_and_add+0x53/0x70 mutex_lock+0x11/0x40 __free_pages+0x25/0x30 free_pages+0x41/0x50 selinux_sb_copy_data+0x14e/0x1e0 mount_fs+0x3e/0x1a0 vfs_kern_mount+0x71/0x120 do_mount+0x3f7/0x980 SyS_mount+0x8b/0xe0 system_call_fastpath+0x16/0x1b ------ further 'modprobe -r btrfs' fails as well Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:44 -08:00
Filipe David Borba Manana	f74b86d855	Btrfs: fix snprintf usage by send's gen_unique_name The buffer size argument passed to snprintf must account for the trailing null byte added by snprintf, and it returns a value >= then sizeof(buffer) when the string can't fit in the buffer. Since our buffer has a size of 64 characters, and the maximum orphan name we can generate is 63 characters wide, we must pass 64 as the buffer size to snprintf, and not 63. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:43 -08:00
Justin Maggard	c41570c9d2	btrfs: fix defrag 32-bit integer overflow When defragging a very large file, the cluster variable can wrap its 32-bit signed int type and become negative, which eventually gets passed to btrfs_force_ra() as a very large unsigned long value. On 32-bit platforms, this eventually results in an Oops from the SLAB allocator. Change the cluster and max_cluster signed int variables to unsigned long to match the readahead functions. This also allows the min() comparison in btrfs_defrag_file() to work as intended. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:43 -08:00
David Sterba	c736c095de	btrfs: sysfs: list the NO_HOLES feature Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:42 -08:00
David Sterba	66b4bbd4f5	btrfs: sysfs: don't show reserved incompat feature The COMPRESS_LZOv2 incompat featue is currently not implemented, the bit is only reserved, no point to list it in sysfs. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:41 -08:00
David Sterba	bd60ea0fe9	btrfs: call permission checks earlier in ioctls and return EPERM The owner and capability checks in IOC_SUBVOL_SETFLAGS and SET_RECEIVED_SUBVOL should be called before any other checks are done. Also unify the error code to EPERM. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:41 -08:00
David Sterba	d024206133	btrfs: restrict snapshotting to own subvolumes Currently, any user can snapshot any subvolume if the path is accessible and thus indirectly create and keep files he does not own under his direcotries. This is not possible with traditional directories. In security context, a user can snapshot root filesystem and pin any potentially buggy binaries, even if the updates are applied. All the snapshots are visible to the administrator, so it's possible to verify if there are suspicious snapshots. Another more practical problem is that any user can pin the space used by eg. root and cause ENOSPC. Original report: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786 CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:40 -08:00
Miao Xie	89d4346a36	Btrfs: fix wrong block group in trace during the free space allocation We allocate the free space from the former block group, not the current one, so should use the former one to output the trace information. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:40 -08:00
Miao Xie	215a63d139	Btrfs: cleanup the code of used_block_group in find_free_extent() used_block_group is just used for the space cluster which doesn't belong to the current block group, the other place needn't use it. Or the logic of code seems unclear. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:39 -08:00
Miao Xie	920e4a58d2	Btrfs: cleanup the redundant code for the block group allocation and init Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:38 -08:00
Miao Xie	26b47ff65b	Btrfs: change the members' order of btrfs_space_info structure to reduce the cache miss It is better that the position of the lock is close to the data which is protected by it, because they may be in the same cache line, we will load less cache lines when we access them. So we rearrange the members' position of btrfs_space_info structure to make the lock be closer to the its data. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:38 -08:00
Wang Shilong	ffcfaf8179	Btrfs: fix wrong search path initialization before searching tree root To search tree root without transaction protection, we should neither search commit root nor skip locking here, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:37 -08:00
Miao Xie	23c671a588	Btrfs: flush the dirty pages of the ordered extent aggressively during logging csum The performance of fsync dropped down suddenly sometimes, the main reason of this problem was that we might only flush part dirty pages in a ordered extent, then got that ordered extent, wait for the csum calcucation. But if no task flushed the left part, we would wait until the flusher flushed them, sometimes we need wait for several seconds, it made the performance drop down suddenly. (On my box, it drop down from 56MB/s to 4-10MB/s) This patch improves the above problem by flushing left dirty pages aggressively. Test Environment: CPU: 2CPU * 2Cores Memory: 4GB Partition: 20GB(HDD) Test Command: # sysbench --num-threads=8 --test=fileio --file-num=1 \ > --file-total-size=8G --file-block-size=32768 \ > --file-io-mode=sync --file-fsync-freq=100 \ > --file-fsync-end=no --max-requests=10000 \ > --file-test-mode=rndwr run Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:37 -08:00
Wang Shilong	2c21b4d733	Btrfs: fix transaction abortion when remounting btrfs from RW to RO Steps to reproduce: # mkfs.btrfs -f /dev/sda8 # mount /dev/sda8 /mnt -o flushoncommit # dd if=/dev/zero of=/mnt/data bs=4k count=102400 & # mount /dev/sda8 /mnt -o remount, ro When remounting RW to RO, the logic is to firstly set flag to RO and then commit transaction, however with option flushoncommit enabled,we will do RO check within committing transaction, so we get a transaction abortion here. Actually,here check is wrong, we should check if FS_STATE_ERROR is set, fix it. Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Suggested-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:36 -08:00
Filipe David Borba Manana	e4355f34ef	Btrfs: faster file extent item search in clone ioctl When we are looking for file extent items that intersect the cloning range, for each one that falls completely outside the range, don't release the path and do another full tree search - just move on to the next slot and copy the file extent item into our buffer only if the item intersects the cloning range. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:35 -08:00
Liu Bo	1a4319cc3c	Btrfs: fix extent state leak on transaction abortion When transaction is aborted, we fail to commit transaction, instead we do cleanup work. After that when we umount btrfs, we get to free fs roots' log trees respectively, but that happens after we unpin extents, so those extents pinned by freeing log trees will remain in memory and lead to the leak. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:35 -08:00
Qu Wenruo	078025347c	btrfs: Cleanup the btrfs_parse_options for remount. Since remount will pending the new mount options to the original mount options, which will make btrfs_parse_options check the old options then new options, causing some stupid output like "enabling XXX" following by "disable XXX". This patch will add extra check before every btrfs_info to skip the output from old options checking. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:34 -08:00
Qu Wenruo	3818aea275	btrfs: Add noinode_cache mount option Add noinode_cache mount option for btrfs. Since inode map cache involves all the btrfs_find_free_ino/return_ino things and if just trigger the mount_opt, an inode number get from inode map cache will not returned to inode map cache. To keep the find and return inode both in the same behavior, a new bit in mount_opt, CHANGE_INODE_CACHE, is introduced for this idea. CHANGE_INODE_CACHE is set/cleared in remounting, and the original INODE_MAP_CACHE is set/cleared according to CHANGE_INODE_CACHE after a success transaction. Since find/return inode is all done between btrfs_start_transaction and btrfs_commit_transaction, this will keep consistent behavior. Also noinode_cache mount option will not stop the caching_kthread. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:33 -08:00
Wang Shilong	ade2e0b3ee	Btrfs: fix to search previous metadata extent item since skinny metadata There is a bug that using btrfs_previous_item() to search metadata extent item. This is because in btrfs_previous_item(), we need type match, however, since skinny metada was introduced by josef, we may mix this two types. So just use btrfs_previous_item() is not working right. To keep btrfs_previous_item() like normal tree search, i introduce another function btrfs_previous_extent_item(). Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:33 -08:00
Wang Shilong	7c76edb77c	Btrfs: fix missing skinny metadata check in scrub_stripe() Check if we support skinny metadata firstly and fix to use right type to search. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:32 -08:00
Filipe David Borba Manana	28e5dd8f35	Btrfs: fix send to not send non-aligned clone operations It is possible for the send feature to send clone operations that request a cloning range (offset + length) that is not aligned with the block size. This makes the btrfs receive command send issue a clone ioctl call that will fail, as the ioctl will return an -EINVAL error because of the unaligned range. Fix this by not sending clone operations for non block aligned ranges, and instead send regular write operation for these (less common) cases. The following xfstest reproduces this issue, which fails on the second btrfs receive command without this change: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=`mktemp -d` status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -fr $tmp } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _need_to_be_root rm -f $seqres.full _scratch_mkfs >/dev/null 2>&1 _scratch_mount $XFS_IO_PROG -f -c "truncate 819200" $SCRATCH_MNT/foo \| _filter_xfs_io $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT \| _filter_scratch $XFS_IO_PROG -c "falloc -k 819200 667648" $SCRATCH_MNT/foo \| _filter_xfs_io $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT \| _filter_scratch $XFS_IO_PROG -f -c "pwrite 1482752 2978" $SCRATCH_MNT/foo \| _filter_xfs_io $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT \| _filter_scratch $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap1 \| \ _filter_scratch $XFS_IO_PROG -f -c "truncate 883305" $SCRATCH_MNT/foo \| _filter_xfs_io $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT \| _filter_scratch $BTRFS_UTIL_PROG subvol snapshot -r $SCRATCH_MNT $SCRATCH_MNT/mysnap2 \| \ _filter_scratch $BTRFS_UTIL_PROG send $SCRATCH_MNT/mysnap1 -f $tmp/1.snap 2>&1 \| _filter_scratch $BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 $SCRATCH_MNT/mysnap2 \ -f $tmp/2.snap 2>&1 \| _filter_scratch md5sum $SCRATCH_MNT/foo \| _filter_scratch md5sum $SCRATCH_MNT/mysnap1/foo \| _filter_scratch md5sum $SCRATCH_MNT/mysnap2/foo \| _filter_scratch _scratch_unmount _check_btrfs_filesystem $SCRATCH_DEV _scratch_mkfs >/dev/null 2>&1 _scratch_mount $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/1.snap md5sum $SCRATCH_MNT/mysnap1/foo \| _filter_scratch $BTRFS_UTIL_PROG receive $SCRATCH_MNT -f $tmp/2.snap md5sum $SCRATCH_MNT/mysnap2/foo \| _filter_scratch _scratch_unmount _check_btrfs_filesystem $SCRATCH_DEV status=0 exit The tests expected output is: QA output created by 025 FSSync 'SCRATCH_MNT' FSSync 'SCRATCH_MNT' wrote 2978/2978 bytes at offset 1482752 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) FSSync 'SCRATCH_MNT' Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap1' FSSync 'SCRATCH_MNT' Create a readonly snapshot of 'SCRATCH_MNT' in 'SCRATCH_MNT/mysnap2' At subvol SCRATCH_MNT/mysnap1 At subvol SCRATCH_MNT/mysnap2 129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/foo 42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo 129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo At subvol mysnap1 42b6369eae2a8725c1aacc0440e597aa SCRATCH_MNT/mysnap1/foo At snapshot mysnap2 129b8eaee8d3c2bcad49bec596591cb3 SCRATCH_MNT/mysnap2/foo Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:32 -08:00
Filipe David Borba Manana	14a958e678	Btrfs: fix btrfs boot when compiled as built-in After the change titled "Btrfs: add support for inode properties", if btrfs was built-in the kernel (i.e. not as a module), it would cause a kernel panic, as reported recently by Fengguang: [ 2.024722] BUG: unable to handle kernel NULL pointer dereference at (null) [ 2.027814] IP: [<ffffffff81501594>] crc32c+0xc/0x6b [ 2.028684] PGD 0 [ 2.028684] Oops: 0000 [#1] SMP [ 2.028684] Modules linked in: [ 2.028684] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.13.0-rc7-04795-ga7b57c2 #1 [ 2.028684] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 2.028684] task: ffff88000edba100 ti: ffff88000edd6000 task.ti: ffff88000edd6000 [ 2.028684] RIP: 0010:[<ffffffff81501594>] [<ffffffff81501594>] crc32c+0xc/0x6b [ 2.028684] RSP: 0000:ffff88000edd7e58 EFLAGS: 00010246 [ 2.028684] RAX: 0000000000000000 RBX: ffffffff82295550 RCX: 0000000000000000 [ 2.028684] RDX: 0000000000000011 RSI: ffffffff81efe393 RDI: 00000000fffffffe [ 2.028684] RBP: ffff88000edd7e60 R08: 0000000000000003 R09: 0000000000015d20 [ 2.028684] R10: ffffffff81ef225e R11: ffffffff811b0222 R12: ffffffffffffffff [ 2.028684] R13: 0000000000000239 R14: 0000000000000000 R15: 0000000000000000 [ 2.028684] FS: 0000000000000000(0000) GS:ffff88000fa00000(0000) knlGS:0000000000000000 [ 2.028684] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 2.028684] CR2: 0000000000000000 CR3: 000000000220c000 CR4: 00000000000006f0 [ 2.028684] Stack: [ 2.028684] ffffffff82295550 ffff88000edd7e80 ffffffff8238af62 ffffffff8238ac05 [ 2.028684] 0000000000000000 ffff88000edd7e98 ffffffff8238ac0f ffffffff8238ac05 [ 2.028684] ffff88000edd7f08 ffffffff810002ba ffff88000edd7f00 ffffffff810e2404 [ 2.028684] Call Trace: [ 2.028684] [<ffffffff8238af62>] btrfs_props_init+0x4f/0x96 [ 2.028684] [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145 [ 2.028684] [<ffffffff8238ac0f>] init_btrfs_fs+0xa/0xf0 [ 2.028684] [<ffffffff8238ac05>] ? ftrace_define_fields_btrfs_space_reservation+0x145/0x145 [ 2.028684] [<ffffffff810002ba>] do_one_initcall+0xa4/0x13a [ 2.028684] [<ffffffff810e2404>] ? parse_args+0x25f/0x33d [ 2.028684] [<ffffffff8234cf75>] kernel_init_freeable+0x1aa/0x230 [ 2.028684] [<ffffffff8234c785>] ? do_early_param+0x88/0x88 [ 2.028684] [<ffffffff819f61b5>] ? rest_init+0x89/0x89 [ 2.028684] [<ffffffff819f61c3>] kernel_init+0xe/0x109 The issue here is that the initialization function of btrfs (super.c:init_btrfs_fs) started using crc32c (from lib/libcrc32c.c). But when it needs to call crc32c (as part of the properties initialization routine), the libcrc32c is not yet initialized, so crc32c derreferenced a NULL pointer (lib/libcrc32c.c:tfm), causing the kernel panic on boot. The approach to fix this is to use crypto component directly to use its crc32c (which is basically what lib/libcrc32c.c is, a wrapper around crypto). This is what ext4 is doing as well, it uses crypto directly to get crc32c functionality. Verified this works both when btrfs is built-in and when it's loadable kernel module. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:31 -08:00
Filipe David Borba Manana	c57c2b3ed2	Btrfs: unlock inodes in correct order in clone ioctl In the clone ioctl, when the source and target inodes are different, we can acquire their mutexes in 2 possible different orders. After we're done cloning, we were releasing the mutexes always in the same order - the most correct way of doing it is to release them by the reverse order they were acquired. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:30 -08:00
Wang Shilong	f499e40fd9	Btrfs: optimize to remove unnecessary removal with ulist reallocation Here we are not going to free memory, no need to remove every node one by one, just init root node here is ok. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:30 -08:00
Liu Bo	de6e820066	Btrfs: release subvolume's block_rsv before transaction commit We don't have to keep subvolume's block_rsv during transaction commit, and within transaction commit, we may also need the free space reclaimed from this block_rsv to process delayed refs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:29 -08:00
Miao Xie	f1de968376	Btrfs: fix the race between write back and nocow buffered write When we ran the 274th case of xfstests with nodatacow mount option, We met the following warning message: WARNING: CPU: 1 PID: 14185 at fs/btrfs/extent-tree.c:3734 btrfs_free_reserved_data_space+0xa6/0xd0 It is caused by the race between the write back and nocow buffered write: Task1 Task2 __btrfs_buffered_write() skip data reservation reserve the metadata space copy the data dirty the pages unlock the pages write back the pages release the data space becasue there is no noreserve flag set the noreserve flag This patch fixes this problem by unlocking the pages after the noreserve flag is set. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:28 -08:00
Josef Bacik	7ef81ac86c	Btrfs: only process as many file extents as there are refs The backref walking code will search down to the key it is looking for and then proceed to walk _all_ of the extents on the file until it hits the end. This is suboptimal with large files, we only need to look for as many extents as we have references for that inode. I have a testcase that creates a randomly written 4 gig file and before this patch it took 6min 30sec to do the initial send, with this patch it takes 2min 30sec to do the intial send. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:28 -08:00
Josef Bacik	3a6d75e846	Btrfs: fix qgroup rescan to work with skinny metadata Could have sworn I fixed this before but apparently not. This makes us pass btrfs/022 with skinny metadata enabled. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:27 -08:00
Josef Bacik	580f0a678e	Btrfs: fix extent_from_logical to deal with skinny metadata I don't think this is an issue and I've not seen it in practice but extent_from_logical will fail to find a skinny extent because it uses btrfs_previous_item and gives it the normal extent item type. This is just not a place to use btrfs_previous_item since we care about either normal extents or skinny extents, so open code btrfs_previous_item to properly check. This would only affect metadata and the only place this is used for metadata is scrub and I'm pretty sure it's just for printing stuff out, not actually doing any work so hopefully it was never a problem other than a cosmetic one. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:27 -08:00
Josef Bacik	0a2b2a844a	Btrfs: throttle delayed refs better On one of our gluster clusters we noticed some pretty big lag spikes. This turned out to be because our transaction commit was taking like 3 minutes to complete. This is because we have like 30 gigs of metadata, so our global reserve would end up being the max which is like 512 mb. So our throttling code would allow a ridiculous amount of delayed refs to build up and then they'd all get run at transaction commit time, and for a cold mounted file system that could take up to 3 minutes to run. So fix the throttling to be based on both the size of the global reserve and how long it takes us to run delayed refs. This patch tracks the time it takes to run delayed refs and then only allows 1 seconds worth of outstanding delayed refs at a time. This way it will auto-tune itself from cold cache up to when everything is in memory and it no longer has to go to disk. This makes our transaction commits take much less time to run. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:26 -08:00
Josef Bacik	d7df2c796d	Btrfs: attach delayed ref updates to delayed ref heads Currently we have two rb-trees, one for delayed ref heads and one for all of the delayed refs, including the delayed ref heads. When we process the delayed refs we have to hold onto the delayed ref lock for all of the selecting and merging and such, which results in quite a bit of lock contention. This was solved by having a waitqueue and only one flusher at a time, however this hurts if we get a lot of delayed refs queued up. So instead just have an rb tree for the delayed ref heads, and then attach the delayed ref updates to an rb tree that is per delayed ref head. Then we only need to take the delayed ref lock when adding new delayed refs and when selecting a delayed ref head to process, all the rest of the time we deal with a per delayed ref head lock which will be much less contentious. The locking rules for this get a little more complicated since we have to lock up to 3 things to properly process delayed refs, but I will address that problem later. For now this passes all of xfstests and my overnight stress tests. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:25 -08:00
Josef Bacik	5039eddc19	Btrfs: make fsync latency less sucky Looking into some performance related issues with large amounts of metadata revealed that we can have some pretty huge swings in fsync() performance. If we have a lot of delayed refs backed up (as you will tend to do with lots of metadata) fsync() will wander off and try to run some of those delayed refs which can result in reading from disk and such. Since the actual act of fsync() doesn't create any delayed refs there is no need to make it throttle on delayed ref stuff, that will be handled by other people. With this patch we get much smoother fsync performance with large amounts of metadata. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:25 -08:00
Filipe David Borba Manana	63541927c8	Btrfs: add support for inode properties This change adds infrastructure to allow for generic properties for inodes. Properties are name/value pairs that can be associated with inodes for different purposes. They are stored as xattrs with the prefix "btrfs." Properties can be inherited - this means when a directory inode has inheritable properties set, these are added to new inodes created under that directory. Further, subvolumes can also have properties associated with them, and they can be inherited from their parent subvolume. Naturally, directory properties have priority over subvolume properties (in practice a subvolume property is just a regular property associated with the root inode, objectid 256, of the subvolume's fs tree). This change also adds one specific property implementation, named "compression", whose values can be "lzo" or "zlib" and it's an inheritable property. The corresponding changes to btrfs-progs were also implemented. A patch with xfstests for this feature will follow once there's agreement on this change/feature. Further, the script at the bottom of this commit message was used to do some benchmarks to measure any performance penalties of this feature. Basically the tests correspond to: Test 1 - create a filesystem and mount it with compress-force=lzo, then sequentially create N files of 64Kb each, measure how long it took to create the files, unmount the filesystem, mount the filesystem and perform an 'ls -lha' against the test directory holding the N files, and report the time the command took. Test 2 - create a filesystem and don't use any compression option when mounting it - instead set the compression property of the subvolume's root to 'lzo'. Then create N files of 64Kb, and report the time it took. The unmount the filesystem, mount it again and perform an 'ls -lha' like in the former test. This means every single file ends up with a property (xattr) associated to it. Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the compression property, have no real effect other than adding more work when inheriting properties and taking more btree leaf space. Test 4 - same as test 3 but with 10 properties per file. Results (in seconds, and averages of 5 runs each), for different N numbers of files follow. * Without properties (test 1) file creation time ls -lha time 10 000 files 3.49 0.76 100 000 files 47.19 8.37 1 000 000 files 518.51 107.06 * With 1 property (compression property set to lzo - test 2) file creation time ls -lha time 10 000 files 3.63 0.93 100 000 files 48.56 9.74 1 000 000 files 537.72 125.11 * With 4 properties (test 3) file creation time ls -lha time 10 000 files 3.94 1.20 100 000 files 52.14 11.48 1 000 000 files 572.70 142.13 * With 10 properties (test 4) file creation time ls -lha time 10 000 files 4.61 1.35 100 000 files 58.86 13.83 1 000 000 files 656.01 177.61 The increased latencies with properties are essencialy because of: ) When creating an inode, we now synchronously write 1 more item (an xattr item) for each property inherited from the parent dir (or subvolume). This could be done in an asynchronous way such as we do for dir intex items (delayed-inode.c), which could help reduce the file creation latency; ) With properties, we now have larger fs trees. For this particular test each xattr item uses 75 bytes of leaf space in the fs tree. This could be less by using a new item for xattr items, instead of the current btrfs_dir_item, since we could cut the 'location' and 'type' fields (saving 18 bytes) and maybe 'transid' too (saving a total of 26 bytes per xattr item) from the btrfs_dir_item type. Also tried batching the xattr insertions (ignoring proper hash collision handling, since it didn't exist) when creating files that inherit properties from their parent inode/subvolume, but the end results were (surprisingly) essentially the same. Test script: $ cat test.pl #!/usr/bin/perl -w use strict; use Time::HiRes qw(time); use constant NUM_FILES => 10_000; use constant FILE_SIZES => (64 * 1024); use constant DEV => '/dev/sdb4'; use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev'; use constant TEST_DIR => (MNT_POINT . '/testdir'); system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!"; # following line for testing without properties #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!"; # following 2 lines for testing with properties system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!"; system("mkdir", TEST_DIR) == 0 or die "mkdir failed!"; my ($t1, $t2); $t1 = time(); for (my $i = 1; $i <= NUM_FILES; $i++) { my $p = TEST_DIR . '/file_' . $i; open(my $f, '>', $p) or die "Error opening file!"; $f->autoflush(1); for (my $j = 0; $j < FILE_SIZES; $j += 4096) { print $f ('A' x 4096) or die "Error writing to file!"; } close($f); } $t2 = time(); print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; system("mount", DEV, MNT_POINT) == 0 or die "mount failed!"; $t1 = time(); system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!"; $t2 = time(); print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n"; system("umount", DEV) == 0 or die "umount failed!"; Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:24 -08:00
Filipe David Borba Manana	1acae57b16	Btrfs: faster file extent item replace operations When writing to a file we drop existing file extent items that cover the write range and then add a new file extent item that represents that write range. Before this change we were doing a tree lookup to remove the file extent items, and then after we did another tree lookup to insert the new file extent item. Most of the time all the file extent items we need to drop are located within a single leaf - this is the leaf where our new file extent item ends up at. Therefore, in this common case just combine these 2 operations into a single one. By avoiding the second btree navigation for insertion of the new file extent item, we reduce btree node/leaf lock acquisitions/releases, btree block/leaf COW operations, CPU time on btree node/leaf key binary searches, etc. Besides for file writes, this is an operation that happens for file fsync's as well. However log btrees are much less likely to big as big as regular fs btrees, therefore the impact of this change is smaller. The following benchmark was performed against an SSD drive and a HDD drive, both for random and sequential writes: sysbench --test=fileio --file-num=4096 --file-total-size=8G \ --file-test-mode=[rndwr\|seqwr] --num-threads=512 \ --file-block-size=8192 \ --max-requests=1000000 \ --file-fsync-freq=0 --file-io-mode=sync [prepare\|run] All results below are averages of 10 runs of the respective test. SSD sequential writes Before this change: 225.88 Mb/sec After this change: 277.26 Mb/sec SSD random writes Before this change: 49.91 Mb/sec After this change: 56.39 Mb/sec HDD sequential writes Before this change: 68.53 Mb/sec After this change: 69.87 Mb/sec HDD random writes Before this change: 13.04 Mb/sec After this change: 14.39 Mb/sec Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:23 -08:00
Wang Shilong	90515e7f5d	Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot() We may return early in btrfs_drop_snapshot(), we shouldn't call btrfs_std_err() for this case, fix it. Cc: stable@vger.kernel.org Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:23 -08:00
Wang Shilong	8e56338d7d	Btrfs: remove unnecessary transaction commit before send We will finish orphan cleanups during snapshot, so we don't have to commit transaction here. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:22 -08:00
Wang Shilong	18f687d538	Btrfs: fix protection between send and root deletion We should gurantee that parent and clone roots can not be destroyed during send, for this we have two ideas. 1.by holding @subvol_sem, this might be a nightmare, because it will block all subvolumes deletion for a long time. 2.Miao pointed out we can reuse @send_in_progress, that mean we will skip snapshot deletion if root sending is in progress. Here we adopt the second approach since it won't block other subvolumes deletion for a long time. Besides in btrfs_clean_one_deleted_snapshot(), we only check first root , if this root is involved in send, we return directly rather than continue to check.There are several reasons about it: 1.this case happen seldomly. 2.after sending,cleaner thread can continue to drop that root. 3.make code simple Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:21 -08:00
Wang Shilong	896c14f97f	Btrfs: fix wrong send_in_progress accounting Steps to reproduce: # mkfs.btrfs -f /dev/sda8 # mount /dev/sda8 /mnt # btrfs sub snapshot -r /mnt /mnt/snap1 # btrfs sub snapshot -r /mnt /mnt/snap2 # btrfs send /mnt/snap1 -p /mnt/snap2 -f /mnt/1 # dmesg The problem is that we will sort clone roots(include @send_root), it might push @send_root before thus @send_root's @send_in_progress will be decreased twice. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:21 -08:00
Qu Wenruo	a88998f291	btrfs: Add treelog mount option. Add treelog mount option to enable tree log with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:20 -08:00
Qu Wenruo	d399167d88	btrfs: Add datasum mount option. Add datasum mount option to enable checksum with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:20 -08:00
Qu Wenruo	a258af7a3e	btrfs: Add datacow mount option. Add datacow mount option to enable copy-on-write with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:19 -08:00
Qu Wenruo	bd0330ad21	btrfs: Add acl mount option. Add acl mount option to enable acl with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:19 -08:00
Qu Wenruo	2c9ee85671	btrfs: Add noflushoncommit mount option. Add noflushoncommit mount option to disable flush on commit with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:18 -08:00
Qu Wenruo	5303629343	btrfs: Add noenospc_debug mount option. Add noenospc_debug mount option to disable ENOSPC debug with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:17 -08:00
Qu Wenruo	e07a2ade44	btrfs: Add nodiscard mount option. Add nodiscard mount option to disable discard with remount option. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:17 -08:00
Qu Wenruo	fc0ca9af18	btrfs: Add noautodefrag mount option. Btrfs has autodefrag mount option but no pairing noautodefrag option, which makes it impossible to disable autodefrag without umount. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:16 -08:00
Qu Wenruo	842bef5891	btrfs: Add "barrier" option to support "-o remount,barrier" Btrfs can be remounted without barrier, but there is no "barrier" option so nobody can remount btrfs back with barrier on. Only umount and mount again can re-enable barrier.(Quite awkward) Also the mount options in the document is also changed slightly for the further pairing options changes. Reported-by: Daniel Blueman <daniel@quora.org> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:16 -08:00
Wang Shilong	e8117c26b2	Btrfs: only fua the first superblock when writting supers We only intent to fua the first superblock in every device from comments, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:15 -08:00
Liu Bo	17504584f5	Btrfs: return free space to global_rsv as much as possible @full is not protected within global_rsv.lock, so we may think global_rsv is already full but in fact it's not, so we miss the opportunity to return free space to global_rsv directly when we release other block_rsvs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:14 -08:00
Wang Shilong	1708cc5723	Btrfs: fix an oops when we fail to relocate tree blocks During balance test, we hit an oops: [ 2013.841551] kernel BUG at fs/btrfs/relocation.c:1174! The problem is that if we fail to relocate tree blocks, we should update backref cache, otherwise, some pending nodes are not updated while snapshot check @cache->last_trans is within one transaction and won't update it and then oops happen. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:14 -08:00
Miao Xie	e77751aad1	Btrfs: fix the wrong nocow range check The following warning message was outputed when running the 274th case of xfstests with nodatacow option: BUG: Bad page state in process kswapd0 pfn:1c66f page:ffffea0000636848 count:0 mapcount:0 mapping:(null) index:0x78000 page flags: 0x1000000000100a(error\|uptodate\|private_2) It is because the check of nocow range was wrong, we should compare the start and end position of the extent with the write position to verify if the write position was in the extent, but the current code just used the start postion to do the check, so we got the wrong extent and told the caller that it was a nocow write. And then when we write back the dirty pages, we found we should cow the extent, but at that time, there was no space in the fs, we had to the error flag for the page. When someone reclaimed that page, the above warning outputed. Fix it. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:13 -08:00
Wang Shilong	25e293c2a2	Btrfs: fix an oops when we fail to merge reloc roots Previously, we will free reloc root memory and then force filesystem to be readonly. The problem is that there may be another thread commiting transaction which will try to access freed reloc root during merging reloc roots process. To keep consistency snapshots shared space, we should allow snapshot finished if possible, so here we don't free reloc root memory. signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:13 -08:00
Wang Shilong	dc4103f933	Btrfs: remove unused argument from select_reloc_root() @nr is no longer used, remove it from select_reloc_root() Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:12 -08:00
Filipe David Borba Manana	eb653de159	Btrfs: reduce btree node locking duration on item update If we do a btree search with the goal of updating an existing item without changing its size (ins_len == 0 and cow == 1), then we never need to hold locks on upper level nodes (even when slot == 0) after we COW their child nodes/leaves, as we won't have node splits or merges in this scenario (that is, no key additions, removals or shifts on any nodes or leaves). Therefore release the locks immediately after COWing the child nodes/leaves while navigating the btree, even if their parent slot is 0, instead of returning a path to the caller with those nodes locked, which would get released only when the caller releases or frees the path (or if it calls btrfs_unlock_up_safe). This is a common scenario, for example when updating inode items in fs trees and block group items in the extent tree. The following benchmarks were performed on a quad core machine with 32Gb of ram, using a leaf/node size of 4Kb (to generate deeper fs trees more quickly). sysbench --test=fileio --file-num=131072 --file-total-size=8G \ --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \ --max-requests=100000 --file-io-mode=sync [prepare\|run] Before this change: 49.85Mb/s (average of 5 runs) After this change: 50.38Mb/s (average of 5 runs) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:11 -08:00
Wenliang Fan	eb8052e015	fs/btrfs: Integer overflow in btrfs_ioctl_resize() The local variable 'new_size' comes from userspace. If a large number was passed, there would be an integer overflow in the following line: new_size = old_size + new_size; Signed-off-by: Wenliang Fan <fanwlexca@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:11 -08:00
Josef Bacik	c9ea7b24ce	Btrfs: stop caching thread if extent_commit_sem is contended We can starve out the transaction commit with a bunch of caching threads all running at the same time. This is because we will only drop the extent_commit_sem if we need_resched(), which isn't likely to happen since we will be reading a lot from the disk so have already schedule()'ed plenty. Alex observed that he could starve out a transaction commit for up to a minute with 32 caching threads all running at once. This will allow us to drop the extent_commit_sem to allow the transaction commit to swap the commit_root out and then all the cachers will start back up. Here is an explanation provided by Igno So, just to fill in what happens in this loop: mutex_unlock(&caching_ctl->mutex); cond_resched(); goto again; where 'again:' takes caching_ctl->mutex and fs_info->extent_commit_sem again: again: mutex_lock(&caching_ctl->mutex); /* need to make sure the commit_root doesn't disappear */ down_read(&fs_info->extent_commit_sem); So, if I'm reading the code correct, there can be a fair amount of concurrency here: there may be multiple 'caching kthreads' per filesystem active, while there's one fs_info->extent_commit_sem per filesystem AFAICS. So, what happens if there are a lot of CPUs all busy holding the ->extent_commit_sem rwsem read-locked and a writer arrives? They'd all rush to try to release the fs_info->extent_commit_sem, and they'd block in the down_read() because there's a writer waiting. So there's a guarantee of forward progress. This should answer akpm's concern I think. Thanks, Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:10 -08:00
Miao Xie	67de11769b	Btrfs: introduce the delayed inode ref deletion for the single link inode The inode reference item is close to inode item, so we insert it simultaneously with the inode item insertion when we create a file/directory.. In fact, we also can handle the inode reference deletion by the same way. So we made this patch to introduce the delayed inode reference deletion for the single link inode(At most case, the file doesn't has hard link, so we don't take the hard link into account). This function is based on the delayed inode mechanism. After applying this patch, we can reduce the time of the file/directory deletion by ~10%. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:09 -08:00
Miao Xie	7cf35d91b4	Btrfs: use flags instead of the bool variants in delayed node Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:08 -08:00
Miao Xie	a56dbd8940	Btrfs: remove btrfs_end_transaction_dmeta() Two reasons: - btrfs_end_transaction_dmeta() is the same as btrfs_end_transaction_throttle() so it is unnecessary. - All the delayed items should be dealt in the current transaction, so the workers should not commit the transaction, instead, deal with the delayed items as many as possible. So we can remove btrfs_end_transaction_dmeta() Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:08 -08:00
Miao Xie	0353808cae	Btrfs: cleanup code of btrfs_balance_delayed_items() - move the condition check for wait into a function - use wait_event_interruptible instead of prepare-schedule-finish process Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:07 -08:00
Miao Xie	4dd466d36a	Btrfs: don't run delayed nodes again after all nodes flush If the number of the delayed items is greater than the upper limit, we will try to flush all the delayed items. After that, it is unnecessary to run them again because they are being dealt with by the wokers or the number of them is less than the lower limit. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:06 -08:00
Miao Xie	74c40f925e	Btrfs: remove residual code in delayed inode async helper Before applying the patch commit `de3cb945db` title: Btrfs: improve the delayed inode throttling We need requeue the async work after the current work was done, it introduced a deadlock problem. So we wrote the code that this patch removes to avoid the above problem. But after applying the above patch, the deadlock problem didn't exist. So we should remove that fix code. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:06 -08:00
Frank Holton	efe120a067	Btrfs: convert printk to btrfs_ and fix BTRFS prefix Convert all applicable cases of printk and pr_* to the btrfs_* macros. Fix all uses of the BTRFS prefix. Signed-off-by: Frank Holton <fholton@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:05 -08:00
Filipe David Borba Manana	5de865eebb	Btrfs: fix tree mod logging While running the test btrfs/004 from xfstests in a loop, it failed about 1 time out of 20 runs in my desktop. The failure happened in the backref walking part of the test, and the test's error message was like this: btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad) --- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000 +++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000 @@ -1,3 +1,8 @@ QA output created by 004 * test backref walking -* done +unexpected output from + /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1 +expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got: + ... (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff) Ran: btrfs/004 Failures: btrfs/004 Failed 1 of 1 tests But immediately after the test finished, the btrfs inspect-internal command returned the expected output: $ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1 inode 405 offset 454656 root 258 inode 405 offset 454656 root 5 It turned out this was because the btrfs_search_old_slot() calls performed during backref walking (backref.c:__resolve_indirect_ref) were not finding anything. The reason for this turned out to be that the tree mod logging code was not logging some node multi-step operations atomically, therefore btrfs_search_old_slot() callers iterated often over an incomplete tree that wasn't fully consistent with any tree state from the past. Besides missing items, this often (but not always) resulted in -EIO errors during old slot searches, reported in dmesg like this: [ 4299.933936] ------------[ cut here ]------------ [ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]() [ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h [ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012 [ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007 [ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70 [ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000 [ 4299.933987] Call Trace: [ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76 [ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0 [ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20 [ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs] [ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50 [ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs] [ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs] [ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs] [ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs] [ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs] [ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50 [ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs] [ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs] [ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00 [ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40 [ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0 [ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570 [ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6 [ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0 [ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13 [ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0 [ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [ 4299.934104] ---[ end trace 48f0cfc902491414 ]--- [ 4299.934378] btrfs bad fsid on block 0 These tree mod log operations that must be performed atomically, tree_mod_log_free_eb, tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to be performed atomically before the following commit: `c8cc634165` (Btrfs: stop using GFP_ATOMIC for the tree mod log allocations) That change removed the atomicity of such operations. This patch restores the atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem structures, so it has to do the allocations using GFP_NOFS before acquiring the mod log lock. This issue has been experienced by several users recently, such as for example: http://www.spinics.net/lists/linux-btrfs/msg28574.html After running the btrfs/004 test for 679 consecutive iterations with this patch applied, I didn't ran into the issue anymore. Cc: stable@vger.kernel.org Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:04 -08:00
David Sterba	66ef7d65c3	btrfs: check balance of send_in_progress Warn if the balance goes below zero, which appears to be unlikely though. Otherwise cleans up the code a bit. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:04 -08:00
Wang Shilong	41ce9970a8	Btrfs: remove transaction from btrfs send Since daivd did the work that force us to use readonly snapshot, we can safely remove transaction protection from btrfs send. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:03 -08:00
Miao Xie	536cd96401	Btrfs: fix double initialization of the raid kobject We met the following oops when doing space balance: kobject (ffff88081b590278): tried to init an initialized object, something is seriously wrong. ... Call Trace: [<ffffffff81937262>] dump_stack+0x49/0x5f [<ffffffff8137d259>] kobject_init+0x89/0xa0 [<ffffffff8137d36a>] kobject_init_and_add+0x2a/0x70 [<ffffffffa009bd79>] ? clear_extent_bit+0x199/0x470 [btrfs] [<ffffffffa005e82c>] __link_block_group+0xfc/0x120 [btrfs] [<ffffffffa006b9db>] btrfs_make_block_group+0x24b/0x370 [btrfs] [<ffffffffa00a899b>] __btrfs_alloc_chunk+0x54b/0x7e0 [btrfs] [<ffffffffa00a8c6f>] btrfs_alloc_chunk+0x3f/0x50 [btrfs] [<ffffffffa0060123>] do_chunk_alloc+0x363/0x440 [btrfs] [<ffffffffa00633d4>] btrfs_check_data_free_space+0x104/0x310 [btrfs] [<ffffffffa0069f4d>] btrfs_write_dirty_block_groups+0x48d/0x600 [btrfs] [<ffffffffa007aad4>] commit_cowonly_roots+0x184/0x250 [btrfs] ... Steps to reproduce: # mkfs.btrfs -f <dev> # mount -o nospace_cache <dev> <mnt> # btrfs balance start <mnt> # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1 The reason of this problem is that we initialized the raid kobject when we added a block group into a empty raid list. As we know, when we mounted a btrfs filesystem, the raid list was empty, we would initialize the raid kobject when we added the first block group. But if there was not data stored in the block group, the block group would be freed when doing balance, and the raid list would be empty. And then if we allocated a new block group and added it into the raid list, we would initialize the raid kobject again, the oops happened. Fix this problem by initializing the raid kobject just when mounting the fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reported-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:03 -08:00
Wang Shilong	180589efde	Btrfs: fix a warning when iput a file See the warning below: [ 1209.102076] [<ffffffffa04721b9>] remove_extent_mapping+0x69/0x70 [btrfs] [ 1209.102084] [<ffffffffa0466b06>] btrfs_evict_inode+0x96/0x4d0 [btrfs] [ 1209.102089] [<ffffffff81073010>] ? wake_atomic_t_function+0x40/0x40 [ 1209.102092] [<ffffffff8118ab2e>] evict+0x9e/0x190 [ 1209.102094] [<ffffffff8118b313>] iput+0xf3/0x180 [ 1209.102101] [<ffffffffa0461fd1>] btrfs_run_delayed_iputs+0xb1/0xd0 [btrfs] [ 1209.102107] [<ffffffffa045d358>] __btrfs_end_transaction+0x268/0x350 [btrfs] clear extent bit here to avoid triggering WARN_ON() in remove_extent_mapping() Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:02 -08:00
David Sterba	2c68653787	btrfs: Check read-only status of roots during send All the subvolues that are involved in send must be read-only during the whole operation. The ioctl SUBVOL_SETFLAGS could be used to change the status to read-write and the result of send stream is undefined if the data change unexpectedly. Fix that by adding a refcount for all involved roots and verify that there's no send in progress during SUBVOL_SETFLAGS ioctl call that does read-only -> read-write transition. We need refcounts because there are no restrictions on number of send parallel operations currently run on a single subvolume, be it source, parent or one of the multiple clone sources. Kernel is silent when the RO checks fail and returns EPERM. The same set of checks is done already in userspace before send starts. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:01 -08:00
David Sterba	a8d89f5ba0	btrfs: remove unused mnt from send_ctx Unused since `ed2590953b` "Btrfs: stop using vfs_read in send". Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:01 -08:00
David Sterba	95bc79d50d	btrfs: send: clean up dead code Remove ifdefed code: - tlv_put for 8, 16 and 32, add a generic tempalte if needed in future - tlv_put_timespec - the btrfs_timespec fields are used - fs_path_remove obsoleted long ago Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:20:00 -08:00
Filipe David Borba Manana	3fe81ce206	Btrfs: fix deadlock when iterating inode refs and running delayed inodes While running btrfs/004 from xfstests, after 503 iterations, dmesg reported a deadlock between tasks iterating inode refs and tasks running delayed inodes (during a transaction commit). It turns out that iterating inode refs implies doing one tree search and release all nodes in the path except the leaf node, and then passing that leaf node to btrfs_ref_to_path(), which in turn does another tree search without releasing the lock on the leaf node it received as parameter. This is a problem when other task wants to write to the btree as well and ends up updating the leaf that is read locked - the writer task locks the parent of the leaf and then blocks waiting for the leaf's lock to be released - at the same time, the task executing btrfs_ref_to_path() does a second tree search, without releasing the lock on the first leaf, and wants to access a leaf (the same or another one) that is a child of the same parent, resulting in a deadlock. The trace reported by lockdep follows. [84314.936373] INFO: task fsstress:11930 blocked for more than 120 seconds. [84314.936381] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [84314.936383] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84314.936386] fsstress D ffff8806e1bf8000 0 11930 11926 0x00000000 [84314.936393] ffff8804d6d89b78 0000000000000046 ffff8804d6d89b18 ffffffff810bd8bd [84314.936399] ffff8806e1bf8000 ffff8804d6d89fd8 ffff8804d6d89fd8 ffff8804d6d89fd8 [84314.936405] ffff880806308000 ffff8806e1bf8000 ffff8804d6d89c08 ffff8804deb8f190 [84314.936410] Call Trace: [84314.936421] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10 [84314.936428] [<ffffffff81774269>] schedule+0x29/0x70 [84314.936451] [<ffffffffa0715bf5>] btrfs_tree_lock+0x75/0x270 [btrfs] [84314.936457] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60 [84314.936470] [<ffffffffa06ba231>] btrfs_search_slot+0x7f1/0x930 [btrfs] [84314.936489] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs] [84314.936504] [<ffffffffa06d2e1f>] btrfs_lookup_inode+0x2f/0xa0 [btrfs] [84314.936510] [<ffffffff810bd6ef>] ? trace_hardirqs_on_caller+0x1f/0x1e0 [84314.936528] [<ffffffffa073173c>] __btrfs_update_delayed_inode+0x4c/0x1d0 [btrfs] [84314.936543] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs] [84314.936558] [<ffffffffa0731c2a>] ? __btrfs_run_delayed_items+0x13a/0x1e0 [btrfs] [84314.936573] [<ffffffffa0731c82>] __btrfs_run_delayed_items+0x192/0x1e0 [btrfs] [84314.936589] [<ffffffffa0731d03>] btrfs_run_delayed_items+0x13/0x20 [btrfs] [84314.936604] [<ffffffffa06dbcd4>] btrfs_flush_all_pending_stuffs+0x24/0x80 [btrfs] [84314.936620] [<ffffffffa06ddc13>] btrfs_commit_transaction+0x223/0xa20 [btrfs] [84314.936630] [<ffffffffa06ae5ae>] btrfs_sync_fs+0x6e/0x110 [btrfs] [84314.936635] [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60 [84314.936639] [<ffffffff811d0b50>] ? __sync_filesystem+0x60/0x60 [84314.936643] [<ffffffff811d0b70>] sync_fs_one_sb+0x20/0x30 [84314.936648] [<ffffffff811a3541>] iterate_supers+0xf1/0x100 [84314.936652] [<ffffffff811d0c45>] sys_sync+0x55/0x90 [84314.936658] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [84314.936660] INFO: lockdep is turned off. [84314.936663] INFO: task btrfs:11955 blocked for more than 120 seconds. [84314.936666] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [84314.936668] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84314.936670] btrfs D ffff880541729a88 0 11955 11608 0x00000000 [84314.936674] ffff880541729a38 0000000000000046 ffff8805417299d8 ffffffff810bd8bd [84314.936680] ffff88075430c8a0 ffff880541729fd8 ffff880541729fd8 ffff880541729fd8 [84314.936685] ffffffff81c104e0 ffff88075430c8a0 ffff8804de8b00b8 ffff8804de8b0000 [84314.936690] Call Trace: [84314.936695] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10 [84314.936700] [<ffffffff81774269>] schedule+0x29/0x70 [84314.936717] [<ffffffffa0715815>] btrfs_tree_read_lock+0xd5/0x140 [btrfs] [84314.936721] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60 [84314.936733] [<ffffffffa06ba201>] btrfs_search_slot+0x7c1/0x930 [btrfs] [84314.936746] [<ffffffffa06bd505>] btrfs_find_item+0x55/0x160 [btrfs] [84314.936763] [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs] [84314.936780] [<ffffffffa073c9ca>] btrfs_ref_to_path+0xba/0x1e0 [btrfs] [84314.936797] [<ffffffffa06f9719>] ? release_extent_buffer+0xb9/0xe0 [btrfs] [84314.936813] [<ffffffffa06ff689>] ? free_extent_buffer+0x49/0xc0 [btrfs] [84314.936830] [<ffffffffa073cb50>] inode_to_path+0x60/0xd0 [btrfs] [84314.936846] [<ffffffffa073d365>] paths_from_inode+0x115/0x3c0 [btrfs] [84314.936851] [<ffffffff8118dd44>] ? kmem_cache_alloc_trace+0x114/0x200 [84314.936868] [<ffffffffa0714494>] btrfs_ioctl+0xf14/0x2030 [btrfs] [84314.936873] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50 [84314.936877] [<ffffffff8116598f>] ? handle_mm_fault+0x34f/0xb00 [84314.936882] [<ffffffff81075563>] ? up_read+0x23/0x40 [84314.936886] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0 [84314.936892] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570 [84314.936896] [<ffffffff81776e23>] ? error_sti+0x5/0x6 [84314.936901] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0 [84314.936906] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13 [84314.936910] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0 [84314.936915] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f [84314.936920] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [84314.936922] INFO: lockdep is turned off. [84434.866873] INFO: task btrfs-transacti:11921 blocked for more than 120 seconds. [84434.866881] Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [84434.866883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [84434.866886] btrfs-transacti D ffff880755b6a478 0 11921 2 0x00000000 [84434.866893] ffff8800735b9ce8 0000000000000046 ffff8800735b9c88 ffffffff810bd8bd [84434.866899] ffff8805a1b848a0 ffff8800735b9fd8 ffff8800735b9fd8 ffff8800735b9fd8 [84434.866904] ffffffff81c104e0 ffff8805a1b848a0 ffff880755b6a478 ffff8804cece78f0 [84434.866910] Call Trace: [84434.866920] [<ffffffff810bd8bd>] ? trace_hardirqs_on+0xd/0x10 [84434.866927] [<ffffffff81774269>] schedule+0x29/0x70 [84434.866948] [<ffffffffa06dd2ef>] wait_current_trans.isra.33+0xbf/0x120 [btrfs] [84434.866954] [<ffffffff810715c0>] ? __init_waitqueue_head+0x60/0x60 [84434.866970] [<ffffffffa06dec18>] start_transaction+0x388/0x5a0 [btrfs] [84434.866985] [<ffffffffa06db9b5>] ? transaction_kthread+0xb5/0x280 [btrfs] [84434.866999] [<ffffffffa06dee97>] btrfs_attach_transaction+0x17/0x20 [btrfs] [84434.867012] [<ffffffffa06dba9e>] transaction_kthread+0x19e/0x280 [btrfs] [84434.867026] [<ffffffffa06db900>] ? open_ctree+0x2260/0x2260 [btrfs] [84434.867030] [<ffffffff81070dad>] kthread+0xed/0x100 [84434.867035] [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190 [84434.867040] [<ffffffff8177ee6c>] ret_from_fork+0x7c/0xb0 [84434.867044] [<ffffffff81070cc0>] ? flush_kthread_worker+0x190/0x190 Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:59 -08:00
Wang Shilong	663df05330	Btrfs: remove dead comments for read_csums() Chris introduced hleper function read_csums() and this function has been removed, but we forgot to remove its corresponding comments. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:59 -08:00
Filipe David Borba Manana	e223cfcd3e	Btrfs: remove field tree_mod_seq_elem from btrfs_fs_info struct It's not used anywhere, so just drop it. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:58 -08:00
Filipe David Borba Manana	fc28b62d64	Btrfs: fix use of uninitialized err variable fs/btrfs/file.c: In function ‘prepare_pages.isra.18’: fs/btrfs/file.c:1265:6: warning: ‘err’ may be used uninitialized in this function [-Wuninitialized] Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:57 -08:00
Wang Shilong	54eb72c05f	Btrfs: remove unnecessary filemap writting and waiting after block group relocation We have commited transaction before, remove redundant filemap writting and waiting here, it can speed up balance relocation process. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:57 -08:00
Tsutomu Itoh	5662344b3c	Btrfs: fix error check of btrfs_lookup_dentry() Clean up btrfs_lookup_dentry() to never return NULL, but PTR_ERR(-ENOENT) instead. This keeps the return value convention consistent. Callers who use btrfs_lookup_dentry() require a trivial update. create_snapshot() in particular looks like it can also lose a BUG_ON(!inode) which is not really needed - there seems less harm in returning ENOENT to userspace at that point in the stack than there is to crash the machine. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:56 -08:00
Filipe David Borba Manana	7835776635	Btrfs: return immediately if tree log mod is not necessary In ctree.c:tree_mod_log_set_node_key() we were calling __tree_mod_log_insert_key() even when the modification doesn't need to be logged. This would allocate a tree_mod_elem structure, fill it and pass it to __tree_mod_log_insert(), which would just acquire the tree mod log write lock and then free the tree_mod_elem structure and return (that is, a no-op). Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert() which just returns immediately if the modification doesn't need to be logged (without allocating the structure, fill it, acquire write lock, free structure). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:56 -08:00
Josef Bacik	f28491e0a6	Btrfs: move the extent buffer radix tree into the fs_info I need to create a fake tree to test qgroups and I don't want to have to setup a fake btree_inode. The fact is we only use the radix tree for the fs_info, so everybody else who allocates an extent_io_tree is just wasting the space anyway. This patch moves the radix tree and its lock into btrfs_fs_info so there is less stuff I have to fake to do qgroup sanity tests. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:55 -08:00
Josef Bacik	34b41acec1	Btrfs: use a bit to track if we're in the radix tree For creating a dummy in-memory btree I need to be able to use the radix tree to keep track of the buffers like normal extent buffers. With dummy buffers we skip the radix tree step, and we still want to do that for the tree mod log dummy buffers but for my test buffers we need to be able to remove them from the radix tree like normal. This will give me a way to do that. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:54 -08:00
Josef Bacik	a5dee37d39	Btrfs: deal with io_tree->mapping being NULL I need to add infrastructure to allocate dummy extent buffers for running sanity tests, and to do this I need to not have to worry about having an address_mapping for an io_tree, so just fix up the places where we assume that all io_tree's have a non-NULL ->mapping. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:54 -08:00
Filipe David Borba Manana	2ef1fed285	Btrfs: more efficient push_leaf_right Currently when finding the leaf to insert a key into a btree, if the leaf doesn't have enough space to store the item we attempt to move off some items from our leaf to its right neighbor leaf, and if this fails to create enough free space in our leaf, we try to move off more items to the left neighbor leaf as well. When trying to move off items to the right neighbor leaf, if it has enough room to store the new key but not not enough room to move off at least one item from our target leaf, __push_leaf_right returns 1 and we have to attempt to move items to the left neighbor (push_leaf_left function) without touching the right neighbor leaf. For the case where the right leaf has enough room to store at least 1 item from our leaf, we end up modifying (and dirtying) both our leaf and the right leaf. This is non-optimal for the case where the new key is greater than any key in our target leaf because it can be inserted at slot 0 of the right neighbor leaf and we don't need to touch our leaf at all nor to attempt to move off items to the left neighbor leaf. Therefore this change just selects the right neighbor leaf as our new target leaf if it has enough room for the new key without modifying our initial target leaf - we do this only if the new key is higher than any key in the initial target leaf. While running the following test, push_leaf_right was called by split_leaf 4802 times. Out of those 4802 calls, for 2571 calls (53.5%) we hit this special case (right leaf has enough room and new key is higher than any key in the initial target leaf). Test: sysbench --test=fileio --file-num=512 --file-total-size=5G \ --file-test-mode=[seqwr\|rndwr] --num-threads=512 --file-block-size=8192 \ --max-requests=100000 --file-io-mode=sync [prepare\|run] Results: sequential writes Throughput before this change: 65.71Mb/sec (average of 10 runs) Throughput after this change: 66.58Mb/sec (average of 10 runs) random writes Throughput before this change: 10.75Mb/sec (average of 10 runs) Throughput after this change: 11.56Mb/sec (average of 10 runs) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:53 -08:00
Wang Shilong	cb7ab02156	Btrfs: wrap repeated code into scrub_blocked_if_needed() Just wrap same code into one function scrub_blocked_if_needed(). This make a change that we will move waiting (@workers_pending = 0) before we can wake up commiting transaction(atomic_inc(@scrub_paused)), we must take carefully to not deadlock here. Thread 1 Thread 2 \|->btrfs_commit_transaction() \|->set trans type(COMMIT_DOING) \|->btrfs_scrub_paused()(blocked) \|->join_transaction(blocked) Move btrfs_scrub_paused() before setting trans type which means we can still join a transaction when commiting_transaction is blocked. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Suggested-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:53 -08:00
Wang Shilong	3cb0929ad2	Btrfs: fix wrong super generation mismatch when scrubbing supers We came a race condition when scrubbing superblocks, the story is: In commiting transaction, we will update @last_trans_commited after writting superblocks, if scrubber start after writting superblocks and before updating @last_trans_commited, generation mismatch happens! We fix this by checking @scrub_pause_req, and we won't start a srubber until commiting transaction is finished.(after btrfs_scrub_continue() finished.) Reported-by: Sebastian Ochmann <ochmann@informatik.uni-bonn.de> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:52 -08:00
Filipe David Borba Manana	5a0f4e2c2b	Btrfs: fix pass of transid with wrong endianness in send.c fs/btrfs/send.c:2190:9: warning: incorrect type in argument 3 (different base types) fs/btrfs/send.c:2190:9: expected unsigned long long [unsigned] [usertype] value fs/btrfs/send.c:2190:9: got restricted __le64 [usertype] ctransid fs/btrfs/send.c:2195:17: warning: incorrect type in argument 3 (different base types) fs/btrfs/send.c:2195:17: expected unsigned long long [unsigned] [usertype] value fs/btrfs/send.c:2195:17: got restricted __le64 [usertype] ctransid fs/btrfs/send.c:3716:9: warning: incorrect type in argument 3 (different base types) fs/btrfs/send.c:3716:9: expected unsigned long long [unsigned] [usertype] value fs/btrfs/send.c:3716:9: got restricted __le64 [usertype] ctransid Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:51 -08:00
Filipe David Borba Manana	d527afe1e5	Btrfs: fix extent_map block_len after merging When merging an extent_map with its right neighbor, increment its block_len with the neighbor's block_len. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:51 -08:00
Michal Nazarewicz	11850392ee	btrfs: remove dead code [commit `8185554d`: fix incorrect inode acl reset] introduced a dead code by adding a condition which can never be true to an else branch. The condition can never be true because it is already checked by a previous if statement which causes function to return. Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Reviewed-By: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:50 -08:00
Filipe David Borba Manana	878f2d2cb3	Btrfs: fix max dir item size calculation We were accounting for sizeof(struct btrfs_item) twice, once in the data_size variable and another time in the if statement below. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:49 -08:00
Filipe David Borba Manana	12cfbad90e	Btrfs: more efficient extent state insertions Currently we do 2 traversals of an inode's extent_io_tree before inserting an extent state structure: 1 to see if a matching extent state already exists and 1 to do the insertion if the fist traversal didn't found such extent state. This change just combines those tree traversals into a single one. While running sysbench tests (random writes) I captured the number of elements in extent_io_tree trees for a while (into a procfs file backed by a seq_list from seq_file module) and got this histogram: Count: 9310 Range: 51.000 - 21386.000; Mean: 11785.243; Median: 18743.500; Stddev: 8923.688 Percentiles: 90th: 20985.000; 95th: 21155.000; 99th: 21369.000 51.000 - 93.933: 693 ######## 93.933 - 172.314: 938 ########## 172.314 - 315.408: 856 ######### 315.408 - 576.646: 95 # 576.646 - 6415.830: 888 ########## 6415.830 - 11713.809: 1024 ########### 11713.809 - 21386.000: 4816 ##################################################### So traversing such trees can take some significant time that can easily be avoided. Ran the following sysbench tests, 5 times each, for sequential and random writes, and got the following results: sysbench --test=fileio --file-num=1 --file-total-size=2G \ --file-test-mode=seqwr --num-threads=16 --file-block-size=65536 \ --max-requests=0 --max-time=60 --file-io-mode=sync sysbench --test=fileio --file-num=1 --file-total-size=2G \ --file-test-mode=rndwr --num-threads=16 --file-block-size=65536 \ --max-requests=0 --max-time=60 --file-io-mode=sync Before this change: sequential writes: 69.28Mb/sec (average of 5 runs) random writes: 4.14Mb/sec (average of 5 runs) After this change: sequential writes: 69.91Mb/sec (average of 5 runs) random writes: 5.69Mb/sec (average of 5 runs) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:49 -08:00
Filipe David Borba Manana	c42ac0bc95	Btrfs: add missing extent state caching calls When we didn't find a matching extent state, we inserted a new one but didn't cache it in the **cached_state parameter, which makes a subsequent call do a tree lookup to get it. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:48 -08:00
Filipe David Borba Manana	32193c147f	Btrfs: faster and more efficient extent map insertion Before this change, adding an extent map to the extent map tree of an inode required 2 tree nevigations: 1) doing a tree navigation to search for an existing extent map starting at the same offset or an extent map that overlaps the extent map we want to insert; 2) Another tree navigation to add the extent map to the tree (if the former tree search didn't found anything). This change just merges these 2 steps into a single one. While running first few btrfs xfstests I had noticed these trees easily had a few hundred elements, and then with the following sysbench test it reached over 1100 elements very often. Test: sysbench --test=fileio --file-num=32 --file-total-size=10G \ --file-test-mode=seqwr --num-threads=512 --file-block-size=8192 \ --max-requests=1000000 --file-io-mode=sync [prepare\|run] (fs created with mkfs.btrfs -l 4096 -f /dev/sdb3 before each sysbench prepare phase) Before this patch: run 1 - 41.894Mb/sec run 2 - 40.527Mb/sec run 3 - 40.922Mb/sec run 4 - 49.433Mb/sec run 5 - 40.959Mb/sec average - 42.75Mb/sec After this patch: run 1 - 48.036Mb/sec run 2 - 50.21Mb/sec run 3 - 50.929Mb/sec run 4 - 46.881Mb/sec run 5 - 53.192Mb/sec average - 49.85Mb/sec Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:48 -08:00
Filipe David Borba Manana	68ba990f7d	Btrfs: fix extent boundary check in bio_readpage_error Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:47 -08:00
Filipe David Borba Manana	5a4267ca20	Btrfs: try harder to avoid btree node splits When attempting to move items from our target leaf to its neighbor leaves (right and left), we only need to free data_size - free_space bytes from our leaf in order to add the new item (which has size of data_size bytes). Therefore attempt to move items to the right and left leaves if they have at least data_size - free_space bytes free, instead of data_size bytes free. After 5 runs of the following test, I got a smaller number of btree node splits overall: sysbench --test=fileio --file-num=512 --file-total-size=5G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=8192 --max-requests=100000 --file-io-mode=sync Before this change: * 6171 splits (average of 5 test runs) * 61.508Mb/sec of throughput (average of 5 test runs) After this change: * 6036 splits (average of 5 test runs) * 63.533Mb/sec of throughput (average of 5 test runs) An ideal test would not just have multiple threads/processes writing to a file (insertion of file extent items) but also do other operations that result in insertion of items with varied sizes, like file/directory creations, creation of links, symlinks, xattrs, etc. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:46 -08:00
Filipe David Borba Manana	1b8e7e45e5	Btrfs: avoid unnecessary ordered extent cache resets After an ordered extent completes, don't blindly reset the inode's ordered tree last accessed ordered extent pointer. While running the xfstests I noticed that about 29% of the time the ordered extent to which tree->last pointed was not the same as our just completed ordered extent. After that I ran the following sysbench test (after a prepare phase) and noticed that about 68% of the time tree->last pointed to a different ordered extent too. sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=rndwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --max-requests=0 run Therefore reset tree->last on ordered extent removal only if it pointed to the ordered extent we're removing from the tree. Results from 4 runs of the following test before and after applying this patch: $ sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --file-io-mode=sync prepare $ sysbench --test=fileio --file-num=32 --file-total-size=4G \ --file-test-mode=seqwr --num-threads=512 \ --file-block-size=32768 --max-time=60 --file-io-mode=sync run Before this path: run 1 - 64.049Mb/sec run 2 - 63.455Mb/sec run 3 - 64.656Mb/sec run 4 - 63.833Mb/sec After this patch: run 1 - 66.149Mb/sec run 2 - 68.459Mb/sec run 3 - 66.338Mb/sec run 4 - 66.176Mb/sec With random writes (--file-test-mode=rndwr) I had huge fluctuations on the results (+- 35% easily). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:46 -08:00
Jeff Mahoney	e453d989e0	btrfs: fix leaks during sysfs teardown Filipe noticed that we were leaking the features attribute group after umount. His fix of just calling sysfs_remove_group() wasn't enough since that removes just the supported features and not the unsupported features. This patch changes the unknown feature handling to add them individually so we can skip the kmalloc and uses the same iteration to tear them down later. We also fix the error handling during mount so that we catch the failing creation of the per-super kobject, and handle proper teardown of a half-setup sysfs context. Tested properly with kmemleak enabled this time. Reported-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Tested-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:45 -08:00
Jeff Mahoney	1b8e5df6d9	btrfs: fix static checker warnings This patch fixes the following warnings: fs/btrfs/extent-tree.c:6201:12: sparse: symbol 'get_raid_name' was not declared. Should it be static? fs/btrfs/extent-tree.c:8430:9: error: format not a string literal and no format arguments [-Werror=format-security] get_raid_name(index)); Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:44 -08:00
Filipe David Borba Manana	131e404a2a	Btrfs: fix very slow inode eviction and fs unmount The inode eviction can be very slow, because during eviction we tell the VFS to truncate all of the inode's pages. This results in calls to btrfs_invalidatepage() which in turn does calls to lock_extent_bits() and clear_extent_bit(). These calls result in too many merges and splits of extent_state structures, which consume a lot of time and cpu when the inode has many pages. In some scenarios I have experienced umount times higher than 15 minutes, even when there's no pending IO (after a btrfs fs sync). A quick way to reproduce this issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ cd /mnt/btrfs $ sysbench --test=fileio --file-num=128 --file-total-size=16G \ --file-test-mode=seqwr --num-threads=128 \ --file-block-size=16384 --max-time=60 --max-requests=0 run $ time btrfs fi sync . FSSync '.' real 0m25.457s user 0m0.000s sys 0m0.092s $ cd .. $ time umount /mnt/btrfs real 1m38.234s user 0m0.000s sys 1m25.760s The same test on ext4 runs much faster: $ mkfs.ext4 /dev/sdb3 $ mount /dev/sdb3 /mnt/ext4 $ cd /mnt/ext4 $ sysbench --test=fileio --file-num=128 --file-total-size=16G \ --file-test-mode=seqwr --num-threads=128 \ --file-block-size=16384 --max-time=60 --max-requests=0 run $ sync $ cd .. $ time umount /mnt/ext4 real 0m3.626s user 0m0.004s sys 0m3.012s After this patch, the unmount (inode evictions) is much faster: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ cd /mnt/btrfs $ sysbench --test=fileio --file-num=128 --file-total-size=16G \ --file-test-mode=seqwr --num-threads=128 \ --file-block-size=16384 --max-time=60 --max-requests=0 run $ time btrfs fi sync . FSSync '.' real 0m26.774s user 0m0.000s sys 0m0.084s $ cd .. $ time umount /mnt/btrfs real 0m1.811s user 0m0.000s sys 0m1.564s Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:44 -08:00
Wang Shilong	0647bf564f	Btrfs: improve forever loop when doing balance relocation We hit a forever loop when doing balance relocation,the reason is that we firstly reserve 4M(node size is 16k).and within transaction we will try to add extra reservation for snapshot roots,this will return -EAGAIN if there has been a thread flushing space to reserve space.We will do this again and again with filesystem becoming nearly full. If the above '-EAGAIN' case happens, we try to refill reservation more outsize of transaction, and this will return eariler in enospc case,however, this dosen't really hurt because it makes no sense doing balance relocation with the filesystem nearly full. Miao Xie helped a lot to track this issue, thanks. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:43 -08:00
Filipe David Borba Manana	6126e3caf7	Btrfs: fix ordered extent check in btrfs_punch_hole If the ordered extent's last byte was 1 less than our region's start byte, we would unnecessarily wait for the completion of that ordered extent, because it doesn't intersect our target range. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:42 -08:00
Miao Xie	376cc685cb	Btrfs: fix the reserved space leak caused by the race between nonlock dio and buffered io When we ran sysbench on the fs with compression, the following WARN_ONs were triggered: fs/btrfs/inode.c:7829 WARN_ON(BTRFS_I(inode)->outstanding_extents); fs/btrfs/inode.c:7830 WARN_ON(BTRFS_I(inode)->reserved_extents); fs/btrfs/inode.c:7832 WARN_ON(BTRFS_I(inode)->csum_bytes); Steps to reproduce: # mkfs.btrfs -f <dev> # mount -o compress <dev> <mnt> # cd <mnt> # sysbench --test=fileio --num-threads=8 --file-total-size=8G \ > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \ > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \ > --file-test-mode=sync prepare # cd - # umount <mnt> # mount -o compress <dev> <mnt> # cd <mnt> # sysbench --test=fileio --num-threads=8 --file-total-size=8G \ > --file-block-size=32K --file-io-mode=rndwr --file-fsync-freq=0 \ > --file-fsync-end=no --max-requests=300000 --file-extra-flags=direct \ > --file-test-mode=sync run # cd - # umount <mnt> The reason of this problem is: Task0 Task1 btrfs_direct_IO unlock(&inode->i_mutex) lock(&inode->i_mutex) reserve_space() prepare_pages() lock_extent() clear_extent() unlock_extent() lock_extent() test_extent(uptodate) return false copy_data() set_delalloc_extent() extent need compress go back to buffered write clear_extent(DELALLOC \| DIRTY) unlock_extent() Task 0 and 1 wrote the same place, and task0 cleared the delalloc flag which was set by task1, it made the dirty pages in that extents couldn't be flushed into the disk, so the reserved space for that extent was not released at the end. This patch fixes the above bug by unlocking the extent after the delalloc. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:42 -08:00
Miao Xie	b37392ea86	Btrfs: cleanup unnecessary parameter and variant of prepare_pages() - the caller has gotten the inode object, needn't pass the file object. And if so, we needn't define a inode pointer variant. - the position should be aligned by the page size not sector size, so we also needn't pass the root object into prepare_pages(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:41 -08:00
David Sterba	cc37bb0420	btrfs: replace BUG in can_modify_feature We don't need to crash hard here, it's just reading a sysfs file. The values considered in switch are from a fixed set, the default case should not happen at all. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:41 -08:00
David Sterba	43d87fa231	btrfs: reserve no transaction units in btrfs_feature_attr_store Added in patch "btrfs: add ability to change features via sysfs", modifications to superblock don't need to reserve metadata blocks when starting a transaction. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:40 -08:00
Frank Holton	27a0dd61a5	Btrfs: make btrfs_debug match pr_debug handling related to DEBUG The kernel macro pr_debug is defined as a empty statement when DEBUG is not defined. Make btrfs_debug match pr_debug to avoid spamming the kernel log with debug messages Signed-off-by: Frank Holton <fholton@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:39 -08:00
Sergei Trofimovich	33b98f2271	btrfs: cleanup: removed unused 'btrfs_get_inode_ref_index' Found by uselex.rb: > btrfs_get_inode_ref_index: [R]: exported from: fs/btrfs/inode-item.o fs/btrfs/btrfs.o fs/btrfs/built-in.o Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Reviewed-by: David Stebra <dsterba@suse.cz> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:39 -08:00
Kelley Nielsen	3f870c2899	btrfs: expand btrfs_find_item() to include find_orphan_item functionality This is the third step in bootstrapping the btrfs_find_item interface. The function find_orphan_item(), in orphan.c, is similar to the two functions already replaced by the new interface. It uses two parameters, which are already present in the interface, and is nearly identical to the function brought in in the previous patch. Replace the two calls to find_orphan_item() with calls to btrfs_find_item(), with the defined objectid and type that was used internally by find_orphan_item(), a null path, and a null key. Add a test for a null path to btrfs_find_item, and if it passes, allocate and free the path. Finally, remove find_orphan_item(). Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:37 -08:00
Kelley Nielsen	75ac2dd907	btrfs: expand btrfs_find_item() to include find_root_ref functionality This patch is the second step in bootstrapping the btrfs_find_item interface. The btrfs_find_root_ref() is similar to the former __inode_info(); it accepts four of its parameters, and duplicates the first half of its functionality. Replace the one former call to btrfs_find_root_ref() with a call to btrfs_find_item(), along with the defined key type that was used internally by btrfs_find_root ref, and a null found key. In btrfs_find_item(), add a test for the null key at the place where the functionality of btrfs_find_root_ref() ends; btrfs_find_item() then returns if the test passes. Finally, remove btrfs_find_root_ref(). Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Suggested-by: Zach Brown <zab@redhat.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:36 -08:00
Kelley Nielsen	e33d5c3d6d	btrfs: bootstrap generic btrfs_find_item interface There are many btrfs functions that manually search the tree for an item. They all reimplement the same mechanism and differ in the conditions that they use to find the item. __inode_info() is one such example. Zach Brown proposed creating a new interface to take the place of these functions. This patch is the first step to creating the interface. A new function, btrfs_find_item, has been added to ctree.c and prototyped in ctree.h. It is identical to __inode_info, except that the order of the parameters has been rearranged to more closely those of similar functions elsewhere in the code (now, root and path come first, then the objectid, offset and type, and the key to be filled in last). __inode_info's callers have been set to call this new function instead, and __inode_info itself has been removed. Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Suggested-by: Zach Brown <zab@redhat.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:36 -08:00
Valentina Giusti	a3df41ee37	btrfs: fix unused variables in qgroup.c Use otherwise unused local variables slot in update_qgroup_limit_item and in update_qgroup_info_item, and remove unused variable ins from btrfs_qgroup_account_ref. Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:35 -08:00
Valentina Giusti	e94acd86d4	btrfs: replace path->slots[0] with otherwise unused variable 'slot' Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:35 -08:00
Valentina Giusti	ce3e7f1073	btrfs: remove unused variable from scrub_fixup_nodatasum Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:34 -08:00
Valentina Giusti	f0265bb409	btrfs: remove unused variable from setup_cluster_no_bitmap The variable window_start in setup_cluster_no_bitmap is not used since commit `1bb91902dc` (Btrfs: revamp clustered allocation logic) Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:33 -08:00
Valentina Giusti	50892bac3b	btrfs: remove unused variables from extent_io.c Remove unused variables: * tree from end_bio_extent_writepage, * item from extent_fiemap. Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:33 -08:00
Valentina Giusti	4b447bfac6	btrfs: remove unused variable from find_free_extent The variable found_uncached_bg in find_free_extent is not used since commit `285ff5af6c` (Btrfs: remove the ideal caching code) Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:32 -08:00
Valentina Giusti	71db2a7751	btrfs: remove unused variables from disk-io.c Remove unused variables: * tree from csum_dirty_buffer, * tree from btree_readpage_end_io_hook, * tree from btree_writepages, * bytenr from btrfs_create_tree, * fs_info from end_workqueue_fn. Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:31 -08:00
Valentina Giusti	99e22f783b	btrfs: remove unused variable from btrfs_new_inode Variable owner in btrfs_new_inode is unused since commit `d82a6f1d7e` (Btrfs: kill BTRFS_I(inode)->block_group) Signed-off-by: Valentina Giusti <valentina.giusti@microon.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:31 -08:00
Jeff Mahoney	f8ba9c11f8	btrfs: publish fs label in sysfs This adds a writeable attribute which describes the label. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:30 -08:00
Jeff Mahoney	29e5be240a	btrfs: publish device membership in sysfs Now that we have the infrastructure for per-super attributes, we can publish device membership in /sys/fs/btrfs/<fsid>/devices. The information is published as symlinks to the block devices. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:29 -08:00
Jeff Mahoney	6ab0a2029c	btrfs: publish allocation data in sysfs While trying to debug ENOSPC issues, it's helpful to understand what the kernel's view of the available space is. We export this information via ioctl, but sysfs files are more easily used. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:29 -08:00
Jeff Mahoney	01e219e806	btrfs: add ioctl to export size of global metadata reservation btrfs filesystem df output will show the size of the metadata space and how much of it is used, and the user assumes that the difference is all usable space. Since that's not actually the case due to the global metadata reservation, we should provide the full picture to the user. This patch adds an ioctl that exports the size of the global metadata reservation so that btrfs filesystem df can report it. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:28 -08:00
Jeff Mahoney	3b02a68a63	btrfs: use feature attribute names to print better error messages Now that we have the feature name strings available in the kernel via the sysfs attributes, we can use them for printing better failure messages from the ioctl path. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:28 -08:00
Jeff Mahoney	ba631941ef	btrfs: add ability to change features via sysfs This patch adds the ability to change (set/clear) features while the file system is mounted. A bitmask is added for each feature set for the support to set and clear the bits. A message indicating which bit has been set or cleared is issued when it's been changed and also when permission or support for a particular bit has been denied. Since the the attributes can now be writable, we need to introduce another struct attribute to hold the different permissions. If neither set or clear is supported, the file will have 0444 permissions. If either set or clear is supported, the file will have 0644 permissions and the store handler will filter out the write based on the bitmask. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:27 -08:00
Jeff Mahoney	79da4fa4d9	btrfs: publish unknown feature bits in sysfs With the compat and compat-ro bits, it's possible for file systems to exist that have features that aren't supported by the kernel's file system implementation yet still be mountable. This patch publishes read-only info on those features using a prefix:number format, where the number is the bit number rather than the shifted value. e.g. "compat:12" Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:26 -08:00
Jeff Mahoney	510d73600a	btrfs: publish per-super features in sysfs This patch publishes information on which features are enabled in the file system on a per-super basis. At this point, it only publishes information on features supported by the file system implementation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:26 -08:00
Jeff Mahoney	5ac1d209f1	btrfs: publish per-super attributes in sysfs This patch adds per-super attributes to sysfs. It doesn't publish any attributes yet, but does the proper lifetime handling as well as the basic infrastructure to add new attributes. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:25 -08:00
Jeff Mahoney	079b72bca3	btrfs: publish supported featured in sysfs This patch adds the ability to publish supported features to sysfs under /sys/fs/btrfs/features. The files are module-wide and export which features the kernel supports. The content, for now, is just "0\n". Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:25 -08:00
Jeff Mahoney	2eaa055fab	btrfs: add ioctls to query/change feature bits online There are some feature bits that require no offline setup and can be enabled online. I've only reviewed extended irefs, but there will probably be more. We introduce three new ioctls: - BTRFS_IOC_GET_SUPPORTED_FEATURES: query the kernel for supported features. - BTRFS_IOC_GET_FEATURES: query the kernel for enabled features on a per-fs basis, as well as querying for which features are changeable with mounted. - BTRFS_IOC_SET_FEATURES: change features on a per-fs basis. We introduce two new masks per feature set (_SAFE_SET and _SAFE_CLEAR) that allow us to define which features are safe to change at runtime. The failure modes for BTRFS_IOC_SET_FEATURES are as follows: - Enabling a completely unsupported feature: warns and returns -ENOTSUPP - Enabling a feature that can only be done offline: warns and returns -EPERM Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:23 -08:00
Liu Bo	9e5ac13acb	Btrfs: skip merge part for delayed data refs When we have data deduplication on, we'll hang on the merge part because it needs to verify every queued delayed data refs related to this disk offset but we may have millions refs. And in the case of delayed data refs, we don't usually have too much data refs to merge. So it's safe to shut it down for data refs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:23 -08:00
Liu Bo	c46effa601	Btrfs: introduce a head ref rbtree The way how we process delayed refs is 1) get a bunch of head refs, 2) pick up one head ref, 3) go one node back for any delayed ref updates. The head ref is also linked in the same rbtree as the delayed ref is, so in 1) stage, we have to walk one by one including not only head refs, but delayed refs. When we have a great number of delayed refs pending to process, this'll cost time a lot. Here we introduce a head ref specific rbtree, it only has head refs, so troubles go away. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:22 -08:00
Josef Bacik	e20d6c5ba3	Btrfs: fix check-integrity to look at the referenced data properly We were looking at file_extent_num_bytes unconditionally when looking at referenced data bytes, but this isn't correct for compression. Fix this by checking the compression of the file extent we are and setting num_bytes to disk_num_bytes in the case of compression so that we are marking the proper bytes as referenced. This fixes check_int_data freaking out when running btrfs/004. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:21 -08:00
Josef Bacik	16e7549f04	Btrfs: incompatible format change to remove hole extents Btrfs has always had these filler extent data items for holes in inodes. This has made somethings very easy, like logging hole punches and sending hole punches. However for large holey files these extent data items are pure overhead. So add an incompatible feature to no longer add hole extents to reduce the amount of metadata used by these sort of files. This has a few changes for logging and send obviously since they will need to detect holes and log/send the holes if there are any. I've tested this thoroughly with xfstests and it doesn't cause any issues with and without the incompat format set. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-01-28 13:19:21 -08:00
Linus Torvalds	bf3d846b78	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted stuff; the biggest pile here is Christoph's ACL series. Plus assorted cleanups and fixes all over the place... There will be another pile later this week" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits) __dentry_path() fixes vfs: Remove second variable named error in __dentry_path vfs: Is mounted should be testing mnt_ns for NULL or error. Fix race when checking i_size on direct i/o read hfsplus: remove can_set_xattr nfsd: use get_acl and ->set_acl fs: remove generic_acl nfs: use generic posix ACL infrastructure for v3 Posix ACLs gfs2: use generic posix ACL infrastructure jfs: use generic posix ACL infrastructure xfs: use generic posix ACL infrastructure reiserfs: use generic posix ACL infrastructure ocfs2: use generic posix ACL infrastructure jffs2: use generic posix ACL infrastructure hfsplus: use generic posix ACL infrastructure f2fs: use generic posix ACL infrastructure ext2/3/4: use generic posix ACL infrastructure btrfs: use generic posix ACL infrastructure fs: make posix_acl_create more useful fs: make posix_acl_chmod more useful ...	2014-01-28 08:38:04 -08:00
Christoph Hellwig	996a710d46	btrfs: use generic posix ACL infrastructure Also don't bother to set up a .get_acl method for symlinks as we do not support access control (ACLs or even mode bits) for symlinks in Linux. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-01-25 23:58:18 -05:00
Christoph Hellwig	37bc15392a	fs: make posix_acl_create more useful Rename the current posix_acl_created to __posix_acl_create and add a fully featured helper to set up the ACLs on file creation that uses get_acl(). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-01-25 23:58:18 -05:00
Christoph Hellwig	5bf3258fd2	fs: make posix_acl_chmod more useful Rename the current posix_acl_chmod to __posix_acl_chmod and add a fully featured ACL chmod helper that uses the ->set_acl inode operation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-01-25 23:58:18 -05:00
Al Viro	1c1c8747cd	btrfs: sanitize BTRFS_IOC_FILE_EXTENT_SAME * don't assume that ->dest_count won't change between copy_from_user() and memdup_user() * use fdget instead of fget * don't bother comparing superblocks when we'd already compared vfsmounts * get rid of excessive goto * use file_inode() instead of open-coding the sucker Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2014-01-25 03:13:03 -05:00
Linus Torvalds	1d32bdafaa	xfs: update for v3.14-rc1 For 3.14-rc1 there are fixes in the areas of remote attributes, discard, growfs, memory leaks in recovery, directory v2, quotas, the MAINTAINERS file, allocation alignment, extent list locking, and in xfs_bmapi_allocate. There are cleanups in xfs_setsize_buftarg, removing unused macros, quotas, setattr, and freeing of inode clusters. The in-memory and on-disk log format have been decoupled, a common helper to calculate the number of blocks in an inode cluster has been added, and handling of i_version has been pulled into the filesystems that use it. - cleanup in xfs_setsize_buftarg - removal of remaining unused flags for vop toss/flush/flushinval - fix for memory corruption in xfs_attrlist_by_handle - fix for out-of-date comment in xfs_trans_dqlockedjoin - fix for discard if range length is less than one block - fix for overrun of agfl buffer using growfs on v4 superblock filesystems - pull i_version handling out into the filesystems that use it - don't leak recovery items on error - fix for memory leak in xfs_dir2_node_removename - several cleanups for quotas - fix bad assertion in xfs_qm_vop_create_dqattach - cleanup for xfs_setattr_mode, and add xfs_setattr_time - fix quota assert in xfs_setattr_nonsize - fix an infinite loop when turning off group/project quota before user quota - fix for temporary buffer allocation failure in xfs_dir2_block_to_sf with large directory block sizes - fix Dave's email address in MAINTAINERS - cleanup calculation of freed inode cluster blocks - fix alignment of initial file allocations to match filesystem geometry - decouple in-memory and on-disk log format - introduce a common helper to calculate the number of filesystem blocks in an inode cluster - fixes for extent list locking - fix for off-by-one in xfs_attr3_rmt_verify - fix for missing destroy_work_on_stack in xfs_bmapi_allocate -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) iQIcBAABAgAGBQJS4C2sAAoJENaLyazVq6ZOF/UP/A/FmscsEbgz+KHtsg1UXP+g EMkrD8WOOzLnm7/GhkfvmmFLQrWrrNPmiP54MPJ/rxpfDb7rV60HgS0iHm13U+IR 0uPyk2NIQKG/Sj6FHCrgn4oh19B0tmer6lLE32UPrlNM7w/0NRm32Ms/jKz7WNvc 9Tqw3qNdtmP7leHG8EyD1ISzqUz6FNSC+qGyOTuBQUqG24LqsmZ1qD2Nw/HEz0ir 1YCqS+cjj8c3WaWMsdjEFdCvEIKS/sMYM+ZihATIl/5ggwtVobP7PH/4cG8Biby3 5wvcl3/jM+xjgGDC1YMQQeSADieus6ER4aZTjCvGLn9RajQTOBjP0QZTu2OHntTI kD/52DfYErthGijS2qJcqXy11AaYtWQU2yTZXuMpaACIM1DDSD4NyGFP99BzbBaX D4BgnC+vmfpbXO37PmophkeZwAvj+9K2BBG7X+g4sLynj/BtmvFCxIyK+SLHE3hN kn+Pn03yTw0VGgvj0krXSllbYKE0LyLElaA6h0LejQgcOZM0W6Q8qaj+RODLitbw HkQNKYCOMCzbdNu8rKAfup9UPZloiMukyMdMaw1MeCgCQSQLkZVxTegmVW9gzpIh QJEARDaOnKB/G/CLCwbPZR09DBpg3yBt/eHjt0VnV+z2x+u9DTInAp6f013g7E/W gU+vnBBPY1AYI8iiDraU =t8nB -----END PGP SIGNATURE----- Merge tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs Pull xfs update from Ben Myers: "This is primarily bug fixes, many of which you already have. New stuff includes a series to decouple the in-memory and on-disk log format, helpers in the area of inode clusters, and i_version handling. We decided to try to use more topic branches this release, so there are some merge commits in there on account of that. I'm afraid I didn't do a good job of putting meaningful comments in the first couple of merges. Sorry about that. I think I have the hang of it now. For 3.14-rc1 there are fixes in the areas of remote attributes, discard, growfs, memory leaks in recovery, directory v2, quotas, the MAINTAINERS file, allocation alignment, extent list locking, and in xfs_bmapi_allocate. There are cleanups in xfs_setsize_buftarg, removing unused macros, quotas, setattr, and freeing of inode clusters. The in-memory and on-disk log format have been decoupled, a common helper to calculate the number of blocks in an inode cluster has been added, and handling of i_version has been pulled into the filesystems that use it. - cleanup in xfs_setsize_buftarg - removal of remaining unused flags for vop toss/flush/flushinval - fix for memory corruption in xfs_attrlist_by_handle - fix for out-of-date comment in xfs_trans_dqlockedjoin - fix for discard if range length is less than one block - fix for overrun of agfl buffer using growfs on v4 superblock filesystems - pull i_version handling out into the filesystems that use it - don't leak recovery items on error - fix for memory leak in xfs_dir2_node_removename - several cleanups for quotas - fix bad assertion in xfs_qm_vop_create_dqattach - cleanup for xfs_setattr_mode, and add xfs_setattr_time - fix quota assert in xfs_setattr_nonsize - fix an infinite loop when turning off group/project quota before user quota - fix for temporary buffer allocation failure in xfs_dir2_block_to_sf with large directory block sizes - fix Dave's email address in MAINTAINERS - cleanup calculation of freed inode cluster blocks - fix alignment of initial file allocations to match filesystem geometry - decouple in-memory and on-disk log format - introduce a common helper to calculate the number of filesystem blocks in an inode cluster - fixes for extent list locking - fix for off-by-one in xfs_attr3_rmt_verify - fix for missing destroy_work_on_stack in xfs_bmapi_allocate" * tag 'xfs-for-linus-v3.14-rc1' of git://oss.sgi.com/xfs/xfs: (51 commits) xfs: Calling destroy_work_on_stack() to pair with INIT_WORK_ONSTACK() xfs: fix off-by-one error in xfs_attr3_rmt_verify xfs: assert that we hold the ilock for extent map access xfs: use xfs_ilock_attr_map_shared in xfs_attr_list_int xfs: use xfs_ilock_attr_map_shared in xfs_attr_get xfs: use xfs_ilock_data_map_shared in xfs_qm_dqiterate xfs: use xfs_ilock_data_map_shared in xfs_qm_dqtobp xfs: take the ilock around xfs_bmapi_read in xfs_zero_remaining_bytes xfs: reinstate the ilock in xfs_readdir xfs: add xfs_ilock_attr_map_shared xfs: rename xfs_ilock_map_shared xfs: remove xfs_iunlock_map_shared xfs: no need to lock the inode in xfs_find_handle xfs: use xfs_icluster_size_fsb in xfs_imap xfs: use xfs_icluster_size_fsb in xfs_ifree_cluster xfs: use xfs_icluster_size_fsb in xfs_ialloc_inode_init xfs: use xfs_icluster_size_fsb in xfs_bulkstat xfs: introduce a common helper xfs_icluster_size_fsb xfs: get rid of XFS_IALLOC_BLOCKS macros xfs: get rid of XFS_INODE_CLUSTER_SIZE macros ...	2014-01-23 09:16:20 -08:00
Muthu Kumar	c7b22bb19a	btrfs: fix missing increment of bi_remaining In btrfs_end_bio(), we increment bi_remaining if is_orig_bio. If not, we restore the orig_bio but failed to increment bi_remaining for orig_bio, which triggers a BUG_ON later when bio_endio is called. Fix is to increment bi_remaining when we restore the orig bio as well. Reported-by: Fengguang Wu <fengguang.wu@intel.com> CC: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Muthukumar Ratty <muthur@gmail.com> Reviewed-by: Chris Mason <clm@fb.com> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2014-01-08 14:19:52 -07:00
Masanari Iida	8faaaead62	treewide: fix comments and printk msgs This patch fixed several typo in printk from various part of kernel source. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2014-01-07 15:06:07 +01:00
Jens Axboe	b28bc9b38c	Linux 3.13-rc6 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJSwLfoAAoJEHm+PkMAQRiGi6QH/1U1B7lmHChDTw3jj1lfm9gA 189Si4QJlnxFWCKHvKEL+pcaVuACU+aMGI8+KyMYK4/JfuWVjjj5fr/SvyHH2/8m LdSK8aHMhJ46uBS4WJ/l6v46qQa5e2vn8RKSBAyKm/h4vpt+hd6zJdoFrFai4th7 k/TAwOAEHI5uzexUChwLlUBRTvbq4U8QUvDu+DeifC8cT63CGaaJ4qVzjOZrx1an eP6UXZrKDASZs7RU950i7xnFVDQu4PsjlZi25udsbeiKcZJgPqGgXz5ULf8ZH8RQ YCi1JOnTJRGGjyIOyLj7pyB01h7XiSM2+eMQ0S7g54F2s7gCJ58c2UwQX45vRWU= =/4/R -----END PGP SIGNATURE----- Merge tag 'v3.13-rc6' into for-3.14/core Needed to bring blk-mq uptodate, since changes have been going in since for-3.14/core was established. Fixup merge issues related to the immutable biovec changes. Signed-off-by: Jens Axboe <axboe@kernel.dk> Conflicts: block/blk-flush.c fs/btrfs/check-integrity.c fs/btrfs/extent_io.c fs/btrfs/scrub.c fs/logfs/dev_bdev.c	2013-12-31 09:51:02 -07:00
Masanari Iida	77d84ff87e	treewide: Fix typos in printk Correct spelling typo in various part of kernel Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2013-12-19 15:10:49 +01:00
Linus Torvalds	e09f67f147	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is a small collection of fixes. It was rebased this morning, but I was just fixing signed-off-by tags with the wrong email" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix access_ok() check in btrfs_ioctl_send() Btrfs: make sure we cleanup all reloc roots if error happens Btrfs: skip building backref tree for uuid and quota tree when doing balance relocation Btrfs: fix an oops when doing balance relocation Btrfs: don't miss skinny extent items on delayed ref head contention btrfs: call mnt_drop_write after interrupted subvol deletion Btrfs: don't clear the default compression type	2013-12-12 15:25:10 -08:00
Dan Carpenter	700ff4f095	Btrfs: fix access_ok() check in btrfs_ioctl_send() The closing parenthesis is in the wrong place. We want to check "sizeof(arg->clone_sources) arg->clone_sources_count" instead of "sizeof(arg->clone_sources arg->clone_sources_count)". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org	2013-12-12 07:13:02 -08:00
Wang Shilong	467bb1d27c	Btrfs: make sure we cleanup all reloc roots if error happens I hit an oops when merging reloc roots fails, the reason is that new reloc roots may be added and we should make sure we cleanup all reloc roots. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2013-12-12 07:12:51 -08:00
Wang Shilong	6646374863	Btrfs: skip building backref tree for uuid and quota tree when doing balance relocation Quota tree and UUID Tree is only cowed, they can not be snapshoted. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2013-12-12 07:12:36 -08:00
Wang Shilong	c974c4642f	Btrfs: fix an oops when doing balance relocation I hit an oops when inserting reloc root into @reloc_root_tree(it can be easily triggered when forcing cow for relocation root) [ 866.494539] [<ffffffffa0499579>] btrfs_init_reloc_root+0x79/0xb0 [btrfs] [ 866.495321] [<ffffffffa044c240>] record_root_in_trans+0xb0/0x110 [btrfs] [ 866.496109] [<ffffffffa044d758>] btrfs_record_root_in_trans+0x48/0x80 [btrfs] [ 866.496908] [<ffffffffa0494da8>] select_reloc_root+0xa8/0x210 [btrfs] [ 866.497703] [<ffffffffa0495c8a>] do_relocation+0x16a/0x540 [btrfs] This is because reloc root inserted into @reloc_root_tree is not within one transaction,reloc root may be cowed and root block bytenr will be reused then oops happens.We should update reloc root in @reloc_root_tree when cow reloc root node, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2013-12-12 07:12:20 -08:00
Filipe David Borba Manana	639eefc8af	Btrfs: don't miss skinny extent items on delayed ref head contention Currently extent-tree.c:btrfs_lookup_extent_info() can miss the lookup of skinny extent items. This can happen when the execution flow is the following: * We do an extent tree lookup and fail to find a skinny extent item; * As a result, we attempt to see if a non-skinny extent item exists, either by looking at previous item in the leaf or by doing another full extent tree search; * We have a transaction and then we check for a matching delayed ref head in the transaction's delayed refs rbtree; * We find such delayed ref head and then we try to lock it with a call to mutex_trylock(); * The lock was contended so we jump to the label "again", which repeats the extent tree search but for a non-skinny extent item, because we set previously metadata variable to 0 and the search key to look for a non-skinny extent-item; * After the jump (and after releasing the transaction's delayed refs lock), a skinny extent item might have been added to the extent tree but we will miss it because metadata is set to 0 and the search key is set for a non-skinny extent-item. The fix here is to not reset metadata to 0 and to jump to the initial search key setup if the delayed ref head is contended, instead of jumping directly to the extent tree search label ("again"). This issue was found while investigating the issue reported at Bugzilla 64961. David Sterba suspected this function was missing extent items, and that this could be caused by the last change to this function, which was made in the following patch: [PATCH] Btrfs: optimize btrfs_lookup_extent_info() (commit `74be951087`) But in fact this issue already existed before, because after failing to find a skinny extent item, the code set the search key for a non-skinny extent item, and on contention of a matching delayed ref head it would not search the extent tree for a skinny extent item anymore. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2013-12-12 07:11:58 -08:00
David Sterba	e43f998e47	btrfs: call mnt_drop_write after interrupted subvol deletion If btrfs_ioctl_snap_destroy blocks on the mutex and the process is killed, mnt_write count is unbalanced and leads to unmountable filesystem. CC: stable@vger.kernel.org Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2013-12-12 07:11:38 -08:00
Miao Xie	a7e252af5a	Btrfs: don't clear the default compression type We met a oops caused by the wrong compression type: [ 556.512356] BUG: unable to handle kernel NULL pointer dereference at (null) [ 556.512370] IP: [<ffffffff811dbaa0>] __list_del_entry+0x1/0x98 [SNIP] [ 556.512490] [<ffffffff811dbb44>] ? list_del+0xd/0x2b [ 556.512539] [<ffffffffa05dd5ce>] find_workspace+0x97/0x175 [btrfs] [ 556.512546] [<ffffffff813c14b5>] ? _raw_spin_lock+0xe/0x10 [ 556.512576] [<ffffffffa05de276>] btrfs_compress_pages+0x2d/0xa2 [btrfs] [ 556.512601] [<ffffffffa05af060>] compress_file_range.constprop.54+0x1f2/0x4e8 [btrfs] [ 556.512627] [<ffffffffa05af388>] async_cow_start+0x32/0x4d [btrfs] [ 556.512655] [<ffffffffa05cc7a1>] worker_loop+0x144/0x4c3 [btrfs] [ 556.512661] [<ffffffff81059404>] ? finish_task_switch+0x80/0xb8 [ 556.512689] [<ffffffffa05cc65d>] ? btrfs_queue_worker+0x244/0x244 [btrfs] [ 556.512695] [<ffffffff8104fa4e>] kthread+0x8d/0x95 [ 556.512699] [<ffffffff81050000>] ? bit_waitqueue+0x34/0x7d [ 556.512704] [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65 [ 556.512709] [<ffffffff813c7eec>] ret_from_fork+0x7c/0xb0 [ 556.512713] [<ffffffff8104f9c1>] ? __kthread_parkme+0x65/0x65 Steps to reproduce: # mkfs.btrfs -f <dev> # mount -o nodatacow <dev> <mnt> # touch <mnt>/<file> # chattr =c <mnt>/<file> # dd if=/dev/zero of=<mnt>/<file> bs=1M count=10 It is because we cleared the default compression type when setting the nodatacow. In fact, we needn't do it because we have used COMPRESS flag to indicate if we need compressed the file data or not, needn't use the variant -- compress_type -- in btrfs_info to do the same thing, and just use it to hold the default compression type. Or we would get a wrong compress type for a file whose own compress flag is set but the compress flag of its filesystem is not set. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2013-12-12 07:11:19 -08:00
Linus Torvalds	5ee540613d	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "A small collection of fixes for the current series. It contains: - A fix for a use-after-free of a request in blk-mq. From Ming Lei - A fix for a blk-mq bug that could attempt to dereference a NULL rq if allocation failed - Two xen-blkfront small fixes - Cleanup of submit_bio_wait() type uses in the kernel, unifying that. From Kent - A fix for 32-bit blkg_rwstat reading. I apologize for this one looking mangled in the shortlog, it's entirely my fault for missing an empty line between the description and body of the text" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: fix use-after-free of request blk-mq: fix dereference of rq->mq_ctx if allocation fails block: xen-blkfront: Fix possible NULL ptr dereference xen-blkfront: Silence pfn maybe-uninitialized warning block: submit_bio_wait() conversions Update of blkg_stat and blkg_rwstat may happen in bh context	2013-12-05 15:33:27 -08:00
Christoph Hellwig	dff6efc326	fs: fix iversion handling Currently notify_change directly updates i_version for size updates, which not only is counter to how all other fields are updated through struct iattr, but also breaks XFS, which need inode updates to happen under its own lock, and synchronized to the structure that gets written to the log. Remove the update in the common code, and it to btrfs and ext4, XFS already does a proper updaste internally and currently gets a double update with the existing code. IMHO this is 3.13 and -stable material and should go in through the XFS tree. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Ben Myers <bpm@sgi.com>	2013-12-05 16:36:21 -06:00
Kent Overstreet	bc1e79acc1	block: fixup for generic bio chaining btrfs bits got lost in the rebase Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Chris Mason <clm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-12-03 14:30:00 -07:00
Kent Overstreet	c170bbb45f	block: submit_bio_wait() conversions It was being open coded in a few places. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Chris Mason <chris.mason@fusionio.com> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-11-24 16:33:41 -07:00
Kent Overstreet	4f024f3797	block: Abstract out bvec iterator Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6	2013-11-23 22:33:47 -08:00
Kent Overstreet	2c30c71bd6	block: Convert various code to bio_for_each_segment() With immutable biovecs we don't want code accessing bi_io_vec directly - the uses this patch changes weren't incorrect since they all own the bio, but it makes the code harder to audit for no good reason - also, this will help with multipage bvecs later. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com>	2013-11-23 22:33:46 -08:00
Kent Overstreet	33879d4512	block: submit_bio_wait() conversions It was being open coded in a few places. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Neil Brown <neilb@suse.de> Cc: Chris Mason <chris.mason@fusionio.com> Acked-by: NeilBrown <neilb@suse.de>	2013-11-23 22:33:38 -08:00
Linus Torvalds	fb0d1eb892	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Almost all of these are bug fixes. Dave Sterba's documentation update is the big exception because he removed our promises to set any machine running Btrfs on fire" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Documentation: filesystems: update btrfs tools section Documentation: filesystems: add new btrfs mount options btrfs: update kconfig help text btrfs: fix bio_size_ok() for max_sectors > 0xffff btrfs: Use trace condition for get_extent tracepoint btrfs: fix typo in the log message Btrfs: fix list delete warning when removing ordered root from the list Btrfs: print bytenr instead of page pointer in check-int Btrfs: remove dead codes from ctree.h Btrfs: don't wait for ordered data outside desired range Btrfs: fix lockdep error in async commit Btrfs: avoid heavy operations in btrfs_commit_super Btrfs: fix __btrfs_start_workers retval Btrfs: disable online raid-repair on ro mounts Btrfs: do not inc uncorrectable_errors counter on ro scrubs Btrfs: only drop modified extents if we logged the whole inode Btrfs: make sure to copy everything if we rename Btrfs: don't BUG_ON() if we get an error walking backrefs	2013-11-22 08:38:55 -08:00
David Sterba	4204617d14	btrfs: update kconfig help text Reflect the current status. Portions of the text taken from the wiki pages. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:49:09 -05:00
Akinobu Mita	475bf36ffb	btrfs: fix bio_size_ok() for max_sectors > 0xffff The data type of max_sectors in queue settings is unsigned int. But this value is stored to the local variable whose type is unsigned short in bio_size_ok(). This can cause unexpected result when max_sectors > 0xffff. Cc: Chris Mason <chris.mason@fusionio.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:48:44 -05:00
Steven Rostedt	4cd8587ce8	btrfs: Use trace condition for get_extent tracepoint Doing an if statement to test some condition to know if we should trigger a tracepoint is pointless when tracing is disabled. This just adds overhead and wastes a branch prediction. This is why the TRACE_EVENT_CONDITION() was created. It places the check inside the jump label so that the branch does not happen unless tracing is enabled. That is, instead of doing: if (em) trace_btrfs_get_extent(root, em); Which is basically this: if (em) if (static_key(trace_btrfs_get_extent)) { Using a TRACE_EVENT_CONDITION() we can just do: trace_btrfs_get_extent(root, em); And the condition trace event will do: if (static_key(trace_btrfs_get_extent)) { if (em) { ... The static key is a non conditional jump (or nop) that is faster than having to check if em is NULL or not. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:44:47 -05:00
Anand Jain	52a1575921	btrfs: fix typo in the log message Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:44:47 -05:00
Miao Xie	931aa87791	Btrfs: fix list delete warning when removing ordered root from the list Commit `b02441999e` "Btrfs: don't wait for the completion of all the ordered extents" introduced a bug that broke the ordered root list: WARNING: CPU: 1 PID: 7119 at lib/list_debug.c:59 __list_del_entry+0x5a/0x98() It is because we forgot to return the roots in the splice list to the ordered list of the fs. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:44:46 -05:00
Stefan Behrens	56d140f5f6	Btrfs: print bytenr instead of page pointer in check-int The page pointer information was useless. The bytenr is what you want when you search for submitted write bios. Additionally, a new bit in the print mask is added that allows to selectively enable the check-int submit_bio verbose mode. Before, the global verbose mode had to be enabled leading to many million useless lines in the kernel log. And a comment is added that explains that LOG_BUF_SHIFT needs to be set to a really high value. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:44:46 -05:00
Wang Shilong	9650e05c07	Btrfs: remove dead codes from ctree.h These two functions are only stated but undefined. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:44:45 -05:00
Filipe David Borba Manana	b52abf1e3b	Btrfs: don't wait for ordered data outside desired range In btrfs_wait_ordered_range(), if we found an extent to the left of the start of our desired wait range and the last byte of that extent is 1 less than the desired range's start, we would would wait for the IO completion of that extent unnecessarily. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:44:45 -05:00
Liu Bo	b1a06a4b57	Btrfs: fix lockdep error in async commit Lockdep complains about btrfs's async commit: [ 2372.462171] [ BUG: bad unlock balance detected! ] [ 2372.462191] 3.12.0+ #32 Tainted: G W [ 2372.462209] ------------------------------------- [ 2372.462228] ceph-osd/14048 is trying to release lock (sb_internal) at: [ 2372.462275] [<ffffffffa022cb10>] btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs] [ 2372.462305] but there are no more locks to release! [ 2372.462324] [ 2372.462324] other info that might help us debug this: [ 2372.462349] no locks held by ceph-osd/14048. [ 2372.462367] [ 2372.462367] stack backtrace: [ 2372.462386] CPU: 2 PID: 14048 Comm: ceph-osd Tainted: G W 3.12.0+ #32 [ 2372.462414] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011 [ 2372.462455] ffffffffa022cb10 ffff88007490fd28 ffffffff816f094a ffff8800378aa320 [ 2372.462491] ffff88007490fd50 ffffffff810adf4c ffff8800378aa320 ffff88009af97650 [ 2372.462526] ffffffffa022cb10 ffff88007490fd88 ffffffff810b01ee ffff8800898c0000 [ 2372.462562] Call Trace: [ 2372.462584] [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs] [ 2372.462619] [<ffffffff816f094a>] dump_stack+0x45/0x56 [ 2372.462642] [<ffffffff810adf4c>] print_unlock_imbalance_bug+0xec/0x100 [ 2372.462677] [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs] [ 2372.462710] [<ffffffff810b01ee>] lock_release+0x18e/0x210 [ 2372.462742] [<ffffffffa022cb36>] btrfs_commit_transaction_async+0x1d6/0x2a0 [btrfs] [ 2372.462783] [<ffffffffa025a7ce>] btrfs_ioctl_start_sync+0x3e/0xc0 [btrfs] [ 2372.462822] [<ffffffffa025f1d3>] btrfs_ioctl+0x4c3/0x1f70 [btrfs] [ 2372.462849] [<ffffffff812c0321>] ? avc_has_perm+0x121/0x1b0 [ 2372.462873] [<ffffffff812c0224>] ? avc_has_perm+0x24/0x1b0 [ 2372.462897] [<ffffffff8107ecc8>] ? sched_clock_cpu+0xa8/0x100 [ 2372.462922] [<ffffffff8117b145>] do_vfs_ioctl+0x2e5/0x4e0 [ 2372.462946] [<ffffffff812c19e6>] ? file_has_perm+0x86/0xa0 [ 2372.462969] [<ffffffff8117b3c1>] SyS_ioctl+0x81/0xa0 [ 2372.462991] [<ffffffff817045a4>] tracesys+0xdd/0xe2 ==================================================== It's because that we don't do the right thing when checking if it's ok to tell lockdep that we're trying to release the rwsem. If the trans handle's type is TRANS_ATTACH, we won't acquire the freeze rwsem, but as TRANS_ATTACH fits the check (trans < TRANS_JOIN_NOLOCK), we'll release the freeze rwsem, which makes lockdep complains a lot. Reported-by: Ma Jianpeng <majianpeng@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:44:44 -05:00
Liu Bo	d52c1bcc64	Btrfs: avoid heavy operations in btrfs_commit_super The 'git blame' history shows that, the old transaction commit code has to do twice to ensure roots are updated and we have to flush metadata and super block manually, however, right now all of these can be handled well inside the transaction commit code without extra efforts. And the error handling part remains same with the current code, -- 'return to caller once we get error'. This saves us a transaction commit and a flush of super block, which are both heavy operations according to ftrace output analysis. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:42:16 -05:00
Ilya Dryomov	ba69994a40	Btrfs: fix __btrfs_start_workers retval __btrfs_start_workers returns 0 in case it raced with btrfs_stop_workers and lost the race. This is wrong because worker in this case is not allowed to start and is in fact destroyed. Return -EINVAL instead. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:42:11 -05:00
Ilya Dryomov	908960c6c0	Btrfs: disable online raid-repair on ro mounts This disables the "if needed, write the good copy back before the read is completed" part of the read sequence for read-only mounts. Cc: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:42:05 -05:00
Ilya Dryomov	33ef30add1	Btrfs: do not inc uncorrectable_errors counter on ro scrubs Currently if we discover an error when scrubbing in ro mode we a) blindly increment the uncorrectable_errors counter, and b) spam the dmesg with the 'unable to fixup (regular) error at ...' message, even though a) we haven't tried to determine if the error is correctable or not, and b) we haven't tried to fixup anything. Fix this. Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:41:38 -05:00
Josef Bacik	d006a04816	Btrfs: only drop modified extents if we logged the whole inode If we fsync, seek and write, rename and then fsync again we will lose the modified hole extent because the rename will drop all of the modified extents since we didn't do the fast search. We need to only drop the modified extents if we didn't do the fast search and we were logging the entire inode as we don't need them anymore, otherwise this is being premature. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:41:32 -05:00
Josef Bacik	6cfab851f4	Btrfs: make sure to copy everything if we rename If we rename a file that is already in the log and we fsync again we will lose the new name. This is because we just log the inode update and not the new ref. To fix this we just need to check if we are logging the new name of the inode and copy all the metadata instead of just updating the inode itself. With this patch my testcase now passes. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:41:24 -05:00
Josef Bacik	4724b106b9	Btrfs: don't BUG_ON() if we get an error walking backrefs We can just return false for this so we stop doing the snapshot aware defrag stuff. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-20 20:41:16 -05:00
Linus Torvalds	ffd3c0260a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This pull fixes the empty_zero_page bug that Heiko reported, and includes one more cleanup from Al Viro" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: get rid of fdentry() btrfs: fix empty_zero_page misusage	2013-11-16 11:57:05 -08:00
Linus Torvalds	9073e1a804	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "Usual earth-shaking, news-breaking, rocket science pile from trivial.git" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits) doc: usb: Fix typo in Documentation/usb/gadget_configs.txt doc: add missing files to timers/00-INDEX timekeeping: Fix some trivial typos in comments mm: Fix some trivial typos in comments irq: Fix some trivial typos in comments NUMA: fix typos in Kconfig help text mm: update 00-INDEX doc: Documentation/DMA-attributes.txt fix typo DRM: comment: `halve' -> `half' Docs: Kconfig: `devlopers' -> `developers' doc: typo on word accounting in kprobes.c in mutliple architectures treewide: fix "usefull" typo treewide: fix "distingush" typo mm/Kconfig: Grammar s/an/a/ kexec: Typo s/the/then/ Documentation/kvm: Update cpuid documentation for steal time and pv eoi treewide: Fix common typo in "identify" __page_to_pfn: Fix typo in comment Correct some typos for word frequency clk: fixed-factor: Fix a trivial typo ...	2013-11-15 16:47:22 -08:00
Al Viro	54563d41a5	btrfs: get rid of fdentry() 3 of 4 callers actually want file_inode()... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-15 09:18:14 -05:00
Chris Mason	46e0f66a0c	btrfs: fix empty_zero_page misusage Heiko Carstens noticed that btrfs was using empty_zero_page incorrectly. He explained: The definition of empty_zero_page is architecture specific. It is (currently) either a character array, an unsigned long containing the address of the empty_zero_page, or even worse only the address of the struct page belonging to the empty_zero_page. This commit changes btrfs to use a for-loop instead. On x86 the resulting .ko is smaller, and we're no longer worrying about how each arch builds its zeros. Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-15 09:17:47 -05:00
Miao Xie	91aef86f3b	Btrfs: rename btrfs_start_all_delalloc_inodes rename the function -- btrfs_start_all_delalloc_inodes(), and make its name be compatible to btrfs_wait_ordered_roots(), since they are always used at the same place. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:13:58 -05:00
Miao Xie	b02441999e	Btrfs: don't wait for the completion of all the ordered extents It is very likely that there are lots of ordered extents in the filesytem, if we wait for the completion of all of them when we want to reclaim some space for the metadata space reservation, we would be blocked for a long time. The performance would drop down suddenly for a long time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:13:44 -05:00
Miao Xie	9f3a074d10	Btrfs: don't wait for all the async delalloc when shrinking delalloc It was very likely that there were lots of async delalloc pages in the filesystem, if we waited until all the pages were flushed, we would be blocked for a long time, and the performance would also drop down. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:13:37 -05:00
Miao Xie	c61a16a701	Btrfs: fix the confusion between delalloc bytes and metadata bytes In shrink_delalloc(), what we need reclaim is the metadata space, so flushing pages by to_reclaim is not reasonable, it is very likely that the pages we flush are not enough. And then we had to invoke the flush function for several times, at the worst, we need call flush_space for several times. It wasted time. We improve this problem by converting the metadata space size we need reserve to the delalloc bytes, By this way, we can flush the pages by a reasonable number. (Now we use a fixed number to do conversion, it is not flexible, maybe we can find a good way to improve it in the future.) Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:13:30 -05:00
Miao Xie	18cd8ea6df	Btrfs: pick up the code for the item number calculation in flush_space() This patch picked up the code that was used to calculate the number of the items for which we need reserve space, and we will use it in the next patch. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:13:23 -05:00
Miao Xie	38c135af8e	Btrfs: wait for the ordered extent only when we want Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:13:15 -05:00
Miao Xie	d3ee29e396	Btrfs: remove unnecessary initialization and memory barrior in shrink_delalloc() Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:13:07 -05:00
Wang Shilong	3b7a016f44	Btrfs: avoid unnecessary scrub workers allocation We only allocate scrub workers if we pass all the necessary checks, for example, there are no operation in progress. Besides, move mutex lock protection outside of scrub_workers_get() /scrub_workers_put(). Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:12:58 -05:00
Josef Bacik	007d31f755	Btrfs: check file extent type before anything else I hit this problem with my no holes patch and it made me realize what the problem was for bz 60834. If the first item in the leaf is an inline extent and we try to read anything starting from disk_bytenr onward we will read off the end of the leaf. So we need to check to see what it's type is, and if it's not REG we can just break out. This should fix this problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:12:49 -05:00
Rashika	f570e757b5	btrfs: Remove useless variable in write_ctree_super() The function write_ctree_super() in disk-io.c uses variable ret to return the result of function write_all_supers(). Since, this variable serves no purpose, hence the patch removes it and returns the call of the called function. Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:12:40 -05:00
Dulshani Gunawardhana	678712545b	btrfs: Fix checkpatch.pl warning of spacing issues Fix spacing issues detected via checkpatch.pl in accordance with the kernel style guidelines. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:12:31 -05:00
Dulshani Gunawardhana	d9b0d9ba04	btrfs: Replace kmalloc with kmalloc_array Replace kmalloc(size * nr, ) with kmalloc_array(nr, size), thus making it easier to check is that the calculation doesn't wrap or return a smaller allocation Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:12:22 -05:00
Dulshani Gunawardhana	e248e04e77	btrfs: Enclose macros with complex values within parenthesis Enclose macros with complex values within parenthesis in accordance to checkpatch.pl. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:12:06 -05:00
Dulshani Gunawardhana	fae7f21cec	btrfs: Use WARN_ON()'s return value in place of WARN_ON(1) Use WARN_ON()'s return value in place of WARN_ON(1) for cleaner source code that outputs a more descriptive warnings. Also fix the styling warning of redundant braces that came up as a result of this fix. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:11:53 -05:00
Dulshani Gunawardhana	b19e684393	btrfs: Remove redundant local zero structure Remove redundant local zero structure, replacing it by the kernel's global ZERO_PAGE. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:11:39 -05:00
Dulshani Gunawardhana	3c45bfc152	btrfs: Pack struct btrfs_device Pack the structure btrfs_device in volumes.h to eliminate holes detected by pahole, thus reducing binary memory footprint. Signed-off-by: Dulshani Gunawardhana <dulshani.gunawardhana89@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:11:26 -05:00
Rashika	95e94d14b4	btrfs: Replace multiple atomic_inc() with atomic_add() This patch replaces multiple atomic_inc() with atomic_add() in delayed-inode.c to reduce source code and have few instructions for compilation. Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:11:19 -05:00
Rashika	2e9f595497	btrfs: Add helper function for free_root_pointers() The function free_root_pointers() in disk-io.h contains redundant code. Therefore, this patch adds a helper function free_root_extent_buffers() to free_root_pointers() to eliminate redundancy. Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Rashika Kheria <rashika.kheria@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:11:13 -05:00
Liu Bo	48ec47364b	Btrfs: fix a crash when running balance and defrag concurrently Running balance and defrag concurrently can end up with a crash: kernel BUG at fs/btrfs/relocation.c:4528! RIP: 0010:[<ffffffffa01ac33b>] [<ffffffffa01ac33b>] btrfs_reloc_cow_block+ 0x1eb/0x230 [btrfs] Call Trace: [<ffffffffa01398c1>] ? update_ref_for_cow+0x241/0x380 [btrfs] [<ffffffffa0180bad>] ? copy_extent_buffer+0xad/0x110 [btrfs] [<ffffffffa0139da1>] __btrfs_cow_block+0x3a1/0x520 [btrfs] [<ffffffffa013a0b6>] btrfs_cow_block+0x116/0x1b0 [btrfs] [<ffffffffa013ddad>] btrfs_search_slot+0x43d/0x970 [btrfs] [<ffffffffa0153c57>] btrfs_lookup_file_extent+0x37/0x40 [btrfs] [<ffffffffa0172a5e>] __btrfs_drop_extents+0x11e/0xae0 [btrfs] [<ffffffffa013b3fd>] ? generic_bin_search.constprop.39+0x8d/0x1a0 [btrfs] [<ffffffff8117d14a>] ? kmem_cache_alloc+0x1da/0x200 [<ffffffffa0138e7a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [<ffffffffa0173ef0>] btrfs_drop_extents+0x60/0x90 [btrfs] [<ffffffffa016b24d>] relink_extent_backref+0x2ed/0x780 [btrfs] [<ffffffffa0162fe0>] ? btrfs_submit_bio_hook+0x1e0/0x1e0 [btrfs] [<ffffffffa01b8ed7>] ? iterate_inodes_from_logical+0x87/0xa0 [btrfs] [<ffffffffa016b909>] btrfs_finish_ordered_io+0x229/0xac0 [btrfs] [<ffffffffa016c3b5>] finish_ordered_fn+0x15/0x20 [btrfs] [<ffffffffa018cbe5>] worker_loop+0x125/0x4e0 [btrfs] [<ffffffffa018cac0>] ? btrfs_queue_worker+0x300/0x300 [btrfs] [<ffffffff81075ea0>] kthread+0xc0/0xd0 [<ffffffff81075de0>] ? insert_kthread_work+0x40/0x40 [<ffffffff8164796c>] ret_from_fork+0x7c/0xb0 [<ffffffff81075de0>] ? insert_kthread_work+0x40/0x40 ---------------------------------------------------------------------- It turns out to be that balance operation will bump root's @last_snapshot, which enables snapshot-aware defrag path, and backref walking stuff will find data reloc tree as refs' parent, and hit the BUG_ON() during COW. As data reloc tree's data is just for relocation purpose, and will be deleted right after relocation is done, it's unnecessary to walk those refs belonged to data reloc tree, it'd be better to skip them. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:11:07 -05:00
Liu Bo	6f519564d7	Btrfs: do not run snapshot-aware defragment on error If something wrong happens in write endio, running snapshot-aware defragment can end up with undefined results, maybe a crash, so we should avoid it. In order to share similar code, this also adds a helper to free the struct for snapshot-aware defrag. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:11:00 -05:00
Filipe David Borba Manana	269d040ff2	Btrfs: log recovery, don't unlink inode always on error If we get any error while doing a dir index/item lookup in the log tree, we were always unlinking the corresponding inode in the subvolume. It makes sense to unlink only if the lookup failed to find the dir index/item, which corresponds to NULL or -ENOENT, and not when other errors happen (like a transient -ENOMEM or -EIO). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:10:48 -05:00
Filipe David Borba Manana	488111aa0e	Btrfs: fix csum search offset/length calculation in log tree We were setting the csums search offset and length to the right values if the extent is compressed, but later on right before doing the csums lookup we were overriding these two parameters regardless of compression being set or not for the extent. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:10:42 -05:00
Filipe David Borba Manana	e46f5388cd	Btrfs: fix verification of dir_item We were ignoring the name component of the dir_item. Both the name and data must fit within BTRFS_MAX_XATTR_SIZE(root). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:10:36 -05:00
Wang Shilong	9b011adfe1	Btrfs: remove scrub_super_lock holding in btrfs_sync_log() Originally, we introduced scrub_super_lock to synchronize tree log code with scrubbing super. However we can replace scrub_super_lock with device_list_mutex, because writing super will hold this mutex, this will reduce an extra lock holding when writing supers in sync log code. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:10:13 -05:00
Wang Shilong	7fdf4b608d	Btrfs: use 'u64' rather than 'int' to get extent's generation We define a 'int' to get extent's generation by mistake,fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:10:06 -05:00
Miao Xie	9dced186f9	Btrfs: fix the free space write out failure when there is no data space After running space balance on a new fs, the fs check program outputed the following warning message: free space inode generation (0) did not match free space cache generation (20) Steps to reproduce: # mkfs.btrfs -f <dev> # mount <dev> <mnt> # btrfs balance start <mnt> # umount <mnt> # btrfs check <dev> It was because there was no data space after the space balance, and the free space write out task didn't try to allocate a new data chunk for the free space inode when doing the reservation. So the data space reservation failed, and in order to tell the free space loader that this free space inode could not be trusted, the generation of the free space inode wasn't updated. Then the check program found this problem and outputed the above message. But in fact, it is safe that we try to allocate a new data chunk when we find the data space is not enough. The patch fixes the above problem by this way. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:08:49 -05:00
Josef Bacik	9e6a0c52b7	Btrfs: stop committing the transaction so much during relocate I noticed with my horrible snapshot excercisor that we were taking forever to relocate the larger the file system got. This appeared to be because we were committing the transaction _constantly_. There were a few places where we do braindead things with metadata reservation, like start a transaction and then try to refill the block rsv, which not only keeps us from committing a transaction during the enospc stuff, but keeps us from doing some of the harder flushing work which will make us more likely to need to commit the transaction. We also were checking the block rsv and committing the transaction if the block rsv was below a certain threshold, but we were doing this in a place where we don't actually keep anything in the block rsv so this was always ending up false so we always committed the transaction in this case. I tested this to make sure it didn't break anything, but it takes about 10 hours to get the box to this state so I don't know how much of an impact it will really make. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:08:31 -05:00
Josef Bacik	9f23e289ed	Btrfs: make sure the delalloc workers actually flush compressed writes When using delalloc workers in a non-waiting way (like for enospc handling) we can end up not actually waiting for the dirty pages to be started if we have compression. We need to add an extra filemap flush to make sure any async extents that have started are actually moved along before returning. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:08:22 -05:00
Josef Bacik	9385876917	Btrfs: take ordered root lock when removing ordered operations inode A user reported a list corruption warning from btrfs_remove_ordered_extent, it is because we aren't taking the ordered_root_lock when we remove the inode from the ordered operations list. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:08:10 -05:00
Josef Bacik	d788a34929	Btrfs: don't abort transaction in run_delalloc_nocow This is just the write path, the only reason we start a transaction is so we can check cross references, we don't make any actual changes, so there is no reason to abort the transaction if we fail. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:07:58 -05:00
Josef Bacik	02ecd2c278	Btrfs: do not bug_on if we try to cow a free space cache inode We can just return an error and we'll bail out properly. We still want to catch this case to make sure we don't have a bug somewhere, so just warn if this pops up. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:07:49 -05:00
Josef Bacik	0ef8b72607	Btrfs: return an error from btrfs_wait_ordered_range I noticed that if the free space cache has an error writing out it's data it won't actually error out, it will just carry on. This is because it doesn't check the return value of btrfs_wait_ordered_range, which didn't actually return anything. So fix this in order to keep us from making free space cache look valid when it really isnt. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:07:35 -05:00
Josef Bacik	ed2590953b	Btrfs: stop using vfs_read in send Apparently we don't actually close the files until we return to userspace, so stop using vfs_read in send. This is actually better for us since we can avoid all the extra logic of holding the file we're sending open and making sure to clean it up. This will fix people who have been hitting too many files open errors when trying to send. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:07:11 -05:00
Stefan Behrens	301993a4a1	Btrfs: check_int, remove warning for mixed-mode In mixed-mode, when a data-block was later reused for metadata, a warning was printed. This condition is now filtered out and the warning is eliminated in this case. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:04:25 -05:00
Stefan Behrens	a5f519c91d	Btrfs: fix check_int 'leaf item out of bounce' regression Yet another cleanup patch broke code for which no xfstest exists. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:04:06 -05:00
Filipe David Borba Manana	5599488708	Btrfs: optimize extent item search in run_delayed_extent_op Instead of doing another extent tree search if the first search failed to find a metadata item, check if the previous item in the leaf is an extent item and use it if it is, otherwise do the second tree search for an extent item. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:03:53 -05:00
Jeff Mahoney	cab45e22da	btrfs: add tracing for failed reservations When debugging ENOSPC issues, it's nice to be able to see which reservations failed as well as the ones which succeeded. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:03:37 -05:00
Zach Brown	8b558c5f09	btrfs: remove fs/btrfs/compat.h fs/btrfs/compat.h only contained trivial macro wrappers of drop_nlink() and inc_nlink(). This doesn't belong in mainline. Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:03:19 -05:00
Zach Brown	1877e1a747	btrfs: remove move_pages() move_pages() has an inefficient backwards byte copy of regions of two different pages. They're different pages so the regions won't overlap and it could use memcpy(). At that point, though, move_pages() would be a slightly dimmer re-implementation of copy_pages() that lacked the test for overlapping page regions. So remove move_pages() and just call copy_pages(). Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:03:09 -05:00
Zach Brown	4546bcaeba	btrfs: use get_seconds() instead of btrfs wrapper Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:03:00 -05:00
Filipe David Borba Manana	8185554d3e	Btrfs: fix incorrect inode acl reset When a directory has a default ACL and a subdirectory is created under that directory, btrfs_init_acl() is called when the subdirectory's inode is created to initialize the inode's ACL (inherited from the parent directory) but it was clearing the ACL from the inode after setting it if posix_acl_create() returned success, instead of clearing it only if it returned an error. To reproduce this issue: $ mkfs.btrfs -f /dev/loop0 $ mount /dev/loop0 /mnt $ mkdir /mnt/acl $ setfacl -d --set u::rwx,g::rwx,o::- /mnt/acl $ getfacl /mnt/acl user::rwx group::rwx other::r-x default:user::rwx default:group::rwx default:other::--- $ mkdir /mnt/acl/dir1 $ getfacl /mnt/acl/dir1 user::rwx group::rwx other::--- After unmounting and mounting again the filesystem, fgetacl returned the expected ACL: $ umount /mnt/acl $ mount /dev/loop0 /mnt $ getfacl /mnt/acl/dir1 user::rwx group::rwx other::--- default:user::rwx default:group::rwx default:other::--- Meaning that the underlying xattr was persisted. Reported-by: Giuseppe Fierro <giuseppe@fierro.org> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:02:51 -05:00
Stefan Behrens	ff76b05655	Btrfs: Don't allocate inode that is already in use Due to an off-by-one error, it is possible to reproduce a bug when the inode cache is used. The same inode number is assigned twice, the second time this leads to an EEXIST in btrfs_insert_empty_items(). The issue can happen when a file is removed right after a subvolume is created and then a new inode number is created before the inodes in free_inode_pinned are processed. unlink() calls btrfs_return_ino() which calls start_caching() in this case which adds [highest_ino + 1, BTRFS_LAST_FREE_OBJECTID] by searching for the highest inode (which already cannot find the unlinked one anymore in btrfs_find_free_objectid()). So if this unlinked inode's number is equal to the highest_ino + 1 (or >= this value instead of > this value which was the off-by-one error), we mustn't add the inode number to free_ino_pinned (caching_thread() does it right). In this case we need to try directly to add the number to the inode_cache which will fail in this case. When this inode number is allocated while it is still in free_ino_pinned, it is allocated and still added to the free inode cache when the pinned inodes are processed, thus one of the following inode number allocations will get an inode that is already in use and fail with EEXIST in btrfs_insert_empty_items(). One example which was created with the reproducer below: Create a snapshot, work in the newly created snapshot for the rest. In unlink(inode 34284) call btrfs_return_ino() which calls start_caching(). start_caching() calls add_free_space [34284, 18446744073709517077]. In btrfs_return_ino(), call start_caching pinned [34284, 1] which is wrong. mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284. btrfs_unpin_free_ino calls add_free_space [34284, 1]. mkdir() call btrfs_find_ino_for_alloc() which returns the number 34284. EEXIST when the new inode is inserted. One possible reproducer is this one: #!/bin/sh # preparation TEST_DEV=/dev/sdc1 TEST_MNT=/mnt umount ${TEST_MNT} 2>/dev/null \|\| true mkfs.btrfs -f ${TEST_DEV} mount ${TEST_DEV} ${TEST_MNT} -o \ rw,relatime,compress=lzo,space_cache,inode_cache btrfs subv create ${TEST_MNT}/s1 for i in `seq 34027`; do touch ${TEST_MNT}/s1/${i}; done btrfs subv snap ${TEST_MNT}/s1 ${TEST_MNT}/s2 FILENAME=`find ${TEST_MNT}/s1/ -inum 4085 \| sed 's\|^./$[^/]$$\|\1\|'` rm ${TEST_MNT}/s2/$FILENAME touch ${TEST_MNT}/s2/$FILENAME # the following steps can be repeated to reproduce the issue again and again [ -e ${TEST_MNT}/s3 ] && btrfs subv del ${TEST_MNT}/s3 btrfs subv snap ${TEST_MNT}/s2 ${TEST_MNT}/s3 rm ${TEST_MNT}/s3/$FILENAME touch ${TEST_MNT}/s3/$FILENAME ls -alFi ${TEST_MNT}/s?/$FILENAME touch ${TEST_MNT}/s3/_1 \|\| logger FAILED ls -alFi ${TEST_MNT}/s?/_1 touch ${TEST_MNT}/s3/_2 \|\| logger FAILED ls -alFi ${TEST_MNT}/s?/_2 touch ${TEST_MNT}/s3/__1 \|\| logger FAILED ls -alFi ${TEST_MNT}/s?/__1 touch ${TEST_MNT}/s3/__2 \|\| logger FAILED ls -alFi ${TEST_MNT}/s?/__2 # if the above is not enough, add the following loop: for i in `seq 3 9`; do touch ${TEST_MNT}/s3/__${i} \|\| logger FAILED; done #for i in `seq 3 34027`; do touch ${TEST_MNT}/s3/__${i} \|\| logger FAILED; done # one of the touch(1) calls in s3 fail due to EEXIST because the inode is # already in use that btrfs_find_ino_for_alloc() returns. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:02:36 -05:00
Filipe David Borba Manana	e8b0d724d5	Btrfs: fix btrfs_prev_leaf() previous key computation If we decrement the key type, we must reset its offset to the largest possible offset (u64)-1. If we decrement the key's objectid, then we must reset the key's type and offset to their largest possible values, (u8)-1 and (u64)-1 respectively. Not doing so can make us miss an items in the tree. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:02:26 -05:00
Filipe David Borba Manana	e93ae26fe1	Btrfs: optimize tree-log.c:count_inode_refs() Avoid repeated tree searches by processing all inode ref items in a leaf at once instead of processing one at a time, followed by a path release and a tree search for a key with a decremented offset. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:02:19 -05:00
Geyslan G. Bem	229eed4348	btrfs: simplify kmalloc+copy_from_user to memdup_user Use memdup_user rather than duplicating its implementation This is a little bit restricted to reduce false positives The semantic patch that makes this report is available in scripts/coccinelle/api/memdup_user.cocci. More information about semantic patching is available at http://coccinelle.lip6.fr/ Signed-off-by: Geyslan G. Bem <geyslan@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:01:51 -05:00
chandan	5ede859b00	Btrfs: btrfs_add_ordered_operation: Fix last modified transaction comparison. Comparison of an inode's last modified transaction with the last committed transaction is incorrect. Fix it. Signed-off-by: chandan <chandan@linux.vnet.ibm.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:01:37 -05:00
Filipe David Borba Manana	3c77bd94ec	Btrfs: don't leak delayed node on path allocation failure If the path allocation failed, we would return without decrementing the reference count in the delayed node we got before, resulting in a leak. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:01:27 -05:00
Stefan Behrens	361c093d7f	Btrfs: Wait for uuid-tree rebuild task on remount read-only If the user remounts the filesystem read-only while the uuid-tree scan and rebuild task is still running (this happens once after the filesystem was mounted with an old kernel, or when forced with the mount options), the remount should wait on the tasks completion before setting the filesystem read-only. Otherwise the background task continues to write to the filesystem which is apparently not what users expect. The reproducer: TEST_DEV=/dev/sdzzzzz1 TEST_MNT=/mnt mkfs.btrfs -f $TEST_DEV mount $TEST_DEV $TEST_MNT for i in `seq 50000`; do btrfs subvolume create ${TEST_MNT}/$i; done umount $TEST_MNT mount $TEST_DEV $TEST_MNT -o rescan_uuid_tree sleep 1 ps -elf \| fgrep '[btrfs-uuid]' \| grep -v grep mount $TEST_DEV $TEST_MNT -o ro,remount ps -elf \| fgrep '[btrfs-uuid]' \| grep -v grep sleep 1 umount $TEST_MNT Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:01:18 -05:00
Stefan Behrens	27087f3701	Btrfs: init device stats for new devices Device stats are only initialized (read from tree items) on mount. Trying to read device stats after adding or replacing new devices will return errors. btrfs_init_new_device() and btrfs_init_dev_replace_tgtdev() are the two functions that allocate and initialize new btrfs_device structures after a filesystem is mounted. They set the device stats to zero by using kzalloc() which is correct for new devices. The only missing thing was to declare these stats as being valid (device->dev_stats_valid = 1) and this patch adds this missing code. This is the reproducer: TEST_DEV1=/dev/sdzzzzz1 TEST_DEV2=/dev/sdzzzzz2 TEST_DEV3=/dev/sdzzzzz3 TEST_MNT=/mnt mkfs.btrfs $TEST_DEV1 mount $TEST_DEV1 $TEST_MNT btrfs device add $TEST_DEV2 $TEST_MNT btrfs device stat $TEST_MNT btrfs replace start -B $TEST_DEV2 $TEST_DEV3 $TEST_MNT btrfs device stat $TEST_MNT umount $TEST_MNT Reported-by: Ondrej Kunc <kunc88@gmail.com> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:01:09 -05:00
Liu Bo	30d133fc22	Btrfs: fixup error path in __btrfs_inc_extent_ref When we fail to add a reference after a non-inline insertion by some reasons, eg. ENOSPC, we'll abort the transaction, but we don't return this error to the caller who has to walk around again to find something wrong, that's unnecessary. Also fixup other error paths to keep it simple. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:01:00 -05:00
Ilya Dryomov	e649e587cb	Btrfs: disallow 'btrfs {balance,replace} cancel' on ro mounts For both balance and replace, cancelling involves changing the on-disk state and committing a transaction, which is not a good thing to do on read-only filesystems. Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:00:50 -05:00
Ilya Dryomov	adfa97cbdf	Btrfs: don't leak ioctl args in btrfs_ioctl_dev_replace struct btrfs_ioctl_dev_replace_args memory is leaked if replace is requested on a read-only filesystem. Fix it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:00:37 -05:00
Ilya Dryomov	f747cab7b7	Btrfs: nuke a bogus rw_devices decrement in __btrfs_close_devices On mount failures, __btrfs_close_devices can be called well before dev-replace state is read and ->is_tgtdev_for_dev_replace is set. This leads to a bogus decrement of ->rw_devices and sets off a WARN_ON in __btrfs_close_devices if replace target device happens to be on the lists and we fail early in the mount sequence. Fix this by checking the devid instead of ->is_tgtdev_for_dev_replace before the decrement: for replace targets devid is always equal to BTRFS_DEV_REPLACE_DEVID. Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:00:24 -05:00
Geyslan G. Bem	03b2f08b5f	btrfs: Fix memory leakage in the tree-log.c In add_inode_ref() function: Initializes local pointers. Reduces the logical condition with the __add_inode_ref() return value by using only one 'goto out'. Centralizes the exiting, ensuring the freeing of all used memory. Signed-off-by: Geyslan G. Bem <geyslan@gmail.com> Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 22:00:11 -05:00
Liu Bo	498456d33e	Btrfs: kill unused code in btrfs_search_forward After commit `de78b51a28` (btrfs: remove cache only arguments from defrag path), @blockptr is no more used. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:59:56 -05:00
Liu Bo	8319bfe136	Btrfs: cleanup dead code of defragment @is_extent is no more needed since we don't defrag extent root. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:59:45 -05:00
Filipe David Borba Manana	efd0c4055a	Btrfs: remove unnecessary key copy when logging inode The btrfs_insert_empty_item() function doesn't modify its key argument. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:59:30 -05:00
Chandra Seetharaman	452c75c3d2	Btrfs: Simplify the logic in alloc_extent_buffer() for existing extent buffer case alloc_extent_buffer() uses radix_tree_lookup() when radix_tree_insert() fails with EEXIST. That part of the code is very similar to the code in find_extent_buffer(). This patch replaces radix_tree_lookup() and surrounding code in alloc_extent_buffer() with find_extent_buffer(). Note that radix_tree_lookup() does not need to be protected by tree->buffer_lock. It is protected by eb->refs. While at it, this patch - changes the other usage of radix_tree_lookup() in alloc_extent_buffer() with find_extent_buffer() to reduce redundancy. - removes the unused argument 'len' to find_extent_buffer(). Signed-Off-by: Chandra Seetharaman <sekharan@us.ibm.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:59:11 -05:00
Josef Bacik	7f4ca37c48	Btrfs: fix up seek_hole/seek_data handling Whoever wrote this was braindead. Also it doesn't work right if you have VACANCY's since we assumed you would only have that at the end of the file, which won't be the case in the near future. I tested this with generic/285 and generic/286 as well as the btrfs tests that use fssum since it uses seek_hole/seek_data to verify things are ok. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:58:56 -05:00
Josef Bacik	4277a9c3b3	Btrfs: add an assert to btrfs_lookup_csums_range for alignment I was hitting weird issues when trying to remove hole extents and it turned out it was because I was sending non-aligned offsets down to btrfs_lookup_csums_range. So add an assert for this in case somebody trips over this in the future. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:58:45 -05:00
Josef Bacik	ed9e8af88e	Btrfs: fix hole check in log_one_extent I added an assert to make sure we were looking up aligned offsets for csums and I tripped it when running xfstests. This is because log_one_extent was checking if block_start == 0 for a hole instead of EXTENT_MAP_HOLE. This worked out fine in practice it seems, but it adds a lot of extra work that is uneeded. With this fix I'm no longer tripping my assert. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:58:32 -05:00
Josef Bacik	0e30db86a4	Btrfs: add a sanity test for a vacant extent at the front of a file Btrfs_get_extent was not handling this case properly, add a test to make sure we don't regress. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:58:19 -05:00
Josef Bacik	25a50341b6	Btrfs: handle a missing extent for the first file extent While trying to kill our hole extents I noticed I was seeing problems where we seek into a file and then start writing and then try to fiemap that file later. This is because we search for offset 0, don't find anything and so back up one slot, which puts us at the inode ref or something like that, which means we goto not_found and create an extent map for our entire search area. This isn't quite what we want, we want to move forward one slot and see if there is an extent there so we can limit our hole extent. This patch fixes this problem, I will add a testcase for this as well. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:58:05 -05:00
Josef Bacik	96192499c2	Btrfs: stop all workers after we free block groups Stefan was hitting a panic in the async worker stuff because we had outstanding read bios while we were stopping the worker threads. You could reproduce this easily if you mount -o nospace_cache and ran generic/273. This is because the caching thread stuff is still going and we were stopping all the worker threads. We need to stop the workers after this work is done, and the free block groups code will wait for all the caching threads to stop first so we don't run into this problem. With this patch we no longer panic. Thanks, Cc: stable@vger.kernel.org Reported-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:57:49 -05:00
Josef Bacik	aaedb55bc0	Btrfs: add tests for btrfs_get_extent I'm going to be removing hole extents in the near future so I wanted to make a sanity test for btrfs_get_extent to make sure I don't break anything in the meantime. This patch just puts btrfs_get_extent through its paces by giving it a completely unreasonable mapping to look at and make sure it is giving us back maps that make sense. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:57:30 -05:00
Josef Bacik	294e30fee3	Btrfs: add tests for find_lock_delalloc_range So both Liu and I made huge messes of find_lock_delalloc_range trying to fix stuff, me first by fixing extent size, then him by fixing something I broke and then me again telling him to fix it a different way. So this is obviously a candidate for some testing. This patch adds a pseudo fs so we can allocate fake inodes for tests that need an inode or pages. Then it addes a bunch of tests to make sure find_lock_delalloc_range is acting the way it is supposed to. With this patch and all of our previous patches to find_lock_delalloc_range I am sure it is working as expected now. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:56:51 -05:00
Josef Bacik	857cc2fc29	Btrfs: free reserved space on error in a few places While trying to track down a reserved space leak I noticed a few places where we won't properly clean up reserved space if we have an error, this patch fixes those up. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:56:41 -05:00
Josef Bacik	0be5dc67c4	Btrfs: fixup reserved trace points In trying to track down where we were leaking reserved space I noticed our reserve extent tracepoints are a little off. First we were saying that the reserved space had been alloced in btrfs_reserve_extent, which isn't the case, this needs to be triggered when we actually allocate the space when we run the delayed ref. We were also missing a few places where we should have been tracing the btrfs_reserve_extent_free tracepoint. With these in place I was able to put together where we were leaking reserved space. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:56:31 -05:00
Josef Bacik	2b1360da35	Btrfs: free up block groups after everything If we abort a transaction we will do the tree log cleanup at unmount, but this happens after we free up the block groups. This makes all the leak detection warnings go off because we think we've leaked space but in reality we just haven't cleaned it up yet. So instead do the block group cleanup stuff after free'ing the fs roots so we don't get these warnings. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:56:17 -05:00
Josef Bacik	681ae50917	Btrfs: cleanup reserved space when freeing tree log on error On error we will wait and free the tree log at unmount without a transaction. This means that the actual freeing of the blocks doesn't happen which means we complain about space leaks on unmount. So to fix this just skip the transaction specific cleanup part of the tree log free'ing if we don't have a transaction and that way we can free up our reserved space and our counters stay happy. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:56:11 -05:00
Josef Bacik	eb58bb371a	Btrfs: do not free the dirty bytes from the trans block rsv on cleanup The transactions should be cleaning up their reservations on failure, this just causes us to have warnings on unmount because we go negative by free'ing reservations that have already been free'ed. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:55:58 -05:00
Filipe David Borba Manana	80d94fb3df	Btrfs: fix memory leaks on transaction commit failure Structures of the types tree_mod_elem and qgroup_update are allocated during transaction commit but were not being released if the call to btrfs_run_delayed_items() returned an error. Stack trace reported by kmemleak: unreferenced object 0xffff880679f0b398 (size 128): comm "umount", pid 21508, jiffies 4295967793 (age 36718.112s) hex dump (first 32 bytes): 60 b5 f0 79 06 88 ff ff 00 00 00 00 00 00 00 00 `..y............ 00 00 00 00 00 00 00 00 50 1c 00 00 00 00 00 00 ........P....... backtrace: [<ffffffff81742d26>] kmemleak_alloc+0x26/0x50 [<ffffffff811889c2>] kmem_cache_alloc_trace+0x112/0x200 [<ffffffffa046f2d3>] tree_mod_log_insert_key.constprop.45+0x93/0x150 [btrfs] [<ffffffffa04720f9>] __btrfs_cow_block+0x299/0x4f0 [btrfs] [<ffffffffa0472510>] btrfs_cow_block+0x120/0x1f0 [btrfs] [<ffffffffa0476679>] btrfs_search_slot+0x449/0x930 [btrfs] [<ffffffffa048eecf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs] [<ffffffffa04eb49c>] __btrfs_update_delayed_inode+0x1c/0x1d0 [btrfs] [<ffffffffa04eb9e2>] __btrfs_run_delayed_items+0x162/0x1e0 [btrfs] [<ffffffffa04eba63>] btrfs_delayed_inode_exit+0x3/0x20 [btrfs] [<ffffffffa0499c63>] btrfs_commit_transaction+0x203/0xa50 [btrfs] [<ffffffffa046b519>] btrfs_sync_fs+0x69/0x110 [btrfs] [<ffffffff811cb210>] __sync_filesystem+0x30/0x60 [<ffffffff811cb2bb>] sync_filesystem+0x4b/0x70 [<ffffffff8119ce7b>] generic_shutdown_super+0x3b/0xf0 [<ffffffff8119cfc6>] kill_anon_super+0x16/0x30 unreferenced object 0xffff880677e0dd88 (size 32): comm "umount", pid 21508, jiffies 4295967793 (age 36718.112s) hex dump (first 32 bytes): 78 75 11 a9 06 88 ff ff 00 c0 e0 77 06 88 ff ff xu.........w.... 40 c3 a2 70 06 88 ff ff 00 00 00 00 00 00 00 00 @..p............ backtrace: [<ffffffff81742d26>] kmemleak_alloc+0x26/0x50 [<ffffffff811889c2>] kmem_cache_alloc_trace+0x112/0x200 [<ffffffffa04fa54f>] btrfs_qgroup_record_ref+0xf/0x90 [btrfs] [<ffffffffa04e1914>] btrfs_add_delayed_tree_ref+0xf4/0x170 [btrfs] [<ffffffffa048518a>] btrfs_free_tree_block+0x9a/0x220 [btrfs] [<ffffffffa0472163>] __btrfs_cow_block+0x303/0x4f0 [btrfs] [<ffffffffa0472510>] btrfs_cow_block+0x120/0x1f0 [btrfs] [<ffffffffa0476679>] btrfs_search_slot+0x449/0x930 [btrfs] [<ffffffffa048eecf>] btrfs_lookup_inode+0x2f/0xa0 [btrfs] [<ffffffffa04eb49c>] __btrfs_update_delayed_inode+0x1c/0x1d0 [btrfs] [<ffffffffa04eb9e2>] __btrfs_run_delayed_items+0x162/0x1e0 [btrfs] [<ffffffffa04eba63>] btrfs_delayed_inode_exit+0x3/0x20 [btrfs] [<ffffffffa0499c63>] btrfs_commit_transaction+0x203/0xa50 [btrfs] [<ffffffffa046b519>] btrfs_sync_fs+0x69/0x110 [btrfs] [<ffffffff811cb210>] __sync_filesystem+0x30/0x60 [<ffffffff811cb2bb>] sync_filesystem+0x4b/0x70 Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:55:46 -05:00
Ilya Dryomov	539f358a30	Btrfs: fix the dev-replace suspend sequence Replace progresses strictly from lower to higher offsets, and the progress is tracked in chunks, by storing the physical offset of the dev_extent which is being copied in the cursor_left field of btrfs_dev_replace_item. When we are done copying the chunk, left_cursor is updated to point one byte past the dev_extent, so that on resume we can skip the dev_extents that have already been copied. There is a major bug (which goes all the way back to the inception of dev-replace in 3.8) in the way left_cursor is bumped: the bump is done unconditionally, without any regard to the scrub_chunk return value. On suspend (and also on any kind of error) scrub_chunk returns early, i.e. without completing the copy. This leads to us skipping the chunk that hasn't been fully copied yet when resuming. Fix this by doing the cursor_left update only if scrub_chunk ret is 0. (On suspend scrub_chunk returns with -ECANCELED, so this fix covers both suspend and error cases.) Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:55:36 -05:00
Filipe David Borba Manana	778ba82b17	Btrfs: improve inode hash function/inode lookup Currently the hash value used for adding an inode to the VFS's inode hash table consists of the plain inode number, which is a 64 bits integer. This results in hash table buckets (hlist_head lists) with too many elements for at least 2 important scenarios: 1) When we have many subvolumes. Each subvolume has its own btree where its files and directories are added to, and each has its own objectid (inode number) namespace. This means that if we have N subvolumes, and all have inode number X associated to a file or directory, the corresponding inodes all map to the same hash table entry, resulting in a bucket (hlist_head list) with N elements; 2) On 32 bits machines. Th VFS hash values are unsigned longs, which are 32 bits wide on 32 bits machines, and the inode (objectid) numbers are 64 bits unsigned integers. We simply cast the inode numbers to hash values, which means that for all inodes with the same 32 bits lower half, the same hash bucket is used for all of them. For example, all inodes with a number (objectid) between 0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in the same hash table bucket. This change ensures the inode's hash value depends both on the objectid (inode number) and its subvolume's (btree root) objectid. For 32 bits machines, this change gives better entropy by making the hash value depend on both the upper and lower 32 bits of the 64 bits hash previously computed. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:55:19 -05:00
Filipe David Borba Manana	3d41d70252	Btrfs: remove unnecessary tree search when logging inode In tree-log.c:btrfs_log_inode(), we keep calling btrfs_search_forward() until it returns a key whose objectid is higher than our inode or until the key's type is higher than our maximum allowed type. At the end of the loop, we increment our mininum search key's objectid and type regardless of our desired target objectid and maximum desired type, which causes another loop iteration that will call again btrfs_search_forward() just to figure out we've gone beyond our maximum key and exit the loop. Therefore while incrementing our minimum key, don't do it blindly and exit the loop immiediately if the next search key's objectid or type is beyond what we seek. Also after incrementing the type, set the key's offset to 0, which was missing and could make us loose some of the inode's items. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:55:11 -05:00
Filipe David Borba Manana	6174d3cb43	Btrfs: remove unused max_key arg from btrfs_search_forward It is not used for anything. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:54:57 -05:00
Liu Bo	7d3d1744f8	Btrfs: fix memory leak of chunks' extent map As we're hold a ref on looking up the extent map, we need to drop the ref before returning to callers. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:54:48 -05:00
Miao Xie	fa7c14947a	Btrfs: improve jitter performance of the sequential buffered write The performance was slowed down sometimes when we ran sysbench to measure the performance of the sequential buffered write by 2 or more threads. It was because the write order of the test threads might be confused by the task scheduler, and the coming write would be beyond the end of the file, in this case, we need insert dummy file extents and create a hole for the area we skip. But in order to avoid the ongoing ordered extents which are in the area, we need wait for them. Unfortunately, the current code doesn't check if there are ordered extents in the area or not, try to find and flush the dirty pages directly, but in fact, there is no dirty page in that area, this step of the current code is unnecessary, and just wastes time. Sometimes, it would increase the contention of some locks, and makes the performance slow down suddenly. So we remove the ordered extent flush function before the check, and flush the dirty pages and wait for the ordered extents only when we find them. According to my test, we got 1-2 times of the performance regression when we ran the test by 10 times before applying this patch. After applying this patch, the regression went away. Test Environment: CPU: 1CPU * 4Cores Memory: 6GB Partition: 20GB Test Command: # sysbench --test=fileio --file-total-size=16G --file-test-mode=seqwr \ > --num-threads=512 --file-block-size=16384 --max-time=60 --max-requests=0 run Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:54:38 -05:00
Miao Xie	20dd2cbf01	Btrfs: fix BUG_ON() casued by the reserved space migration When we did space balance and snapshot creation at the same time, we might meet the following oops: kernel BUG at fs/btrfs/inode.c:3038! [SNIP] Call Trace: [<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs] [<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs] [<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs] [<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs] [<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs] [<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379 [<ffffffff811215a9>] vfs_ioctl+0x1d/0x39 [<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2 [<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8 [<ffffffff81121e88>] SyS_ioctl+0x57/0x83 [<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14 [<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b [SNIP] RIP [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs] The reason of the problem is that the relocation root creation stole the reserved space, which was reserved for orphan item deletion. There are several ways to fix this problem, one is to increasing the reserved space size of the space balace, and then we can use that space to create the relocation tree for each fs/file trees. But it is hard to calculate the suitable size because we doesn't know how many fs/file trees we need relocate. We fixed this problem by reserving the space for relocation root creation actively since the space it need is very small (one tree block, used for root node copy), then we use that reserved space to create the relocation tree. If we don't reserve space for relocation tree creation, we will use the reserved space of the balance. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:54:28 -05:00
Ross Kirk	0a4e558609	btrfs: remove unused parameter from btrfs_header_fsid Remove unused parameter, 'eb'. Unused since introduction in `5f39d397df` Updated to be rebased against current upstream and correct diff supplied this time! Signed-off-by: Ross Kirk <ross.kirk@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:54:16 -05:00
Josef Bacik	724e2315db	Btrfs: fix two use-after-free bugs with transaction cleanup I was noticing the slab redzone stuff going off every once and a while during transaction aborts. This was caused by two things 1) We would walk the pending snapshots and set their error to -ECANCELED. We don't need to do this, the snapshot stuff waits for a transaction commit and if there is a problem we just free our pending snapshot object and exit. Doing this was causing us to touch the pending snapshot object after the thing had already been freed. 2) We were freeing the transaction manually with wanton disregard for it's use_count reference counter. To fix this I cleaned up the transaction freeing loop to either wait for the transaction commit to finish if it was in the middle of that (since it will be cleaned and freed up there) or to do the cleanup oursevles. I also moved the global "kill all things dirty everywhere" stuff outside of the transaction cleanup loop since that only needs to be done once. With this patch I'm no longer seeing slab corruption because of use after frees. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:54:03 -05:00
Josef Bacik	c16ce19014	Btrfs: remove all BUG_ON()'s from commit_cowonly_roots Noticed this when forcing errors to happen during delayed ref running. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:53:57 -05:00
Josef Bacik	1de2cfde93	Btrfs: don't delete ordered roots from list during cleanup During transaction cleanup after an abort we are just removing roots from the ordered roots list which is incorrect. We have a BUG_ON() to make sure that the root is still part of the ordered roots list when we put our ordered extent which we were tripping in this case. So do like we do everywhere else and just move it to the tail of the ordered roots list and allow the normal cleanup to take care of stuff. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:53:49 -05:00
Josef Bacik	4e121c06ad	Btrfs: cleanup transaction on abort If we abort not during a transaction commit we won't clean up anything until we unmount. Unfortunately if we abort in the middle of writing out an ordered extent we won't clean it up and if somebody is waiting on that ordered extent they will wait forever. To fix this just make the transaction kthread call the cleanup transaction stuff if it notices theres an error, and make btrfs_end_transaction wake up the transaction kthread if there is an error. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:53:42 -05:00
Josef Bacik	b6d08f0630	Btrfs: do not release metadata for space cache inodes I've been testing our error paths and I was tripping the BUG_ON() in drop_outstanding_extent because our outstanding_extents is 0 for space cache inodes. This is because we don't reserve metadata space for these inodes since we depend on the global block reserve for our space. To fix this we need to make sure the DO_ACCOUNTING stuff doesn't actually call release_metadata for space cache inodes. With this patch I'm no longer panicing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:53:36 -05:00
Josef Bacik	e0228285a8	Btrfs: reset intwrite on transaction abort If we abort a transaction in the middle of a commit we weren't undoing the intwrite locking. This patch fixes that problem. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:53:29 -05:00
Josef Bacik	4577b014d1	Btrfs: relocate csums properly with prealloc extents A user reported a problem where they were getting csum errors when running a balance and running systemd's journal. This is because systemd is awesome and fallocate()'s its log space and writes into it. Unfortunately we assume that when we read in all the csums for an extent that they are sequential starting at the bytenr we care about. This obviously isn't the case for prealloc extents, where we could have written to the middle of the prealloc extent only, which means the csum would be for the bytenr in the middle of our range and not the front of our range. Fix this by offsetting the new bytenr we are logging to based on the original bytenr the csum was for. With this patch I no longer see the csum errors I was seeing. Thanks, Cc: stable@vger.kernel.org Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:53:22 -05:00
Filipe David Borba Manana	e84cc14213	Btrfs: don't leak block group on error In extent-tree.c:btrfs_write_dirty_block_groups(), if the call to write_one_cache_group() failed, we would return without putting the block group first. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:53:15 -05:00
Filipe David Borba Manana	9b19985986	Btrfs: fix sync fs to actually wait for all data to be persisted Currently the fs sync function (super.c:btrfs_sync_fs()) doesn't wait for delayed work to finish before returning success to the caller. This change fixes this, ensuring that there's no data loss if a power failure happens right after fs sync returns success to the caller and before the next commit happens. Steps to reproduce the data loss issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ perl -e '$d = ("\x41" x 6001); open($f,">","/mnt/btrfs/foobar"); print $f $d; close($f);' && btrfs fi sync /mnt/btrfs Right after the btrfs fi sync command (a second or 2 for example), power off the machine and reboot it. The file will be empty, as it can be verified after mounting the filesystem and through btrfs-debug-tree: $ btrfs-debug-tree /dev/sdb3 \| egrep '$257 INODE_ITEM 0$ itemoff' -B 3 -A 8 item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36 location key (257 INODE_ITEM 0) type FILE namelen 6 datalen 0 name: foobar item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160 inode generation 7 transid 7 size 0 block group 0 mode 100644 links 1 item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16 inode ref index 2 namelen 6 name: foobar checksum tree key (CSUM_TREE ROOT_ITEM 0) leaf 29429760 items 0 free space 3995 generation 7 owner 7 fs uuid 6192815c-af2a-4b75-b3db-a959ffb6166e chunk uuid b529c44b-938c-4d3d-910a-013b4700bcae uuid tree key (UUID_TREE ROOT_ITEM 0) After this patch, the data loss no longer happens after a power failure and btrfs-debug-tree shows: $ btrfs-debug-tree /dev/sdb3 \| egrep '$257 INODE_ITEM 0$ itemoff' -B 3 -A 8 item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36 location key (257 INODE_ITEM 0) type FILE namelen 6 datalen 0 name: foobar item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160 inode generation 6 transid 6 size 6001 block group 0 mode 100644 links 1 item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 3522 itemsize 53 extent data disk byte 12845056 nr 8192 extent data offset 0 nr 8192 ram 8192 extent compression 0 checksum tree key (CSUM_TREE ROOT_ITEM 0) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:53:08 -05:00
Filipe David Borba Manana	703c88e035	Btrfs: fix tracking of orphan inode count In inode.c:btrfs_orphan_add() if we failed to insert the orphan item, we would return without decrementing the orphan count that we just incremented before attempting the insertion, leaving the orphan inode count wrong. In inode.c:btrfs_orphan_del(), we were decrementing the inode orphan count if the bit BTRFS_INODE_ORPHAN_META_RESERVED was set, which is logically wrong because it should be decremented if the bit BTRFS_INODE_HAS_ORPHAN_ITEM was set - after all we increment the count when we set the bit BTRFS_INODE_HAS_ORPHAN_ITEM elsewhere. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:53:01 -05:00
Liu Bo	fe09e16cc8	Btrfs: export btrfs space shared info to userspace Similar to ocfs2, btrfs also supports that extents can be shared by different inodes, and there are some userspace tools requesting for this kind of 'space shared infomation'.[1] ocfs2 uses flag FIEMAP_EXTENT_SHARED, so does btrfs. [1]: http://thr3ads.net/ocfs2-devel/2010/09/489052-PATCH-3-3-shared-du-using-fiemap-to-figure-up-the-shared-extents-per-file-and-the-footprint-in Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:52:54 -05:00
Filipe David Borba Manana	7451432394	Btrfs: remove path arg from btrfs_truncate_free_space_cache Not used for anything, and removing it avoids caller's need to allocate a path structure. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:51:33 -05:00
Filipe David Borba Manana	53645a91f4	Btrfs: remove duplicated ino cache's inode lookup We're doing a unnecessary extra lookup of the ino cache's inode when we already have it (and holding a reference) during the process of saving the ino cache contents to disk. Therefore remove this extra lookup. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:51:24 -05:00
Josef Bacik	d4b4087c43	Btrfs: do a full search everytime in btrfs_search_old_slot While running some snashot aware defrag tests I noticed I was panicing every once and a while in key_search. This is because of the optimization that says if we find a key at slot 0 it will be at slot 0 all the way down the rest of the tree. This isn't the case for btrfs_search_old_slot since it will likely replay changes to a buffer if something has changed since we took our sequence number. So short circuit this optimization by setting prev_cmp to -1 every time we call key_search so we will do our normal binary search. With this patch I am no longer seeing the panics I was seeing before. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:51:17 -05:00
Josef Bacik	06ea65a398	Btrfs: add a sanity test for btrfs_split_item While looking at somebodys corruption I became completely convinced that btrfs_split_item was broken, so I wrote this test to verify that it was working as it was supposed to. Thankfully it appears to be working as intended, so just add this test to make sure nobody breaks it in the future. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:51:02 -05:00
Ross Kirk	dd3cc16b87	btrfs: drop unused parameter from btrfs_item_nr Remove unused eb parameter from btrfs_item_nr Signed-off-by: Ross Kirk <ross.kirk@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:50:48 -05:00
Filipe David Borba Manana	f06becc411	Btrfs: don't store NULL byte in symlink extents It is not necessary to store the NULL byte in a symlink inline file extent. There's currently no code that requires the NULL byte to be present in the extent. This change also doesn't break file format compatibility nor the send/receive feature. The VFS also doesn't need the NULL byte to be present in the extent, as it reads up to inode->i_size bytes (which already excluded the NULL byte) and sets the NULL byte for us (in fs/namei.c:page_getlink()). So with this change we save 1 byte per symlink file extent (which is always inlined in the btree leaf) without losing backward and forward compatibility. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:49:51 -05:00
Stefan Behrens	69e9c6c6dc	Btrfs: eliminate the exceptional root_tree refs=0 The fact that btrfs_root_refs() returned 0 for the tree_root caused bugs in the past, therefore it is set to 1 with this patch and (hopefully) all affected code is adapted to this change. I verified this change by temporarily adding WARN_ON() checks everywhere where btrfs_root_refs() is used, checking whether the logic of the code is changed by btrfs_root_refs() returning 1 instead of 0 for root->root_key.objectid == BTRFS_ROOT_TREE_OBJECTID. With these added checks, I ran the xfstests './check -g auto'. The two roots chunk_root and log_root_tree that are only referenced by the superblock and the log_roots below the log_root_tree still have btrfs_root_refs() == 0, only the tree_root is changed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-11-11 21:49:26 -05:00
Linus Torvalds	bdeeab62a6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fix from Chris Mason: "Sage hit a deadlock with ceph on btrfs, and Josef tracked it down to a regression in our initial rc1 pull. When doing nocow writes we were sometimes starting a transaction with locks held" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: release path before starting transaction in can_nocow_extent	2013-10-18 16:46:21 -07:00
Josef Bacik	1bda19eb73	Btrfs: release path before starting transaction in can_nocow_extent We can't be holding tree locks while we try to start a transaction, we will deadlock. Thanks, Reported-by: Sage Weil <sage@inktank.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-10-18 12:43:40 -04:00
Michael Witten	a26a8746ce	Docs: Kconfig: `devlopers' ->` developers' Signed-off-by: Michael Witten <mfwitten@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2013-10-14 15:48:34 +02:00
Linus Torvalds	d64dab903f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We've got more bug fixes in my for-linus branch: One of these fixes another corner of the compression oops from last time. Miao nailed down some problems with concurrent snapshot deletion and drive balancing. I kept out one of his patches for more testing, but these are all stable" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix oops caused by the space balance and dead roots Btrfs: insert orphan roots into fs radix tree Btrfs: limit delalloc pages outside of find_delalloc_range Btrfs: use right root when checking for hash collision	2013-10-12 12:54:24 -07:00
Miao Xie	c00869f1ae	Btrfs: fix oops caused by the space balance and dead roots When doing space balance and subvolume destroy at the same time, we met the following oops: kernel BUG at fs/btrfs/relocation.c:2247! RIP: 0010: [<ffffffffa04cec16>] prepare_to_merge+0x154/0x1f0 [btrfs] Call Trace: [<ffffffffa04b5ab7>] relocate_block_group+0x466/0x4e6 [btrfs] [<ffffffffa04b5c7a>] btrfs_relocate_block_group+0x143/0x275 [btrfs] [<ffffffffa0495c56>] btrfs_relocate_chunk.isra.27+0x5c/0x5a2 [btrfs] [<ffffffffa0459871>] ? btrfs_item_key_to_cpu+0x15/0x31 [btrfs] [<ffffffffa048b46a>] ? btrfs_get_token_64+0x7e/0xcd [btrfs] [<ffffffffa04a3467>] ? btrfs_tree_read_unlock_blocking+0xb2/0xb7 [btrfs] [<ffffffffa049907d>] btrfs_balance+0x9c7/0xb6f [btrfs] [<ffffffffa049ef84>] btrfs_ioctl_balance+0x234/0x2ac [btrfs] [<ffffffffa04a1e8e>] btrfs_ioctl+0xd87/0x1ef9 [btrfs] [<ffffffff81122f53>] ? path_openat+0x234/0x4db [<ffffffff813c3b78>] ? __do_page_fault+0x31d/0x391 [<ffffffff810f8ab6>] ? vma_link+0x74/0x94 [<ffffffff811250f5>] vfs_ioctl+0x1d/0x39 [<ffffffff811258c8>] do_vfs_ioctl+0x32d/0x3e2 [<ffffffff811259d4>] SyS_ioctl+0x57/0x83 [<ffffffff813c3bfa>] ? do_page_fault+0xe/0x10 [<ffffffff813c73c2>] system_call_fastpath+0x16/0x1b It is because we returned the error number if the reference of the root was 0 when doing space relocation. It was not right here, because though the root was dead(refs == 0), but the space it held still need be relocated, or we could not remove the block group. So in this case, we should return the root no matter it is dead or not. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-10-10 21:31:02 -04:00
Miao Xie	14927d9546	Btrfs: insert orphan roots into fs radix tree Now we don't drop all the deleted snapshots/subvolumes before the space balance. It means we have to relocate the space which is held by the dead snapshots/subvolumes. So we must into them into fs radix tree, or we would forget to commit the change of them when doing transaction commit, and it would corrupt the metadata. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-10-10 21:30:53 -04:00
Josef Bacik	7bf811a595	Btrfs: limit delalloc pages outside of find_delalloc_range Liu fixed part of this problem and unfortunately I steered him in slightly the wrong direction and so didn't completely fix the problem. The problem is we limit the size of the delalloc range we are looking for to max bytes and then we try to lock that range. If we fail to lock the pages in that range we will shrink the max bytes to a single page and re loop. However if our first page is inside of the delalloc range then we will end up limiting the end of the range to a period before our first page. This is illustrated below [0 -------- delalloc range --------- 256mb] [page] So find_delalloc_range will return with delalloc_start as 0 and end as 128mb, and then we will notice that delalloc_start < *start and adjust it up, but not adjust delalloc_end up, so things go sideways. To fix this we need to not limit the max bytes in find_delalloc_range, but in find_lock_delalloc_range and that way we don't end up with this confusion. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-10-10 21:27:56 -04:00
Josef Bacik	4871c1588f	Btrfs: use right root when checking for hash collision btrfs_rename was using the root of the old dir instead of the root of the new dir when checking for a hash collision, so if you tried to move a file into a subvol it would freak out because it would see the file you are trying to move in its current root. This fixes the bug where this would fail btrfs subvol create test1 btrfs subvol create test2 mv test1 test2. Thanks to Chris Murphy for catching this, Cc: stable@vger.kernel.org Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-10-10 21:27:45 -04:00
Darrick J. Wong	b208c2f7ce	btrfs: Fix crash due to not allocating integrity data for a bioset When btrfs creates a bioset, we must also allocate the integrity data pool. Otherwise btrfs will crash when it tries to submit a bio to a checksumming disk: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff8111e28a>] mempool_alloc+0x4a/0x150 PGD 2305e4067 PUD 23063d067 PMD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: btrfs scsi_debug xfs ext4 jbd2 ext3 jbd mbcache sch_fq_codel eeprom lpc_ich mfd_core nfsd exportfs auth_rpcgss af_packet raid6_pq xor zlib_deflate libcrc32c [last unloaded: scsi_debug] CPU: 1 PID: 4486 Comm: mount Not tainted 3.12.0-rc1-mcsum #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8802451c9720 ti: ffff880230698000 task.ti: ffff880230698000 RIP: 0010:[<ffffffff8111e28a>] [<ffffffff8111e28a>] mempool_alloc+0x4a/0x150 RSP: 0018:ffff880230699688 EFLAGS: 00010286 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00000000005f8445 RDX: 0000000000000001 RSI: 0000000000000010 RDI: 0000000000000000 RBP: ffff8802306996f8 R08: 0000000000011200 R09: 0000000000000008 R10: 0000000000000020 R11: ffff88009d6e8000 R12: 0000000000011210 R13: 0000000000000030 R14: ffff8802306996b8 R15: ffff8802451c9720 FS: 00007f25b8a16800(0000) GS:ffff88024fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000018 CR3: 0000000230576000 CR4: 00000000000007e0 Stack: ffff8802451c9720 0000000000000002 ffffffff81a97100 0000000000281250 ffffffff81a96480 ffff88024fc99150 ffff880228d18200 0000000000000000 0000000000000000 0000000000000040 ffff880230e8c2e8 ffff8802459dc900 Call Trace: [<ffffffff811b2208>] bio_integrity_alloc+0x48/0x1b0 [<ffffffff811b26fc>] bio_integrity_prep+0xac/0x360 [<ffffffff8111e298>] ? mempool_alloc+0x58/0x150 [<ffffffffa03e8041>] ? alloc_extent_state+0x31/0x110 [btrfs] [<ffffffff81241579>] blk_queue_bio+0x1c9/0x460 [<ffffffff8123e58a>] generic_make_request+0xca/0x100 [<ffffffff8123e639>] submit_bio+0x79/0x160 [<ffffffffa03f865e>] btrfs_map_bio+0x48e/0x5b0 [btrfs] [<ffffffffa03c821a>] btree_submit_bio_hook+0xda/0x110 [btrfs] [<ffffffffa03e7eba>] submit_one_bio+0x6a/0xa0 [btrfs] [<ffffffffa03ef450>] read_extent_buffer_pages+0x250/0x310 [btrfs] [<ffffffff8125eef6>] ? __radix_tree_preload+0x66/0xf0 [<ffffffff8125f1c5>] ? radix_tree_insert+0x95/0x260 [<ffffffffa03c66f6>] btree_read_extent_buffer_pages.constprop.128+0xb6/0x120 [btrfs] [<ffffffffa03c8c1a>] read_tree_block+0x3a/0x60 [btrfs] [<ffffffffa03caefd>] open_ctree+0x139d/0x2030 [btrfs] [<ffffffffa03a282a>] btrfs_mount+0x53a/0x7d0 [btrfs] [<ffffffff8113ab0b>] ? pcpu_alloc+0x8eb/0x9f0 [<ffffffff81167305>] ? __kmalloc_track_caller+0x35/0x1e0 [<ffffffff81176ba0>] mount_fs+0x20/0xd0 [<ffffffff81191096>] vfs_kern_mount+0x76/0x120 [<ffffffff81193320>] do_mount+0x200/0xa40 [<ffffffff81135cdb>] ? strndup_user+0x5b/0x80 [<ffffffff81193bf0>] SyS_mount+0x90/0xe0 [<ffffffff8156d31d>] system_call_fastpath+0x1a/0x1f Code: 4c 8d 75 a8 4c 89 6d e8 45 89 e0 4c 8d 6f 30 48 89 5d d8 41 83 e0 af 48 89 fb 49 83 c6 18 4c 89 7d f8 65 4c 8b 3c 25 c0 b8 00 00 <48> 8b 73 18 44 89 c7 44 89 45 98 ff 53 20 48 85 c0 48 89 c2 74 RIP [<ffffffff8111e28a>] mempool_alloc+0x4a/0x150 RSP <ffff880230699688> CR2: 0000000000000018 ---[ end trace 7a96042017ed21e2 ]--- Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-10-05 10:52:10 -04:00
Chris Mason	1329dfc8bb	Merge branch 'for-linus' into for-linus-3.12	2013-10-05 10:51:32 -04:00
Ilya Dryomov	1357272fc7	Btrfs: fix a use-after-free bug in btrfs_dev_replace_finishing free_device rcu callback, scheduled from btrfs_rm_dev_replace_srcdev, can be processed before btrfs_scratch_superblock is called, which would result in a use-after-free on btrfs_device contents. Fix this by zeroing the superblock before the rcu callback is registered. Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-10-04 16:02:14 -04:00
Ilya Dryomov	964fb15acf	Btrfs: eliminate races in worker stopping code The current implementation of worker threads in Btrfs has races in worker stopping code, which cause all kinds of panics and lockups when running btrfs/011 xfstest in a loop. The problem is that btrfs_stop_workers is unsynchronized with respect to check_idle_worker, check_busy_worker and __btrfs_start_workers. E.g., check_idle_worker race flow: btrfs_stop_workers(): check_idle_worker(aworker): - grabs the lock - splices the idle list into the working list - removes the first worker from the working list - releases the lock to wait for its kthread's completion - grabs the lock - if aworker is on the working list, moves aworker from the working list to the idle list - releases the lock - grabs the lock - puts the worker - removes the second worker from the working list ...... btrfs_stop_workers returns, aworker is on the idle list FS is umounted, memory is freed ...... aworker is waken up, fireworks ensue With this applied, I wasn't able to trigger the problem in 48 hours, whereas previously I could reliably reproduce at least one of these races within an hour. Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-10-04 16:02:13 -04:00
Liu Bo	385fe0bede	Btrfs: fix crash of compressed writes The crash[1] is found by xfstests/generic/208 with "-o compress", it's not reproduced everytime, but it does panic. The bug is quite interesting, it's actually introduced by a recent commit (`573aecafca`, Btrfs: actually limit the size of delalloc range). Btrfs implements delay allocation, so during writeback, we (1) get a page A and lock it (2) search the state tree for delalloc bytes and lock all pages within the range (3) process the delalloc range, including find disk space and create ordered extent and so on. (4) submit the page A. It runs well in normal cases, but if we're in a racy case, eg. buffered compressed writes and aio-dio writes, sometimes we may fail to lock all pages in the 'delalloc' range, in which case, we need to fall back to search the state tree again with a smaller range limit(max_bytes = PAGE_CACHE_SIZE - offset). The mentioned commit has a side effect, that is, in the fallback case, we can find delalloc bytes before the index of the page we already have locked, so we're in the case of (delalloc_end <= *start) and return with (found > 0). This ends with not locking delalloc pages but making ->writepage still process them, and the crash happens. This fixes it by just thinking that we find nothing and returning to caller as the caller knows how to deal with it properly. [1]: ------------[ cut here ]------------ kernel BUG at mm/page-writeback.c:2170! [...] CPU: 2 PID: 11755 Comm: btrfs-delalloc- Tainted: G O 3.11.0+ #8 [...] RIP: 0010:[<ffffffff810f5093>] [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83 [...] [ 4934.248731] Stack: [ 4934.248731] ffff8801477e5dc8 ffffea00049b9f00 ffff8801869f9ce8 ffffffffa02b841a [ 4934.248731] 0000000000000000 0000000000000000 0000000000000fff 0000000000000620 [ 4934.248731] ffff88018db59c78 ffffea0005da8d40 ffffffffa02ff860 00000001810016c0 [ 4934.248731] Call Trace: [ 4934.248731] [<ffffffffa02b841a>] extent_range_clear_dirty_for_io+0xcf/0xf5 [btrfs] [ 4934.248731] [<ffffffffa02a8889>] compress_file_range+0x1dc/0x4cb [btrfs] [ 4934.248731] [<ffffffff8104f7af>] ? detach_if_pending+0x22/0x4b [ 4934.248731] [<ffffffffa02a8bad>] async_cow_start+0x35/0x53 [btrfs] [ 4934.248731] [<ffffffffa02c694b>] worker_loop+0x14b/0x48c [btrfs] [ 4934.248731] [<ffffffffa02c6800>] ? btrfs_queue_worker+0x25c/0x25c [btrfs] [ 4934.248731] [<ffffffff810608f5>] kthread+0x8d/0x95 [ 4934.248731] [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43 [ 4934.248731] [<ffffffff814fe09c>] ret_from_fork+0x7c/0xb0 [ 4934.248731] [<ffffffff81060868>] ? kthread_freezable_should_stop+0x43/0x43 [ 4934.248731] Code: ff 85 c0 0f 94 c0 0f b6 c0 59 5b 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 2c de 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 52 49 8b 84 24 80 00 00 00 f6 40 20 01 75 44 [ 4934.248731] RIP [<ffffffff810f5093>] clear_page_dirty_for_io+0x1e/0x83 [ 4934.248731] RSP <ffff8801869f9c48> [ 4934.280307] ---[ end trace 36f06d3f8750236a ]--- Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-10-04 16:02:11 -04:00
Josef Bacik	60e7cd3a4b	Btrfs: fix transid verify errors when recovering log tree If we crash with a log, remount and recover that log, and then crash before we can commit another transaction we will get transid verify errors on the next mount. This is because we were not zero'ing out the log when we committed the transaction after recovery. This is ok as long as we commit another transaction at some point in the future, but if you abort or something else goes wrong you can end up in this weird state because the recovery stuff says that the tree log should have a generation+1 of the super generation, which won't be the case of the transaction that was started for recovery. Fix this by removing the check and _always_ zero out the log portion of the super when we commit a transaction. This fixes the transid verify issues I was seeing with my force errors tests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-10-04 16:02:09 -04:00
Linus Torvalds	0fbf2cc983	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are mostly bug fixes and a two small performance fixes. The most important of the bunch are Josef's fix for a snapshotting regression and Mark's update to fix compile problems on arm" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits) Btrfs: create the uuid tree on remount rw btrfs: change extent-same to copy entire argument struct Btrfs: dir_inode_operations should use btrfs_update_time also btrfs: Add btrfs: prefix to kernel log output btrfs: refuse to remount read-write after abort Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0 Btrfs: don't leak transaction in btrfs_sync_file() Btrfs: add the missing mutex unlock in write_all_supers() Btrfs: iput inode on allocation failure Btrfs: remove space_info->reservation_progress Btrfs: kill delay_iput arg to the wait_ordered functions Btrfs: fix worst case calculator for space usage Revert "Btrfs: rework the overcommit logic to be based on the total size" Btrfs: improve replacing nocow extents Btrfs: drop dir i_size when adding new names on replay Btrfs: replay dir_index items before other items Btrfs: check roots last log commit when checking if an inode has been logged Btrfs: actually log directory we are fsync()'ing Btrfs: actually limit the size of delalloc range Btrfs: allocate the free space by the existed max extent size when ENOSPC ...	2013-09-22 14:58:49 -07:00
Josef Bacik	94aebfb2e7	Btrfs: create the uuid tree on remount rw Users have been complaining of the uuid tree stuff warning that there is no uuid root when trying to do snapshot operations. This is because if you mount -o ro we will not create the uuid tree. But then if you mount -o rw,remount we will still not create it and then any subsequent snapshot/subvol operations you try to do will fail gloriously. Fix this by creating the uuid_root on remount rw if it was not already there. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:50:43 -04:00
Mark Fasheh	cbf8b8ca3e	btrfs: change extent-same to copy entire argument struct btrfs_ioctl_file_extent_same() uses __put_user_unaligned() to copy some data back to it's argument struct. Unfortunately, not all architectures provide __put_user_unaligned(), so compiles break on them if btrfs is selected. Instead, just copy the whole struct in / out at the start and end of operations, respectively. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:31 -04:00
Guangyu Sun	93fd63c2f0	Btrfs: dir_inode_operations should use btrfs_update_time also Commit `2bc5565286` (Btrfs: don't update atime on RO subvolumes) ensures that the access time of an inode is not updated when the inode lives in a read-only subvolume. However, if a directory on a read-only subvolume is accessed, the atime is updated. This results in a write operation to a read-only subvolume. I believe that access times should never be updated on read-only subvolumes. To reproduce: # mkfs.btrfs -f /dev/dm-3 (...) # mount /dev/dm-3 /mnt # btrfs subvol create /mnt/sub Create subvolume '/mnt/sub' # mkdir /mnt/sub/dir # echo "abc" > /mnt/sub/dir/file # btrfs subvol snapshot -r /mnt/sub /mnt/rosnap Create a readonly snapshot of '/mnt/sub' in '/mnt/rosnap' # stat /mnt/rosnap/dir File: `/mnt/rosnap/dir' Size: 8 Blocks: 0 IO Block: 4096 directory Device: 16h/22d Inode: 257 Links: 1 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2013-09-11 07:21:49.389157126 -0400 Modify: 2013-09-11 07:22:02.330156079 -0400 Change: 2013-09-11 07:22:02.330156079 -0400 # ls /mnt/rosnap/dir file # stat /mnt/rosnap/dir File: `/mnt/rosnap/dir' Size: 8 Blocks: 0 IO Block: 4096 directory Device: 16h/22d Inode: 257 Links: 1 Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2013-09-11 07:22:56.797151670 -0400 Modify: 2013-09-11 07:22:02.330156079 -0400 Change: 2013-09-11 07:22:02.330156079 -0400 Reported-by: Koen De Wit <koen.de.wit@oracle.com> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:30 -04:00
Frank Holton	5138cccf34	btrfs: Add btrfs: prefix to kernel log output The kernel log entries for device label %s and device fsid %pU are missing the btrfs: prefix. Add those here. Signed-off-by: Frank Holton <fholton@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:30 -04:00
David Sterba	6ef3de9c92	btrfs: refuse to remount read-write after abort It's still possible to flip the filesystem into RW mode after it's remounted RO due to an abort. There are lots of places that check for the superblock error bit and will not write data, but we should not let the filesystem appear read-write. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:30 -04:00
chandan	1cecf579d1	Btrfs: btrfs_ioctl_default_subvol: Revert back to toplevel subvolume when arg is 0 This patch makes it possible to set BTRFS_FS_TREE_OBJECTID as the default subvolume by passing a subvolume id of 0. Signed-off-by: chandan <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:29 -04:00
Filipe David Borba Manana	a0634be562	Btrfs: don't leak transaction in btrfs_sync_file() In btrfs_sync_file(), if the call to btrfs_log_dentry_safe() returns a negative error (for e.g. -ENOMEM via btrfs_log_inode()), we would return without ending/freeing the transaction. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:29 -04:00
Stefan Behrens	a724b43690	Btrfs: add the missing mutex unlock in write_all_supers() The BUG() was replaced by btrfs_error() and return -EIO with the patch "get rid of one BUG() in write_all_supers()", but the missing mutex_unlock() was overlooked. The 0-DAY kernel build service from Intel reported the missing unlock which was found by the coccinelle tool: fs/btrfs/disk-io.c:3422:2-8: preceding lock on line 3374 Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:28 -04:00
Josef Bacik	f4ab9ea706	Btrfs: iput inode on allocation failure We don't do the iput when we fail to allocate our delayed delalloc work in __start_delalloc_inodes, fix this. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:28 -04:00
Josef Bacik	363e4d354e	Btrfs: remove space_info->reservation_progress This isn't used for anything anymore, just remove it. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:27 -04:00
Josef Bacik	f0de181c9b	Btrfs: kill delay_iput arg to the wait_ordered functions This is a left over of how we used to wait for ordered extents, which was to grab the inode and then run filemap flush on it. However if we have an ordered extent then we already are holding a ref on the inode, and we just use btrfs_start_ordered_extent anyway, so there is no reason to have an extra ref on the inode to start work on the ordered extent. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:27 -04:00
Josef Bacik	c4fbb4300a	Btrfs: fix worst case calculator for space usage Forever ago I made the worst case calculator say that we could potentially split into 3 blocks for every level on the way down, which isn't right. If we split we're only going to get two new blocks, the one we originally cow'ed and the new one we're going to split. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:27 -04:00
Josef Bacik	14575aef42	Revert "Btrfs: rework the overcommit logic to be based on the total size" This reverts commit `70afa3998c`. It is causing performance issues and wasn't actually correct. There were problems with the way we flushed delalloc and that was the real cause of the early enospc. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:26 -04:00
Josef Bacik	652f25a292	Btrfs: improve replacing nocow extents Various people have hit a deadlock when running btrfs/011. This is because when replacing nocow extents we will take the i_mutex to make sure nobody messes with the file while we are replacing the extent. The problem is we are already holding a transaction open, which is a locking inversion, so instead we need to save these inodes we find and then process them outside of the transaction. Further we can't just lock the inode and assume we are good to go. We need to lock the extent range and then read back the extent cache for the inode to make sure the extent really still points at the physical block we want. If it doesn't we don't have to copy it. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:26 -04:00
Josef Bacik	d555438b6e	Btrfs: drop dir i_size when adding new names on replay So if we have dir_index items in the log that means we also have the inode item as well, which means that the inode's i_size is correct. However when we process dir_index'es we call btrfs_add_link() which will increase the directory's i_size for the new entry. To fix this we need to just set the dir items i_size to 0, and then as we find dir_index items we adjust the i_size. btrfs_add_link() will do it for new entries, and if the entry already exists we can just add the name_len to the i_size ourselves. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:25 -04:00
Josef Bacik	dd8e721773	Btrfs: replay dir_index items before other items A user reported a bug where his log would not replay because he was getting -EEXIST back. This was because he had a file moved into a directory that was logged. What happens is the file had a lower inode number, and so it is processed first when replaying the log, and so we add the inode ref in for the directory it was moved to. But then we process the directories DIR_INDEX item and try to add the inode ref for that inode and it fails because we already added it when we replayed the inode. To solve this problem we need to just process any DIR_INDEX items we have in the log first so this all is taken care of, and then we can replay the rest of the items. With this patch my reproducer can remount the file system properly instead of erroring out. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:25 -04:00
Josef Bacik	a5874ce6ce	Btrfs: check roots last log commit when checking if an inode has been logged Liu introduced a local copy of the last log commit for an inode to make sure we actually log an inode even if a log commit has already taken place. In order to make sure we didn't relog the same inode multiple times he set this local copy to the current trans when we log the inode, because usually we log the inode and then sync the log. The exception to this is during rename, we will relog an inode if the name changed and it is already in the log. The problem with this is then we go to sync the inode, and our check to see if the inode has already been logged is tripped and we don't sync the log. To fix this we need to _also_ check against the roots last log commit, because it could be less than what is in our local copy of the log commit. This fixes a bug where we rename a file into a directory and then fsync the directory and then on remount the directory is no longer there. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:24 -04:00
Josef Bacik	de2b530bfb	Btrfs: actually log directory we are fsync()'ing If you just create a directory and then fsync that directory and then pull the power plug you will come back up and the directory will not be there. That is because we won't actually create directories if we've logged files inside of them since they will be created on replay, but in this check we will set our logged_trans of our current directory if it happens to be a directory, making us think it doesn't need to be logged. Fix the logic to only do this to parent directories. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:24 -04:00
Josef Bacik	573aecafca	Btrfs: actually limit the size of delalloc range So forever we have had this thing to limit the amount of delalloc pages we'll setup to be written out to 128mb. This is because we have to lock all the pages in this range, so anything above this gets a bit unweildly, and also without a limit we'll happily allocate gigantic chunks of disk space. Turns out our check for this wasn't quite right, we wouldn't actually limit the chunk we wanted to write out, we'd just stop looking for more space after we went over the limit. So if you do a giant 20gb dd on my box with lots of ram I could get 2gig extents. This is fine normally, except when you go to relocate these extents and we can't find enough space to relocate these moster extents, since we have to be able to allocate exactly the same sized extent to move it around. So fix this by actually enforcing the limit. With this patch I'm no longer seeing giant 1.5gb extents. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:24 -04:00
Miao Xie	a482039889	Btrfs: allocate the free space by the existed max extent size when ENOSPC By the current code, if the requested size is very large, and all the extents in the free space cache are small, we will waste lots of the cpu time to cut the requested size in half and search the cache again and again until it gets down to the size the allocator can return. In fact, we can know the max extent size in the cache after the first search, so we needn't cut the size in half repeatedly, and just use the max extent size directly. This way can save lots of cpu time and make the performance grow up when there are only fragments in the free space cache. According to my test, if there are only 4KB free space extents in the fs, and the total size of those extents are 256MB, we can reduce the execute time of the following test from 5.4s to 1.4s. dd if=/dev/zero of=<testfile> bs=1MB count=1 oflag=sync Changelog v2 -> v3: - fix the problem that we skip the block group with the space which is less than we need. Changelog v1 -> v2: - address the problem that we return a wrong start position when searching the free space in a bitmap. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 11:05:23 -04:00
David Sterba	13fd8da98f	btrfs: add lockdep and tracing annotations for uuid tree Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 10:58:56 -04:00
Stefan Behrens	79556c3d88	btrfs: show compiled-in config features at module load time We want to know if there are debugging features compiled in, this may affect performance. The message is printed before the sanity checks. (This commit message is a copy of David Sterba's commit message when he introduced btrfs_print_info()). Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 10:58:56 -04:00
Filipe David Borba Manana	cef2193729	Btrfs: more efficient inode tree replace operation Instead of removing the current inode from the red black tree and then add the new one, just use the red black tree replace operation, which is more efficient. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 10:58:55 -04:00
Ilya Dryomov	55e50e458e	Btrfs: do not add replace target to the alloc_list If replace was suspended by the umount, replace target device is added to the fs_devices->alloc_list during a later mount. This is obviously wrong. ->is_tgtdev_for_dev_replace is supposed to guard against that, but ->is_tgtdev_for_dev_replace is (and can only ever be) initialized after everything is opened and fs_devices lists are populated. Fix this by checking the devid instead: for replace targets it's always equal to BTRFS_DEV_REPLACE_DEVID. Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 10:58:55 -04:00
Josef Bacik	83d4cfd4da	Btrfs: fixup error handling in btrfs_reloc_cow If we failed to actually allocate the correct size of the extent to relocate we will end up in an infinite loop because we won't return an error, we'll just move on to the next extent. So fix this up by returning an error, and then fix all the callers to return an error up the stack rather than BUG_ON()'ing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-21 10:58:54 -04:00
Linus Torvalds	ac4de9543a	Merge branch 'akpm' (patches from Andrew Morton) Merge more patches from Andrew Morton: "The rest of MM. Plus one misc cleanup" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits) mm/Kconfig: add MMU dependency for MIGRATION. kernel: replace strict_strto() with kstrto() mm, thp: count thp_fault_fallback anytime thp fault fails thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() thp: do_huge_pmd_anonymous_page() cleanup thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() mm: cleanup add_to_page_cache_locked() thp: account anon transparent huge pages into NR_ANON_PAGES truncate: drop 'oldsize' truncate_pagecache() parameter mm: make lru_add_drain_all() selective memcg: document cgroup dirty/writeback memory statistics memcg: add per cgroup writeback pages accounting memcg: check for proper lock held in mem_cgroup_update_page_stat memcg: remove MEMCG_NR_FILE_MAPPED memcg: reduce function dereference memcg: avoid overflow caused by PAGE_ALIGN memcg: rename RESOURCE_MAX to RES_COUNTER_MAX memcg: correct RESOURCE_MAX to ULLONG_MAX mm: memcg: do not trap chargers with full callstack on OOM mm: memcg: rework and document OOM waiting and wakeup ...	2013-09-12 15:44:27 -07:00
Kirill A. Shutemov	7caef26767	truncate: drop 'oldsize' truncate_pagecache() parameter truncate_pagecache() doesn't care about old size since commit `cedabed49b` ("vfs: Fix vmtruncate() regression"). Let's drop it. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 15:38:02 -07:00
Linus Torvalds	b7c09ad401	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This is against 3.11-rc7, but was pulled and tested against your tree as of yesterday. We do have two small incrementals queued up, but I wanted to get this bunch out the door before I hop on an airplane. This is a fairly large batch of fixes, performance improvements, and cleanups from the usual Btrfs suspects. We've included Stefan Behren's work to index subvolume UUIDs, which is targeted at speeding up send/receive with many subvolumes or snapshots in place. It closes a long standing performance issue that was built in to the disk format. Mark Fasheh's offline dedup work is also here. In this case offline means the FS is mounted and active, but the dedup work is not done inline during file IO. This is a building block where utilities are able to ask the FS to dedup a series of extents. The kernel takes care of verifying the data involved really is the same. Today this involves reading both extents, but we'll continue to evolve the patches" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits) Btrfs: optimize key searches in btrfs_search_slot Btrfs: don't use an async starter for most of our workers Btrfs: only update disk_i_size as we remove extents Btrfs: fix deadlock in uuid scan kthread Btrfs: stop refusing the relocation of chunk 0 Btrfs: fix memory leak of uuid_root in free_fs_info btrfs: reuse kbasename helper btrfs: return btrfs error code for dev excl ops err Btrfs: allow partial ordered extent completion Btrfs: convert all bug_ons in free-space-cache.c Btrfs: add support for asserts Btrfs: adjust the fs_devices->missing count on unmount Btrf: cleanup: don't check for root_refs == 0 twice Btrfs: fix for patch "cleanup: don't check the same thing twice" Btrfs: get rid of one BUG() in write_all_supers() Btrfs: allocate prelim_ref with a slab allocater Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl Btrfs: fix race between removing a dev and writing sbs Btrfs: remove ourselves from the cluster list under lock ...	2013-09-12 09:58:51 -07:00
Linus Torvalds	2e515bf096	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "The usual trivial updates all over the tree -- mostly typo fixes and documentation updates" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (52 commits) doc: Documentation/cputopology.txt fix typo treewide: Convert retrun typos to return Fix comment typo for init_cma_reserved_pageblock Documentation/trace: Correcting and extending tracepoint documentation mm/hotplug: fix a typo in Documentation/memory-hotplug.txt power: Documentation: Update s2ram link doc: fix a typo in Documentation/00-INDEX Documentation/printk-formats.txt: No casts needed for u64/s64 doc: Fix typo "is is" in Documentations treewide: Fix printks with 0x%# zram: doc fixes Documentation/kmemcheck: update kmemcheck documentation doc: documentation/hwspinlock.txt fix typo PM / Hibernate: add section for resume options doc: filesystems : Fix typo in Documentations/filesystems scsi/megaraid fixed several typos in comments ppc: init_32: Fix error typo "CONFIG_START_KERNEL" treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks page_isolation: Fix a comment typo in test_pages_isolated() doc: fix a typo about irq affinity ...	2013-09-06 09:36:28 -07:00
Linus Torvalds	45d9a2220f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "Unfortunately, this merge window it'll have a be a lot of small piles - my fault, actually, for not keeping #for-next in anything that would resemble a sane shape ;-/ This pile: assorted fixes (the first 3 are -stable fodder, IMO) and cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last components) + several long-standing patches from various folks. There definitely will be a lot more (starting with Miklos' check_submount_and_drop() series)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) direct-io: Handle O_(D)SYNC AIO direct-io: Implement generic deferred AIO completions add formats for dentry/file pathnames kvm eventfd: switch to fdget powerpc kvm: use fdget switch fchmod() to fdget switch epoll_ctl() to fdget switch copy_module_from_fd() to fdget git simplify nilfs check for busy subtree ibmasmfs: don't bother passing superblock when not needed don't pass superblock to hypfs_{mkdir,create*} don't pass superblock to hypfs_diag_create_files don't pass superblock to hypfs_vm_create_files() oprofile: get rid of pointless forward declarations of struct super_block oprofilefs_create_...() do not need superblock argument oprofilefs_mkdir() doesn't need superblock argument don't bother with passing superblock to oprofile_create_stats_files() oprofile: don't bother with passing superblock to ->create_files() don't bother passing sb to oprofile_create_files() coh901318: don't open-code simple_read_from_buffer() ...	2013-09-05 08:50:26 -07:00
Linus Torvalds	27703bb4a6	PTR_RET() is a weird name, and led to some confusing usage. We ended up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages. This has been sitting in linux-next for a whole cycle. Thanks, Rusty. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJSJo+1AAoJENkgDmzRrbjxIC4QALJK95o8AUXuwUkl+2fmFkUt hh2/PJ1vDYgk4Xt0J6hyoK7XMa0H1RkbBrROuDdsBnorMFpEsGcgdkUZte9ufoAS 97Bg+7N0KPbTB/S8vOwtW1vbERTJIVPN2uf6h1Wqm9Xc2puCh3HbMMr1AWMGu0WQ NqY5+Zz8zecy1UOrMhEP6H1CjeQcL1w1DO6YM5ydeqlKNzAz+JMfDXriLPDwiE7+ XFPDF/O3Vtd2ckA7L70Lio7hfHwxV5U4WwFVfiwls98XB4jcZqDKIoh1r8z4SRgR +0Rae2DN3BaOabGMr//5XdrzQVpwJTh5m2w8BAOHJvCJ9HR7Sq29UIN4u+TowZBy L2xYo4dvFxkympwu5zEd3c7vHYWKIaqmSq5PIjr4gF/uIo2OeOTrpPIK782ZEYb7 e+qUgOEM05V9AmQZCrSZeP9u474Sj8ow3sCtWxfdRtwNfoEIcUXsNNJd/zDHlVtW cEtXqc2xXIpcuUJQWlSaGp8fmRQjVZPzrLKYLM2m39ZcOOJbf5rzQAYS7hHPosIa SK+YVux/+Zzi+Xo/vXq1OlM/SruCr5S7JOgCxLowoQ88vupgXME6uPyC8EO+QQ50 GsrHes5ZNLbk0uVsfcexIyojkUnyvDmmnDpv+1zdC6RgZLJQn8OXp5yNhHhnhrFT BiHX6YFWtDDqRlVv8Q0F =LeaW -----END PGP SIGNATURE----- Merge tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux Pull PTR_RET() removal patches from Rusty Russell: "PTR_RET() is a weird name, and led to some confusing usage. We ended up with PTR_ERR_OR_ZERO(), and replacing or fixing all the usages. This has been sitting in linux-next for a whole cycle" [ There are still some PTR_RET users scattered about, with some of them possibly being new, but most of them existing in Rusty's tree too. We have that #define PTR_RET(p) PTR_ERR_OR_ZERO(p) thing in <linux/err.h>, so they continue to work for now - Linus ] * tag 'PTR_RET-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: GFS2: Replace PTR_RET with PTR_ERR_OR_ZERO Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO drm/cma: Replace PTR_RET with PTR_ERR_OR_ZERO sh_veu: Replace PTR_RET with PTR_ERR_OR_ZERO dma-buf: Replace PTR_RET with PTR_ERR_OR_ZERO drivers/rtc: Replace PTR_RET with PTR_ERR_OR_ZERO mm/oom_kill: remove weird use of ERR_PTR()/PTR_ERR(). staging/zcache: don't use PTR_RET(). remoteproc: don't use PTR_RET(). pinctrl: don't use PTR_RET(). acpi: Replace weird use of PTR_RET. s390: Replace weird use of PTR_RET. PTR_RET is now PTR_ERR_OR_ZERO(): Replace most. PTR_RET is now PTR_ERR_OR_ZERO	2013-09-04 17:31:11 -07:00
Christoph Hellwig	02afc27fae	direct-io: Handle O_(D)SYNC AIO Call generic_write_sync() from the deferred I/O completion handler if O_DSYNC is set for a write request. Also make sure various callers don't call generic_write_sync if the direct I/O code returns -EIOCBQUEUED. Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-09-04 09:23:46 -04:00
Filipe David Borba Manana	d7396f0735	Btrfs: optimize key searches in btrfs_search_slot When the binary search returns 0 (exact match), the target key will necessarily be at slot 0 of all nodes below the current one, so in this case the binary search is not needed because it will always return 0, and we waste time doing it, holding node locks for longer than necessary, etc. Below follow histograms with the times spent on the current approach of doing a binary search when the previous binary search returned 0, and times for the new approach, which directly picks the first item/child node in the leaf/node. Current approach: Count: 6682 Range: 35.000 - 8370.000; Mean: 85.837; Median: 75.000; Stddev: 106.429 Percentiles: 90th: 124.000; 95th: 145.000; 99th: 206.000 35.000 - 61.080: 1235 ################ 61.080 - 106.053: 4207 ##################################################### 106.053 - 183.606: 1122 ############## 183.606 - 317.341: 111 # 317.341 - 547.959: 6 \| 547.959 - 8370.000: 1 \| Approach proposed by this patch: Count: 6682 Range: 6.000 - 135.000; Mean: 16.690; Median: 16.000; Stddev: 7.160 Percentiles: 90th: 23.000; 95th: 27.000; 99th: 40.000 6.000 - 8.418: 58 # 8.418 - 11.670: 1149 ######################### 11.670 - 16.046: 2418 ##################################################### 16.046 - 21.934: 2098 ############################################## 21.934 - 29.854: 744 ################ 29.854 - 40.511: 154 ### 40.511 - 54.848: 41 # 54.848 - 74.136: 5 \| 74.136 - 100.087: 9 \| 100.087 - 135.000: 6 \| These samples were captured during a run of the btrfs tests 001, 002 and 004 in the xfstests, with a leaf/node size of 4Kb. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:42 -04:00
Josef Bacik	45d5fd14d2	Btrfs: don't use an async starter for most of our workers We only need an async starter if we can't make a GFP_NOFS allocation in our current path. This is the case for the endio stuff since it happens in IRQ context, but things like the caching thread workers and the delalloc flushers we can easily make this allocation and start threads right away. Also change the worker count for the caching thread pool. Traditionally we limited this to 2 since we took read locks while caching, but nowadays we do this lockless so there's no reason to limit the number of caching threads. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:41 -04:00
Josef Bacik	7f4f6e0a3f	Btrfs: only update disk_i_size as we remove extents This fixes a problem where if we fail a truncate we will leave the i_size set where we wanted to truncate to instead of where we were able to truncate to. Fix this by making btrfs_truncate_inode_items do the disk_i_size update as it removes extents, that way it will always be consistent with where its extents are. Then if the truncate fails at all we can update the in-ram i_size with what we have on disk and delete the orphan item. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:40 -04:00
Filipe David Borba Manana	f45388f387	Btrfs: fix deadlock in uuid scan kthread If there's an ongoing transaction when the uuid scan kthread attempts to create one, the kthread will block, waiting for that transaction to finish while it's keeping locks on the tree root, and in turn the existing transaction is waiting for those locks to be free. The stack trace reported by the kernel follows. [36700.671601] INFO: task btrfs-uuid:15480 blocked for more than 120 seconds. [36700.671602] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [36700.671602] btrfs-uuid D 0000000000000000 0 15480 2 0x00000000 [36700.671604] ffff880710bd5b88 0000000000000046 ffff8803d36ba850 0000000000030000 [36700.671605] ffff8806d76dc530 ffff880710bd5fd8 ffff880710bd5fd8 ffff880710bd5fd8 [36700.671607] ffff8808098ac530 ffff8806d76dc530 ffff880710bd5b98 ffff8805e4508e40 [36700.671608] Call Trace: [36700.671610] [<ffffffff816f36b9>] schedule+0x29/0x70 [36700.671620] [<ffffffffa05a3bdf>] wait_current_trans.isra.33+0xbf/0x120 [btrfs] [36700.671623] [<ffffffff81066760>] ? add_wait_queue+0x60/0x60 [36700.671629] [<ffffffffa05a5b06>] start_transaction+0x3d6/0x530 [btrfs] [36700.671636] [<ffffffffa05bb1f4>] ? btrfs_get_token_32+0x64/0xf0 [btrfs] [36700.671642] [<ffffffffa05a5fbb>] btrfs_start_transaction+0x1b/0x20 [btrfs] [36700.671649] [<ffffffffa05c8a81>] btrfs_uuid_scan_kthread+0x211/0x3d0 [btrfs] [36700.671655] [<ffffffffa05c8870>] ? __btrfs_open_devices+0x2a0/0x2a0 [btrfs] [36700.671657] [<ffffffff81065fa0>] kthread+0xc0/0xd0 [36700.671659] [<ffffffff81065ee0>] ? flush_kthread_worker+0xb0/0xb0 [36700.671661] [<ffffffff816fcd1c>] ret_from_fork+0x7c/0xb0 [36700.671662] [<ffffffff81065ee0>] ? flush_kthread_worker+0xb0/0xb0 [36700.671663] INFO: task btrfs:15481 blocked for more than 120 seconds. [36700.671664] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [36700.671665] btrfs D 0000000000000000 0 15481 15212 0x00000004 [36700.671666] ffff880248cbf4c8 0000000000000086 ffff8803d36ba700 ffff8801dbd5c280 [36700.671668] ffff880807815c40 ffff880248cbffd8 ffff880248cbffd8 ffff880248cbffd8 [36700.671669] ffff8805e86a0000 ffff880807815c40 ffff880248cbf4d8 ffff8801dbd5c280 [36700.671670] Call Trace: [36700.671672] [<ffffffff816f36b9>] schedule+0x29/0x70 [36700.671679] [<ffffffffa05d9b0d>] btrfs_tree_lock+0x6d/0x230 [btrfs] [36700.671680] [<ffffffff81066760>] ? add_wait_queue+0x60/0x60 [36700.671685] [<ffffffffa0582829>] btrfs_search_slot+0x999/0xb00 [btrfs] [36700.671691] [<ffffffffa05bd9de>] ? btrfs_lookup_first_ordered_extent+0x5e/0xb0 [btrfs] [36700.671698] [<ffffffffa05e3e54>] __btrfs_write_out_cache+0x8c4/0xa80 [btrfs] [36700.671704] [<ffffffffa05e4362>] btrfs_write_out_cache+0xb2/0xf0 [btrfs] [36700.671710] [<ffffffffa05c4441>] ? free_extent_buffer+0x61/0xc0 [btrfs] [36700.671716] [<ffffffffa0594c82>] btrfs_write_dirty_block_groups+0x562/0x650 [btrfs] [36700.671723] [<ffffffffa0610092>] commit_cowonly_roots+0x171/0x24b [btrfs] [36700.671729] [<ffffffffa05a4dde>] btrfs_commit_transaction+0x4fe/0xa10 [btrfs] [36700.671735] [<ffffffffa0610af3>] create_subvol+0x5c0/0x636 [btrfs] [36700.671742] [<ffffffffa05d49ff>] btrfs_mksubvol.isra.60+0x33f/0x3f0 [btrfs] [36700.671747] [<ffffffffa05d4bf2>] btrfs_ioctl_snap_create_transid+0x142/0x190 [btrfs] [36700.671752] [<ffffffffa05d4c6c>] ? btrfs_ioctl_snap_create+0x2c/0x80 [btrfs] [36700.671757] [<ffffffffa05d4c9e>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs] [36700.671759] [<ffffffff8113a764>] ? handle_pte_fault+0x84/0x920 [36700.671764] [<ffffffffa05d87eb>] btrfs_ioctl+0xf0b/0x1d00 [btrfs] [36700.671766] [<ffffffff8113c120>] ? handle_mm_fault+0x210/0x310 [36700.671768] [<ffffffff816f83a4>] ? __do_page_fault+0x284/0x4e0 [36700.671770] [<ffffffff81180aa6>] do_vfs_ioctl+0x96/0x550 [36700.671772] [<ffffffff81170fe3>] ? __sb_end_write+0x33/0x70 [36700.671774] [<ffffffff81180ff1>] SyS_ioctl+0x91/0xb0 [36700.671775] [<ffffffff816fcdc2>] system_call_fastpath+0x16/0x1b Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:39 -04:00
Ilya Dryomov	795a332139	Btrfs: stop refusing the relocation of chunk 0 AFAICT chunk 0 is no longer special, and so it should be restriped just like every other chunk. One reason for this change is us refusing the relocation can lead to filesystems that can only be mounted ro, and never rw -- see the bugzilla [1] for details. The other reason is that device removal code is already doing this: it will happily relocate chunk 0 is part of shrinking the device. [1] https://bugzilla.kernel.org/show_bug.cgi?id=60594 Reported-by: Xavier Bassery <xavier@bartica.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:38 -04:00
Filipe David Borba Manana	d8f980391f	Btrfs: fix memory leak of uuid_root in free_fs_info Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:37 -04:00
Andy Shevchenko	ed84885d1e	btrfs: reuse kbasename helper To get name of the file from a pathname let's use kbasename() helper. It allows to simplify code a bit. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:36 -04:00
Anand Jain	e57138b3e9	btrfs: return btrfs error code for dev excl ops err now threads can return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS as defined in btrfs.h for the dev excl operation error in the FS, which means with this kernel would stop logging (almost an user error) into the /var/log/messages v2: accepts Josef' comment Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:35 -04:00
Josef Bacik	77cef2ec54	Btrfs: allow partial ordered extent completion We currently have this problem where you can truncate pages that have not yet been written for an ordered extent. We do this because the truncate will be coming behind to clean us up anyway so what's the harm right? Well if truncate fails for whatever reason we leave an orphan item around for the file to be cleaned up later. But if the user goes and truncates up the file and tries to read from the area that had been discarded previously they will get a csum error because we never actually wrote that data out. This patch fixes this by allowing us to either discard the ordered extent completely, by which I mean we just free up the space we had allocated and not add the file extent, or adjust the length of the file extent we write. We do this by setting the length we truncated down to in the ordered extent, and then we set the file extent length and ram bytes to this length. The total disk space stays unchanged since we may be compressed and we can't just chop off the disk space, but at least this way the file extent only points to the valid data. Then when the file extent is free'd the extent and csums will be freed normally. This patch is needed for the next series which will give us more graceful recovery of failed truncates. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:34 -04:00
Josef Bacik	b12d6869f6	Btrfs: convert all bug_ons in free-space-cache.c All of these are logic checks to make sure we're not breaking anything, so convert them over to ASSERT(). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:33 -04:00
Josef Bacik	2e17c7c65e	Btrfs: add support for asserts One of the complaints we get a lot is how many BUG_ON()'s we have. So to help with this I'm introducing a kconfig option to enable/disable a new ASSERT() mechanism much like what XFS does. This will allow us developers to still get our nice panics but allow users/distros to compile them out. With this we can go through and convert any BUG_ON()'s that we have to catch actual programming mistakes to the new ASSERT() and then fix everybody else to return errors. This will also allow developers to leave sanity checks in their new code to make sure we don't trip over problems while testing stuff and vetting new features. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:32 -04:00
Josef Bacik	726551ebc7	Btrfs: adjust the fs_devices->missing count on unmount I noticed that if I tried to mount a file system with -o degraded after having done it once already we would fail to mount. This is because the fs_devices->missing count was getting bumped everytime we mounted, but not getting reset whenever we unmounted. To fix this we just drop the missing count as we're closing devices to make sure this doesn't happen. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:31 -04:00
Stefan Behrens	23fa76b0ba	Btrf: cleanup: don't check for root_refs == 0 twice btrfs_read_fs_root_no_name() already checks if btrfs_root_refs() is zero and returns ENOENT in this case. There is no need to do it again in three more places. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:29 -04:00
Stefan Behrens	4847547172	Btrfs: fix for patch "cleanup: don't check the same thing twice" Mitch Harder noticed that the patch `3c64a1a` mentioned in the subject line was causing a kernel BUG() on snapshot deletion. The patch was wrong. It did not handle cached roots correctly. The check for root_refs == 0 was removed everywhere where btrfs_read_fs_root_no_name() had been used to retrieve the root, because this check was already dealt with in btrfs_read_fs_root_no_name(). But in the case when the root was found in the cache, there was no such check. This patch adds the missing check in the case where the root is found in the cache. Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:29 -04:00
Stefan Behrens	9d565ba433	Btrfs: get rid of one BUG() in write_all_supers() The second round uses btrfs_error() and return -EIO, the first round can handle write errors the same way. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:28 -04:00
Wang Shilong	b9e9a6cbc6	Btrfs: allocate prelim_ref with a slab allocater struct __prelim_ref is allocated and freed frequently when walking backref tree, using slab allocater can not only speed up allocating but also detect memory leaks. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:27 -04:00
Wang Shilong	742916b885	Btrfs: pass gfp_t to __add_prelim_ref() to avoid always using GFP_ATOMIC Currently, only add_delayed_refs have to allocate with GFP_ATOMIC, So just pass arg 'gfp_t' to decide which allocation mode. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:26 -04:00
Filipe David Borba Manana	f717175024	Btrfs: fix race conditions in BTRFS_IOC_FS_INFO ioctl The handler for the ioctl BTRFS_IOC_FS_INFO was reading the number of devices before acquiring the device list mutex. This could lead to inconsistent results because the update of the device list and the number of devices counter (amongst other counters related to the device list) are updated in volumes.c while holding the device list mutex - except for 2 places, one was volumes.c:btrfs_prepare_sprout() and the other was volumes.c:device_list_add(). For example, if we have 2 devices, with IDs 1 and 2 and then add a new device, with ID 3, and while adding the device is in progress an BTRFS_IOC_FS_INFO ioctl arrives, it could return a number of devices of 2 and a max dev id of 3. This would be incorrect. Also, this ioctl handler was reading the fsid while it can be updated concurrently. This can happen when while a new device is being added and the current filesystem is in seeding mode. Example: $ mkfs.btrfs -f /dev/sdb1 $ mkfs.btrfs -f /dev/sdb2 $ btrfstune -S 1 /dev/sdb1 $ mount /dev/sdb1 /mnt/test $ btrfs device add /dev/sdb2 /mnt/test If during the last step a BTRFS_IOC_FS_INFO ioctl was requested, it could read an fsid that was never valid (some bits part of the old fsid and others part of the new fsid). Also, it could read a number of devices that doesn't match the number of devices in the list and the max device id, as explained before. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:25 -04:00
Filipe David Borba Manana	d730680184	Btrfs: fix race between removing a dev and writing sbs This change fixes an issue when removing a device and writing all super blocks run simultaneously. Here's the steps necessary for the issue to happen: 1) disk-io.c:write_all_supers() gets a number of N devices from the super_copy, so it will not panic if it fails to write super blocks for N - 1 devices; 2) Then it tries to acquire the device_list_mutex, but blocks because volumes.c:btrfs_rm_device() got it first; 3) btrfs_rm_device() removes the device from the list, then unlocks the mutex and after the unlock it updates the number of devices in super_copy to N - 1. 4) write_all_supers() finally acquires the mutex, iterates over all the devices in the list and gets N - 1 errors, that is, it failed to write super blocks to all the devices; 5) Because write_all_supers() thinks there are a total of N devices, it considers N - 1 errors to be ok, and therefore won't panic. So this change just makes sure that write_all_supers() reads the number of devices from super_copy after it acquires the device_list_mutex. Conversely, it changes btrfs_rm_device() to update the number of devices in super_copy before it releases the device list mutex. The code path to add a new device (volumes.c:btrfs_init_new_device), already has the right behaviour: it updates the number of devices in super_copy while holding the device_list_mutex. The only code path that doesn't lock the device list mutex before updating the number of devices in the super copy is disk-io.c:next_root_backup(), called by open_ctree() during mount time where concurrency issues can't happen. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:24 -04:00
Josef Bacik	b8d0c69b94	Btrfs: remove ourselves from the cluster list under lock A user was reporting weird warnings from btrfs_put_delayed_ref() and I noticed that we were doing this list_del_init() on our head ref outside of delayed_refs->lock. This is a problem if we have people still on the list, we could end up modifying old pointers and such. Fix this by removing us from the list before we do our run_delayed_ref on our head ref. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:23 -04:00
Josef Bacik	e8e7cff667	Btrfs: do not clear our orphan item runtime flag on eexist We were unconditionally clearing our runtime flag on the inode on error when trying to insert an orphan item. This is wrong in the case of -EEXIST since we obviously have an orphan item. This was causing us to not do the correct cleanup of our orphan items which caused issues on cleanup. This happens because currently when truncate fails we just leave the orphan item on there so it can be cleaned up, so if we go to remove the file later we will hit this issue. What we do for truncate isn't right either, but we shouldn't screw this sort of thing up on error either, so fix this and then I'll fix truncate in a different patch. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:22 -04:00
Josef Bacik	57cfd46270	Btrfs: fix send to deal with sparse files properly Send was just sending everything it found, even if the extent was a hole. This is unpleasant for users, so just skip holes when we are sending. This will also skip sending prealloc extents since the send spec doesn't have a prealloc command. Eventually we will add a prealloc command and rev the send version so we can send down the prealloc info. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:21 -04:00
Filipe David Borba Manana	bdab49d760	Btrfs: fix printing of non NULL terminated string The name buffer is not terminated by a '\0' character, therefore it needs to be printed with %.*s and use the length of the buffer. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:20 -04:00
Geert Uytterhoeven	8d78eb1663	Btrfs: Use %z to format size_t Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:19 -04:00
Geert Uytterhoeven	fce29364e5	Btrfs: Do not truncate sector_t on 32-bit with CONFIG_LBDAF=y sector_t may be either "u64" (always 64 bit) or "unsigned long" (32 or 64 bit). Casting it to "unsigned long" will truncate it on 32-bit platforms where CONFIG_LBDAF=y. Cast to "unsigned long long" and format using "ll" instead. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:18 -04:00
Geert Uytterhoeven	778746b53b	Btrfs: PAGE_CACHE_SIZE is already unsigned long PAGE_CACHE_SIZE == PAGE_SIZE is "unsigned long" everywhere, so there's no need to cast it to "unsigned long". Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:17 -04:00
Geert Uytterhoeven	b308bc2f05	Btrfs: Make btrfs_header_chunk_tree_uuid() return unsigned long Internally, btrfs_header_chunk_tree_uuid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:16 -04:00
Geert Uytterhoeven	fba6aa7565	Btrfs: Make btrfs_header_fsid() return unsigned long Internally, btrfs_header_fsid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:15 -04:00
Geert Uytterhoeven	231e88f410	Btrfs: Make btrfs_dev_extent_chunk_tree_uuid() return unsigned long Internally, btrfs_dev_extent_chunk_tree_uuid() calculates an unsigned long, but casts it to a pointer, while all callers cast it to unsigned long again. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:14 -04:00
Geert Uytterhoeven	1473b24ee0	Btrfs: Make btrfs_device_fsid() return unsigned long All callers of btrfs_device_fsid() cast its return type to unsigned long. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:13 -04:00
Geert Uytterhoeven	410ba3a291	Btrfs: Make btrfs_device_uuid() return unsigned long All callers of btrfs_device_uuid() cast its return type to unsigned long. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:12 -04:00
Geert Uytterhoeven	118a0a2514	Btrfs: Format mirror_num as int mirror_num is always "int", hence don't cast it to "unsigned long long" and format it as a 64-bit number. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:11 -04:00
Geert Uytterhoeven	27f9f02357	Btrfs: Format PAGE_SIZE as unsigned long PAGE_SIZE is "unsigned long" everywhere, so there's no need to cast it to "unsigned long long" and format it as a 64-bit number. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:10 -04:00
Geert Uytterhoeven	6e71c47afe	Btrfs: Make BTRFS_DEV_REPLACE_DEVID an unsigned long long constant The internal btrfs device id is a u64, hence make the constant BTRFS_DEV_REPLACE_DEVID "unsigned long long" as well, so we no longer need a cast to print it. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:09 -04:00
Geert Uytterhoeven	c1c9ff7c94	Btrfs: Remove superfluous casts from u64 to unsigned long long u64 is "unsigned long long" on all architectures now, so there's no need to cast it when formatting it using the "ll" length modifier. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:08 -04:00
Filipe David Borba Manana	1cb048f596	Btrfs: fix memory leak of orphan block rsv This issue is simple to reproduce and observe if kmemleak is enabled. Two simple ways to reproduce it: 1 $ mkfs.btrfs -f /dev/loop0 $ mount /dev/loop0 /mnt/btrfs $ btrfs balance start /mnt/btrfs $ umount /mnt/btrfs 2 $ mkfs.btrfs -f /dev/loop0 $ mount /dev/loop0 /mnt/btrfs $ touch /mnt/btrfs/foobar $ rm -f /mnt/btrfs/foobar $ umount /mnt/btrfs After a while, kmemleak reports the leak: $ cat /sys/kernel/debug/kmemleak unreferenced object 0xffff880402b13e00 (size 128): comm "btrfs", pid 19621, jiffies 4341648183 (age 70057.844s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 fc c6 b1 04 88 ff ff 04 00 04 00 ad 4e ad de .............N.. backtrace: [<ffffffff817275a6>] kmemleak_alloc+0x26/0x50 [<ffffffff8117832b>] kmem_cache_alloc_trace+0xeb/0x1d0 [<ffffffffa04db499>] btrfs_alloc_block_rsv+0x39/0x70 [btrfs] [<ffffffffa04f8bad>] btrfs_orphan_add+0x13d/0x1b0 [btrfs] [<ffffffffa04e2b13>] btrfs_remove_block_group+0x143/0x500 [btrfs] [<ffffffffa0518158>] btrfs_relocate_chunk.isra.63+0x618/0x790 [btrfs] [<ffffffffa051bc27>] btrfs_balance+0x8f7/0xe90 [btrfs] [<ffffffffa05240a0>] btrfs_ioctl_balance+0x250/0x550 [btrfs] [<ffffffffa05269ca>] btrfs_ioctl+0xdfa/0x25f0 [btrfs] [<ffffffff8119c936>] do_vfs_ioctl+0x96/0x570 [<ffffffff8119cea1>] SyS_ioctl+0x91/0xb0 [<ffffffff81750242>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff This affects btrfs-next, revision be8e3cd00d7293dd177e3f8a4a1645ce09ca3acb (Btrfs: separate out tests into their own directory). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:07 -04:00
Ilya Dryomov	a1e8780a89	Btrfs: rollback btrfs_device fields on umount It turns out we don't properly rollback in-core btrfs_device state on umount. We zero out ->bdev, ->in_fs_metadata and that's about it. In particular, we don't zero out ->generation, and this can lead to us refusing a mount -- a non-NULL fs_devices->latest_bdev is essential, but btrfs_close_extra_devices will happily assign NULL to ->latest_bdev if the first device on the dev_list happens to be missing and consequently has no bdev attached. This happens because since commit `a6b0d5c8` btrfs_close_extra_devices adjusts ->latest_bdev, and in doing that, relies on the ->generation. Fix this, and possibly other problems, by zeroing out everything except for what device_list_add sets, so that a mount right after insmod and 'btrfs dev scan' is no different from any later mount in this respect. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:06 -04:00
Ilya Dryomov	2208a378f3	Btrfs: add alloc_fs_devices and switch to it In the spirit of btrfs_alloc_device, add a helper for allocating and doing some common initialization of btrfs_fs_devices struct. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:05 -04:00
Ilya Dryomov	12bd2fc0d2	Btrfs: add btrfs_alloc_device and switch to it Currently btrfs_device is allocated ad-hoc in a few different places, and as a result not all fields are initialized properly. In particular, readahead state is only initialized in device_list_add (at scan time), and not in btrfs_init_new_device (when the new device is added with 'btrfs dev add'). Fix this by adding an allocation helper and switch everybody but __btrfs_close_devices to it. (__btrfs_close_devices is dealt with in a later commit.) Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:04 -04:00
Ilya Dryomov	53f10659f9	Btrfs: find_next_devid: root -> fs_info find_next_devid() knows which root to search, so it should take an fs_info instead of an arbitrary root. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:03 -04:00
Stefan Behrens	bbb651e469	Btrfs: don't allow the replace procedure on read only filesystems If you start the replace procedure on a read only filesystem, at the end the procedure fails to write the updated dev_items to the chunk tree. The problem is that this error is not indicated except for a WARN_ON(). If the user now thinks that everything was done as expected and destroys the source device (with mkfs or with a hammer). The next mount fails with "failed to read chunk root" and the filesystem is gone. This commit adds code to fail the attempt to start the replace procedure if the filesystem is mounted read-only. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Cc: <stable@vger.kernel.org> # 3.10+ Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:02 -04:00
Filipe David Borba Manana	633085c79c	Btrfs: reset force_compress on btrfs_file_defrag failure After we set force_compress with a new value (which was not being done while holding the inode mutex), if an error happens and we jump to the label out_ra, the force_compress property of the inode is not set to BTRFS_COMPRESS_NONE (unlike in the case where no errors happen). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:16:00 -04:00
Stefan Behrens	f420ee1e92	Btrfs: add mount option to force UUID tree checking This should never be needed, but since all functions are there to check and rebuild the UUID tree, a mount option is added that allows to force this check and rebuild procedure. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:59 -04:00
Stefan Behrens	70f8017547	Btrfs: check UUID tree during mount if required If the filesystem was mounted with an old kernel that was not aware of the UUID tree, this is detected by looking at the uuid_tree_generation field of the superblock (similar to how the free space cache is doing it). If a mismatch is detected at mount time, a thread is started that does two things: 1. Iterate through the UUID tree, check each entry, delete those entries that are not valid anymore (i.e., the subvol does not exist anymore or the value changed). 2. Iterate through the root tree, for each found subvolume, add the UUID tree entries for the subvolume (if they are not already there). This mechanism is also used to handle and repair errors that happened during the initial creation and filling of the tree. The update of the uuid_tree_generation field (which indicates that the state of the UUID tree is up to date) is blocked until all create and repair operations are successfully completed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:58 -04:00
Stefan Behrens	26432799c9	Btrfs: introduce uuid-tree-gen field In order to be able to detect the case that a filesystem is mounted with an old kernel, add a uuid-tree-gen field like the free space cache is doing it. It is part of the super block and written with each commit. Old kernels do not know this field and don't update it. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:57 -04:00
Stefan Behrens	803b2f54fb	Btrfs: fill UUID tree initially When the UUID tree is initially created, a task is spawned that walks through the root tree. For each found subvolume root_item, the uuid and received_uuid entries in the UUID tree are added. This is such a quick operation so that in case somebody wants to unmount the filesystem while the task is still running, the unmount is delayed until the UUID tree building task is finished. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:56 -04:00
Stefan Behrens	dd5f9615fc	Btrfs: maintain subvolume items in the UUID tree When a new subvolume or snapshot is created, a new UUID item is added to the UUID tree. Such items are removed when the subvolume is deleted. The ioctl to set the received subvolume UUID is also touched and will now also add this received UUID into the UUID tree together with the ID of the subvolume. The latter is also done when read-only snapshots are created which inherit all the send/receive information from the parent subvolume. User mode programs use the BTRFS_IOC_TREE_SEARCH ioctl to search and read in the UUID tree. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:55 -04:00
Stefan Behrens	f7a81ea4cc	Btrfs: create UUID tree if required This tree is not created by mkfs.btrfs. Therefore when a filesystem is mounted writable and the UUID tree does not exist, this tree is created if required. The tree is also added to the fs_info structure and initialized, but this commit does not yet read or write UUID tree elements. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:54 -04:00
Stefan Behrens	8f8ae8e213	Btrfs: support printing UUID tree elements This commit adds support to print UUID tree elements to print-tree.c. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:53 -04:00
Stefan Behrens	07b30a49da	Btrfs: introduce a tree for items that map UUIDs to something Mapping UUIDs to subvolume IDs is an operation with a high effort today. Today, the algorithm even has quadratic effort (based on the number of existing subvolumes), which means, that it takes minutes to send/receive a single subvolume if 10,000 subvolumes exist. But even linear effort would be too much since it is a waste. And these data structures to allow mapping UUIDs to subvolume IDs are created every time a btrfs send/receive instance is started. It is much more efficient to maintain a searchable persistent data structure in the filesystem, one that is updated whenever a subvolume/snapshot is created and deleted, and when the received subvolume UUID is set by the btrfs-receive tool. Therefore kernel code is added with this commit that is able to maintain data structures in the filesystem that allow to quickly search for a given UUID and to retrieve data that is assigned to this UUID, like which subvolume ID is related to this UUID. This commit adds a new tree to hold UUID-to-data mapping items. The key of the items is the full UUID plus the key type BTRFS_UUID_KEY. Multiple data blocks can be stored for a given UUID, a type/length/ value scheme is used. Now follows the lengthy justification, why a new tree was added instead of using the existing root tree: The first approach was to not create another tree that holds UUID items. Instead, the items should just go into the top root tree. Unfortunately this confused the algorithm to assign the objectid of subvolumes and snapshots. The reason is that btrfs_find_free_objectid() calls btrfs_find_highest_objectid() for the first created subvol or snapshot after mounting a filesystem, and this function simply searches for the largest used objectid in the root tree keys to pick the next objectid to assign. Of course, the UUID keys have always been the ones with the highest offset value, and the next assigned subvol ID was wastefully huge. To use any other existing tree did not look proper. To apply a workaround such as setting the objectid to zero in the UUID item key and to implement collision handling would either add limitations (in case of a btrfs_extend_item() approach to handle the collisions) or a lot of complexity and source code (in case a key would be looked up that is free of collisions). Adding new code that introduces limitations is not good, and adding code that is complex and lengthy for no good reason is also not good. That's the justification why a completely new tree was introduced. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:52 -04:00
Sergei Trofimovich	171170c1c5	btrfs: mark some local function as 'static' Cc: Josef Bacik <jbacik@fusionio.com> Cc: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:51 -04:00
Stefan Behrens	35a3621beb	Btrfs: get rid of sparse warnings make C=2 fs/btrfs/ CF=-D__CHECK_ENDIAN__ I tried to filter out the warnings for which patches have already been sent to the mailing list, pending for inclusion in btrfs-next. All these changes should be obviously safe. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:50 -04:00
Filipe David Borba Manana	18674c6cc1	Btrfs: don't miss inode ref items in BTRFS_IOC_INO_LOOKUP If the inode ref key was not found and the current leaf slot was 0 (first item in the leaf) the code would always return -ENOENT. This was not correct because the desired inode ref item might be the last item in the previous leaf. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:49 -04:00
Filipe David Borba Manana	a696cf3529	Btrfs: add missing error code to BTRFS_IOC_INO_LOOKUP handler If the path doesn't fit in the input buffer, return ENAMETOOLONG instead of returning with a success code (0) and a partially filled and right justified buffer. Also removed useless buffer pointer check outside the while loop. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:48 -04:00
Wang Shilong	b006b2e4f9	Btrfs: remove reduplicate check when disabling quota We have checked 'quota_root' with qgroup_ioctl_lock held before,So here the check is reduplicate, remove it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:47 -04:00
Wang Shilong	e685da14af	Btrfs: move btrfs_free_qgroup_config() out of spin_lock and fix comments btrfs_free_qgroup_config() is not only called by open/close_ctree(),but also btrfs_disable_quota().And for btrfs_disable_quota(),we have set 'quota_root' to be null before calling btrfs_free_qgroup_config(),so it is safe to cleanup in-memory structures without lock held. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:46 -04:00
Wang Shilong	4082bd3d73	Btrfs: fix oops when writing dirty qgroups to disk When disabling quota, we should clear out list 'dirty_qgroups',otherwise, we will get oops if enabling quota again. Fix this by abstracting similar code from del_qgroup_rb(). Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:45 -04:00
Josef Bacik	ba5e8f2e2d	Btrfs: fix send issues related to inode number reuse If you are sending a snapshot and specifying a parent snapshot we will walk the trees and figure out where they differ and send the differences only. The way we check for differences are if the leaves aren't the same and if the keys are not the same within the leaves. So if neither leaf is the same (ie the leaf has been cow'ed from the parent snapshot) we walk each item in the send root and check it against the parent root. If the items match exactly then we don't do anything. This doesn't quite work for inode refs, since they will just have the name and the parent objectid. If you move the file from a directory and then remove that directory and re-create a directory with the same inode number as the old directory and then move that file back into that directory we will assume that nothing changed and you will get errors when you try to receive. In order to fix this we need to do extra checking to see if the inode ref really is the same or not. So do this by passing down BTRFS_COMPARE_TREE_SAME if the items match. Then if the key type is an inode ref we can do some extra checking, otherwise we just keep processing. The extra checking is to look up the generation of the directory in the parent volume and compare it to the generation of the send volume. If they match then they are the same directory and we are good to go. If they don't we have to add them to the changed refs list. This means we have to track the generation of the ref we're trying to lookup when we iterate all the refs for a particular inode. So in the case of looking for new refs we have to get the generation from the parent volume, and in the case of looking for deleted refs we have to get the generation from the send volume to compare with. There was also the issue of using a ulist to keep track of the directories we needed to check. Because we can get a deleted ref and a new ref for the same inode number the ulist won't work since it indexes based on the value. So instead just dup any directory ref we find and add it to a local list, and then process that list as normal and do away with using a ulist for this altogether. Before we would fail all of the tests in the far-progs that related to moving directories (test group 32). With this patch we now pass these tests, and all of the tests in the far-progs send testing suite. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:44 -04:00
Josef Bacik	dc11dd5d70	Btrfs: separate out tests into their own directory The plan is to have a bunch of unit tests that run when btrfs is loaded when you build with the appropriate config option. My ultimate goal is to have a test for every non-static function we have, but at first I'm going to focus on the things that cause us the most problems. To start out with this just adds a tests/ directory and moves the existing free space cache tests into that directory and sets up all of the infrastructure. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:15:38 -04:00
Josef Bacik	00361589d2	Btrfs: avoid starting a transaction in the write path I noticed while looking at a deadlock that we are always starting a transaction in cow_file_range(). This isn't really needed since we only need a transaction if we are doing an inline extent, or if the allocator needs to allocate a chunk. So push down all the transaction start stuff to be closer to where we actually need a transaction in all of these cases. This will hopefully reduce our write latency when we are committing often. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:05:05 -04:00
Josef Bacik	9ffba8cda9	Btrfs: fix heavy delalloc related deadlock I added a patch where we started taking the ordered operations mutex when we waited on ordered extents. We need this because we splice the list and process it, so if a flusher came in during this scenario it would think the list was empty and we'd usually get an early ENOSPC. The problem with this is that this lock is used in transaction committing. So we end up with something like this Transaction commit -> wait on writers Delalloc flusher -> run_ordered_operations (holds mutex) ->wait for filemap-flush to do its thing flush task -> cow_file_range ->wait on btrfs_join_transaction because we're commiting some other task -> commit_transaction because we notice trans->transaction->flush is set -> run_ordered_operations (hang on mutex) We need to disentangle the ordered operations flushing from the delalloc flushing, since they are separate things. This solves the deadlock issue I was seeing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:05:04 -04:00
Josef Bacik	4ef31a45a0	Btrfs: fix the error handling wrt orphan items There are several places where we BUG_ON() if we fail to remove the orphan items and such, which is not ok, so remove those and either abort or just carry on. This also fixes a problem where if we couldn't start a transaction we wouldn't actually remove the orphan item reserve for the inode. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:05:03 -04:00
Josef Bacik	175a2b871f	Btrfs: don't allow a subvol to be deleted if it is the default subovl Eric pointed out that btrfs will happily allow you to delete the default subvol. This is a problem obviously since the next time you go to mount the file system it will freak out because it can't find the root. Fix this by adding a check to see if our default subvol points to the subvol we are trying to delete, and if it does not allowing it to happen. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:05:02 -04:00
Josef Bacik	a05254143c	Btrfs: skip subvol entries when checking if we've created a dir already We have logic to see if we've already created a parent directory by check to see if an inode inside of that directory has a lower inode number than the one we are currently processing. The logic is that if there is a lower inode number then we would have had to made sure the directory was created at that previous point. The problem is that subvols inode numbers count from the lowest objectid in the root tree, which may be less than our current progress. So just skip if our dir item key is a root item. This fixes the original test and the xfstest version I made that added an extra subvol create. Thanks, Reported-by: Emil Karlson <jekarlson@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:05:01 -04:00
Mark Fasheh	416161db9b	btrfs: offline dedupe This patch adds an ioctl, BTRFS_IOC_FILE_EXTENT_SAME which will try to de-duplicate a list of extents across a range of files. Internally, the ioctl re-uses code from the clone ioctl. This avoids rewriting a large chunk of extent handling code. Userspace passes in an array of file, offset pairs along with a length argument. The ioctl will then (for each dedupe) do a byte-by-byte comparison of the user data before deduping the extent. Status and number of bytes deduped are returned for each operation. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:05:00 -04:00
Mark Fasheh	4b384318a7	btrfs: Introduce extent_read_full_page_nolock() We want this for btrfs_extent_same. Basically readpage and friends do their own extent locking but for the purposes of dedupe, we want to have both files locked down across a set of readpage operations (so that we can compare data). Introduce this variant and a flag which can be set for extent_read_full_page() to indicate that we are already locked. Partial credit for this patch goes to Gabriel de Perthuis <g2p.code@gmail.com> as I have included a fix from him to the original patch which avoids a deadlock on compressed extents. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:59 -04:00
Mark Fasheh	32b7c687c5	btrfs_ioctl_clone: Move clone code into it's own function There's some 250+ lines here that are easily encapsulated into their own function. I don't change how anything works here, just create and document the new btrfs_clone() function from btrfs_ioctl_clone() code. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:58 -04:00
Mark Fasheh	77fe20dc62	btrfs: abtract out range locking in clone ioctl() The range locking in btrfs_ioctl_clone is trivially broken out into it's own function. This reduces the complexity of btrfs_ioctl_clone() by a small bit and makes that locking code available to future functions in fs/btrfs/ioctl.c Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:57 -04:00
Wang Shilong	a4fdb61e81	Btrfs: fix possible memory leak in find_parent_nodes() Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:56 -04:00
Filipe David Borba Manana	09fb99a696	Btrfs: return ENOSPC when target space is full In extent-tree.c:do_chunk_alloc(), early on we returned 0 (success) when the target space was full and when chunk allocation is needed. However, later on in that same function we return ENOSPC if btrfs_alloc_chunk() fails (and chunk allocation was needed) and set the space's full flag. This was inconsistent, as -ENOSPC should be returned if the space is full and a chunk allocation needs to performed. If the space is full but no chunk allocation is needed, just return 0 (success). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:55 -04:00
Filipe David Borba Manana	ada9af215c	Btrfs: don't ignore errors from btrfs_run_delayed_items tree-log.c was ignoring the return value from btrfs_run_delayed_items() in several places. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:54 -04:00
Filipe David Borba Manana	2bac325ea8	Btrfs: fix inode leak on kmalloc failure in tree-log.c In tree-log.c:replay_one_name(), if memory allocation for the name fails, ensure we iput the dir inode we got before before we return. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:53 -04:00
Liu Bo	116e0024c4	Btrfs: allow compressed extents to be merged during defragment The rule originally comes from nocow writing, but snapshot-aware defrag is a different case, the extent has been writen and we're not going to change the extent but add a reference on the data. So we're able to allow such compressed extents to be merged into one bigger extent if they're pointing to the same data. Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:52 -04:00
David Sterba	8b87dc17fb	btrfs: add mount option to set commit interval I'ts hardcoded to 30 seconds which is fine for most users. Higher values defer data being synced to permanent storage with obvious consequences when the system crashes. The upper bound is not forced, but a warning is printed if it's more than 300 seconds (5 minutes). Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:51 -04:00
Josef Bacik	9ec7267751	Btrfs: stop using GFP_ATOMIC when allocating rewind ebs There is no reason we can't just set the path to blocking and then do normal GFP_NOFS allocations for these extent buffers. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:50 -04:00
Josef Bacik	db7f3436c1	Btrfs: deal with enomem in the rewind path We can get ENOMEM trying to allocate dummy bufs for the rewind operation of the tree mod log. Instead of BUG_ON()'ing in this case pass up ENOMEM. I looked back through the callers and I'm pretty sure I got everybody who did BUG_ON(ret) in this path. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:49 -04:00
Josef Bacik	ebdad913aa	Btrfs: check our parent dir when doing a compare send When doing a send with a parent subvol we will check to see if the file we are acting on is being overwritten and move it if we think it may be needed further down the line during the send. We check this by checking its directory and making sure it existed in the parent and making sure the file existed in the parent. The problem with this check is that if we create a directory and a file in that directory, and then snapshot, and then remove and re-create that same directory and file with different inode numbers and then try to snapshot and send with the original parent we will try and save the original file inside of that directory. This is a problem because during the receive we move the directory out of the way because it is a completely new inode, which makes us unable to find the old file inside of the directory when we try to move that out of the way for the overwrite. We fix this by checking the parent directory of the inode we think we are overwriting. If the parent directory generation in the send root != the parent directory generation in the parent root then we know it is a completely new directory and we need not bother with moving the file out of the way because it would have been completely destroyed. This fixes bz 60673. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:48 -04:00
Josef Bacik	36cce92287	Btrfs: handle errors when doing slow caching Alex Lyakas reported a bug where wait_block_group_cache_progress() would wait forever if a drive failed. This is because we just bail out if there is an error while trying to cache a block group, we don't update anybody who may be waiting. So this introduces a new enum for the cache state in case of error and makes everybody bail out if we have an error. Alex tested and verified this patch fixed his problem. This fixes bz 59431. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:47 -04:00
Filipe David Borba Manana	0f0fe8f710	Btrfs: add missing error handling to read_tree_block Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:46 -04:00
Dave Jones	eb2067f713	Fix leak in __btrfs_map_block error path If we bail out when the stripe alloc fails, we need to undo the earlier allocation of raid_map. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:45 -04:00
Filipe David Borba Manana	f5929cd814	Btrfs: add missing error check to find_parent_nodes Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:44 -04:00
Filipe David Borba Manana	395927a9d8	Btrfs: optimize function btrfs_read_chunk_tree After reading all device items from the chunk tree, don't exit the loop and then navigate down the tree again to find the chunk items. Instead just read all device items and chunk items with a single tree search. This is possible because all device items are found before any chunk item in the chunks tree. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:43 -04:00
Josef Bacik	6596a92819	Btrfs: don't bug_on when we fail when cleaning up transactions There is no reason for this sort of jackassery. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:42 -04:00
Josef Bacik	b6c60c8018	Btrfs: change how we queue blocks for backref checking Previously we only added blocks to the list to have their backrefs checked if the level of the block is right above the one we are searching for. This is because we want to make sure we don't add the entire path up to the root to the lists to make sure we process things one at a time. This assumes that if any blocks in the path to the root are going to be not checked (shared in other words) then they will be in the level right above the current block on up. This isn't quite right though since we can have blocks higher up the list that are shared because they are attached to a reloc root. But we won't add this block to be checked and then later on we will BUG_ON(!upper->checked). So instead keep track of wether or not we've queued a block to be checked in this current search, and if we haven't go ahead and queue it to be checked. This patch fixed the panic I was seeing where we BUG_ON(!upper->checked). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:41 -04:00
Josef Bacik	d062d13cf1	Btrfs: check to see if we have an inline item properly If our item isn't big enough to have an actual inline item when we have skinny metadata enabled just return 1 in find_inline_backref so we can move on to the next item. This probably wasn't causing a problem since we check the values of ptr and end properly, but just in case this will keep us from doing extra work. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:40 -04:00
Josef Bacik	151a41bc46	Btrfs: fix what bits we clear when erroring out from delalloc First of all we no longer set EXTENT_DIRTY when we dirty an extent so this patch removes the clearing of EXTENT_DIRTY since it confuses me. This patch also adds clearing EXTENT_DEFRAG and also doing EXTENT_DO_ACCOUNTING when we have errors. This is because if we are clearing delalloc without adding an ordered extent then we need to make sure the enospc handling stuff is accounted for. Also if this range was DEFRAG we need to make sure that bit is cleared so we dont leak it. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:39 -04:00
Josef Bacik	c2790a2e2b	Btrfs: cleanup arguments to extent_clear_unlock_delalloc This patch removes the io_tree argument for extent_clear_unlock_delalloc since we always use &BTRFS_I(inode)->io_tree, and it separates out the extent tree operations from the page operations. This way we just pass in the extent bits we want to clear and then pass in the operations we want done to the pages. This is because I'm going to fix what extent bits we clear in some cases and rather than add a bunch of new flags we'll just use the actual extent bits we want to clear. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:38 -04:00
Anand Jain	8068a47e2a	btrfs: use BTRFS_SUPER_INFO_SIZE macro at btrfs_read_dev_super() Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:37 -04:00
Miao Xie	125bac016d	Btrfs: cache the extent map struct when reading several pages When we read several pages at once, we needn't get the extent map object every time we deal with a page, and we can cache the extent map object. So, we can reduce the search time of the extent map, and besides that, we also can reduce the lock contention of the extent map tree. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:36 -04:00
Miao Xie	9974090bdd	Btrfs: batch the extent state operation when reading pages In the past, we cached the checksum value in the extent state object, so we had to split the extent state object by the block size, or we had no space to keep this checksum value. But it increased the lock contention of the extent state tree. Now we removed this limit by caching the checksum into the bio object, so it is unnecessary to do the extent state operations by the block size, we can do it in batches, in this way, we can reduce the lock contention of the extent state tree. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:35 -04:00
Miao Xie	883d0de485	Btrfs: batch the extent state operation in the end io handle of the read page Before applying this patch, we set the uptodate flag and unlock the extent by the page size, it is unnecessary, we can do it in batches, it can reduce the lock contention of the extent state tree. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:34 -04:00
Miao Xie	facc8a2247	Btrfs: don't cache the csum value into the extent state tree Before applying this patch, we cached the csum value into the extent state tree when reading some data from the disk, this operation increased the lock contention of the state tree. Now, we just store the csum value into the bio structure or other unshared structure, so we can reduce the lock contention. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:33 -04:00
Miao Xie	f2a09da9d0	Btrfs: add branch prediction hints in the read page end IO function This patch add some branch prediction hints into the end IO function of the read page, it reduced the percentage of the branch misses from 5.5% to 4.9%. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:32 -04:00
Miao Xie	09a7f7a289	Btrfs: remove unnecessary argument of bio_readpage_error() Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:31 -04:00
Wang Shilong	8507d216a4	Btrfs: add missing mounting options in btrfs_show_options() Some options are missing in btrfs_show_options(), this patch adds them. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:30 -04:00
Wang Shilong	1493381f2f	Btrfs: use u64 for subvolid when parsing mount options Although for most time, int is enough for subvolid, we should ensure safety in theory. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:29 -04:00
Wang Shilong	2c334e87f3	Btrfs: add sanity checks regarding to parsing mount options I just notice the following commands succeed: mount <dev> <mnt> -o thread_pool=-1 This is ridiculous, only positive thread_pool makes sense,this patch adds sanity checks for them, and also catches the error of ENOMEM if allocating memory fails. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:28 -04:00
Miao Xie	3cd846d1d7	Btrfs, raid56: fix memory leak when allocating pages for p/q stripes failed Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:27 -04:00
Dan Carpenter	3dc0e818af	btrfs/raid56: fix and cleanup some error paths The alloc_rbio() frees "raid_map" and "bbio" on error, so there is a potential double free bug in raid56_parity_write(). The raid56_parity_write() and raid56_parity_recover() functions should still free "raid_map" and "bbio" on error if other errors occur though, so I have added some more calls to kfree(). Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:26 -04:00
Josef Bacik	2112ac800d	Btrfs: don't bother autodefragging if our root is going away We can end up with inodes on the auto defrag list that exist on roots that are going to be deleted. This is extra work we don't need to do, so just bail if our root has 0 root refs. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:25 -04:00
Josef Bacik	b37b39cd6b	Btrfs: cleanup reloc roots properly on error I was hitting the BUG_ON() at the end of merge_reloc_roots() because we were aborting the transaction at some point previously and then getting an error when we tried to drop the reloc root. I fixed btrfs_drop_snapshot to re-add us to the dead roots list if we failed, but this isn't the right thing to do for reloc roots since it uses root->root_list for it's own stuff in order to know what needs to be cleaned up. So fix btrfs_drop_snapshot to only do the re-add if we aren't dropping for reloc, and handle errors from merge_reloc_root() by dropping the reloc root we are processing since it won't be on the list of roots to cleanup. With this patch my reproducer no longer panics. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:24 -04:00
Josef Bacik	50f1319cb5	Btrfs: reset ret in record_one_backref I was getting warnings when running find ./ -type f -exec btrfs fi defrag -f {} \; from record_one_backref because ret was set. Turns out it was because it was set to 1 because the search slot didn't come out exact and we never reset it. So reset it to 0 right after the search so we don't leak this and get uneccessary warnings. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:23 -04:00
Anand Jain	a1b83ac52d	btrfs: fix get set label blocking against balance btrfs_ioctl_get_fslabel() and btrfs_ioctl_set_fslabel() used root->fs_info->volume_mutex mutex which caused operations like balance to block set/get label operation until its completion and generally balance operation takes a long time to complete, so it will be annoying to the user when cli appears hung also this patch will add a bit of optimization within the btrfs_ioctl_get_falabel() function. v1->v2: use fs_info->super_lock instead of uuid_mutex Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 08:04:15 -04:00
Stefan Behrens	d4c34f6bff	Btrfs: Print key type in decimal everywhere This is confusing, sometimes the key type is printed in hex (without a leading "0x" which makes things even more complicated), sometimes in decimal... Change it to be in decimal everywhere. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:40 -04:00
Liu Bo	599c75ec3f	Btrfs/tracepoint: update delayed ref tracepoints This shows exactly how btrfs processes the delayed refs onto disks, which is very helpful on understanding delayed ref mechanism and debugging related bugs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:39 -04:00
chandan	1095cc0d92	btrfs_read_block_groups: Use enums to index btrfs_space_info->block_groups. The current code uses integer literals to index btrfs_space_info->block_groups[] array. Instead use corresponding enums from 'enum btrfs_raid_types'. Signed-off-by: chandan <chandan@linux.vnet.ibm.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:38 -04:00
Qu Wenruo	3cae210fa5	btrfs: Cleanup for using BTRFS_SETGET_STACK instead of raw convert Some codes still use the cpu_to_lexx instead of the BTRFS_SETGET_STACK_FUNCS declared in ctree.h. Also added some BTRFS_SETGET_STACK_FUNCS for btrfs_header btrfs_timespec and other structures. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Miao Xie <miaoxie@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:37 -04:00
Wang Shilong	1e7bac1ef7	Btrfs: set qgroup_ulist to be null after calling ulist_free() We call ulist_free(qgroup_ulist) in btrfs_free_qgroup_config(), and btrfs_free_qgroup_config() may be called in two cases: (1)umount filesystem (2)disabling quota However, if we firstly disable quota and then umount filesystem, a double free happens. Fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:36 -04:00
Filipe David Borba Manana	647f63bd36	Btrfs: add missing error checks to add_data_references The function relocation.c:add_data_references() was not checking if all calls to __add_tree_block() and find_data_references() were succeeding or not. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:35 -04:00
David Sterba	ccf39f92f3	btrfs: make errors in btrfs_num_copies less noisy The log message level 'critical' is verbose enough, 'emergency' beeps on all terminals. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:34 -04:00
Liu Bo	52ee28d249	Btrfs: make free space caching faster with many non-inline extent references So to cache free space, we iterate every extent item to gather free space info. When we have say 10,000 non-inline extent refs(such as BTRFS_EXTENT_DATA_REF), it takes quite a long time, and since inline extent refs and non-inline ones have same objectid in their keys, we can just re-search the tree with the next address to skip non-inline references. (This is found by dedup feature because dedup extents can end up with many non-inline extent refs.) Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:24 -04:00
Jeff Mahoney	ee3441b490	btrfs: fall back to global reservation when removing subvolumes I recently did some ENOSPC testing that involved filling the disk while create and removing snapshots in a loop. During the test cycle, I ran into an ENOSPC when trying to remove a snapshot, leaving the fs stuck in ENOSPC even after a umount/mount cycle. This patch allow subvolume removal to fall back onto the global block reservation in order to succeed when it would have failed otherwise. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:23 -04:00
Filipe David Borba Manana	74be951087	Btrfs: optimize btrfs_lookup_extent_info() If we're looking for a metadata item in the tree and the search fails with return value of 1, and the slot doesn't point to the first item in the leaf, check if the previous item in the leaf corresponds to an extent item for the same object id - if it does, then don't do another tree search to get it. This optimization is already done by btrfs-progs. V2: updated commit message. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:22 -04:00
Carey Underwood	d790155457	Btrfs: Release uuid_mutex for shrink during device delete Device scanning waits on the uuid_mutex, which can result in a very long wait if dev delete is shrinking the device. Signed-off-by: Carey Underwood <cwillu@cwillu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:21 -04:00
Josef Bacik	b2aaaa3b8c	Btrfs: set lockdep class before locking new extent buffer We've been seeing spurious complaints out of lockdep because the lock class name changes. This is happening because when we drop a snapshot we will lock a block before we've read it in, which sets the lockdep class to whatever the default is. Then once we read the thing in we reset the lockdep class to what it is supposed to be, which blows lockdeps' mind. This patch should fix the problem, it appears to be the only place where we do this sort of thing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:20 -04:00
Stefan Agner	59516f6017	Btrfs: return -1 when lzo compression makes data bigger With this fix the lzo code behaves like the zlib code by returning an error code when compression does not help reduce the size of the file. This is currently not a bug since the compressed size is checked again in the calling method compress_file_range. Signed-off-by: Stefan Agner <stefan@agner.ch> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:19 -04:00
Josef Bacik	c8cc634165	Btrfs: stop using GFP_ATOMIC for the tree mod log allocations Previously we held the tree mod lock when adding stuff because we use it to check and see if we truly do want to track tree modifications. This is admirable, but GFP_ATOMIC in a critical area that is going to get hit pretty hard and often is not nice. So instead do our basic checks to see if we don't need to track modifications, and if those pass then do our allocation, and then when we go to insert the new modification check if we still care, and if we don't just free up our mod and return. Otherwise we're good to go and we can carry on. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-09-01 07:57:17 -04:00
Joe Perches	8be04b9374	treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks Don't emit OOM warnings when k.alloc calls fail when there there is a v.alloc immediately afterwards. Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc. Signed-off-by: Joe Perches <joe@perches.com> Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2013-08-20 13:06:40 +02:00
Zach Brown	db62efbbf8	btrfs: don't loop on large offsets in readdir When btrfs readdir() hits the last entry it sets the readdir offset to a huge value to stop buggy apps from breaking when the same name is returned by readdir() with concurrent rename()s. But unconditionally setting the offset to INT_MAX causes readdir() to loop returning any entries with offsets past INT_MAX. It only takes a few hours of constant file creation and removal to create entries past INT_MAX. So let's set the huge offset to LLONG_MAX if the last entry has already overflowed 32bit loff_t. Without large offsets behaviour is identical. With large offsets 64bit apps will work and 32bit apps will be no more broken than they currently are if they see large offsets. Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-08-09 19:34:56 -04:00
Josef Bacik	cfad392b22	Btrfs: check to see if root_list is empty before adding it to dead roots A user reported a panic when running with autodefrag and deleting snapshots. This is because we could end up trying to add the root to the dead roots list twice. To fix this check to see if we are empty before adding ourselves to the dead roots list. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-08-09 19:30:23 -04:00
Josef Bacik	f3b15ccdbb	Btrfs: release both paths before logging dir/changed extents The ceph guys tripped over this bug where we were still holding onto the original path that we used to copy the inode with when logging. This is based on Chris's fix which was reported to fix the problem. We need to drop the paths in two cases anyway so just move the drop up so that we don't have duplicate code. Thanks, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-08-09 19:30:16 -04:00
Josef Bacik	ee20a98314	Btrfs: allow splitting of hole em's when dropping extent cache I noticed while running multi-threaded fsync tests that sometimes fsck would complain about an improper gap. This happens because we fail to add a hole extent to the file, which was happening when we'd split a hole EM because btrfs_drop_extent_cache was just discarding the whole em instead of splitting it. So this patch fixes this by allowing us to split a hole em properly, which means that added holes actually get logged properly and we no longer see this fsck error. Thankfully we're tolerant of these sort of problems so a user would not see any adverse effects of this bug, other than fsck complaining. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-08-09 19:30:09 -04:00
Josef Bacik	ed8c4913da	Btrfs: make sure the backref walker catches all refs to our extent Because we don't mess with the offset into the extent for compressed we will properly find both extents for this case [extent a][extent b][rest of extent a] but because we already added a ref for the front half we won't add the inode information for the second half. This causes us to leak that memory and not print out the other offset when we do logical-resolve. So fix this by calling ulist_add_merge and then add our eie to the existing entry if there is one. With this patch we get both offsets out of logical-resolve. With this and the other 2 patches I've sent we now pass btrfs/276 on my vm with compress-force=lzo set. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-08-09 19:30:03 -04:00
Josef Bacik	8ca15e05e6	Btrfs: fix backref walking when we hit a compressed extent If you do btrfs inspect-internal logical-resolve on a compressed extent that has been partly overwritten it won't find anything. This is because we try and match the extent offset we've searched for based on the extent offset in the data extent entry. However this doesn't work for compressed extents because the offsets are for the uncompressed size, not the compressed size. So instead only do this check if we are not compressed, that way we can get an actual entry for the physical offset rather than nothing for compressed. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-08-09 19:29:56 -04:00
Josef Bacik	b76bb70136	Btrfs: do not offset physical if we're compressed xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was offsetting the physical based on the extent offset. This is perfectly fine with uncompressed extents, however the extent offset is into the uncompressed area, not the compressed. So we can return a physical value that isn't at all within the area we have allocated on disk. Fix this by returning the start of the extent if it is compressed no matter what the offset. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-08-09 19:29:50 -04:00
Liu Bo	b5b9b5b318	Btrfs: fix extent buffer leak after backref walking commit 47fb091fb787420cd195e66f162737401cce023f(Btrfs: fix unlock after free on rewinded tree blocks) takes an extra increment on the reference of allocated dummy extent buffer, so now we cannot free this dummy one, and end up with extent buffer leak. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-08-09 19:29:42 -04:00
Liu Bo	e68afa49ae	Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents For partial extents, snapshot-aware defrag does not work as expected, since a) we use the wrong logical offset to search for parents, which should be disk_bytenr + extent_offset, not just disk_bytenr, b) 'offset' returned by the backref walking just refers to key.offset, not the 'offset' stored in btrfs_extent_data_ref which is (key.offset - extent_offset). The reproducer: $ mkfs.btrfs sda $ mount sda /mnt $ btrfs sub create /mnt/sub $ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done $ btrfs sub snap /mnt/sub /mnt/snap1 $ btrfs sub snap /mnt/sub /mnt/snap2 $ sync; btrfs filesystem defrag /mnt/sub/foo; $ umount /mnt $ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared. This addresses the above two problems. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-08-09 19:29:17 -04:00
Jie Liu	7cddc19392	btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified Create a small file and fallocate it to a big size with FALLOC_FL_KEEP_SIZE option, then truncate it back to the small size again, the disk free space is not changed back in this case. i.e, total 4 -rw-r--r-- 1 root root 512 Jun 28 11:35 test Filesystem Size Used Avail Use% Mounted on .... /dev/sdb1 8.0G 56K 7.2G 1% /mnt -rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test Filesystem Size Used Avail Use% Mounted on .... /dev/sdb1 8.0G 5.1G 2.2G 70% /mnt Filesystem Size Used Avail Use% Mounted on .... /dev/sdb1 8.0G 5.1G 2.2G 70% /mnt With this fix, the truncated up space is back as: Filesystem Size Used Avail Use% Mounted on .... /dev/sdb1 8.0G 56K 7.2G 1% /mnt Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-08-09 19:28:56 -04:00
Stefan Behrens	115930cb2d	Btrfs: fix wrong write offset when replacing a device Miao Xie reported the following issue: The filesystem was corrupted after we did a device replace. Steps to reproduce: # mkfs.btrfs -f -m single -d raid10 <device0>..<device3> # mount <device0> <mnt> # btrfs replace start -rfB 1 <device4> <mnt> # umount <mnt> # btrfsck <device4> The reason for the issue is that we changed the write offset by mistake, introduced by commit `625f1c8dc`. We read the data from the source device at first, and then write the data into the corresponding place of the new device. In order to implement the "-r" option, the source location is remapped using btrfs_map_block(). The read takes place on the mapped location, and the write needs to take place on the unmapped location. Currently the write is using the mapped location, and this commit changes it back by undoing the change to the write address that the aforementioned commit added by mistake. Reported-by: Miao Xie <miaox@cn.fujitsu.com> Cc: <stable@vger.kernel.org> # 3.10+ Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-19 15:07:26 -04:00
Josef Bacik	d29a9f629e	Btrfs: re-add root to dead root list if we stop dropping it If we stop dropping a root for whatever reason we need to add it back to the dead root list so that we will re-start the dropping next transaction commit. The other case this happens is if we recover a drop because we will add a root without adding it to the fs radix tree, so we can leak it's root and commit root extent buffer, adding this to the dead root list makes this cleanup happen. Thanks, Cc: stable@vger.kernel.org Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-19 15:07:19 -04:00
Josef Bacik	fec386ac14	Btrfs: fix lock leak when resuming snapshot deletion We aren't setting path->locks[level] when we resume a snapshot deletion which means we won't unlock the buffer when we free the path. This causes deadlocks if we happen to re-allocate the block before we've evicted the extent buffer from cache. Thanks, Cc: stable@vger.kernel.org Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-19 15:07:11 -04:00
Josef Bacik	3c8f242257	Btrfs: update drop progress before stopping snapshot dropping Alex pointed out a problem and fix that exists in the drop one snapshot at a time patch. If we decide we need to exit for whatever reason (umount for example) we will just exit the snapshot dropping without updating the drop progress. So the next time we go to resume we will BUG_ON() because we can't find the extent we left off at because we never updated it. This patch fixes the problem. Cc: stable@vger.kernel.org Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-19 15:07:03 -04:00
Sachin Kamat	cd633972e1	Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO PTR_RET is now deprecated. Use PTR_ERR_OR_ZERO instead. Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2013-07-16 16:06:02 +09:30
Rusty Russell	8c6ffba0ed	PTR_RET is now PTR_ERR_OR_ZERO(): Replace most. Sweep of the simple cases. Cc: netdev@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-arm-kernel@lists.infradead.org Cc: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>	2013-07-15 11:25:01 +09:30
Linus Torvalds	e3a0dd98e1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "These are the usual mixture of bugs, cleanups and performance fixes. Miao has some really nice tuning of our crc code as well as our transaction commits. Josef is peeling off more and more problems related to early enospc, and has a number of important bug fixes in here too" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (81 commits) Btrfs: wait ordered range before doing direct io Btrfs: only do the tree_mod_log_free_eb if this is our last ref Btrfs: hold the tree mod lock in __tree_mod_log_rewind Btrfs: make backref walking code handle skinny metadata Btrfs: fix crash regarding to ulist_add_merge Btrfs: fix several potential problems in copy_nocow_pages_for_inode Btrfs: cleanup the code of copy_nocow_pages_for_inode() Btrfs: fix oops when recovering the file data by scrub function Btrfs: make the chunk allocator completely tree lockless Btrfs: cleanup orphaned root orphan item Btrfs: fix wrong mirror number tuning Btrfs: cleanup redundant code in btrfs_submit_direct() Btrfs: remove btrfs_sector_sum structure Btrfs: check if we can nocow if we don't have data space Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc Btrfs: use a percpu to keep track of possibly pinned bytes Btrfs: check for actual acls rather than just xattrs when caching no acl Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate Btrfs: optimize reada_for_balance Btrfs: optimize read_block_for_search ...	2013-07-09 12:33:09 -07:00
Linus Torvalds	80cc38b163	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree updates from Jiri Kosina: "The usual stuff from trivial tree" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) treewide: relase -> release Documentation/cgroups/memory.txt: fix stat file documentation sysctl/net.txt: delete reference to obsolete 2.4.x kernel spinlock_api_smp.h: fix preprocessor comments treewide: Fix typo in printk doc: device tree: clarify stuff in usage-model.txt. open firmware: "/aliasas" -> "/aliases" md: bcache: Fixed a typo with the word 'arithmetic' irq/generic-chip: fix a few kernel-doc entries frv: Convert use of typedef ctl_table to struct ctl_table sgi: xpc: Convert use of typedef ctl_table to struct ctl_table doc: clk: Fix incorrect wording Documentation/arm/IXP4xx fix a typo Documentation/networking/ieee802154 fix a typo Documentation/DocBook/media/v4l fix a typo Documentation/video4linux/si476x.txt fix a typo Documentation/virtual/kvm/api.txt fix a typo Documentation/early-userspace/README fix a typo Documentation/video4linux/soc-camera.txt fix a typo lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment ...	2013-07-04 11:40:58 -07:00
Linus Torvalds	790eac5640	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second set of VFS changes from Al Viro: "Assorted f_pos race fixes, making do_splice_direct() safe to call with i_mutex on parent, O_TMPFILE support, Jeff's locks.c series, ->d_hash/->d_compare calling conventions changes from Linus, misc stuff all over the place." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) Document ->tmpfile() ext4: ->tmpfile() support vfs: export lseek_execute() to modules lseek_execute() doesn't need an inode passed to it block_dev: switch to fixed_size_llseek() cpqphp_sysfs: switch to fixed_size_llseek() tile-srom: switch to fixed_size_llseek() proc_powerpc: switch to fixed_size_llseek() ubi/cdev: switch to fixed_size_llseek() pci/proc: switch to fixed_size_llseek() isapnp: switch to fixed_size_llseek() lpfc: switch to fixed_size_llseek() locks: give the blocked_hash its own spinlock locks: add a new "lm_owner_key" lock operation locks: turn the blocked_list into a hashtable locks: convert fl_link to a hlist_node locks: avoid taking global lock if possible when waking up blocked waiters locks: protect most of the file_lock handling with i_lock locks: encapsulate the fl_link list handling locks: make "added" in __posix_lock_file a bool ...	2013-07-03 09:10:19 -07:00
Jie Liu	46a1c2c7ae	vfs: export lseek_execute() to modules For those file systems(btrfs/ext4/ocfs2/tmpfs) that support SEEK_DATA/SEEK_HOLE functions, we end up handling the similar matter in lseek_execute() to update the current file offset to the desired offset if it is valid, ceph also does the simliar things at ceph_llseek(). To reduce the duplications, this patch make lseek_execute() public accessible so that we can call it directly from the underlying file systems. Thanks Dave Chinner for this suggestion. [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back] v2->v1: - Add kernel-doc comments for lseek_execute() - Call lseek_execute() in ceph->llseek() Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Chris Mason <chris.mason@fusionio.com> Cc: Josef Bacik <jbacik@fusionio.com> Cc: Ben Myers <bpm@sgi.com> Cc: Ted Tso <tytso@mit.edu> Cc: Hugh Dickins <hughd@google.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Sage Weil <sage@inktank.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-07-03 16:23:27 +04:00
Linus Torvalds	9e239bb939	Lots of bug fixes, cleanups and optimizations. In the bug fixes category, of note is a fix for on-line resizing file systems where the block size is smaller than the page size (i.e., file systems 1k blocks on x86, or more interestingly file systems with 4k blocks on Power or ia64 systems.) In the cleanup category, the ext4's punch hole implementation was significantly improved by Lukas Czerner, and now supports bigalloc file systems. In addition, Jan Kara significantly cleaned up the write submission code path. We also improved error checking and added a few sanity checks. In the optimizations category, two major optimizations deserve mention. The first is that ext4_writepages() is now used for nodelalloc and ext3 compatibility mode. This allows writes to be submitted much more efficiently as a single bio request, instead of being sent as individual 4k writes into the block layer (which then relied on the elevator code to coalesce the requests in the block queue). Secondly, the extent cache shrink mechanism, which was introduce in 3.9, no longer has a scalability bottleneck caused by the i_es_lru spinlock. Other optimizations include some changes to reduce CPU usage and to avoid issuing empty commits unnecessarily. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG 0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2 AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/ zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX +TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7 XrddRLGlk4Hf+2o7/D7B =SwaI -----END PGP SIGNATURE----- Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 update from Ted Ts'o: "Lots of bug fixes, cleanups and optimizations. In the bug fixes category, of note is a fix for on-line resizing file systems where the block size is smaller than the page size (i.e., file systems 1k blocks on x86, or more interestingly file systems with 4k blocks on Power or ia64 systems.) In the cleanup category, the ext4's punch hole implementation was significantly improved by Lukas Czerner, and now supports bigalloc file systems. In addition, Jan Kara significantly cleaned up the write submission code path. We also improved error checking and added a few sanity checks. In the optimizations category, two major optimizations deserve mention. The first is that ext4_writepages() is now used for nodelalloc and ext3 compatibility mode. This allows writes to be submitted much more efficiently as a single bio request, instead of being sent as individual 4k writes into the block layer (which then relied on the elevator code to coalesce the requests in the block queue). Secondly, the extent cache shrink mechanism, which was introduce in 3.9, no longer has a scalability bottleneck caused by the i_es_lru spinlock. Other optimizations include some changes to reduce CPU usage and to avoid issuing empty commits unnecessarily." * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits) ext4: optimize starting extent in ext4_ext_rm_leaf() jbd2: invalidate handle if jbd2_journal_restart() fails ext4: translate flag bits to strings in tracepoints ext4: fix up error handling for mpage_map_and_submit_extent() jbd2: fix theoretical race in jbd2__journal_restart ext4: only zero partial blocks in ext4_zero_partial_blocks() ext4: check error return from ext4_write_inline_data_end() ext4: delete unnecessary C statements ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree() jbd2: move superblock checksum calculation to jbd2_write_superblock() ext4: pass inode pointer instead of file pointer to punch hole ext4: improve free space calculation for inline_data ext4: reduce object size when !CONFIG_PRINTK ext4: improve extent cache shrink mechanism to avoid to burn CPU time ext4: implement error handling of ext4_mb_new_preallocation() ext4: fix corruption when online resizing a fs with 1K block size ext4: delete unused variables ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents jbd2: remove debug dependency on debug_fs and update Kconfig help text jbd2: use a single printk for jbd_debug() ...	2013-07-02 09:39:34 -07:00
Josef Bacik	0e267c44c3	Btrfs: wait ordered range before doing direct io My recent truncate patch uncovered this bug, but I can reproduce it without the truncate patch. If you mount with -o compress-force, do a direct write to some area, do a buffered write to some other area, and then do a direct read you will get the wrong data for where you did the buffered write. This is because the generic direct io helpers only call filemap_write_and_wait once, and for compression we need it twice. So to be safe add the btrfs_wait_ordered_range to the start of the direct io function to make sure any compressed writes have truly been written. This patch makes xfstests 130 pass when you mount with -o compress-force=lzo. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:51:49 -04:00
Josef Bacik	7fb7d76f96	Btrfs: only do the tree_mod_log_free_eb if this is our last ref There is another bug in the tree mod log stuff in that we're calling tree_mod_log_free_eb every single time a block is cow'ed. The problem with this is that if this block is shared by multiple snapshots we will call this multiple times per block, so if we go to rewind the mod log for this block we'll BUG_ON() in __tree_mod_log_rewind because we try to rewind a free twice. We only want to call tree_mod_log_free_eb if we are actually freeing the block. With this patch I no longer hit the panic in __tree_mod_log_rewind. Thanks, Cc: stable@vger.kernel.org Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:51:20 -04:00
Josef Bacik	f1ca7e98a6	Btrfs: hold the tree mod lock in __tree_mod_log_rewind We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk forward in the tree mod entries, otherwise we'll end up with random entries and trip the BUG_ON() at the front of __tree_mod_log_rewind. This fixes the panics people were seeing when running find /whatever -type f -exec btrfs fi defrag {} \; Thansk, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:51:18 -04:00
Josef Bacik	261c84b662	Btrfs: make backref walking code handle skinny metadata I missed fixing the backref stuff when I introduced the skinny metadata. If you try and do things like snapshot aware defrag with skinny metadata you are going to see tons of warnings related to the backref count being less than 0. This is because the delayed refs will be found for stuff just fine, but it won't find the skinny metadata extent refs. With this patch I'm not seeing warnings anymore. Thanks, Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:51:02 -04:00
Liu Bo	35f0399db6	Btrfs: fix crash regarding to ulist_add_merge Several users reported this crash of NULL pointer or general protection, the story is that we add a rbtree for speedup ulist iteration, and we use krealloc() to address ulist growth, and krealloc() use memcpy to copy old data to new memory area, so it's OK for an array as it doesn't use pointers while it's not OK for a rbtree as it uses pointers. So krealloc() will mess up our rbtree and it ends up with crash. Reviewed-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:59 -04:00
Miao Xie	edd1400be9	Btrfs: fix several potential problems in copy_nocow_pages_for_inode - It makes no sense that we deal with a inode in the dead tree. - fix the race between dio and page copy by waiting the dio completion - avoid the page copy vs truncate/punch hole - check if the page is in the page cache or not Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:58 -04:00
Miao Xie	826aa0a82c	Btrfs: cleanup the code of copy_nocow_pages_for_inode() - It make no sense that we continue to do something after the error happened, just go back with this patch. - remove some check of copy_nocow_pages_for_inode(), such as page check after write, inode check in the end of the function, because we are sure they exist. - remove the unnecessary goto in the return value check of the write Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:56 -04:00
Miao Xie	26b2589190	Btrfs: fix oops when recovering the file data by scrub function We get oops while running btrfs replace start test, ------------[ cut here ]------------ kernel BUG at mm/filemap.c:608! [SNIP] Call Trace: [<ffffffffa04b36c7>] copy_nocow_pages_for_inode+0x217/0x3f0 [btrfs] [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs] [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs] [<ffffffffa04bb8ce>] iterate_extent_inodes+0x1ae/0x300 [btrfs] [<ffffffffa04bbab2>] iterate_inodes_from_logical+0x92/0xb0 [btrfs] [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs] [<ffffffffa04b3b07>] copy_nocow_pages_worker+0x97/0x150 [btrfs] [<ffffffffa048eed4>] worker_loop+0x134/0x540 [btrfs] [<ffffffff816274ea>] ? __schedule+0x3ca/0x7f0 [<ffffffffa048eda0>] ? btrfs_queue_worker+0x300/0x300 [btrfs] [<ffffffff8106f2f0>] kthread+0xc0/0xd0 [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80 [<ffffffff8163181c>] ret_from_fork+0x7c/0xb0 [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80 [SNIP] RIP [<ffffffff8111f4c5>] unlock_page+0x35/0x40 RSP <ffff88010316bb98> ---[ end trace 421e79ad0dd72c7d ]--- it is because we forgot to lock the page again after we read data to the page. Fix it. Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:55 -04:00
Josef Bacik	6df9a95e63	Btrfs: make the chunk allocator completely tree lockless When adjusting the enospc rules for relocation I ran into a deadlock because we were relocating the only system chunk and that forced us to try and allocate a new system chunk while holding locks in the chunk tree, which caused us to deadlock. To fix this I've moved all of the dev extent addition and chunk addition out to the delayed chunk completion stuff. We still keep the in-memory stuff which makes sure everything is consistent. One change I had to make was to search the commit root of the device tree to find a free dev extent, and hold onto any chunk em's that we allocated in that transaction so we do not allocate the same dev extent twice. This has the side effect of fixing a bug with balance that has been there ever since balance existed. Basically you can free a block group and it's dev extent and then immediately allocate that dev extent for a new block group and write stuff to that dev extent, all within the same transaction. So if you happen to crash during a balance you could come back to a completely broken file system. This patch should keep these sort of things from happening in the future since we won't be able to allocate free'd dev extents until after the transaction commits. This has passed all of the xfstests and my super annoying stress test followed by a balance. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:53 -04:00
Josef Bacik	68a7342c51	Btrfs: cleanup orphaned root orphan item I hit a weird problem were my root item had been deleted but the orphan item had not. This isn't necessarily a problem, but it keeps the file system from being mounted. To fix this we just need to axe the orphan item if we can't find the fs root when we're putting them altogether. With this patch I was able to successfully mount my file system. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:52 -04:00
Miao Xie	a70c6172e7	Btrfs: fix wrong mirror number tuning Now reading the data from the target device of the replace operation is allowed, so the mirror number that is greater than the stripes number of a chunk is valid, we will tune it when we find there is no target device later. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:50 -04:00
Miao Xie	e6da5d2ec9	Btrfs: cleanup redundant code in btrfs_submit_direct() Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:48 -04:00
Miao Xie	f51a4a1826	Btrfs: remove btrfs_sector_sum structure Using the structure btrfs_sector_sum to keep the checksum value is unnecessary, because the extents that btrfs_sector_sum points to are continuous, we can find out the expected checksums by btrfs_ordered_sum's bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After removing bytenr, there is only one member in the structure, so it makes no sense to keep the structure, just remove it, and use a u32 array to store the checksum value. By this change, we don't use the while loop to get the checksums one by one. Now, we can get several checksum value at one time, it improved the performance by ~74% on my SSD (31MB/s -> 54MB/s). test command: # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:47 -04:00
Josef Bacik	7ee9e4405f	Btrfs: check if we can nocow if we don't have data space We always just try and reserve data space when we write, but if we are out of space but have prealloc'ed extents we should still successfully write. This patch will try and see if we can write to prealloc'ed space and if we can go ahead and allow the write to continue. With this patch we now pass xfstests generic/274. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:45 -04:00
Josef Bacik	925a6efb8f	Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc try_to_writeback_inodes_sb_nr returns 1 if writeback is already underway, which is completely fraking useless for us as we need to make sure pages are actually written before we go and check if there are ordered extents. So replace this with an open coding of try_to_writeback_inodes_sb_nr minus the writeback underway check so that we are sure to actually have flushed some dirty pages out and will have ordered extents to use. With this patch xfstests generic/273 now passes. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:43 -04:00
Josef Bacik	b150a4f10d	Btrfs: use a percpu to keep track of possibly pinned bytes There are all of these checks in the ENOSPC code to see if committing the transaction would free up enough space to make the allocation. This is because early on we just committed the transaction and hoped and prayed, which resulted in cases where it took _forever_ to get an ENOSPC when we really were out of space. So we check space_info->bytes_pinned, except this isn't completely true because it doesn't account for space we may free but are stuck in delayed refs. So tests like xfstests 226 would fail because we wouldn't commit the transaction to free up the data space. So instead add a percpu counter that will be a little fuzzier, it will add bytes as soon as we try to free up the space, and remove any space it doesn't actually free up when we get around to doing the actual free. We then 0 out this counter every transaction period so we have a better idea of how much space we will actually free up by committing this transaction. With this patch we now pass xfstests 226. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:42 -04:00
Josef Bacik	f23b5a5995	Btrfs: check for actual acls rather than just xattrs when caching no acl We have an optimization that will go ahead and cache no acls on an inode if there are no xattrs on the inode. This saves us a lookup later to check the acls for writes or any other access. The problem is I use selinux so I always have an xattr on inodes, so make this test a little smarter and check for the actual acl hash on the key and if it isn't there then we still get to cache no acl which makes everybody who uses selinux a little happier. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-02 11:50:40 -04:00
Josef Bacik	a71754fc68	Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate This has plagued us forever and I'm so over working around it. When we truncate down to a non-page aligned offset we will call btrfs_truncate_page to zero out the end of the page and write it back to disk, this will keep us from exposing stale data if we truncate back up from that point. The problem with this is it requires data space to do this, and people don't really expect to get ENOSPC from truncate() for these sort of things. This also tends to bite the orphan cleanup stuff too which keeps people from mounting. To get around this we can just move this into btrfs_cont_expand() to make sure if we are truncating up from a non-page size aligned i_size we will zero out the rest of this page so that we don't expose stale data. This will give ENOSPC if you try to truncate() up or if you try to write past the end of isize, which is much more reasonable. This fixes xfstests generic/083 failing to mount because of the orphan cleanup failing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:33 -04:00
Josef Bacik	0b08851fda	Btrfs: optimize reada_for_balance This patch does two things. First we no longer explicitly read in the blocks we're trying to readahead. For things like balance_level we may never actually use the blocks so this just adds uneeded latency, and balance_level and split_node will both read in the blocks they care about explicitly so if the blocks need to be waited on it will be done there. Secondly we no longer drop the path if we do readahead, we just set the path blocking before we call reada_for_balance() and then we're good to go. Hopefully this will cut down on the number of re-searches. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:32 -04:00
Josef Bacik	bdf7c00e8f	Btrfs: optimize read_block_for_search This patch does two things, first it only does one call to btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then again with gen specified. The other thing is to call btrfs_read_buffer() on the buffer we've found instead of dropping it and then calling read_tree_block(). This will keep us from doing yet another radix tree lookup for a buffer we've already found. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:32 -04:00
Josef Bacik	fdf8e2ea3c	Btrfs: unlock extent range on enospc in compressed submit A user reported a deadlock where the async submit thread was blocked on the lock_extent() lock, and then everybody behind him was locked on the page lock for the page he was holding. Looking at the code I noticed we do not unlock the extent range when we get ENOSPC and goto retry. This is bad because we immediately try to lock that range again to do the cow, which will cause a deadlock. Fix this by unlocking the range. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:31 -04:00
Wang Sheng-Hui	90b6d2830a	Btrfs: fix the comment typo for btrfs_attach_transaction_barrier The comment is for btrfs_attach_transaction_barrier, not for btrfs_attach_transaction. Fix the typo. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Acked-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:30 -04:00
Josef Bacik	aee68ee5f5	Btrfs: fix not being able to find skinny extents during relocate We unconditionally search for the EXTENT_ITEM_KEY for metadata during balance, and then check the key that we found to see if it is actually a METADATA_ITEM_KEY, but this doesn't work right because METADATA is a higher key value, so if what we are looking for happens to be the first item in the leaf the search will dump us out at the previous leaf, and we won't find our item. So instead do what we do everywhere else, search for the skinny extent first and if we don't find it go back and re-search for the extent item. This patch fixes the panic I was hitting when balancing a large file system with skinny extents. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:30 -04:00
Josef Bacik	da61d31a78	Btrfs: cleanup backref search commit root flag stuff Looking into this backref problem I noticed we're using a macro to what turns out to essentially be a NULL check to see if we need to search the commit root. I'm killing this, let's just do what everybody else does and checks if trans == NULL. I've also made it so we pass in the path to __resolve_indirect_refs which will have the search_commit_root flag set properly already and that way we can avoid allocating another path when we have a perfectly good one to use. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:29 -04:00
Josef Bacik	d88d46c6e0	Btrfs: free csums when we're done scrubbing an extent A user reported scrub taking up an unreasonable amount of ram as it ran. This is because we lookup the csums for the extent we're scrubbing but don't free it up until after we're done with the scrub, which means we can take up a whole lot of ram. This patch fixes this by dropping the csums once we're done with the extent we've scrubbed. The user reported this to fix their problem. Thanks, Reported-and-tested-by: Remco Hosman <remco@hosman.xs4all.nl> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:28 -04:00
Josef Bacik	1be41b78bc	Btrfs: fix transaction throttling for delayed refs Dave has this fs_mark script that can make btrfs abort with sufficient amount of ram. This is because with more ram we can keep more dirty metadata in cache which in a round about way makes for many more pending delayed refs. What happens is we end up not throttling the transaction enough so when we go to commit the transaction when we've completely filled the file system we'll abort() because we use all of the space in the global reserve and we still have delayed refs to run. To fix this we need to make the delayed ref flushing and the transaction throttling dependant upon the number of delayed refs that we have instead of how much reserved space is left in the global reserve. With this patch we not only stop aborting transactions but we also get a smoother run speed with fs_mark and it makes us about 10% faster. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:28 -04:00
Josef Bacik	501407aab8	Btrfs: stop waiting on current trans if we aborted I hit a hang when run_delayed_refs returned an error in the beginning of btrfs_commit_transaction. If we decide we need to commit the transaction in btrfs_end_transaction we'll set BLOCKED and start to commit, but if we get an error this early on we'll just exit without committing. This is fine, except that anybody else who tried to start a transaction will sit in wait_current_trans() since we're set to BLOCKED and we never set it to something else and woke people up. To fix this we want to check for trans->aborted everywhere we wait for the transaction state to change, and make btrfs_abort_transaction() wake up any waiters there may be. All the callers will notice that the transaction has aborted and exit out properly. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:27 -04:00
Josef Bacik	f971fe29b1	Btrfs: wake up delayed ref flushing waiters on abort I hit a deadlock because we aborted when flushing delayed refs but didn't wake any of the other flushers up and so everybody was just sleeping forever. This should fix the problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:26 -04:00
Jie Liu	3fb4037599	btrfs: fix the code comments for LZO compression workspace Fix the code comments for lzo compression workspace. The buf item is used to store the decompressed data and cbuf is used to store the compressed data. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:26 -04:00
Miao Xie	5bc7247ac4	Btrfs: fix broken nocow after balance Balance will create reloc_root for each fs root, and it's going to record last_snapshot to filter shared blocks. The side effect of setting last_snapshot is to break nocow attributes of files. Since the extents are not shared by the relocation tree after the balance, we can recover the old last_snapshot safely if no one snapshoted the source tree. We fix the above problem by this way. Reported-by: Kyle Gates <kylegates@hotmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-07-01 08:52:25 -04:00
Al Viro	6d0379ec49	btrfs: more open-coded file_inode() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-06-29 12:57:24 +04:00
Al Viro	9cdda8d31f	[readdir] convert btrfs Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-06-29 12:57:00 +04:00
Josef Bacik	8c2a1a3028	Btrfs: exclude logged extents before replying when we are mixed With non-mixed block groups we replay the logs before we're allowed to do any writes, so we get away with not pinning/removing the data extents until right when we replay them. However with mixed block groups we allocate out of the same pool, so we could easily allocate a metadata block that was logged in our tree log. To deal with this we just need to notice that we have mixed block groups and do the normal excluding/removal dance during the pin stage of the log replay and that way we don't allocate metadata blocks from areas we have logged data extents. With this patch we now pass xfstests generic/311 with mixed block groups turned on. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:17 -04:00
Josef Bacik	01cd33674e	Btrfs: put our inode if orphan cleanup fails When we cross into a different subvol when doing a lookup we will run the orhpan cleanup. If this fails however we do not drop the ref to the inode we were looking up before we return an error, which leads to busy inodes on umount. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:16 -04:00
Josef Bacik	c69b26b011	Btrfs: add some missing iput()'s in btrfs_orphan_cleanup There are some error cases that we don't do an iput() on our inode, fix this. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:15 -04:00
Josef Bacik	e78417d192	Btrfs: do not pin while under spin lock When testing a corrupted fs I noticed I was getting sleep while atomic errors when the transaction aborted. This is because btrfs_pin_extent may need to allocate memory and we are calling this under the spin lock. Fix this by moving it out and doing the pin after dropping the spin lock but before dropping the mutex, the same way it works when delayed refs run normally. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:13 -04:00
Thomas Meyer	a5959bc0a1	Btrfs: Cocci spatch "memdup.spatch" Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:12 -04:00
Thomas Meyer	97a184fe81	Btrfs: Cocci spatch "ptr_ret.spatch" Signed-off-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:11 -04:00
Jan Schmidt	b382a324b6	Btrfs: fix qgroup rescan resume on mount When called during mount, we cannot start the rescan worker thread until open_ctree is done. This commit restuctures the qgroup rescan internals to enable a clean deferral of the rescan resume operation. First of all, the struct qgroup_rescan is removed, saving us a malloc and some initialization synchronizations problems. Its only element (the worker struct) now lives within fs_info just as the rest of the rescan code. Then setting up a rescan worker is split into several reusable stages. Currently we have three different rescan startup scenarios: (A) rescan ioctl (B) rescan resume by mount (C) rescan by quota enable Each case needs its own combination of the four following steps: (1) set the progress [A, C: zero; B: state of umount] (2) commit the transaction [A] (3) set the counters [A, C: zero; B: state of umount] (4) start worker [A, B, C] qgroup_rescan_init does step (1). There's no extra function added to commit a transaction, we've got that already. qgroup_rescan_zero_tracking does step (3). Step (4) is nothing more than a call to the generic btrfs_queue_worker. We also get rid of a double check for the rescan progress during btrfs_qgroup_account_ref, which is no longer required due to having step 2 from the list above. As a side effect, this commit prepares to move the rescan start code from btrfs_run_qgroups (which is run during commit) to a less time critical section. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:10 -04:00
Jan Schmidt	eb1716af88	Btrfs: avoid double free of fs_info->qgroup_ulist When btrfs_read_qgroup_config or btrfs_quota_enable return non-zero, we've already freed the fs_info->qgroup_ulist. The final btrfs_free_qgroup_config called from quota_disable makes another ulist_free(fs_info->qgroup_ulist) call. We set fs_info->qgroup_ulist to NULL on the mentioned error paths, turning the ulist_free in btrfs_free_qgroup_config into a noop. Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:08 -04:00
Jan Schmidt	4373519db4	Btrfs: fix memory patcher through fs_info->qgroup_ulist Commit 5b7c665e introduced fs_info->qgroup_ulist, that is allocated during btrfs_read_qgroup_config and meant to be used later by the qgroup accounting code. However, it is always freed before btrfs_read_qgroup_config returns, becuase the commit mentioned above adds a check for (ret), where a check for (ret < 0) would have been the right choice. This commit fixes the check. Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:07 -04:00
Josef Bacik	d52be818e6	Btrfs: simplify unlink reservations Dave pointed out a problem where if you filled up a file system as much as possible you couldn't remove any files. The whole unlink reservation thing is convoluted because it tries to guess if it's going to add space to unlink something or not, and has all these odd uncommented cases where it simply does not try. So to fix this I've added a way to conditionally steal from the global reserve if we can't make our normal reservation. If we have more than half the space in the global reserve free we will go ahead and steal from the global reserve. With this patch Dave's reproducer now works and I can rm all the files on the file system. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:06 -04:00
Miao Xie	c6adc9cc08	Btrfs: merge pending IO for tree log write back Before applying this patch, we flushed the log tree of the fs/file tree firstly, and then flushed the log root tree. It is ineffective, especially on the hard disk. This patch improved this problem by wrapping the above two flushes by the same blk_plug. By test, the performance of the sync write went up ~60%(2.9MB/s -> 4.6MB/s) on my scsi disk whose disk buffer was enabled. Test step: # mkfs.btrfs -f -m single <disk> # mount <disk> <mnt> # dd if=/dev/zero of=<mnt>/file0 bs=32K count=1024 oflag=sync Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:05 -04:00
Liu Bo	a96fbc7288	Btrfs: allow file data clone within a file We did not allow file data clone within the same file because of deadlock issues. However, we now use nested lock to avoid deadlock between the parent directory and the child file. So it's safe to do file clone within the same file when the two ranges are not overlapped. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:03 -04:00
Liu Bo	b7394eb91c	Btrfs: remove unused code in btrfs_del_root 'leaf' and 'ri' is not used somehow. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:02 -04:00
Liu Bo	2da1c669f0	Btrfs: kill replicate code in replay_one_buffer EXTREF is treated same as REF, so we can make the code tidy. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:30:01 -04:00
Liu Bo	33157e05db	Btrfs: check if leaf's parent exists before pushing items around During splitting a leaf, pushing items around to hopefully get some space only works when we have a parent, ie. we have at least one sibling leaf. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:58 -04:00
Liu Bo	fdd99c7294	Btrfs: dont do log_removal in insert_new_root As for splitting a leaf, root is just the leaf, and tree mod log does not apply on leaf, so in this case, we don't do log_removal. As for splitting a node, the old root is kept as a normal node and we have nicely put records in tree mod log for moving keys and items, so in this case we don't do that either. As above, insert_new_root can get rid of log_removal. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:57 -04:00
Wei Yongjun	4b286cd1f5	Btrfs: return error code in btrfs_check_trunc_cache_free_space() Fix to return error code instead always return 0 from function btrfs_check_trunc_cache_free_space(). Introduced by commit `7b61cd9224` (Btrfs: don't use global block reservation for inode cache truncation) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:56 -04:00
Josef Bacik	139f807a1e	Btrfs: fix estale with btrfs send This fixes bugzilla 57491. If we take a snapshot of a fs with a unlink ongoing and then try to send that root we will run into problems. When comparing with a parent root we will search the parents and the send roots commit_root, which if we've just created the snapshot will include the file that needs to be evicted by the orphan cleanup. So when we find a changed extent we will try and copy that info into the send stream, but when we lookup the inode we use the normal root, which no longer has the inode because the orphan cleanup deleted it. The best solution I have for this is to check our otransid with the generation of the commit root and if they match just commit the transaction again, that way we get the changes from the orphan cleanup. With this patch the reproducer I made for this bugzilla no longer returns ESTALE when trying to do the send. Thanks, Cc: stable@vger.kernel.org Reported-by: Chris Wilson <jakdaw@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:55 -04:00
Anand Jain	183860f6a0	btrfs: device delete to get errors from the kernel when user runs command btrfs dev del the raid requisite error if any goes to the /var/log/messages, its not good idea to clutter messages with these user (knowledge) errors, further user don't have to review the system messages to know problem with the cli it should be dropped to the user as part of the cli return. to bring this feature created a set of the ERROR defined BTRFS_ERROR_DEV* error codes and created their error string. I expect this enum to be added with other error which we might want to communicate to the user land v3: moved the code with in the file no logical change v1->v2: introduce error codes for the device mgmt usage v1: adds a parameter in the ioctl arg struct to carry the error string Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:53 -04:00
Josef Bacik	c73e293678	Btrfs: do delay iput in sync_fs We get lock inversion with umount if we allow iputs from sync_fs, so use the delay iput flag to keep this from happening. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:52 -04:00
Miao Xie	4a9d8bdee3	Btrfs: make the state of the transaction more readable We used 3 variants to track the state of the transaction, it was complex and wasted the memory space. Besides that, it was hard to understand that which types of the transaction handles should be blocked in each transaction state, so the developers often made mistakes. This patch improved the above problem. In this patch, we define 6 states for the transaction, enum btrfs_trans_state { TRANS_STATE_RUNNING = 0, TRANS_STATE_BLOCKED = 1, TRANS_STATE_COMMIT_START = 2, TRANS_STATE_COMMIT_DOING = 3, TRANS_STATE_UNBLOCKED = 4, TRANS_STATE_COMPLETED = 5, TRANS_STATE_MAX = 6, } and just use 1 variant to track those state. In order to make the blocked handle types for each state more clear, we introduce a array: unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = { [TRANS_STATE_RUNNING] = 0U, [TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE \| __TRANS_START), [TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE \| __TRANS_START \| __TRANS_ATTACH), [TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE \| __TRANS_START \| __TRANS_ATTACH \| __TRANS_JOIN), [TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE \| __TRANS_START \| __TRANS_ATTACH \| __TRANS_JOIN \| __TRANS_JOIN_NOLOCK), [TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE \| __TRANS_START \| __TRANS_ATTACH \| __TRANS_JOIN \| __TRANS_JOIN_NOLOCK), } it is very intuitionistic. Besides that, because we remove ->in_commit in transaction structure, so the lock ->commit_lock which was used to protect it is unnecessary, remove ->commit_lock. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:51 -04:00
Miao Xie	581227d0d2	Btrfs: remove the time check in btrfs_commit_transaction() We checked the commit time to avoid committing the transaction frequently, but it is unnecessary because: - It made the transaction commit spend more time, and delayed the operation of the external writers(TRANS_START/TRANS_USERSPACE). - Except the space that we have to commit transaction, such as snapshot creation, btrfs doesn't commit the transaction on its own initiative. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:50 -04:00
Miao Xie	3f1e3fa65c	Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure We used ->num_joined track if there were some writers which join the current transaction when the committer was sleeping. If some writers joined the current transaction, we has to continue the while loop to do some necessary stuff, such as flush the ordered operations. But it is unnecessary because we will do it after the while loop. Besides that, tracking ->num_joined would make the committer drop into the while loop when there are lots of internal writers(TRANS_JOIN). So we remove ->num_joined and don't track if there are some writers which join the current transaction when the committer is sleeping. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:48 -04:00
Miao Xie	824366177a	Btrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set It is unnecessary to flush the delalloc inodes again and again because we don't care the dirty pages which are introduced after the flush, and they will be flush in the transaction commit. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:47 -04:00
Miao Xie	0860adfdb2	Btrfs: don't wait for all the writers circularly during the transaction commit btrfs_commit_transaction has the following loop before we commit the transaction. do { // attempt to do some useful stuff and/or sleep } while (atomic_read(&cur_trans->num_writers) > 1 \|\| (should_grow && cur_trans->num_joined != joined)); This is used to prevent from the TRANS_START to get in the way of a committing transaction. But it does not prevent from TRANS_JOIN, that is we would do this loop for a long time if some writers JOIN the current transaction endlessly. Because we need join the current transaction to do some useful stuff, we can not block TRANS_JOIN here. So we introduce a external writer counter, which is used to count the TRANS_USERSPACE/TRANS_START writers. If the external writer counter is zero, we can break the above loop. In order to make the code more clear, we don't use enum variant to define the type of the transaction handle, use bitmask instead. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:46 -04:00
Miao Xie	25d8c284c7	Btrfs: remove the code for the impossible case in cleanup_transaction() If the transaction is removed from the transaction list, it means the transaction has been committed successfully. So it is impossible to call cleanup_transaction(), otherwise there is something wrong with the code logic. Thus, we use BUG_ON() instead of the original handle. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:45 -04:00
Miao Xie	ac6738792f	Btrfs: cleanup unnecessary assignment when cleaning up all the residual transaction When we umount a fs with serious errors, we will invoke btrfs_cleanup_transactions() to clean up the residual transaction. At this time, It is impossible to start a new transaction, so we needn't assign trans_no_join to 1, and also needn't clear running transaction every time we destroy a residual transaction. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:44 -04:00
Miao Xie	6a03843df4	Btrfs: just flush the delalloc inodes in the source tree before snapshot creation Before applying this patch, we need flush all the delalloc inodes in the fs when we want to create a snapshot, it wastes time, and make the transaction commit be blocked for a long time. It means some other user operation would also be blocked for a long time. This patch improves this problem, we just flush the delalloc inodes that in the source trees before snapshot creation, so the transaction commit will complete quickly. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:42 -04:00
Miao Xie	199c2a9c3d	Btrfs: introduce per-subvolume ordered extent list The reason we introduce per-subvolume ordered extent list is the same as the per-subvolume delalloc inode list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:41 -04:00
Miao Xie	eb73c1b7ce	Btrfs: introduce per-subvolume delalloc inode list When we create a snapshot, we need flush all delalloc inodes in the fs, just flushing the inodes in the source tree is OK. So we introduce per-subvolume delalloc inode list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:40 -04:00
Miao Xie	b0feb9d96e	Btrfs: introduce grab/put functions for the root of the fs/file tree The grab/put funtions will be used in the next patch, which need grab the root object and ensure it is not freed. We use reference counter instead of the srcu lock is to aovid blocking the memory reclaim task, which invokes synchronize_srcu(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:38 -04:00
Miao Xie	cb517eabba	Btrfs: cleanup the similar code of the fs root read There are several functions whose code is similar, such as btrfs_find_last_root() btrfs_read_fs_root_no_radix() Besides that, some functions are invoked twice, it is unnecessary, for example, we are sure that all roots which is found in btrfs_find_orphan_roots() have their orphan items, so it is unnecessary to check the orphan item again. So cleanup it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:37 -04:00
Miao Xie	babbf170c7	Btrfs: make the snap/subv deletion end more early when the fs is R/O The snapshot/subvolume deletion might spend lots of time, it would make the remount task wait for a long time. This patch improve this problem, we will break the deletion if the fs is remounted to be R/O. It will make the users happy. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:36 -04:00
Miao Xie	dc7f370c05	Btrfs: move the R/O check out of btrfs_clean_one_deleted_snapshot() If the fs is remounted to be R/O, it is unnecessary to call btrfs_clean_one_deleted_snapshot(), so move the R/O check out of this function. And besides that, it can make the check logic in the caller more clear. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:34 -04:00
Miao Xie	05323cd135	Btrfs: make the cleaner complete early when the fs is going to be umounted Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:33 -04:00
Miao Xie	d027824564	Btrfs: remove unnecessary ->s_umount in cleaner_kthread() In order to avoid the R/O remount, we acquired ->s_umount lock during we deleted the dead snapshots and subvolumes. But it is unnecessary, because we have cleaner_mutex. We use cleaner_mutex to protect the process of the dead snapshots/subvolumes deletion. And when we remount the fs to be R/O, we also acquire this mutex to do cleanup after we change the status of the fs. That is this lock can serialize the above operations, the cleaner can be aware of the status of the fs, and if the cleaner is deleting the dead snapshots/subvolumes, the remount task will wait for it. So it is safe to remove ->s_umount in cleaner_kthread(). Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:32 -04:00
Stefan Behrens	3c64a1aba7	Btrfs: cleanup: don't check the same thing twice btrfs_read_fs_root_no_name() already checks if btrfs_root_refs() is zero and returns ENOENT in this case. There is no need to do it again in six places. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:30 -04:00
Stefan Behrens	b1b195969f	Btrfs: cleanup, btrfs_read_fs_root_no_name() doesn't return NULL No need to check for NULL in send.c and disk-io.c. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:29 -04:00
Stefan Behrens	78a1068b28	Btrfs: delete unused function Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:28 -04:00
Liu Bo	5798b92d2b	Btrfs: remove useless copy in quota_ctl We don't need to copy it back to user side as it remains unchanged. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:27 -04:00
Andreas Philipp	1c89cdd1ce	Minor format cleanup. Clean up the format of the definitions of BTRFS_BLOCK_GROUP_RAID5 and BTRFS_BLOCK_GROUP_RAID6. Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:25 -04:00
Tsutomu Itoh	924794c936	Btrfs: cleanup unused arguments in send.c sctx is removed from the argument of the function that doesn't use sctx. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:24 -04:00
Stefan Behrens	8f69dbd236	Btrfs: fix a comment The size parameter to btrfs_extend_item() is the number of bytes to add to the item, not the size of the item after the operation (like it is for btrfs_truncate_item(), there the size parameter is not the number of bytes to take away, but the total size of the item after truncation). Fix it in the comment. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:23 -04:00
Jan Schmidt	57254b6ebc	Btrfs: add ioctl to wait for qgroup rescan completion btrfs_qgroup_wait_for_completion waits until the currently running qgroup operation completes. It returns immediately when no rescan process is in progress. This is useful to automate things around the rescan process (e.g. testing). Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:22 -04:00
Wang Shilong	1e8f915868	Btrfs: introduce qgroup_ulist to avoid frequently allocating/freeing ulist When doing qgroup accounting, we call ulist_alloc()/ulist_free() every time when we want to walk qgroup tree. By introducing 'qgroup_ulist', we only need to call ulist_alloc()/ulist_free() once. This reduce some sys time to allocate memory, see the measurements below fsstress -p 4 -n 10000 -d $dir With this patch: real 0m50.153s user 0m0.081s sys 0m6.294s real 0m51.113s user 0m0.092s sys 0m6.220s real 0m52.610s user 0m0.096s sys 0m6.125s avg 6.213 ----------------------------------------------------- Without the patch: real 0m54.825s user 0m0.061s sys 0m10.665s real 1m6.401s user 0m0.089s sys 0m11.218s real 1m13.768s user 0m0.087s sys 0m10.665s avg 10.849 we can see the sys time reduce ~43%. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:21 -04:00
David Sterba	85965600f5	btrfs: show compiled-in config features at module load time We want to know if there are debugging features compiled in, this may affect performance. The message is printed before the sanity checks. Also kill version.h file that serves no purpose, we don't use any version tag for kernel module. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:19 -04:00
David Sterba	e6d2960582	btrfs: move ifdef around sanity checks out of init_btrfs_fs Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:18 -04:00
David Sterba	905d0f564e	btrfs: add prefix to sanity tests messages And change the message level to KERN_INFO. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:17 -04:00
David Sterba	8d599ae1bf	btrfs: add debug check for extent_io range alignment The 'end' value must exactly cover the end of the interval, which means one byte less than the expected block alignment, or in case of a file smaller than one block, one byte less than the inode size. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:15 -04:00
Henrik Nordvik	15b0a89d71	Btrfs: fix check on same raid type flag twice Code checked for raid 5 flag in two else-if branches, so code would never be reached. Probably a copy-paste bug. Signed-off-by: Henrik Nordvik <henrikno@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-14 11:29:14 -04:00
Linus Torvalds	a2648ebb7e	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is an assortment of crash fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: stop all workers before cleaning up roots Btrfs: fix use-after-free bug during umount Btrfs: init relocate extent_io_tree with a mapping btrfs: Drop inode if inode root is NULL Btrfs: don't delete fs_roots until after we cleanup the transaction	2013-06-13 22:34:14 -07:00
Josef Bacik	13e6c37b98	Btrfs: stop all workers before cleaning up roots Dave reported a panic because the extent_root->commit_root was NULL in the caching kthread. That is because we just unset it in free_root_pointers, which is not the correct thing to do, we have to either wait for the caching kthread to complete or hold the extent_commit_sem lock so we know the thread has exited. This patch makes the kthreads all stop first and then we do our cleanup. This should fix the race. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-06-08 15:11:35 -04:00
Liu Bo	2932505abe	Btrfs: fix use-after-free bug during umount Commit `be283b2e67` ( Btrfs: use helper to cleanup tree roots) introduced the following bug, BUG: unable to handle kernel NULL pointer dereference at 0000000000000034 IP: [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs] [...] Pid: 2463, comm: btrfs-cache-1 Tainted: G O 3.9.0+ #4 innotek GmbH VirtualBox/VirtualBox RIP: 0010:[<ffffffffa039368c>] [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs] Process btrfs-cache-1 (pid: 2463, threadinfo ffff880112d60000, task ffff880117679730) [...] Call Trace: [<ffffffffa0398a99>] btrfs_search_slot+0x104/0x64d [btrfs] [<ffffffffa039aea4>] btrfs_next_old_leaf+0xa7/0x334 [btrfs] [<ffffffffa039b141>] btrfs_next_leaf+0x10/0x12 [btrfs] [<ffffffffa039ea13>] caching_thread+0x1a3/0x2e0 [btrfs] [<ffffffffa03d8811>] worker_loop+0x14b/0x48e [btrfs] [<ffffffffa03d86c6>] ? btrfs_queue_worker+0x25c/0x25c [btrfs] [<ffffffff81068d3d>] kthread+0x8d/0x95 [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43 [<ffffffff8151e5ac>] ret_from_fork+0x7c/0xb0 [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43 RIP [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs] We've free'ed commit_root before actually getting to free block groups where caching thread needs valid extent_root->commit_root. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-06-08 15:10:01 -04:00
Josef Bacik	a9995eece3	Btrfs: init relocate extent_io_tree with a mapping Dave reported a NULL pointer deref. This is caused because he thought he'd be smart and add sanity checks to the extent_io bit operations, but he didn't expect a tree to have a NULL mapping. To fix this we just need to init the relocation's processed_blocks with the btree_inode->i_mapping. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-06-08 15:07:53 -04:00
Naohiro Aota	6379ef9fb2	btrfs: Drop inode if inode root is NULL There is a path where btrfs_drop_inode() is called with its inode's root is NULL: In btrfs_new_inode(), when btrfs_set_inode_index() fails, iput() is called. We should handle this case before taking look at the root->root_item. Signed-off-by: Naohiro Aota <naota@elisp.net> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-06-08 15:07:53 -04:00
Josef Bacik	7b5ff90ed0	Btrfs: don't delete fs_roots until after we cleanup the transaction We get a use after free if we had a transaction to cleanup since there could be delayed inodes which refer to their respective fs_root. Thanks Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-06-08 15:07:53 -04:00
Masanari Iida	8b513d0cf6	treewide: Fix typo in printk Correct spelling typo in various part of drivers Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2013-05-28 12:02:13 +02:00
Stefan Behrens	7e21f14d17	btrfs: fix btrfs_extend_item() comment The size parameter to btrfs_extend_item() is the number of bytes to add to the item, not the size of the item after the operation (like it is for btrfs_truncate_item(), there the size parameter is not the number of bytes to take away, but the total size of the item after truncation). Fix it in the comment. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2013-05-28 12:02:12 +02:00
Lukas Czerner	d47992f86b	mm: change invalidatepage prototype to accept length Currently there is no way to truncate partial page where the end truncate point is not at the end of the page. This is because it was not needed and the functionality was enough for file system truncate operation to work properly. However more file systems now support punch hole feature and it can benefit from mm supporting truncating page just up to the certain point. Specifically, with this functionality truncate_inode_pages_range() can be changed so it supports truncating partial page at the end of the range (currently it will BUG_ON() if 'end' is not at the end of the page). This commit changes the invalidatepage() address space operation prototype to accept range to be invalidated and update all the instances for it. We also change the block_invalidatepage() in the same way and actually make a use of the new length argument implementing range invalidation. Actual file system implementations will follow except the file systems where the changes are really simple and should not change the behaviour in any way .Implementation for truncate_page_range() which will be able to accept page unaligned ranges will follow as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com>	2013-05-21 23:17:23 -04:00
Linus Torvalds	130901ba33	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Miao Xie has been very busy, fixing races and enospc problems and many other small but important pieces. Alexandre Oliva discovered some problems with how our error handling was interacting with the block layer and for now has disabled our partial handling of sub-page writes. The real sub-page work is in a series of patches from IBM that we still need to integrate and test. The code Alexandre has turned off was really incomplete. Josef has more error handling fixes and an important fix for the new skinny extent format. This also has my fix for the tracepoint crash from late in 3.9. It's the first stage in a larger clean up to get rid of btrfs_bio and make a proper bioset for all the items we need to tack into the bio. For now the bioset only holds our mirror_num and stripe_index, but for the next merge window I'll shuffle more in." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits) Btrfs: use a btrfs bioset instead of abusing bio internals Btrfs: make sure roots are assigned before freeing their nodes Btrfs: explicitly use global_block_rsv for quota_tree btrfs: do away with non-whole_page extent I/O Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix() Btrfs: pause the space balance when remounting to R/O Btrfs: fix unprotected root node of the subvolume's inode rb-tree Btrfs: fix accessing a freed tree root Btrfs: return errno if possible when we fail to allocate memory Btrfs: update the global reserve if it is empty Btrfs: don't steal the reserved space from the global reserve if their space type is different Btrfs: optimize the error handle of use_block_rsv() Btrfs: don't use global block reservation for inode cache truncation Btrfs: don't abort the current transaction if there is no enough space for inode cache Correct allowed raid levels on balance. Btrfs: fix possible memory leak in replace_path() Btrfs: fix possible memory leak in the find_parent_nodes() Btrfs: don't allow device replace on RAID5/RAID6 Btrfs: handle running extent ops with skinny metadata ...	2013-05-18 11:35:28 -07:00
Chris Mason	c5cb6a0573	Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next	2013-05-17 21:53:17 -04:00
Chris Mason	9be3395bcd	Btrfs: use a btrfs bioset instead of abusing bio internals Btrfs has been pointer tagging bi_private and using bi_bdev to store the stripe index and mirror number of failed IOs. As bios bubble back up through the call chain, we use these to decide if and how to retry our IOs. They are also used to count IO failures on a per device basis. Recently a bio tracepoint was added lead to crashes because we were abusing bi_bdev. This commit adds a btrfs bioset, and creates explicit fields for the mirror number and stripe index. The plan is to extend this structure for all of the fields currently in struct btrfs_bio, which will mean one less kmalloc in our IO path. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Tejun Heo <tj@kernel.org>	2013-05-17 21:52:52 -04:00
Josef Bacik	655b09fe54	Btrfs: make sure roots are assigned before freeing their nodes If we fail to load the chunk tree we'll call free_root_pointers, except we may not have assigned the roots for the dev_root/extent_root/csum_root yet, so we could NULL pointer deref at this point. Just add checks to make sure these roots are set to keep us from panicing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:38 -04:00
Stefan Behrens	3a6cad9009	Btrfs: explicitly use global_block_rsv for quota_tree The quota_tree was set up to use the empty_block_rsv before which would be problematic when the filesystem is filled up and ENOSPC happens during internal operations while the quota tree is updated and COWed (when the btrfs_qgroup_info_item items) are written. In fact, use_block_rsv() which is used in btrfs_cow_block() falls back to the global_block_rsv in this case. But just in order to make it more clear what is happening, change it to explicitly use the global_block_rsv. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:36 -04:00
Alexandre Oliva	17a5adccf3	btrfs: do away with non-whole_page extent I/O end_bio_extent_readpage computes whole_page based on bv_offset and bv_len, without taking into account that blk_update_request may modify them when some of the blocks to be read into a page produce a read error. This would cause the read to unlock only part of the file range associated with the page, which would in turn leave the entire page locked, which would not only keep the process blocked instead of returning -EIO to it, but also prevent any further access to the file. It turns out that btrfs always issues whole-page reads and writes. The special handling of non-whole_page appears to be a mistake or a left-over from a time when this wasn't the case. Indeed, end_bio_extent_writepage distinguished between whole_page and non-whole_page writes but behaved identically in both cases! I've replaced the whole_page computations with warnings, just to be sure that we're not issuing partial page reads or writes. The warnings should probably just go away some time. Signed-off-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:35 -04:00
Miao Xie	b216cbfb52	Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context btrfs_invalidate_inodes() may sleep, so we should not invoke it in the spin lock context. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:34 -04:00
Miao Xie	314297c2a3	Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix() We have checked if ->node is NULL or not, so it is unnecessary to use BUG_ON() to check again. Remove it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:32 -04:00
Miao Xie	061594ef17	Btrfs: pause the space balance when remounting to R/O Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:31 -04:00
Miao Xie	e1409cef85	Btrfs: fix unprotected root node of the subvolume's inode rb-tree The root node of the rb-tree may be changed, so we should get it under the lock. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:30 -04:00
Miao Xie	89042e5ad2	Btrfs: fix accessing a freed tree root inode_tree_del() will move the tree root into the dead root list, and then the tree will be destroyed by the cleaner. So if we remove the delayed node which is cached in the inode after inode_tree_del(), we may access a freed tree root. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:29 -04:00
Liu Bo	b9aa55bed1	Btrfs: return errno if possible when we fail to allocate memory We need to set return value explicitly, otherwise we'll lose the error value. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:27 -04:00
Miao Xie	d88033dbf4	Btrfs: update the global reserve if it is empty Before applying this patch, we reserved the space for the global reserve by the minimum unit if we found it is empty, it was unreasonable and inefficient, because if the global reserve space was depleted, it implied that the size of the global reserve was too small. In this case, we shoud update the global reserve and fill it. Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:26 -04:00
Miao Xie	5881cfc924	Btrfs: don't steal the reserved space from the global reserve if their space type is different If the type of the space we need is different with the global reserve, we can not steal the space from the global reserve, because we can not allocate the space from the free space cache that the global reserve points to. Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:25 -04:00
Miao Xie	b586b32374	Btrfs: optimize the error handle of use_block_rsv() cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:24 -04:00
Miao Xie	7b61cd9224	Btrfs: don't use global block reservation for inode cache truncation It is very likely that there are lots of subvolumes/snapshots in the filesystem, so if we use global block reservation to do inode cache truncation, we may hog all the free space that is reserved in global rsv. So it is better that we do the free space reservation for inode cache truncation by ourselves. Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:22 -04:00
Miao Xie	7cfa9e51d2	Btrfs: don't abort the current transaction if there is no enough space for inode cache The filesystem with inode cache was forced to be read-only when we umounted it. Steps to reproduce: # mkfs.btrfs -f ${DEV} # mount -o inode_cache ${DEV} ${MNT} # dd if=/dev/zero of=${MNT}/file1 bs=1M count=8192 # btrfs fi syn ${MNT} # dd if=${MNT}/file1 of=/dev/null bs=1M # rm -f ${MNT}/file1 # btrfs fi syn ${MNT} # umount ${MNT} It is because there was no enough space to do inode cache truncation, and then we aborted the current transaction. But no space error is not a serious problem when we write out the inode cache, and it is safe that we just skip this step if we meet this problem. So we need not abort the current transaction. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:21 -04:00
Andreas Philipp	8250dabedb	Correct allowed raid levels on balance. Raid5 with 3 devices is well defined while the old logic allowed raid5 only with a minimum of 4 devices when converting the block group profile via btrfs balance. Creating a raid5 with just three devices using mkfs.btrfs worked always as expected. This is now fixed and the whole logic is rewritten. Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:20 -04:00
Stefan Behrens	379cde741b	Btrfs: fix possible memory leak in replace_path() In replace_path(), if read_tree_block() fails, we cannot return directly, we should free some allocated memory otherwise memory leak happens. Similar to Wang's "Btrfs: fix possible memory leak in the find_parent_nodes()" patch, the current commit fixes an issue that is related to the "Btrfs: fix all callers of read_tree_block" commit. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:19 -04:00
Wang Shilong	c16c2e2e51	Btrfs: fix possible memory leak in the find_parent_nodes() In the find_parent_nodes(), if read_tree_block() fails, we can not return directly, we should free some allocated memory otherwise memory leak happens. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:17 -04:00
Stefan Behrens	4968810752	Btrfs: don't allow device replace on RAID5/RAID6 This is not yet supported and causes crashes. One sad user reported that it destroyed his filesystem. One failure is in __btrfs_map_block+0xc1f calling kmalloc(0). 0x5f21f is in __btrfs_map_block (fs/btrfs/volumes.c:4923). 4918 num_stripes = map->num_stripes; 4919 max_errors = nr_parity_stripes(map); 4920 4921 raid_map = kmalloc(sizeof(u64) * num_stripes, 4922 GFP_NOFS); 4923 if (!raid_map) { 4924 ret = -ENOMEM; 4925 goto out; 4926 } 4927 There might be more issues. Until this is really tested, don't allow users to start the procedure on RAID5/RAID6 filesystems. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:16 -04:00
Josef Bacik	b1c79e0947	Btrfs: handle running extent ops with skinny metadata Chris hit a bug where we weren't finding extent records when running extent ops. This is because we use the delayed_ref_head when running the extent op, which means we can't use the ->type checks to see if we are metadata. We also lose the level of the metadata we are working on. So to fix this we can just check the ->is_data section of the extent_op, and we can store the level of the buffer we were modifying in the extent_op. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:15 -04:00
Josef Bacik	73e1e61fb8	Btrfs: remove warn on in free space cache writeout This catches block groups that are too large to properly cache. We deal with this case fine, so the warning just confuses users. Remove the warning. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:13 -04:00
Josef Bacik	69a85bd87c	Btrfs: don't null pointer deref on abort I'm sorry, theres no excuse for this sort of work. We need to use root->leafsize since eb may be NULL. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:12 -04:00
Gabriel de Perthuis	03b71c6ca6	btrfs: don't stop searching after encountering the wrong item The search ioctl skips items that are too large for a result buffer, but inline items of a certain size occuring before any search result is found would trigger an overflow and stop the search entirely. Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641 Cc: stable@vger.kernel.org Signed-off-by: Gabriel de Perthuis <g2p.code+btrfs@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 21:40:10 -04:00
Liu Bo	a52f4cd2b1	Btrfs: fix off-by-one in fiemap lock_extent/unlock_extent expect an exclusive end. Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 16:27:26 -04:00
David Sterba	60b62978bc	btrfs: annotate quota tree for lockdep Quota tree has been missing from lockdep annotations, though no warning has been seen in the wild. There's currently one entry that does not belong there, BTRFS_ORPHAN_OBJECTID. No such tree exists, it's probably a copy & paste mistake, the id is defined among tree ids. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-17 16:27:25 -04:00
Linus Torvalds	983a5f84a4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "These are mostly fixes. The biggest exceptions are Josef's skinny extents and Jan Schmidt's code to rebuild our quota indexes if they get out of sync (or you enable quotas on an existing filesystem). The skinny extents are off by default because they are a new variation on the extent allocation tree format. btrfstune -x enables them, and the new format makes the extent allocation tree about 30% smaller. I rebased this a few days ago to rework Dave Sterba's crc checks on the super block, but almost all of these go back to rc6, since I though 3.9 was due any minute. The biggest missing fix is the tracepoint bug that was hit late in 3.9. I ran into problems with that in overnight testing and I'm still tracking it down. I'll definitely have that fixed for rc2." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (101 commits) Btrfs: allow superblock mismatch from older mkfs btrfs: enhance superblock checks btrfs: fix misleading variable name for flags btrfs: use unsigned long type for extent state bits Btrfs: improve the loop of scrub_stripe btrfs: read entire device info under lock btrfs: remove unused gfp mask parameter from release_extent_buffer callchain btrfs: handle errors returned from get_tree_block_key btrfs: make static code static & remove dead code Btrfs: deal with errors in write_dev_supers Btrfs: remove almost all of the BUG()'s from tree-log.c Btrfs: deal with free space cache errors while replaying log Btrfs: automatic rescan after "quota enable" command Btrfs: rescan for qgroups Btrfs: split btrfs_qgroup_account_ref into four functions Btrfs: allocate new chunks if the space is not enough for global rsv Btrfs: separate sequence numbers for delayed ref tracking and tree mod log btrfs: move leak debug code to functions Btrfs: return free space in cow error path Btrfs: set UUID in root_item for created trees ...	2013-05-09 13:07:40 -07:00
Linus Torvalds	4de13d7aa8	Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block Pull block core updates from Jens Axboe: - Major bit is Kents prep work for immutable bio vecs. - Stable candidate fix for a scheduling-while-atomic in the queue bypass operation. - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging discard bios. - Tejuns changes to convert the writeback thread pool to the generic workqueue mechanism. - Runtime PM framework, SCSI patches exists on top of these in James' tree. - A few random fixes. * 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits) relay: move remove_buf_file inside relay_close_buf partitions/efi.c: replace useless kzalloc's by kmalloc's fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read() block: fix max discard sectors limit blkcg: fix "scheduling while atomic" in blk_queue_bypass_start Documentation: cfq-iosched: update documentation help for cfq tunables writeback: expose the bdi_wq workqueue writeback: replace custom worker pool implementation with unbound workqueue writeback: remove unused bdi_pending_list aoe: Fix unitialized var usage bio-integrity: Add explicit field for owner of bip_buf block: Add an explicit bio flag for bios that own their bvec block: Add bio_alloc_pages() block: Convert some code to bio_for_each_segment_all() block: Add bio_for_each_segment_all() bounce: Refactor __blk_queue_bounce to not use bi_io_vec raid1: use bio_copy_data() pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage pktcdvd: use bio_copy_data() block: Add bio_copy_data() ...	2013-05-08 10:13:35 -07:00
Kent Overstreet	a27bb332c0	aio: don't include aio.h in sched.h Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-05-07 20:16:25 -07:00
Chris Mason	667e7d94a1	Btrfs: allow superblock mismatch from older mkfs We've added new checks to make sure the super block crc is correct during mount. A fresh filesystem from an older mkfs won't have the crc set. This adds a warning when it finds a newly created filesystem but doesn't fail the mount. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-05-07 11:00:13 -04:00
David Sterba	1104a88551	btrfs: enhance superblock checks The superblock checksum is not verified upon mount. <awkward silence> Add that check and also reorder existing checks to a more logical order. Current mkfs.btrfs does not calculate the correct checksum of super_block and thus a freshly created filesytem will fail to mount when this patch is applied. First transaction commit calculates correct superblock checksum and saves it to disk. Reproducer: $ mfks.btrfs /dev/sda $ mount /dev/sda /mnt $ btrfs scrub start /mnt $ sleep 5 $ btrfs scrub status /mnt ... super:2 ... Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-05-07 10:50:27 -04:00
David Sterba	b6919a58f0	btrfs: fix misleading variable name for flags The variable was named 'data' in btrfs_reserve_extent and that's the only function that actually uses it to let btrfs_get_alloc_profile know what profile we want. Then it's passed down as u64 flags. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:27 -04:00
David Sterba	410748882a	btrfs: use unsigned long type for extent state bits Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:27 -04:00
Liu Bo	625f1c8dc6	Btrfs: improve the loop of scrub_stripe 1) Right now scrub_stripe() is looping in some unnecessary cases: * when the found extent item's objectid has been out of the dev extent's range but we haven't finish scanning all the range within the dev extent * when all the items has been processed but we haven't finish scanning all the range within the dev extent In both cases, we can just finish the loop to save costs. 2) Besides, when the found extent item's length is larger than the stripe len(64k), we don't have to release the path and search again as it'll get at the same key used in the last loop, we can instead increase the logical cursor in place till all space of the extent is scanned. 3) And we use 0 as the key's offset to search btree, then get to previous item to find a smaller item, and again have to move to the next one to get the right item. Setting offset=-1 and previous_item() is the correct way. 4) As we won't find any checksum at offset unless this 'offset' is in a data extent, we can just find checksum when we're really going to scrub an extent. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:26 -04:00
David Sterba	55793c0d03	btrfs: read entire device info under lock There's a theoretical possibility of reading stale (or even more theoretically, freed) data from DEV_INFO ioctl when the device would disappear between an early mutex unlock and data being copied from the device structure. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:25 -04:00
David Sterba	f7a52a40ca	btrfs: remove unused gfp mask parameter from release_extent_buffer callchain It's unused since `0b32f4bbb4`. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:24 -04:00
David Sterba	34c2b29079	btrfs: handle errors returned from get_tree_block_key Signed-off-by: David Sterba <dsterba@suse.cz> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:24 -04:00
Eric Sandeen	48a3b6366f	btrfs: make static code static & remove dead code Big patch, but all it does is add statics to functions which are in fact static, then remove the associated dead-code fallout. removed functions: btrfs_iref_to_path() __btrfs_lookup_delayed_deletion_item() __btrfs_search_delayed_insertion_item() __btrfs_search_delayed_deletion_item() find_eb_for_page() btrfs_find_block_group() range_straddles_pages() extent_range_uptodate() btrfs_file_extent_length() btrfs_scrub_cancel_devid() btrfs_start_transaction_lflush() btrfs_print_tree() is left because it is used for debugging. btrfs_start_transaction_lflush() and btrfs_reada_detach() are left for symmetry. ulist.c functions are left, another patch will take care of those. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:23 -04:00
Josef Bacik	634554dc0a	Btrfs: deal with errors in write_dev_supers If you try to mount -o loop a restored file system it will panic if the file ends up being smaller than the original disk. This is because we go to try and get a block for a super that may be past the EOF which makes __getblk return NULL for a buffer head when we aren't expecting it to. Fix this by dealing with this case and just jacking up the errors count. With this patch we no longer panic when mounting a restored file system loopback. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:22 -04:00
Josef Bacik	3650860b90	Btrfs: remove almost all of the BUG()'s from tree-log.c There were a whole bunch and I was doing it for other things. I haven't tested these error paths but at the very least this is better than panicing. I've only left 2 BUG_ON()'s since they are logic errors and I want to replace them with a ASSERT framework that we can compile out for production users. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:21 -04:00
Josef Bacik	b50c6e250e	Btrfs: deal with free space cache errors while replaying log So everybody who got hit by my fsync bug will still continue to hit this BUG_ON() in the free space cache, which is pretty heavy handed. So I took a file system that had this bug and fixed up all the BUG_ON()'s and leaks that popped up when I tried to mount a broken file system like this. With this patch we just fail to mount instead of panicing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:20 -04:00
Jan Schmidt	3d7b5a2882	Btrfs: automatic rescan after "quota enable" command When qgroup tracking is enabled, we do an automatic cycle of the new rescan mechanism. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:20 -04:00
Jan Schmidt	2f2320360b	Btrfs: rescan for qgroups If qgroup tracking is out of sync, a rescan operation can be started. It iterates the complete extent tree and recalculates all qgroup tracking data. This is an expensive operation and should not be used unless required. A filesystem under rescan can still be umounted. The rescan continues on the next mount. Status information is provided with a separate ioctl while a rescan operation is in progress. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:19 -04:00
Jan Schmidt	46b665ceb1	Btrfs: split btrfs_qgroup_account_ref into four functions The function is separated into a preparation part and the three accounting steps mentioned in the qgroups documentation. The goal is to make steps two and three usable by the rescan functionality. A side effect is that the function is restructured into readable subunits. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:18 -04:00
Miao Xie	3c76cd84e0	Btrfs: allocate new chunks if the space is not enough for global rsv When running the 208th of xfstests, the fs returned the enospc error when there was lots of free space in the disk. By bisect debug, we found it was introduced by commit `96f1bb5777`. This commit makes the space check for the global reservation in can_overcommit() be inconsistent with should_alloc_chunk(). can_overcommit() requires that the free space is 2 times the size of the global reservation, or we can't do overcommit. And instead, we need reclaim some reserved space, and if we still don't have enough free space, we need allocate a new chunk. But unfortunately, should_alloc_chunk() just requires that the free space is 1 time the size of the global reservation, that is we would not try to allocate a new chunk if the free space size is in the middle of these two requires, and just return the enospc error. Fix it. Cc: Jim Schutt <jaschut@sandia.gov> Cc: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:17 -04:00
Jan Schmidt	fc36ed7e0b	Btrfs: separate sequence numbers for delayed ref tracking and tree mod log Sequence numbers for delayed refs have been introduced in the first version of the qgroup patch set. To solve the problem of find_all_roots on a busy file system, the tree mod log was introduced. The sequence numbers for that were simply shared between those two users. However, at one point in qgroup's quota accounting, there's a statement accessing the previous sequence number, that's still just doing (seq - 1) just as it would have to in the very first version. To satisfy that requirement, this patch makes the sequence number counter 64 bit and splits it into a major part (used for qgroup sequence number counting) and a minor part (incremented for each tree modification in the log). This enables us to go exactly one major step backwards, as required for qgroups, while still incrementing the sequence counter for tree mod log insertions to keep track of their order. Keeping them in a single variable means there's no need to change all the code dealing with comparisons of two sequence numbers. The sequence number is reset to 0 on commit (not new in this patch), which ensures we won't overflow the two 32 bit counters. Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs from the tree mod log code may happen. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:17 -04:00
Eric Sandeen	6d49ba1b47	btrfs: move leak debug code to functions Clean up the leak debugging in extent_io.c by moving the debug code into functions. This also removes the list_heads used for debugging from the extent_buffer and extent_state structures when debug is not enabled. Since we need a global debug config to do that last part, implement CONFIG_BTRFS_DEBUG to accommodate. Thanks to Dave Sterba for the Kconfig bit. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:16 -04:00
Liu Bo	ace68bac61	Btrfs: return free space in cow error path Replace some BUG_ONs with proper handling and take allocated space back to free space cache for later use. We don't have to worry about extent maps since they'd be freed in releasepage path. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:15 -04:00
Stefan Behrens	6463fe58ea	Btrfs: set UUID in root_item for created trees It is a rare exception that a new tree is created, like the qgroups tree. So far these new trees have an all-zero UUID in their root items. All trees that mkfs.btrfs has created get an UUID during the first mount when btrfs_read_root_item() rewrites the root_item to the v2 structure style. These UUID are never used so far, but anyway, since it is better to have it uniform for all trees, this commit adds some lines that generate and write an UUID for newly created trees. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:14 -04:00
Stefan Behrens	5fbf83c10c	Btrfs: delete unused parameter to btrfs_read_root_item() Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:14 -04:00
Tsutomu Itoh	ecc7ada77b	Btrfs: fix error handling in btrfs_ioctl_send() fget() returns NULL if error. So, we should check NULL or not. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:13 -04:00
Tsutomu Itoh	ba1eeaac99	Btrfs: remove unused variable in __process_changed_new_xattr() Variable 'p' is not used any more. So, remove it. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:12 -04:00
Josef Bacik	54067ae95e	Btrfs: various abort cleanups I have a broken file system that when it aborts leaves all sorts of accounting things wrong and gives you lots of WARN_ON()'s other than the abort. This is because we're not cleaning up various parts of the file system when we abort. The first chunks are specific to mount failures, we weren't cleaning up the block group cached inodes and we weren't cleaning up any transactions that had been aborted, which leaves a bunch of things laying around. The second half of this are related to the cleanup parts. First we don't need to release space for the dirty pages from the trans_block_rsv, that's all handled by the trans handles so this is just plain wrong. The other thing is we need to pin down extents that were set ->must_insert_reserved for delayed refs. This isn't so much for the pinning but more for the cleaning up the cache->reserved counter since we are no longer going to use those reserved bytes. With this patch I no longer see a bunch of WARN_ON()'s when I try to mount this broken file system, just the initial one from the abort. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:11 -04:00
Josef Bacik	fd8b2b6115	Btrfs: cleanup destroy_marked_extents We can just look up the extent_buffers for the range and free stuff that way. This makes the cleanup a bit cleaner and we can make sure to evict the extent_buffers pretty quickly by marking them as stale. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:11 -04:00
Josef Bacik	abefa55ac1	Btrfs: check return value of commit when recovering log We need to check the return value of the commit in case something goes wrong, otherwise we could end up going down the line and doing more stuff (like orphan cleanup) before we notice we should have errored out. We need to do this before we free up the log_tree_root since the caller will handle all of that. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:10 -04:00
Josef Bacik	32b0253803	Btrfs: don't panic if we're trying to drop too many refs This is just obnoxious. Just print a message, abort the transaction, and return an error. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:09 -04:00
Josef Bacik	171f6537ab	Btrfs: cleanup fs roots if we fail to mount We can run the tree logging recovery or the orphan cleanup on mount, so we'll end up looking up a random fs tree in the meantime. So we need to clean this up so we don't leave extent buffers hanging around on the cache. With this patch we no longer leak extent buffers on failure to mount. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:08 -04:00
Josef Bacik	eb384b55ae	Btrfs: fix extent logging with O_DIRECT into prealloc This is the same as the fix from commit Btrfs: fix bad extent logging but for O_DIRECT. I missed this when I fixed the problem originally, we were still using the em for the orig_start and orig_block_len, which would be the merged extent. We need to use the actual extent from the on disk file extent item, which we have to lookup to make sure it's ok to nocow anyway so just pass in some pointers to hold this info. Thanks, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:07 -04:00
Josef Bacik	416bc6580b	Btrfs: fix all callers of read_tree_block We kept leaking extent buffers when mounting a broken file system and it turns out it's because not everybody uses read_tree_block properly. You need to check and make sure the extent_buffer is uptodate before you use it. This patch fixes everybody who calls read_tree_block directly to make sure they check that it is uptodate and free it and return an error if it is not. With this we no longer leak EB's when things go horribly wrong. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:07 -04:00
Josef Bacik	51bf5f0bc4	Btrfs: only exclude supers in the range of our block group If we fail to load block groups halfway through we can leave extent_state's on the excluded tree. This is because we just lookup the supers and add them to the excluded tree regardless of which block group we are looking at currently. This is a problem because we remove the excluded extents for the range of the block group only, so if we don't ever load a block group for one of the excluded extents we won't ever free it. This fixes the problem by only adding excluded extents if it falls in the block group range we care about. With this patch we're no longer leaking space when we fail to read all of the block groups. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:06 -04:00
Josef Bacik	1c24c3ce6a	Btrfs: add tree block level sanity check With a users corrupted fs I was getting weird behavior and panics and it turns out it was because one of his tree blocks had a bogus header level. So add this to the sanity checks in the endio handler for tree blocks. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:05 -04:00
Josef Bacik	5ec8dca761	Btrfs: don't try and free ebs twice in log replay This work is done by btrfs_free_path() anyway so there's no need for this duplicate work. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:04 -04:00
Josef Bacik	fb7669b5a0	Btrfs: don't BUG_ON() in btrfs_num_copies A user sent me a btrfs-image that was panicing because of some corruption. This is because we pass in a bogus value to btrfs_num_copies, and it panics. Instead just return 1. We only call btrfs_num_copies to see if there are other copies to try and read for things, so if we just return 1 it will make the callers exit out with an appropriate error value. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:04 -04:00
Josef Bacik	79fb65a1f6	Btrfs: don't call readahead hook until we have read the entire eb Martin Steigerwald reported a BUG_ON() where we were given a bogus bytenr to map. Turns out he is using > PAGESIZE leafsizes. The readahead stuff is called every time we do a completion, but we may not have finished reading in all the pages, so the bytenr we read off the node could be completely bogus. Fix this by only calling the readahead hook once all pages have been read in. Thanks, Reported-by: Martin Steigerwald <Martin@lichtvoll.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:03 -04:00
Josef Bacik	9bb91873e3	Btrfs: deal with bad mappings in btrfs_map_block Martin Steigerwald reported a BUG_ON() in btrfs_map_block where we didn't find a chunk for a particular block we were trying to map. This happened because the block was bogus. We shouldn't be BUG_ON()'ing in this case, just print a message and return an error. This came from reada_add_block and it appears to deal with an error fine so we should be good there. Thanks, Reported-by: Martin Steigerwald <Martin@lichtvoll.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:02 -04:00
Josef Bacik	d4c7ca86b5	Btrfs: use REQ_META for all metadata IO We need to tag metadata io with REQ_META to avoid priority inversion when using io throttling cqroups. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:01 -04:00
Josef Bacik	0a3896d0f5	Btrfs: fix possible infinite loop in slow caching So I noticed there is an infinite loop in the slow caching code. If we return 1 when we hit the end of the tree, so we could end up caching the last block group the slow way and suddenly we're looping forever because we just keep re-searching and trying again. Fix this by only doing btrfs_next_leaf() if we don't need_resched(). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:01 -04:00
Josef Bacik	62dbd7176e	Btrfs: fix lockdep warning The locking order for stuff is __sb_start_write ordered_mutex but with sync() we don't do __sb_start_write for some strange reason, which means that our iput in wait_ordered_extents could start a transaction which does the __sb_start_write while we're holding the ordered_mutex. Fix this by using delayed iput in sync. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:55:00 -04:00
Wang Shilong	534e6623b7	Btrfs: add all ioctl checks before user change for quota operations Since all the quota configurations are loaded in memory, and we can have ioctl checks before operating in the disk. It is safe to do such things because qgroup_ioctl_lock is held outside. Without these extra checks firstly, it should be ok to do user change for quota operations. For example: if we want to add an existed qgroup, we will do: ->add_qgroup_item() ->add_qgroup_rb() add_qgroup_item() will return -EEXIST to us, however, qgroups are all in memory, why not check them in memory firstly. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:59 -04:00
Wang Shilong	3c97185c65	Btrfs: fix missing check about ulist_add() in qgroup.c ulist_add() may return -ENOMEM, fix missing check about return value. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:58 -04:00
Stefan Behrens	70023da276	Btrfs: clear received_uuid field for new writable snapshots For created snapshots, the full root_item is copied from the source root and afterwards selectively modified. The current code forgets to clear the field received_uuid. The only problem is that it is confusing when you look at it with 'btrfs subv list', since for writable snapshots, the contents of the snapshot can be completely unrelated to the previously received snapshot. The receiver ignores such snapshots anyway because he also checks the field stransid in the root_item and that value used to be reset to zero for all created snapshots. This commit changes two things: - clear the received_uuid field for new writable snapshots. - don't clear the send/receive related information like the stransid for read-only snapshots (which makes them useable as a parent for the automatic selection of parents in the receive code). Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:58 -04:00
Josef Bacik	b8d7f3ac10	Btrfs: don't force pages under writeback to finish when aborting Dave reported a BUG_ON() that happened in end_page_writeback() after an abort. This happened because we unconditionally call end_page_writeback() in the endio case, which is right. However when we abort the transaction we will call end_page_writeback() on any writeback pages we find, which is wrong. We need to lock the page and wait on page writeback to complete if it is. There is nothing unsafe about this since we are discarding the transaction anyway. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:57 -04:00
Wang Shilong	ccf7f29d1a	Btrfs: remove unused variable in the iterate_extent_inodes() Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:56 -04:00
Liu Bo	0abd5b1724	Btrfs: return error when we specify wrong start to defrag We need such a sanity check for wrong start when we defrag a file, otherwise, even with a wrong start that's larger than file size, we can end up changing not only inode's force compress flag but also FS's incompat flags. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:55 -04:00
Vincent	3c59ccd32a	Btrfs: fix reada debug code compilation This fixes the following errors: fs/btrfs/reada.c: In function ‘btrfs_reada_wait’: fs/btrfs/reada.c:958:42: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’) fs/btrfs/reada.c:961:41: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’) Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net> Cc: Chris Mason <chris.mason@fusionio.com> Cc: linux-btrfs@vger.kernel.org Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:55 -04:00
Tsutomu Itoh	fd279faefa	Btrfs: cleanup of function where btrfs_extend_item() is called Argument 'trans' became unnecessary from setup_inline_extent_backref() that called btrfs_extend_item(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:54 -04:00
Tsutomu Itoh	4b90c68015	Btrfs: remove unused argument of btrfs_extend_item() Argument 'trans' is not used in btrfs_extend_item(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:53 -04:00
Tsutomu Itoh	afe5fea72b	Btrfs: cleanup of function where fixup_low_keys() is called If argument 'trans' is unnecessary in the function where fixup_low_keys() is called, 'trans' is deleted. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:52 -04:00
Tsutomu Itoh	d6a0a12684	Btrfs: remove unused argument of fixup_low_keys() Argument 'trans' is not used in fixup_low_keys(). So, remove it. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:52 -04:00
Wang Shilong	b4fcd6be6b	Btrfs: fix confusing edquot happening case Step to reproduce: mkfs.btrfs <disk> mount <disk> <mnt> dd if=/dev/zero of=/<mnt>/data bs=1M count=10 sync btrfs quota enable <mnt> btrfs qgroup create 0/5 <mnt> btrfs qgroup limit 5M 0/5 <mnt> rm -f /<mnt>/data sync btrfs qgroup show <mnt> dd if=/dev/zero of=data bs=1M count=1 >From the perspective of users, qgroup's referenced or exclusive is negative,but user can not continue to write data! a workaround way is to cast u64 to s64 when doing qgroup reservation. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:51 -04:00
Wang Shilong	e36902d4cc	Btrfs: do not continue if out of memory happens If out of memory happens, we should return -ENOMEM directly to the caller rather than continue the work. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:50 -04:00
Nathaniel Yazdani	9c931c5ab2	btrfs: fix minor typo in comment In the comment describing the sync_writers field of the btrfs_inode struct, "fsyncing" was misspelled "fsycing." Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:49 -04:00
Wang Shilong	98ad43be0a	Btrfs: cleanup to remove reduplicate code in transaction.c Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:49 -04:00
Jan Schmidt	47fb091fb7	Btrfs: fix unlock after free on rewinded tree blocks When tree_mod_log_rewind decides to make a copy of the current tree buffer for its modifications, it subsequently freed the buffer before unlocking it. Obviously, those operations are required in reverse order. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:48 -04:00
Jan Schmidt	30b0463a93	Btrfs: fix accessing the root pointer in tree mod log functions The tree mod log functions were accessing root->node->... directly, without use of btrfs_root_node() or explicit rcu locking. This could lead to an extent buffer reference being leaked and another reference being freed too early when preemtion was enabled. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:47 -04:00
Jan Schmidt	90f8d62ebb	Btrfs: fix tree mod log regression on root split operations Commit `d9abbf1c` changed tree mod log locking around ROOT_REPLACE operations. When a tree root is split, however, we were logging removal of all elements from the root node before logging removal of half of the elements for the split operation. This leads to a BUG_ON when rewinding. This commit removes the erroneous logging of removal of all elements. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:47 -04:00
Miao Xie	ceda086424	Btrfs: use a lock to protect incompat/compat flag of the super block The following case will make the incompat/compat flag of the super block be recovered. Task1 \|Task2 flags = btrfs_super_incompat_flags(); \| \|flags = btrfs_super_incompat_flags(); flags \|= new_flag1; \| \|flags \|= new_flag2; btrfs_set_super_incompat_flags(flags); \| \|btrfs_set_super_incompat_flags(flags); the new_flag1 is recovered. In order to avoid this problem, we introduce a lock named super_lock into the btrfs_fs_info structure. If we want to update incompat/compat flags of the super block, we must hold it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:46 -04:00
Miao Xie	f42a34b2f1	Btrfs: fix unblocked autodefraggers when remount The new mount option is set after parsing the remount arguments, so it is wrong that checking the autodefrag is close or not at btrfs_remount_prepare(). Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:45 -04:00
Wang Shilong	f7f82b81d2	Btrfs: add a rb_tree to improve performance of ulist search Walking backref tree and btrfs quota rely on ulist very much. This patch tries to use rb_tree to speed up search time. The original code always checks whether an element exists before adding a new element, however it costs O(n). I try to add a rb_tree in the ulist,this is only used to speed up search. I also do some measurements with quota enabled. fsstress -p 4 -n 10000 Without this path: real 0m51.058s 2m4.745s 1m28.222s 1m5.137s user 0m0.035s 0m0.041s 0m0.105s 0m0.100s sys 0m12.009s 0m11.246s 0m10.901s 0m10.999s 0m11.287s With this path: real 0m55.295s 0m50.960s 1m2.214s 0m48.273s user 0m0.053s 0m0.095s 0m0.135s 0m0.107s sys 0m7.766s 0m6.013s 0m6.319s 0m6.030s 0m6.532s After applying the patch,the execute time is down by ~42%.(11.287s->6.532s) Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:44 -04:00
Stefan Behrens	c2c71324ec	Btrfs: allow omitting stream header and end-cmd for btrfs send Two new flags are added to allow omitting the stream header and the end command for btrfs send streams. This is used in cases where you send multiple snapshots back-to-back in one stream. This used to be encoded like this (with 2 snapshots in this example): <stream header> + <sequence of commands> + <end cmd> + <stream header> + <sequence of commands> + <end cmd> + EOF The new format (if the two new flags are used) is this one: <stream header> + <sequence of commands> + <sequence of commands> + <end cmd> Note that the currently existing receivers treat <end cmd> only as an indication that a new <stream header> is following. This means, you can just skip the sequence <end cmd> <stream header> without loosing compatibility. As long as an EOF is following, the currently existing receivers handle the new format (if the two new flags are used) exactly as the old one. So what is the benefit of this change? The goal is to be able to use a single stream (one TCP connection) to multiplex a request/response handshake plus Btrfs send streams, all in the same stream. In this case you cannot evaluate an EOF condition as an end of the Btrfs send stream. You need something else, and the <end cmd> is just perfect for this purpose. The summary is: The format change is driven by the need to send several Btrfs send streams over a single TCP connections, with the ability for a repeated request/response handshake in the middle. And this format change does not break any existing tool, it is completely compatible. You could compare the old behaviour of the Btrfs send stream to the one of ftp where you need a seperate request/response channel and newly opened data transfer channels for each file, while the new behaviour is more like http using a single stream for everything. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:44 -04:00
Wang Shilong	692206b153	Btrfs: make __merge_refs() return type be void __merge_refs() always return 0, it is unnecessary for the caller to check the return value. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:43 -04:00
Wang Shilong	1149ab6bd4	Btrfs: remove some BUG_ONs() when walking backref tree The only error return value of __add_prelim_ref() is -ENOMEM, just return errors rather than trigger BUG_ON(). Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:42 -04:00
Wang Shilong	92f183aa5b	Btrfs: use tree_root to avoid edquot when disabling quota Steps to reproduce: mkfs.btrfs <disk> mount <disk> <mnt> btrfs quota enable <mnt> btrfs sub create <mnt>/subv btrfs qgroup limit 10K <mnt>/subv btrfs quota disable <mnt>/subv It is wrong for qgroup to reserve when disabling quota, so just use tree_root to avoid edquot when disabling quota. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:41 -04:00
Wang Shilong	ddb47afa50	Btrfs: fix a warning when updating qgroup limit Step to reproduce: mkfs.btrfs <disk> mount <disk> <mnt> btrfs quota enable <mnt> btrfs qgroup limit 0/1 <mnt> dmesg If the relative qgroup dosen't exist, flag 'BTRFS_QGROUP_STATUS_ FLAG_INCONSISTENT' will be set, and print the noise message. This is wrong, we can just move find_qgroup_rb() before update_qgroup_limit_item().this dosen't change the logic of the function. But it can avoid unnecessary noise message and wrong set of flag. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:41 -04:00
Wang Shilong	3f5e2d3b38	Btrfs: fix missing check in the btrfs_qgroup_inherit() The original code forgot to check 'inherit', we should gurantee that all the qgroups in the struct 'inherit' exist. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:40 -04:00
Wang Shilong	b7fef4f593	Btrfs: fix missing check before creating a qgroup relation Step to reproduce: mkfs.btrfs <disk> mount <disk> <mnt> btrfs quota enable <mnt> btrfs qgroup assign 0/1 1/1 <mnt> umount <mnt> btrfs-debug-tree <disk> \| grep QGROUP If we want to add a qgroup relation, we should gurantee that 'src' and 'dst' exist, otherwise, such qgroup relation should not be allowed to create. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:39 -04:00
Wang Shilong	58400fce5a	Btrfs: remove some unnecessary spin_lock usages We use mutex lock to protect all the user change operations. So when we are calling find_qgroup_rb() to check whether qgroup exists, we don't have to hold spin_lock. Besides, when enabling/disabling quota, it must be single thread when operations come here. spin lock must be firstly used to clear quota_root when disabling quota, while enabling quota, spin lock must be used to complete the last assign work. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:39 -04:00
Wang Shilong	f2f6ed3d54	Btrfs: introduce a mutex lock for btrfs quota operations The original code has one spin_lock 'qgroup_lock' to protect quota configurations in memory. If we want to add a BTRFS_QGROUP_INFO_KEY, it will be added to Btree firstly, and then update configurations in memory,however, a race condition may happen between these operations. For example: ->add_qgroup_info_item() ->add_qgroup_rb() For the above case, del_qgroup_info_item() may happen just before add_qgroup_rb(). What's worse, when we want to add a qgroup relation: ->add_qgroup_relation_item() ->add_qgroup_relations() We don't have any checks whether 'src' and 'dst' exist before add_qgroup_relation_item(), a race condition can also happen for the above case. To avoid race condition and have all the necessary checks, we introduce a mutex lock 'qgroup_ioctl_lock', and we make all the user change operations protected by the mutex lock. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:38 -04:00
Wang Shilong	7708f029dc	Btrfs: creating the subvolume qgroup automatically when enabling quota Creating the subvolume/snapshots(including root subvolume) qgroup auotomatically when enabling quota. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:37 -04:00
Zach Brown	d4e3991b99	btrfs: abort unlink trans in missed error case __btrfs_unlink_inode() aborts its transaction when it sees errors after it removes the directory item. But it missed the case where btrfs_del_dir_entries_in_log() returns an error. If this happens then the unlink appears to fail but the items have been removed without updating the directory size. The directory then has leaked bytes in i_size and can never be removed. Adding the missing transaction abort at least makes this failure consistent with the other failure cases. I noticed this while reading the code after someone on irc reported having a directory with i_size but no entries. I tested it by forcing btrfs_del_dir_entries_in_log() to return -ENOMEM. Signed-off-by: Zach Brown <zab@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:36 -04:00
Eric Sandeen	f63e0cca91	btrfs: ignore device open failures in __btrfs_open_devices This: # mkfs.btrfs /dev/sdb{1,2} ; wipefs -a /dev/sdb1; mount /dev/sdb2 /mnt/test would lead to a blkdev open/close mismatch when the mount fails, and a permanently busy (opened O_EXCL) sdb2: # wipefs -a /dev/sdb2 wipefs: error: /dev/sdb2: probing initialization failed: Device or resource busy It's because btrfs_open_devices() may open some devices, fail on the last one, and return that failure stored in "ret." The mount then fails, but the caller then does not clean up the open devices. Chris assures me that: "btrfs_open_devices just means: go off and open every bdev you can from this uuid. It should return success if we opened any of them at all." So change the logic to ignore any open failures; just skip processing of that device. Later on it's decided whether we have enough devices to continue. Reported-by: Jan Safranek <jsafrane@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:36 -04:00
Miao Xie	e4100d987b	Btrfs: improve the performance of the csums lookup It is very likely that there are several blocks in bio, it is very inefficient if we get their csums one by one. This patch improves this problem by getting the csums in batch. According to the result of the following test, the execute time of __btrfs_lookup_bio_sums() is down by ~28%(300us -> 217us). # dd if=<mnt>/file of=/dev/null bs=1M count=1024 Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:35 -04:00
Josef Bacik	09a2a8f96e	Btrfs: fix bad extent logging A user sent me a btrfs-image of a file system that was panicing on mount during the log recovery. I had originally thought these problems were from a bug in the free space cache code, but that was just a symptom of the problem. The problem is if your application does something like this [prealloc][prealloc][prealloc] the internal extent maps will merge those all together into one extent map, even though on disk they are 3 separate extents. So if you go to write into one of these ranges the extent map will be right since we use the physical extent when doing the write, but when we log the extents they will use the wrong sizes for the remainder prealloc space. If this doesn't happen to trip up the free space cache (which it won't in a lot of cases) then you will get bogus entries in your extent tree which will screw stuff up later. The data and such will still work, but everything else is broken. This patch fixes this by not allowing extents that are on the modified list to be merged. This has the side effect that we are no longer adding everything to the modified list all the time, which means we now have to call btrfs_drop_extents every time we log an extent into the tree. So this allows me to drop all this speciality code I was using to get around calling btrfs_drop_extents. With this patch the testcase I've created no longer creates a bogus file system after replaying the log. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:34 -04:00
Josef Bacik	cc95bef635	Btrfs: log ram bytes properly When logging changed extents I was logging ram_bytes as the current length, which isn't correct, it's supposed to be the ram bytes of the original extent. This is for compression where even if we split the extent we need to know the ram bytes so when we uncompress the extent we know how big it will be. This was still working out right with compression for some reason but I think we were getting lucky. It was definitely off for prealloc which is why I noticed it, btrfsck was complaining about it. With this patch btrfsck no longer complains after a log replay. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:33 -04:00
Josef Bacik	98ad69cfd2	Btrfs: don't wait on ordered extents if we have a trans open Dave was hitting a lockdep warning because we're now properly taking the ordered operations mutex in the ordered wait stuff. This is because some cases we will have a trans handle when we are flushing delalloc space, but we can't wait on ordered extents because we could potentially deadlock, so fix this by not doing the wait if we have a trans handle. Thanks Reported-and-tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:32 -04:00
Josef Bacik	8c579fe745	Btrfs: fix error handling in make/read block group I noticed that we will add a block group to the space info before we add it to the block group cache rb tree, so we could potentially allocate from the block group before it's able to be searched for. I don't think this is too much of a problem, the race window is microscopic, but just in case move the tree insertion to above the space info linking. This makes it easier to adjust the error handling as well, so we can remove a couple of BUG_ON(ret)'s and have real error handling setup for these scenarios. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:32 -04:00
Wang Shilong	5c2d867fdc	Btrfs: fix double free in the iterate_extent_inodes() If btrfs_find_all_roots() fails, 'roots' has been freed or 'roots' fails to allocate. We don't need to free it outside btrfs_find_all_roots() again.Fix it. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:31 -04:00
Wang Shilong	f172393952	Btrfs: kill some BUG_ONs() in the find_parent_nodes() The reason that BUG_ON() happens in these places is just because of ENOMEM. We try ro return ENOMEM rather than trigger BUG_ON(), the caller will abort the transaction thus avoiding the kernel panic. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:30 -04:00
Josef Bacik	41b0fc4280	Btrfs: compare relevant parts of delayed tree refs A user reported a panic while running a balance. What was happening was he was relocating a block, which added the reference to the relocation tree. Then relocation would walk through the relocation tree and drop that reference and free that block, and then it would walk down a snapshot which referenced the same block and add another ref to the block. The problem is this was all happening in the same transaction, so the parent block was free'ed up when we drop our reference which was immediately available for allocation, and then it was used _again_ to add a reference for the same block from a different snapshot. This resulted in something like this in the delayed ref tree add ref to 90234880, parent=2067398656, ref_root 1766, level 1 del ref to 90234880, parent=2067398656, ref_root 18446744073709551608, level 1 add ref to 90234880, parent=2067398656, ref_root 1767, level 1 as you can see the ref_root's don't match, because when we inc the ref we use the header owner, which is the original tree the block belonged to, instead of the data reloc tree. Then when we remove the extent we use the reloc tree objectid. But none of this matters, since it is a shared reference which means only the parent matters. When the delayed ref stuff runs it adds all the increments first, and then does all the drops, to make sure that we don't delete the ref if we net a positive ref count. But tree blocks aren't allowed to have multiple refs from the same block, so this panics when it tries to add the second ref. We need the add and the drop to cancel each other out in memory so we only do the final add. So to fix this we need to adjust how the delayed refs are added to the tree. Only the ref_root matters when it is a normal backref, and only the parent matters when it is a shared backref. So make our decision based on what ref type we have. This allows us to keep the ref_root in memory in case anybody wants to use it for something else, and it allows the delayed refs to be merged properly so we don't end up with this panic. With this patch the users image no longer panics on mount, and it has a clean fsck after a normal mount/umount cycle. Thanks, Cc: stable@vger.kernel.org Reported-by: Roman Mamedov <rm@romanrm.ru> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:29 -04:00
Josef Bacik	cf79ffb5b7	Btrfs: fix infinite loop when we abort on mount Testing my enospc log code I managed to abort a transaction during mount, which put me into an infinite loop. This is because of two things, first we don't reset trans_no_join if we abort during transaction commit, which will force anybody trying to start a transaction to just loop endlessly waiting for it to be set to 0. But this is still just a symptom, the second issue is we don't set the fs state to error during errors on mount. This is because we don't want to do the flip read only thing during mount, but we still really want to set the fs state to an error to keep us from even getting to the trans_no_join check. So fix both of these things, make sure to reset trans_no_join if we abort during a commit, and make sure we set the fs state to error no matter if we're mounting or not. This should keep us from getting into this infinite loop again. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:29 -04:00
Wang Shilong	c9a9dbf2cb	Btrfs: fix a warning when disabling quota Steps to reproduce: mkfs.btrfs <disk> mount <disk> <mnt> btrfs quota enable <mnt> btrfs sub create <mnt>/subv i=1 while [ $i -le 10000 ] do dd if=/dev/zero of=<mnt>/subv/data_$i bs=1K count=1 i=$(($i+1)) if [ $i -eq 500 ] then btrfs quota disable $mnt fi done dmesg Obviously, this warn_on() is unnecessary, and it will be easily triggered. Just remove it. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:28 -04:00
Liu Bo	6b67a32000	Btrfs: pass NULL instead of 0 set_extent_bit()'s (u64 *failed_start) expects NULL not 0. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:27 -04:00
David Sterba	5c50c9b89f	btrfs: make subvol creation/deletion killable in the early stages The subvolume ioctls block on the parent directory mutex that can be held by other concurrent snapshot activity for a long time. Give the user at least some chance to get out of this situation by allowing to send a kill signal. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:26 -04:00
David Sterba	94ef7280e8	btrfs: cover more error codes in btrfs_decode_error Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:25 -04:00
David Sterba	4884b476d7	btrfs: make orphan cleanup less verbose The messages btrfs: unlinked 123 orphans btrfs: truncated 456 orphans are not useful to regular users and raise questions whether there are problems with the filesystem. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:24 -04:00
David Sterba	5e2a4b25da	btrfs: deprecate subvolrootid mount option This mount option was a workaround when subvol= assumed path relative to the default subvolume, not the toplevel one. This was fixed long time ago and subvolrootid has no effect. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:23 -04:00
Simon Kirby	c2cf52eb71	Btrfs: Include the device in most error printk()s With more than one btrfs volume mounted, it can be very difficult to find out which volume is hitting an error. btrfs_error() will print this, but it is currently rigged as more of a fatal error handler, while many of the printk()s are currently for debugging and yet-unhandled cases. This patch just changes the functions where the device information is already available. Some cases remain where the root or fs_info is not passed to the function emitting the error. This may introduce some confusion with volumes backed by multiple devices emitting errors referring to the primary device in the set instead of the one on which the error occurred. Use btrfs_printk(fs_info, format, ...) rather than writing the device string every time, and introduce macro wrappers ala XFS for brevity. Since the function already cannot be used for continuations, print a newline as part of the btrfs_printk() message rather than at each caller. Signed-off-by: Simon Kirby <sim@hostway.ca> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:23 -04:00
David Sterba	aa8259145e	btrfs: update kconfig title The Kconfig title does not make much sense after the cleanup of CONFIG_EXPERIMENTAL option, align the wording with other filesystems. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:22 -04:00
David Sterba	9d1a2a3ad5	btrfs: clean snapshots one by one Each time pick one dead root from the list and let the caller know if it's needed to continue. This should improve responsiveness during umount and balance which at some point waits for cleaning all currently queued dead roots. A new dead root is added to the end of the list, so the snapshots disappear in the order of deletion. The snapshot cleaning work is now done only from the cleaner thread and the others wake it if needed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:21 -04:00
Zhi Yong Wu	6841ebee6b	btrfs: Cleanup some redundant codes in btrfs_log_inode() Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:20 -04:00
Zhi Yong Wu	628c8282be	btrfs: Cleanup some redundant codes in btrfs_lookup_csums_range() Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:20 -04:00
Liu Bo	7abadb6431	Btrfs: share stop worker code Share the exactly same code of stopping workers. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:19 -04:00
Josef Bacik	3173a18f70	Btrfs: add a incompatible format change for smaller metadata extent refs We currently store the first key of the tree block inside the reference for the tree block in the extent tree. This takes up quite a bit of space. Make a new key type for metadata which holds the level as the offset and completely removes storing the btrfs_tree_block_info inside the extent ref. This reduces the size from 51 bytes to 33 bytes per extent reference for each tree block. In practice this results in a 30-35% decrease in the size of our extent tree, which means we COW less and can keep more of the extent tree in memory which makes our heavy metadata operations go much faster. This is not an automatic format change, you must enable it at mkfs time or with btrfstune. This patch deals with having metadata stored as either the old format or the new format so it is easy to convert. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:18 -04:00
Liu Bo	be283b2e67	Btrfs: use helper to cleanup tree roots free_root_pointers() has been introduced to cleanup all of tree roots, so just use it instead. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:17 -04:00
Liu Bo	b0496686ba	Btrfs: cleanup unused arguments of btrfs_csum_data Argument 'root' is no more used in btrfs_csum_data(). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:54:14 -04:00
David Sterba	087488109a	btrfs: clean up transaction abort messages The transaction abort stacktrace is printed only once per module lifetime, but we'd like to see it each time it happens per mounted filesystem. Introduce a fs_state flag that records it. Tweak the messages around abort: * add error number to the first abort * print the exact negative errno from btrfs_decode_error * clean up btrfs_decode_error and callers * no dots at the end of the messages Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:52:56 -04:00
David Sterba	bbece8a3f0	btrfs: merge save_error_info helpers into one Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:52:55 -04:00
Josef Bacik	74255aa07d	Btrfs: add some free space cache tests We keep hitting bugs in the tree log replay because btrfs_remove_free_space doesn't account for some corner case. So add a bunch of tests to try and fully test btrfs_remove_free_space since the only time it is called is during tree log replay. These tests all finish successfully, so as we find more of these bugs we need to add to these tests to make sure we don't regress in fixing things. I've hidden the tests behind a Kconfig option, but they take no time to run so all btrfs developers should have this turned on all the time. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-05-06 15:52:54 -04:00
Linus Torvalds	20b4fb4852	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...	2013-05-01 17:51:54 -07:00
Liu Bo	e75206cfdc	Btrfs: cleanup unused function btrfs_abort_devices() is no more used. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-04-29 14:58:34 -04:00
Linus Torvalds	3792a64fde	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull one more btrfs fix from Chris Mason: "This has a recent fix from Josef for our tree log replay code. It fixes problems where the inode counter for the number of bytes in the file wasn't getting updated properly during fsync replay. The commit did get rebased this morning, but it was only to clean up the subject line. The code hasn't changed." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: make sure nbytes are right after log replay	2013-04-14 10:52:54 -07:00
Josef Bacik	4bc4bee459	Btrfs: make sure nbytes are right after log replay While trying to track down a tree log replay bug I noticed that fsck was always complaining about nbytes not being right for our fsynced file. That is because the new fsync stuff doesn't wait for ordered extents to complete, so the inodes nbytes are not necessarily updated properly when we log it. So to fix this we need to set nbytes to whatever it is on the inode that is on disk, so when we replay the extents we can just add the bytes that are being added as we replay the extent. This makes it work for the case that we have the wrong nbytes or the case that we logged everything and nbytes is actually correct. With this I'm no longer getting nbytes errors out of btrfsck. Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-04-13 07:35:06 -04:00
Al Viro	8d71db4f08	lift sb_start_write/sb_end_write out of ->aio_write() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-04-09 14:12:55 -04:00
Jens Axboe	64f8de4da7	Merge branch 'writeback-workqueue' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq into for-3.10/core Tejun writes: ----- This is the pull request for the earlier patchset[1] with the same name. It's only three patches (the first one was committed to workqueue tree) but the merge strategy is a bit involved due to the dependencies. * Because the conversion needs features from wq/for-3.10, block/for-3.10/core is based on rc3, and wq/for-3.10 has conflicts with rc3, I pulled mainline (rc5) into wq/for-3.10 to prevent those workqueue conflicts from flaring up in block tree. * Resolving the issue that Jan and Dave raised about debugging requires arch-wide changes. The patchset is being worked on[2] but it'll have to go through -mm after these changes show up in -next, and not included in this pull request. The three commits are located in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git writeback-workqueue Pulling it into block/for-3.10/core produces a conflict in drivers/md/raid5.c between the following two commits. `e3620a3ad5` ("MD RAID5: Avoid accessing gendisk or queue structs when not available") `2f6db2a707` ("raid5: use bio_reset()") The conflict is trivial - one removes an "if ()" conditional while the other removes "rbi->bi_next = NULL" right above it. We just need to remove both. The merged branch is available at git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git block-test-merge so that you can use it for verification. The test merge commit has proper merge description. While these changes are a bit of pain to route, they make code simpler and even have, while minute, measureable performance gain[3] even on a workload which isn't particularly favorable to showing the benefits of this conversion. ---- Fixed up the conflict. Conflicts: drivers/md/raid5.c Signed-off-by: Jens Axboe <axboe@kernel.dk>	2013-04-02 10:04:39 +02:00
Linus Torvalds	3615db41c4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We've had a busy two weeks of bug fixing. The biggest patches in here are some long standing early-enospc problems (Josef) and a very old race where compression and mmap combine forces to lose writes (me). I'm fairly sure the mmap bug goes all the way back to the introduction of the compression code, which is proof that fsx doesn't trigger every possible mmap corner after all. I'm sure you'll notice one of these is from this morning, it's a small and isolated use-after-free fix in our scrub error reporting. I double checked it here." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: don't drop path when printing out tree errors in scrub Btrfs: fix wrong return value of btrfs_lookup_csum() Btrfs: fix wrong reservation of csums Btrfs: fix double free in the btrfs_qgroup_account_ref() Btrfs: limit the global reserve to 512mb Btrfs: hold the ordered operations mutex when waiting on ordered extents Btrfs: fix space accounting for unlink and rename Btrfs: fix space leak when we fail to reserve metadata space Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes Btrfs: fix race between mmap writes and compression Btrfs: fix memory leak in btrfs_create_tree() Btrfs: fix locking on ROOT_REPLACE operations in tree mod log Btrfs: fix missing qgroup reservation before fallocating Btrfs: handle a bogus chunk tree nicely Btrfs: update to use fs_state bit	2013-03-29 11:13:25 -07:00
Josef Bacik	d8fe29e9de	Btrfs: don't drop path when printing out tree errors in scrub A user reported a panic where we were panicing somewhere in tree_backref_for_extent from scrub_print_warning. He only captured the trace but looking at scrub_print_warning we drop the path right before we mess with the extent buffer to print out a bunch of stuff, which isn't right. So fix this by dropping the path after we use the eb if we need to. Thanks, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-29 10:18:59 -04:00
Miao Xie	82d130ff39	Btrfs: fix wrong return value of btrfs_lookup_csum() If we don't find the expected csum item, but find a csum item which is adjacent to the specified extent, we should return -EFBIG, or we should return -ENOENT. But btrfs_lookup_csum() return -EFBIG even the csum item is not adjacent to the specified extent. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-28 09:51:31 -04:00
Miao Xie	39847c4d3d	Btrfs: fix wrong reservation of csums We reserve the space for csums only when we write data into a file, in the other cases, such as tree log, log replay, we don't do reservation, so we can use the reservation of the transaction handle just for the former. And for the latter, we should use the tree's own reservation. But the function - btrfs_csum_file_blocks() didn't differentiate between these two types of the cases, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-28 09:51:30 -04:00
Wang Shilong	a7975026ff	Btrfs: fix double free in the btrfs_qgroup_account_ref() The function btrfs_find_all_roots is responsible to allocate memory for 'roots' and free it if errors happen,so the caller should not free it again since the work has been done. Besides,'tmp' is allocated after the function btrfs_find_all_roots, so we can return directly if btrfs_find_all_roots() fails. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-28 09:51:29 -04:00
Josef Bacik	fdf30d1c1b	Btrfs: limit the global reserve to 512mb A user reported a problem where he was getting early ENOSPC with hundreds of gigs of free data space and 6 gigs of free metadata space. This is because the global block reserve was taking up the entire free metadata space. This is ridiculous, we have infrastructure in place to throttle if we start using too much of the global reserve, so instead of letting it get this huge just limit it to 512mb so that users can still get work done. This allowed the user to complete his rsync without issues. Thanks Cc: stable@vger.kernel.org Reported-and-tested-by: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-28 09:51:29 -04:00
Josef Bacik	db1d607d3c	Btrfs: hold the ordered operations mutex when waiting on ordered extents We need to hold the ordered_operations mutex while waiting on ordered extents since we splice and run the ordered extents list. We need to make sure anybody else who wants to wait on ordered extents does actually wait for them to be completed. This will keep us from bailing out of flushing in case somebody is already waiting on ordered extents to complete. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-28 09:51:28 -04:00
Josef Bacik	6e137ed3f3	Btrfs: fix space accounting for unlink and rename We are way over-reserving for unlink and rename. Rename is just some random huge number and unlink accounts for tree log operations that don't actually happen during unlink, not to mention the tree log doesn't take from the trans block rsv anyway so it's completely useless. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-28 09:51:27 -04:00
Josef Bacik	f4881bc7a8	Btrfs: fix space leak when we fail to reserve metadata space Dave reported a warning when running xfstest 275. We have been leaking delalloc metadata space when our reservations fail. This is because we were improperly calculating how much space to free for our checksum reservations. The problem is we would sometimes free up space that had already been freed in another thread and we would end up with negative usage for the delalloc space. This patch fixes the problem by calculating how much space the other threads would have already freed, and then calculate how much space we need to free had we not done the reservation at all, and then freeing any excess space. This makes xfstests 275 no longer have leaked space. Thanks Cc: stable@vger.kernel.org Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-28 09:51:26 -04:00
Jan Schmidt	adaa4b8e4d	Btrfs: fix EIO from btrfs send in is_extent_unchanged for punched holes When you take a snapshot, punch a hole where there has been data, then take another snapshot and try to send an incremental stream, btrfs send would give you EIO. That is because is_extent_unchanged had no support for holes being punched. With this patch, instead of returning EIO we just return 0 (== the extent is not unchanged) and we're good. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Cc: Alexander Block <ablock84@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-28 09:51:26 -04:00
Chris Mason	4adaa61102	Btrfs: fix race between mmap writes and compression Btrfs uses page_mkwrite to ensure stable pages during crc calculations and mmap workloads. We call clear_page_dirty_for_io before we do any crcs, and this forces any application with the file mapped to wait for the crc to finish before it is allowed to change the file. With compression on, the clear_page_dirty_for_io step is happening after we've compressed the pages. This means the applications might be changing the pages while we are compressing them, and some of those modifications might not hit the disk. This commit adds the clear_page_dirty_for_io before compression starts and makes sure to redirty the page if we have to fallback to uncompressed IO as well. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Alexandre Oliva <oliva@gnu.org> cc: stable@vger.kernel.org	2013-03-26 13:19:14 -04:00
Kent Overstreet	aa8b57aa3d	block: Use bio_sectors() more consistently Bunch of places in the code weren't using it where they could be - this'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: "Ed L. Cashin" <ecashin@coraid.com> CC: Nick Piggin <npiggin@kernel.dk> CC: Jiri Kosina <jkosina@suse.cz> CC: Jim Paris <jim@jtan.com> CC: Geoff Levand <geoff@infradead.org> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ed Cashin <ecashin@coraid.com>	2013-03-23 14:15:30 -07:00
Kent Overstreet	f73a1c7d11	block: Add bio_end_sector() Just a little convenience macro - main reason to add it now is preparing for immutable bio vecs, it'll reduce the size of the patch that puts bi_sector/bi_size/bi_idx into a struct bvec_iter. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Jens Axboe <axboe@kernel.dk> CC: Lars Ellenberg <drbd-dev@lists.linbit.com> CC: Jiri Kosina <jkosina@suse.cz> CC: Alasdair Kergon <agk@redhat.com> CC: dm-devel@redhat.com CC: Neil Brown <neilb@suse.de> CC: Martin Schwidefsky <schwidefsky@de.ibm.com> CC: Heiko Carstens <heiko.carstens@de.ibm.com> CC: linux-s390@vger.kernel.org CC: Chris Mason <chris.mason@fusionio.com> CC: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com>	2013-03-23 14:15:29 -07:00
Tsutomu Itoh	1dd05682b3	Btrfs: fix memory leak in btrfs_create_tree() We should free leaf and root before returning from the error handling code. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-21 19:31:52 -04:00
Jan Schmidt	d9abbf1c31	Btrfs: fix locking on ROOT_REPLACE operations in tree mod log To resolve backrefs, ROOT_REPLACE operations in the tree mod log are required to be tied to at least one KEY_REMOVE_WHILE_FREEING operation. Therefore, those operations must be enclosed by tree_mod_log_write_lock() and tree_mod_log_write_unlock() calls. Those calls are private to the tree_mod_log_* functions, which means that removal of the elements of an old root node must be logged from tree_mod_log_insert_root. This partly reverts and corrects commit `ba1bfbd5` (Btrfs: fix a tree mod logging issue for root replacement operations). This fixes the brand-new version of xfstest 276 as of commit cfe73f71. Cc: stable@vger.kernel.org Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-21 19:31:52 -04:00
Wang Shilong	6113077cd3	Btrfs: fix missing qgroup reservation before fallocating Steps to reproduce: mkfs.btrfs <disk> mount <disk> <mnt> btrfs quota enable <mnt> btrfs sub create <mnt>/subv btrfs qgroup limit 10M <mnt>/subv fallocate --length 20M <mnt>/subv/data For the above example, fallocating will return successfully which is not expected, we try to fix it by doing qgroup reservation before fallocating. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-21 19:24:32 -04:00
Josef Bacik	835d974fab	Btrfs: handle a bogus chunk tree nicely If you restore a btrfs-image file system and try to mount that file system we'll panic. That's because btrfs-image restores and just makes one big chunk to envelope the whole disk, since they are really only meant to be messed with by our btrfs-progs. So fix up btrfs_rmap_block and the callers of it for mount so that we no longer panic but instead just return an error and fail to mount. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-21 19:24:31 -04:00
Liu Bo	d763448286	Btrfs: update to use fs_state bit Now that we use bit operation to check fs_state, update btrfs_free_fs_root()'s checker, otherwise we get back to memory leak case. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-21 19:24:31 -04:00
Linus Torvalds	08637024ab	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Eric's rcu barrier patch fixes a long standing problem with our unmount code hanging on to devices in workqueue helpers. Liu Bo nailed down a difficult assertion for in-memory extent mappings." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix warning of free_extent_map Btrfs: fix warning when creating snapshots Btrfs: return as soon as possible when edquot happens Btrfs: return EIO if we have extent tree corruption btrfs: use rcu_barrier() to wait for bdev puts at unmount Btrfs: remove btrfs_try_spin_lock Btrfs: get better concurrency for snapshot-aware defrag work	2013-03-17 11:04:14 -07:00
Liu Bo	3b2775942d	Btrfs: fix warning of free_extent_map Users report that an extent map's list is still linked when it's actually going to be freed from cache. The story is that a) when we're going to drop an extent map and may split this large one into smaller ems, and if this large one is flagged as EXTENT_FLAG_LOGGING which means that it's on the list to be logged, then the smaller ems split from it will also be flagged as EXTENT_FLAG_LOGGING, and this is _not_ expected. b) we'll keep ems from unlinking the list and freeing when they are flagged with EXTENT_FLAG_LOGGING, because the log code holds one reference. The end result is the warning, but the truth is that we set the flag EXTENT_FLAG_LOGGING only during fsync. So clear flag EXTENT_FLAG_LOGGING for extent maps split from a large one. Reported-by: Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de> Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-15 21:51:49 -04:00
Liu Bo	7c2ec3f073	Btrfs: fix warning when creating snapshots Creating snapshot passes extent_root to commit its transaction, but it can lead to the warning of checking root for quota in the __btrfs_end_transaction() when someone else is committing the current transaction. Since we've recorded the needed root in trans_handle, just use it to get rid of the warning. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-14 14:57:30 -04:00
Wang Shilong	720f1e2060	Btrfs: return as soon as possible when edquot happens If one of qgroup fails to reserve firstly, we should return immediately, it is unnecessary to continue check. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-14 14:57:29 -04:00
Josef Bacik	492104c866	Btrfs: return EIO if we have extent tree corruption The callers of lookup_inline_extent_info all handle getting an error back properly, so return an error if we have corruption instead of being a jerk and panicing. Still WARN_ON() since this is kind of crucial and I've been seeing it a bit too much recently for my taste, I think we're doing something wrong somewhere. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-14 14:57:29 -04:00
Eric Sandeen	bc178622d4	btrfs: use rcu_barrier() to wait for bdev puts at unmount Doing this would reliably fail with -EBUSY for me: # mount /dev/sdb2 /mnt/scratch; umount /mnt/scratch; mkfs.btrfs -f /dev/sdb2 ... unable to open /dev/sdb2: Device or resource busy because mkfs.btrfs tries to open the device O_EXCL, and somebody still has it. Using systemtap to track bdev gets & puts shows a kworker thread doing a blkdev put after mkfs attempts a get; this is left over from the unmount path: btrfs_close_devices __btrfs_close_devices call_rcu(&device->rcu, free_device); free_device INIT_WORK(&device->rcu_work, __free_device); schedule_work(&device->rcu_work); so unmount might complete before __free_device fires & does its blkdev_put. Adding an rcu_barrier() to btrfs_close_devices() causes unmount to wait until all blkdev_put()s are done, and the device is truly free once unmount completes. Cc: stable@vger.kernel.org Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-14 14:57:29 -04:00
Liu Bo	d340d2475c	Btrfs: remove btrfs_try_spin_lock Remove a useless function declaration Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-14 14:57:10 -04:00
Liu Bo	a09a0a705d	Btrfs: get better concurrency for snapshot-aware defrag work Using spinning case instead of blocking will result in better concurrency overall. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-14 14:50:19 -04:00
Linus Torvalds	72932611b4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace bugfixes from Eric Biederman: "This is three simple fixes against 3.9-rc1. I have tested each of these fixes and verified they work correctly. The userns oops in key_change_session_keyring and the BUG_ON triggered by proc_ns_follow_link were found by Dave Jones. I am including the enhancement for mount to only trigger requests of filesystem modules here instead of delaying this for the 3.10 merge window because it is both trivial and the kind of change that tends to bit-rot if left untouched for two months." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Use nd_jump_link in proc_ns_follow_link fs: Limit sys_mount to only request filesystem modules (Part 2). fs: Limit sys_mount to only request filesystem modules. userns: Stop oopsing in key_change_session_keyring	2013-03-09 16:51:13 -08:00
Linus Torvalds	0aefda3e81	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are scattered fixes and one performance improvement. The biggest functional change is in how we throttle metadata changes. The new code bumps our average file creation rate up by ~13% in fs_mark, and lowers CPU usage. Stefan bisected out a regression in our allocation code that made balance loop on extents larger than 256MB." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: improve the delayed inode throttling Btrfs: fix a mismerge in btrfs_balance() Btrfs: enforce min_bytes parameter during extent allocation Btrfs: allow running defrag in parallel to administrative tasks Btrfs: avoid deadlock on transaction waiting list Btrfs: do not BUG_ON on aborted situation Btrfs: do not BUG_ON in prepare_to_reloc Btrfs: free all recorded tree blocks on error Btrfs: build up error handling for merge_reloc_roots Btrfs: check for NULL pointer in updating reloc roots Btrfs: fix unclosed transaction handler when the async transaction commitment fails Btrfs: fix wrong handle at error path of create_snapshot() when the commit fails Btrfs: use set_nlink if our i_nlink is 0	2013-03-08 17:33:20 -08:00
Chris Mason	de3cb945db	Btrfs: improve the delayed inode throttling The delayed inode code batches up changes to the btree in hopes of doing them in bulk. As the changes build up, processes kick off worker threads and wait for them to make progress. The current code kicks off an async work queue item for each delayed node, which creates a lot of churn. It also uses a fixed 1 HZ waiting period for the throttle, which allows us to build a lot of pending work and can slow down the commit. This changes us to watch a sequence counter as it is bumped during the operations. We kick off fewer work items and have each work item do more work. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-07 07:52:40 -05:00
Ilya Dryomov	3a01aa7a25	Btrfs: fix a mismerge in btrfs_balance() Raid56 merge (merge commit `e942f88`) had mistakenly removed a call to __cancel_balance(), which resulted in balance not cleaning up after itself after a successful finish. (Cleanup includes switching the state, removing the balance item and releasing mut_ex_op testnset lock.) Bring it back. Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-06 22:03:16 -05:00
Chris Mason	2cc65e3e57	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-linus-3.9	2013-03-06 19:46:29 -05:00
Chris Mason	154ea28930	Btrfs: enforce min_bytes parameter during extent allocation Commit `24542bf7ea` changed preallocation of extents to cap the max size we try to allocate. It's a valid change, but the extent reservation code is also used by balance, and that can't tolerate a smaller extent being allocated. __btrfs_prealloc_file_range already has a min_size parameter, which is used by relocation to request a specific extent size. This commit adds an extra check to enforce that minimum extent size. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>	2013-03-05 11:30:16 -05:00
Stefan Behrens	9b53157aac	Btrfs: allow running defrag in parallel to administrative tasks Commit `5ac00add` added a testnset mutex and code that disallows running administrative tasks in parallel. It is prevented that the device add/delete/balance/replace/resize operations are started in parallel. By mistake, the defragmentation operation was included in the check for mutually exclusiveness as well. This is fixed with this commit. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-04 16:33:24 -05:00
Liu Bo	66b6135b7c	Btrfs: avoid deadlock on transaction waiting list Only let one trans handle to wait for other handles, otherwise we will get ABBA issues. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-04 16:33:23 -05:00
Liu Bo	0f788c5819	Btrfs: do not BUG_ON on aborted situation Btrfs balance can easily hit BUG_ON in these places, but we want to it bail out gracefully after we force the whole filesystem to readonly. So we use btrfs_std_error hook in place of BUG_ON. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-04 16:33:23 -05:00
Liu Bo	288189471d	Btrfs: do not BUG_ON in prepare_to_reloc We can bail out from here gracefully instead of a cold BUG_ON. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-04 16:33:23 -05:00
Liu Bo	e1a1267054	Btrfs: free all recorded tree blocks on error We've missed the 'free blocks' part on ENOMEM error. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-04 16:33:23 -05:00
Liu Bo	aca1bba6f9	Btrfs: build up error handling for merge_reloc_roots We first use btrfs_std_error hook to replace with BUG_ON, and we also need to cleanup what is left, including reloc roots rbtree and reloc roots list. Here we use a helper function to cleanup both rbtree and list, and since this function can also be used in the balance recover path, we also make the change as well to keep code simple. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-04 16:33:22 -05:00
Liu Bo	8f71f3e0e4	Btrfs: check for NULL pointer in updating reloc roots Add a check for NULL pointer to avoid invalid reference. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-04 16:33:22 -05:00
Miao Xie	00d71c9c17	Btrfs: fix unclosed transaction handler when the async transaction commitment fails If the async transaction commitment failed, we need close the current transaction handler, or the current transaction will be blocked to commit because of this orphan handler. We fix the problem by doing sync transaction commitment, that is to invoke btrfs_commit_transaction(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-04 16:33:22 -05:00
Miao Xie	aec8030a87	Btrfs: fix wrong handle at error path of create_snapshot() when the commit fails There are several bugs at error path of create_snapshot() when the transaction commitment failed. - access the freed transaction handler. At the end of the transaction commitment, the transaction handler was freed, so we should not access it after the transaction commitment. - we were not aware of the error which happened during the snapshot creation if we submitted a async transaction commitment. - pending snapshot access vs pending snapshot free. when something wrong happened after we submitted a async transaction commitment, the transaction committer would cleanup the pending snapshots and free them. But the snapshot creators were not aware of it, they would access the freed pending snapshots. This patch fixes the above problems by: - remove the dangerous code that accessed the freed handler - assign ->error if the error happens during the snapshot creation - the transaction committer doesn't free the pending snapshots, just assigns the error number and evicts them before we unblock the transaction. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-04 16:33:22 -05:00
Josef Bacik	9bf7a48905	Btrfs: use set_nlink if our i_nlink is 0 We need to inc the nlink of deleted entries when running replay so we can do the unlink on the fs_root and get everything cleaned up and then have the orphan cleanup do the right thing. The problem is inc_nlink complains about this, even thought it still does the right thing. So use set_nlink() if our i_nlink is 0 to keep users from seeing the warnings during log replay. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-04 16:30:06 -05:00
Eric W. Biederman	7f78e03513	fs: Limit sys_mount to only request filesystem modules. Modify the request_module to prefix the file system type with "fs-" and add aliases to all of the filesystems that can be built as modules to match. A common practice is to build all of the kernel code and leave code that is not commonly needed as modules, with the result that many users are exposed to any bug anywhere in the kernel. Looking for filesystems with a fs- prefix limits the pool of possible modules that can be loaded by mount to just filesystems trivially making things safer with no real cost. Using aliases means user space can control the policy of which filesystem modules are auto-loaded by editing /etc/modprobe.d/*.conf with blacklist and alias directives. Allowing simple, safe, well understood work-arounds to known problematic software. This also addresses a rare but unfortunate problem where the filesystem name is not the same as it's module name and module auto-loading would not work. While writing this patch I saw a handful of such cases. The most significant being autofs that lives in the module autofs4. This is relevant to user namespaces because we can reach the request module in get_fs_type() without having any special permissions, and people get uncomfortable when a user specified string (in this case the filesystem type) goes all of the way to request_module. After having looked at this issue I don't think there is any particular reason to perform any filtering or permission checks beyond making it clear in the module request that we want a filesystem module. The common pattern in the kernel is to call request_module() without regards to the users permissions. In general all a filesystem module does once loaded is call register_filesystem() and go to sleep. Which means there is not much attack surface exposed by loading a filesytem module unless the filesystem is mounted. In a user namespace filesystems are not mounted unless .fs_flags = FS_USERNS_MOUNT, which most filesystems do not set today. Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Acked-by: Kees Cook <keescook@chromium.org> Reported-by: Kees Cook <keescook@google.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2013-03-03 19:36:31 -08:00
Linus Torvalds	1c82315a12	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixup from Chris Mason: "Geert and James both sent this one in, sorry guys" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs/raid56: Add missing #include <linux/vmalloc.h>	2013-03-03 13:13:20 -08:00
Geert Uytterhoeven	d7011f5b9d	btrfs/raid56: Add missing #include <linux/vmalloc.h> tilegx_defconfig: fs/btrfs/raid56.c: In function 'btrfs_alloc_stripe_hash_table': fs/btrfs/raid56.c:206:3: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration] fs/btrfs/raid56.c:206:9: warning: assignment makes pointer from integer without a cast [enabled by default] fs/btrfs/raid56.c:226:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration] Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-03 06:53:41 -05:00
Linus Torvalds	b695188dd3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "The biggest feature in the pull is the new (and still experimental) raid56 code that David Woodhouse started long ago. I'm still working on the parity logging setup that will avoid inconsistent parity after a crash, so this is only for testing right now. But, I'd really like to get it out to a broader audience to hammer out any performance issues or other problems. scrub does not yet correct errors on raid5/6 either. Josef has another pass at fsync performance. The big change here is to combine waiting for metadata with waiting for data, which is a big latency win. It is also step one toward using atomics from the hardware during a commit. Mark Fasheh has a new way to use btrfs send/receive to send only the metadata changes. SUSE is using this to make snapper more efficient at finding changes between snapshosts. Snapshot-aware defrag is also included. Otherwise we have a large number of fixes and cleanups. Eric Sandeen wins the award for removing the most lines, and I'm hoping we steal this idea from XFS over and over again." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (118 commits) btrfs: fixup/remove module.h usage as required Btrfs: delete inline extents when we find them during logging btrfs: try harder to allocate raid56 stripe cache Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails Btrfs: remove reduplicate check about root in the function btrfs_clean_quota_tree Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails Btrfs: fix missing deleted items in btrfs_clean_quota_tree btrfs: use only inline_pages from extent buffer Btrfs: fix wrong reserved space when deleting a snapshot/subvolume Btrfs: fix wrong reserved space in qgroup during snap/subv creation Btrfs: remove unnecessary dget_parent/dput when creating the pending snapshot btrfs: remove a printk from scan_one_device Btrfs: fix NULL pointer after aborting a transaction Btrfs: fix memory leak of log roots Btrfs: copy everything if we've created an inline extent btrfs: cleanup for open-coded alignment Btrfs: do not change inode flags in rename Btrfs: use reserved space for creating a snapshot clear chunk_alloc flag on retryable failure ...	2013-03-02 16:41:54 -08:00
Paul Gortmaker	180e001cd5	btrfs: fixup/remove module.h usage as required We want to avoid module.h where posible, since it in turn includes nearly all of header space. This means removing it where it is not required, and using export.h where we are only exporting symbols via EXPORT_SYMBOL and friends. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-03-01 15:01:01 -05:00
Josef Bacik	124fe663f9	Btrfs: delete inline extents when we find them during logging Apparently when we do inline extents we allow the data to overlap the last chunk of the btrfs_file_extent_item, which means that we can possibly have a btrfs_file_extent_item that isn't actually as large as a btrfs_file_extent_item. This messes with us when we try to overwrite the extent when logging new extents since we expect for it to be the right size. To fix this just delete the item and try to do the insert again which will give us the proper sized btrfs_file_extent_item. This fixes a panic where map_private_extent_buffer would blow up because we're trying to write past the end of the leaf. Thanks, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-01 11:47:21 -05:00
David Sterba	83c8266acc	btrfs: try harder to allocate raid56 stripe cache The stripe hash table is large, starting with allocation order 4 and can go as high as order 7 in case lock debugging is turned on and structure padding happens. Observed mount failure: mount: page allocation failure: order:7, mode:0x200050 Pid: 8234, comm: mount Tainted: G W 3.8.0-default+ #267 Call Trace: [<ffffffff81114353>] warn_alloc_failed+0xf3/0x140 [<ffffffff811171d2>] ? __alloc_pages_direct_compact+0x92/0x250 [<ffffffff81117ac3>] __alloc_pages_nodemask+0x733/0x9d0 [<ffffffff81152878>] ? cache_alloc_refill+0x3f8/0x840 [<ffffffff811528bc>] cache_alloc_refill+0x43c/0x840 [<ffffffff811302eb>] ? is_kernel_percpu_address+0x4b/0x90 [<ffffffffa00a00ac>] ? btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs] [<ffffffff811531d7>] kmem_cache_alloc_trace+0x247/0x270 [<ffffffffa00a00ac>] btrfs_alloc_stripe_hash_table+0x5c/0x130 [btrfs] [<ffffffffa003133f>] open_ctree+0xb2f/0x1f90 [btrfs] [<ffffffff81397289>] ? string+0x49/0xe0 [<ffffffff813987b3>] ? vsnprintf+0x443/0x5d0 [<ffffffffa0007cb6>] btrfs_mount+0x526/0x600 [btrfs] [<ffffffff8115127c>] ? cache_alloc_debugcheck_after+0x4c/0x200 [<ffffffff81162b90>] mount_fs+0x20/0xe0 [<ffffffff8117db26>] vfs_kern_mount+0x76/0x120 [<ffffffff811801b6>] do_mount+0x386/0x980 [<ffffffff8112a5cb>] ? strndup_user+0x5b/0x80 [<ffffffff81180840>] sys_mount+0x90/0xe0 [<ffffffff81962e99>] system_call_fastpath+0x16/0x1b Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-01 10:13:05 -05:00
Wang Shilong	88e081bf82	Btrfs: cleanup to make the function btrfs_delalloc_reserve_metadata more logic The original code is a little confusing and not clear, The right way to deal with the kernel code like this: [...] if (ret) goto out; [...] So i move the common clean_up code to the place labeled with out_fail, this will be easier to maintain. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-01 10:13:04 -05:00
Wang Shilong	a9870c0e03	Btrfs: don't call btrfs_qgroup_free if just btrfs_qgroup_reserve fails commit `eb6b88d92c` leads into another bug. If it is just because qgroup_reserve fails, the function btrfs_qgroup_free should not be called, otherwise, it will cause the wrong quota accounting. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-01 10:13:04 -05:00
Wang Shilong	235fdb8ef2	Btrfs: remove reduplicate check about root in the function btrfs_clean_quota_tree The check work has been done just before the function btrfs_clean_quota_tree is called, it is not necessary to check it again, remove it. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-01 10:13:04 -05:00
Wang Shilong	84cbe2f725	Btrfs: return ENOMEM rather than use BUG_ON when btrfs_alloc_path fails Return ENOMEM rather trigger BUG_ON, fix it. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-01 10:13:04 -05:00
Wang Shilong	06b3a860dc	Btrfs: fix missing deleted items in btrfs_clean_quota_tree Steps to reproduce: i=0 ncases=100 mkfs.btrfs <disk> mount <disk> <mnt> btrfs quota enable <mnt> btrfs qgroup create 2/1 <mnt> while [ $i -le $ncases ] do btrfs qgroup create 1/$i <mnt> btrfs qgroup assign 1/$i 2/1 <mnt> i=$(($i+1)) done btrfs quota disable <mnt> umount <mnt> btrfsck <mnt> You can also use the commands: btrfs-debug-tree <disk> \| grep QGROUP You will find there are still items existed.The reasons why this happens is because the original code just checks slots[0]==0 and returns. We try to fix it by deleting the leaf one by one. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-03-01 10:13:03 -05:00
Linus Torvalds	de1a2262b0	2 writeback fixes - fix negative (setpoint - dirty) in 32bit archs - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle() -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJRLrFaAAoJECvKgwp+S8JaV2IP/jo34e3Ht0gvIfxz9rh05dvR LqBmSAXXJQYgxUKUjYECuyLahIciniKYZp/fS6s5myOPAayiirB70rC1W85Kz8Sm uR1wDvG0g1AyK39kJas+WZw2fJFicthSSp29jhTH0upbEcMX+/tzsHTsJRH1WqI0 rtV8wHVxDu4+njz44hZIVxmJ9S7XZCuw8D6NfbyobmAqOm35j0VJ7uzQOxrNoJDe lvnwEGXfSU9KTfOIxt4K0d+lovXT6IRfN0qfdgcrWwxx9QJ/cU5F5b6cjdN9BsEF oq2UKSihbU55PdgUk6DfMJ3t7AXS/u2/P5a8PNfoNL9ovKQMJMHPXXDtxXmwCvcI aaYbULbwojMWZyrijViJpkftVKKtM/96X/jyCsof96UhJdah8c9wM44k1LDRBYXi WbQbD+doUII+pEmxUxF3Chrk/Yi3T5q2IWiVsixUEGewrSChOSqMIXOcSpgz97lL eGmNHgC/rn5TdDx8J3u0V+1+QYCvNxC25GG4E2+9QhU+mecLKt+IG1Dhn35xUjq1 kjgfrNWJC6zxlIq7owk8pTI7DxiV/iMqogR5mMDz0umrPrid/J/xb6zxuAcnk3WU j0clNu7gzIYB8NjxBskO3Fg2AWKJxSohpu+r9wcjmjf0T5uEUmLwpI0i4tdDlYNw IvcmOpF1I2Ja5TrW8HWw =j9Sn -----END PGP SIGNATURE----- Merge tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback fixes from Wu Fengguang: "Two writeback fixes - fix negative (setpoint - dirty) in 32bit archs - use down_read_trylock() in writeback_inodes_sb(_nr)_if_idle()" * tag 'writeback-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: Negative (setpoint-dirty) in bdi_position_ratio() vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them	2013-02-28 13:21:44 -08:00
David Sterba	b8dae31388	btrfs: use only inline_pages from extent buffer The nodesize is capped at 64k and there are enough pages preallocated in extent_buffer::inline_pages. The fallback to kmalloc never happened because even on the smallest page size considered (4k) inline_pages covered the needs. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-28 13:33:56 -05:00
Miao Xie	c58aaad2ac	Btrfs: fix wrong reserved space when deleting a snapshot/subvolume When deleting a snapshot/subvolume, we need remove root ref/backref, dir entries and update the dir inode, so we must reserve free space for those operations. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-28 13:33:55 -05:00
Miao Xie	d5c1207017	Btrfs: fix wrong reserved space in qgroup during snap/subv creation There are two problems in the space reservation of the snapshot/ subvolume creation. - don't reserve the space for the root item insertion - the space which is reserved in the qgroup is different with the free space reservation. we need reserve free space for 7 items, but in qgroup reservation, we need reserve space only for 3 items. So we implement new metadata reservation functions for the snapshot/subvolume creation. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-28 13:33:54 -05:00
Miao Xie	e9662f701c	Btrfs: remove unnecessary dget_parent/dput when creating the pending snapshot Since we have grabbed the parent inode at the beginning of the snapshot creation, and both sync and async snapshot creation release it after the pending snapshots are actually created, it is safe to access the parent inode directly during the snapshot creation, we needn't use dget_parent/dput to fix the parent dentry and get the dir inode. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-28 13:33:53 -05:00
David Sterba	2d8946c597	btrfs: remove a printk from scan_one_device Dave pointed out that he saw messages from btrfs although there was no such filesystem on his computers. The automatic device scan is called on every new blockdevice if the usual distro udev rule set is used. The printk introduced in `6f60cbd3ae` was a remainder from copying portions of code from btrfs_get_bdev_and_sb which is used under different conditions and the warning makes sense there. Reported-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-28 13:33:52 -05:00
Liu Bo	f094ac32ab	Btrfs: fix NULL pointer after aborting a transaction While doing cleanup work on an aborted transaction, we've set the global running transaction pointer to NULL _before_ waiting all other transaction handles to finish, so others'd hit NULL pointer crash when referencing the global running transaction pointer. This first sets a hint to avoid new transaction handle joining, then waits other existing handles to abort or finish so that we can safely set the above global pointer to NULL. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-28 13:33:52 -05:00
Liu Bo	3321719ed6	Btrfs: fix memory leak of log roots When we abort a transaction while fsyncing, we'll skip freeing log roots part of committing a transaction, which leads to memory leak. This adds a 'free log roots' in putting super when no more users hold references on log roots, so it's safe and clean. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-28 13:33:51 -05:00
Josef Bacik	bdc20e67e8	Btrfs: copy everything if we've created an inline extent I noticed while looking into a tree logging bug that we aren't logging inline extents properly. Since this requires copying and it shouldn't happen too often just force us to copy everything for the inode into the tree log when we have an inline extent. With this patch we have valid data after a crash when we write an inline extent. Thanks, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-28 13:33:20 -05:00
Linus Torvalds	d895cb1af1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile (part one) from Al Viro: "Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent locking violations, etc. The most visible changes here are death of FS_REVAL_DOT (replaced with "has ->d_weak_revalidate()") and a new helper getting from struct file to inode. Some bits of preparation to xattr method interface changes. Misc patches by various people sent this cycle and ocfs2 fixes from several cycles ago that should've been upstream right then. PS: the next vfs pile will be xattr stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits) saner proc_get_inode() calling conventions proc: avoid extra pde_put() in proc_fill_super() fs: change return values from -EACCES to -EPERM fs/exec.c: make bprm_mm_init() static ocfs2/dlm: use GFP_ATOMIC inside a spin_lock ocfs2: fix possible use-after-free with AIO ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero target: writev() on single-element vector is pointless export kernel_write(), convert open-coded instances fs: encode_fh: return FILEID_INVALID if invalid fid_type kill f_vfsmnt vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op nfsd: handle vfs_getattr errors in acl protocol switch vfs_getattr() to struct path default SET_PERSONALITY() in linux/elf.h ceph: prepopulate inodes only when request is aborted d_hash_and_lookup(): export, switch open-coded instances 9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate() 9p: split dropping the acls from v9fs_set_create_acl() ...	2013-02-26 20:16:07 -08:00
Qu Wenruo	fda2832feb	btrfs: cleanup for open-coded alignment Though most of the btrfs codes are using ALIGN macro for page alignment, there are still some codes using open-coded alignment like the following: ------ u64 mask = ((u64)root->stripesize - 1); u64 ret = (val + mask) & ~mask; ------ Or even hidden one: ------ num_bytes = (end - start + blocksize) & ~(blocksize - 1); ------ Sometimes these open-coded alignment is not so easy to understand for newbie like me. This commit changes the open-coded alignment to the ALIGN macro for a better readability. Also there is a previous patch from David Sterba with similar changes, but the patch is for 3.2 kernel and seems not merged. http://www.spinics.net/lists/linux-btrfs/msg12747.html Cc: David Sterba <dave@jikos.cz> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-26 11:04:13 -05:00
Liu Bo	8c4ce81e91	Btrfs: do not change inode flags in rename Before we forced to change a file's NOCOW and COMPRESS flag due to the parent directory's, but this ends up a bad idea, because it confuses end users a lot about file's NOCOW status, eg. if someone change a file to NOCOW via 'chattr' and then rename it in the current directory which is without NOCOW attribute, the file will lose the NOCOW flag silently. This diables 'change flags in rename', so from now on we'll only inherit flags from the parent directory on creation stage while in other places we can use 'chattr' to set NOCOW or COMPRESS flags. Reported-by: Marios Titas <redneb8888@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-26 11:01:19 -05:00
Liu Bo	2382c5cc7e	Btrfs: use reserved space for creating a snapshot While inserting dir index and updating inode for a snapshot, we'd add delayed items which consume trans->block_rsv, if we don't have any space reserved in this trans handle, we either just return or reserve space again. But before creating pending snapshots during committing transaction, we've done a release on this trans handle, so we don't have space reserved in it at this stage. What we're using is block_rsv of pending snapshots which has already reserved well enough space for both inserting dir index and updating inode, so we need to set trans handle to indicate that we have space now. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-26 11:00:52 -05:00
Alexandre Oliva	a81cb9a2d9	clear chunk_alloc flag on retryable failure I've experienced filesystem freezes with permanent spikes in the active process count for quite a while, particularly on filesystems whose available raw space has already been fully allocated to chunks. While looking into this, I found a pretty obvious error in do_chunk_alloc: it sets space_info->chunk_alloc, but if btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving that flag set, which causes any other threads waiting for space_info->chunk_alloc to become zero to spin indefinitely. I haven't double-checked that this patch fixes the failure I've observed fully (it's not exactly trivial to trigger), but it surely is a bug and the fix is trivial, so... Please put it in :-) What I saw in that function also happens to explain why in some cases I see filesystems allocate a huge number of chunks that remain unused (leading to the scenario above, of not having more chunks to allocate). It happens for data and metadata, but not necessarily both. I'm guessing some thread sets the force_alloc flag on the corresponding space_info, and then several threads trying to get disk space end up attempting to allocate a new chunk concurrently. All of them will see the force_alloc flag and bump their local copy of force up to the level they see first, and they won't clear it even if another thread succeeds in allocating a chunk, thus clearing the force flag. Then each thread that observed the force flag will, on its turn, force the allocation of a new chunk. And any threads that come in while it does that will see the force flag still set and pick it up, and so on. This sounds like a problem to me, but... what should the correct behavior be? Clear force_flag once we copy it to a local force? Reset force to the incoming value on every loop? Set the flag to our incoming force if we have it at first, clear our local flag, and move it from the space_info when we determined that we are the thread that's going to perform the allocation? btrfs: clear chunk_alloc flag on retryable failure From: Alexandre Oliva <oliva@gnu.org> If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc without clearing chunk_alloc in space_info. As a result, any further calls to do_chunk_alloc on that filesystem will start busy-waiting for chunk_alloc to be cleared, but it never will be. This patch adjusts do_chunk_alloc so that it clears this flag in case of an error. Signed-off-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-26 11:00:51 -05:00
Jan Schmidt	ca60ebfa30	Btrfs: fix backref walking race with tree deletions When a subvolume is removed, we remove the root item from the root tree, while the tree blocks and backrefs remain for a while. When backref walking comes across one of those orphan tree blocks, it can find a backref for a no longer existing root. This is all good, we only must tolerate __resolve_indirect_ref returning an error and continue with the good refs found. Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-26 11:00:50 -05:00
Josef Bacik	f2bdf9a8f7	Btrfs: make sure NODATACOW also gets NODATASUM set A user reported hitting the BUG_ON() in btrfs_finished_ordered_io() where we had csums on a NOCOW extent. This can happen if we have NODATACOW set but not NODATASUM set, which can happen in two cases, either we mount with -o nodatacow and then write into preallocated space, or chattr +C a directory and move a file into that directory. Liu has fixed the move case in a different place, but this fixes the mount -o nodatacow case. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-26 10:57:48 -05:00
Namjae Jeon	94e07a7590	fs: encode_fh: return FILEID_INVALID if invalid fid_type This patch is a follow up on below patch: [PATCH] exportfs: add FILEID_INVALID to indicate invalid fid_type commit: `216b6cbdcb` Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com> Signed-off-by: Vivek Trivedi <t.vivek@samsung.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Acked-by: Sage Weil <sage@inktank.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-26 02:46:10 -05:00
Al Viro	496ad9aa8e	new helper: file_inode(file) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2013-02-22 23:31:31 -05:00
Linus Torvalds	9afa3195b9	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "Assorted tiny fixes queued in trivial tree" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (22 commits) DocBook: update EXPORT_SYMBOL entry to point at export.h Documentation: update top level 00-INDEX file with new additions ARM: at91/ide: remove unsused at91-ide Kconfig entry percpu_counter.h: comment code for better readability x86, efi: fix comment typo in head_32.S IB: cxgb3: delay freeing mem untill entirely done with it net: mvneta: remove unneeded version.h include time: x86: report_lost_ticks doesn't exist any more pcmcia: avoid static analysis complaint about use-after-free fs/jfs: Fix typo in comment : 'how may' -> 'how many' of: add missing documentation for of_platform_populate() btrfs: remove unnecessary cur_trans set before goto loop in join_transaction sound: soc: Fix typo in sound/codecs treewide: Fix typo in various drivers btrfs: fix comment typos Update ibmvscsi module name in Kconfig. powerpc: fix typo (utilties -> utilities) of: fix spelling mistake in comment h8300: Fix home page URL in h8300/README xtensa: Fix home page URL in Kconfig ...	2013-02-21 17:40:58 -08:00
Linus Torvalds	06991c28f3	Driver core patches for 3.9-rc1 Here is the big driver core merge for 3.9-rc1 There are two major series here, both of which touch lots of drivers all over the kernel, and will cause you some merge conflicts: - add a new function called devm_ioremap_resource() to properly be able to check return values. - remove CONFIG_EXPERIMENTAL If you need me to provide a merged tree to handle these resolutions, please let me know. Other than those patches, there's not much here, some minor fixes and updates. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iEYEABECAAYFAlEmV0cACgkQMUfUDdst+yncCQCfbmnQZju7kzWXk6PjdFuKspT9 weAAoMCzcAtEzzc4LXuUxxG/sXBVBCjW =yWAQ -----END PGP SIGNATURE----- Merge tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg Kroah-Hartman: "Here is the big driver core merge for 3.9-rc1 There are two major series here, both of which touch lots of drivers all over the kernel, and will cause you some merge conflicts: - add a new function called devm_ioremap_resource() to properly be able to check return values. - remove CONFIG_EXPERIMENTAL Other than those patches, there's not much here, some minor fixes and updates" Fix up trivial conflicts * tag 'driver-core-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (221 commits) base: memory: fix soft/hard_offline_page permissions drivercore: Fix ordering between deferred_probe and exiting initcalls backlight: fix class_find_device() arguments TTY: mark tty_get_device call with the proper const values driver-core: constify data for class_find_device() firmware: Ignore abort check when no user-helper is used firmware: Reduce ifdef CONFIG_FW_LOADER_USER_HELPER firmware: Make user-mode helper optional firmware: Refactoring for splitting user-mode helper code Driver core: treat unregistered bus_types as having no devices watchdog: Convert to devm_ioremap_resource() thermal: Convert to devm_ioremap_resource() spi: Convert to devm_ioremap_resource() power: Convert to devm_ioremap_resource() mtd: Convert to devm_ioremap_resource() mmc: Convert to devm_ioremap_resource() mfd: Convert to devm_ioremap_resource() media: Convert to devm_ioremap_resource() iommu: Convert to devm_ioremap_resource() drm: Convert to devm_ioremap_resource() ...	2013-02-21 12:05:51 -08:00
Miao Xie	dc81cdc58a	Btrfs: fix remount vs autodefrag If we remount the fs to close the auto defragment or make the fs R/O, we should stop the auto defragment. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-21 08:11:43 -05:00
Miao Xie	172a50497f	Btrfs: fix wrong outstanding_extents when doing DIO write When running the 083th case of xfstests on the filesystem with "compress-force=lzo", the following WARNINGs were triggered. WARNING: at fs/btrfs/inode.c:7908 WARNING: at fs/btrfs/inode.c:7909 WARNING: at fs/btrfs/inode.c:7911 WARNING: at fs/btrfs/extent-tree.c:4510 WARNING: at fs/btrfs/extent-tree.c:4511 This problem was introduced by the patch "Btrfs: fix deadlock due to unsubmitted". In this patch, there are two bugs which caused the above problem. The 1st one is a off-by-one bug, if the DIO write return 0, it is also a short write, we need release the reserved space for it. But we didn't do it in that patch. Fix it by change "ret > 0" to "ret >= 0". The 2nd one is ->outstanding_extents was increased twice when a short write happened. As we know, ->outstanding_extents is a counter to keep track of the number of extent items we may use duo to delalloc, when we reserve the free space for a delalloc write, we assume that the write will introduce just one extent item, so we increase ->outstanding_extents by 1 at that time. And then we will increase it every time we split the write, it is done at the beginning of btrfs_get_blocks_direct(). So when a short write happens, we needn't increase ->outstanding_extents again. But this patch done. In order to fix the 2nd problem, I re-write the logic for ->outstanding_extents operation. We don't increase it at the beginning of btrfs_get_blocks_direct(), instead, we just increase it when the split actually happens. Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-21 08:11:43 -05:00
Liu Bo	38c227d87c	Btrfs: snapshot-aware defrag This comes from one of btrfs's project ideas, As we defragment files, we break any sharing from other snapshots. The balancing code will preserve the sharing, and defrag needs to grow this as well. Now we're able to fill the blank with this patch, in which we make full use of backref walking stuff. Here is the basic idea, o set the writeback ranges started by defragment with flag EXTENT_DEFRAG o at endio, after we finish updating fs tree, we use backref walking to find all parents of the ranges and re-link them with the new COWed file layout by adding corresponding backrefs. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-20 18:06:09 -05:00
Chris Mason	86db25785a	Btrfs: fix max chunk size on raid5/6 We try to limit the size of a chunk to 10GB, which keeps the unit of work reasonable during balance and resize operations. The limit checks were taking into account the number of copies of the data we had but what they really should be doing is comparing against the logical size of the chunk we're creating. This moves the code around a little to use the count of data stripes from raid5/6. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-20 17:08:18 -05:00
Zach Brown	24542bf7ea	btrfs: limit fallocate extent reservation to 256MB Very large fallocate requests are cpu bound and result in extents with a repeating pattern of ever decreasing size: $ time fallocate -l 1T file real 0m13.039s ( an excerpt of the extents from btrfs-debug-tree: ) prealloc data disk byte 1536292564992 nr 397312 prealloc data disk byte 1536292962304 nr 196608 prealloc data disk byte 1536293158912 nr 98304 prealloc data disk byte 1536293257216 nr 49152 prealloc data disk byte 1536293306368 nr 24576 prealloc data disk byte 1536293330944 nr 12288 prealloc data disk byte 1536293343232 nr 8192 prealloc data disk byte 1536293351424 nr 4096 prealloc data disk byte 1536293355520 nr 4096 prealloc data disk byte 1536293359616 nr 4096 The excessive cpu use comes from __btrfs_prealloc_file_range() trying to allocate the entire remaining size after each extent is allocated. btrfs_reserve_extent() repeatedly cuts this requested size in half until it gets down to the size that the allocators can return. We limit the problem for now by capping each reservation at 256 meg. The small extents come from a masking bug when decreasing the requested reservation size. The high 32bits are cleared and the remaining low bits might happen to reserve a small size. Fix this by using round_down() which properly casts the mask. After these fixes huge fallocate requests are fast and result in nice large extents: $ time fallocate -l 1T file real 0m0.082s prealloc data disk byte 1112425889792 nr 268435456 prealloc data disk byte 1112694325248 nr 268435456 prealloc data disk byte 1112962760704 nr 268435456 Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-20 14:06:25 -05:00
Thomas Gleixner	1cba0cdf5e	btrfs: Init io_lock after cloning btrfs device struct __btrfs_close_devices() clones btrfs device structs with memcpy(). Some of the fields in the clone are reinitialized, but it's missing to init io_lock. In mainline this goes unnoticed, but on RT it leaves the plist pointing to the original about to be freed lock struct. Initialize io_lock after cloning, so no references to the original struct are left. Reported-and-tested-by: Mike Galbraith <efault@gmx.de> Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-20 14:06:20 -05:00
Chris Mason	e942f883bc	Merge branch 'raid56-experimental' into for-linus-3.9 Signed-off-by: Chris Mason <chris.mason@fusionio.com> Conflicts: fs/btrfs/ctree.h fs/btrfs/extent-tree.c fs/btrfs/inode.c fs/btrfs/volumes.c	2013-02-20 14:06:05 -05:00
Chris Mason	b2c6b3e061	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-linus-3.9 Signed-off-by: Chris Mason <chris.mason@fusionio.com> Conflicts: fs/btrfs/disk-io.c	2013-02-20 14:05:45 -05:00
Miao Xie	272d26d0ad	Btrfs: fix missing release of qgroup reservation in commit_transaction() We forget to free qgroup reservation in commit_transaction(),fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Cc: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 13:00:09 -05:00
Wang Shilong	683cebda90	Btrfs: fix missing check before disabling quota The original code forget to check whether quota has been disabled firstly, and it will return 'EINVAL' and return error to users if quota has been disabled,it will be unfriendly and confusing for users to see that. So just return directly if quota has been disabled. Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com> Cc: Arne Jansen <sensille@gmx.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 13:00:07 -05:00
Liu Bo	fa6ac8765c	Btrfs: fix cleaner thread not working with inode cache option Right now inode cache inode is treated as the same as space cache inode, ie. keep inode in memory till putting super. But this leads to an awkward situation. If we're going to delete a snapshot/subvolume, btrfs will not actually delete it and return free space, but will add it to dead roots list until the last inode on this snap/subvol being destroyed. Then we'll fetch deleted roots and cleanup them via cleaner thread. So here is the problem, if we enable inode cache option, each snap/subvol has a cached inode which is used to store inode allcation information. And this cache inode will be kept in memory, as the above said. So with inode cache, snap/subvol can only be added into dead roots list during freeing roots stage in umount, so that we can ONLY get space back after another remount(we cleanup dead roots on mount). But the real thing is we'll no more use the snap/subvol if we mark it deleted, so we can safely iput its cache inode when we delete snap/subvol. Another thing is that we need to change the rules of droping inode, we don't keep snap/subvol's cache inode in memory till end so that we can add snap/subvol into dead roots list in time. Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 13:00:06 -05:00
Miao Xie	d4edf39bd5	Btrfs: fix uncompleted transaction In some cases, we need commit the current transaction, but don't want to start a new one if there is no running transaction, so we introduce the function - btrfs_attach_transaction(), which can catch the current transaction, and return -ENOENT if there is no running transaction. But no running transaction doesn't mean the current transction completely, because we removed the running transaction before it completes. In some cases, it doesn't matter. But in some special cases, such as freeze fs, we hope the transaction is fully on disk, it will introduce some bugs, for example, we may feeze the fs and dump the data in the disk, if the transction doesn't complete, we would dump inconsistent data. So we need fix the above problem for those cases. We fixes this problem by introducing a function: btrfs_attach_transaction_barrier() if we hope all the transaction is fully on the disk, even they are not running, we can use this function. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 13:00:05 -05:00
Miao Xie	178260b2c1	Btrfs: fix the deadlock between the transaction start/attach and commit Now btrfs_commit_transaction() does this ret = btrfs_run_ordered_operations(root, 0) which async flushes all inodes on the ordered operations list, it introduced a deadlock that transaction-start task, transaction-commit task and the flush workers waited for each other. (See the following URL to get the detail http://marc.info/?l=linux-btrfs&m=136070705732646&w=2) As we know, if ->in_commit is set, it means someone is committing the current transaction, we should not try to join it if we are not JOIN or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid the above problem. In this way, there is another benefit: there is no new transaction handle to block the transaction which is on the way of commit, once we set ->in_commit. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 13:00:03 -05:00
Miao Xie	4b82490649	Btrfs: fix the qgroup reserved space is released prematurely In start_transactio(), we will try to join the transaction again after the current transaction is committed, so we should not release the reserved space of the qgroup. Fix it. Cc: Arne Jansen <sensille@gmx.net> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 13:00:02 -05:00
Zach Brown	cdb4c5748c	btrfs: define BTRFS_MAGIC as a u64 value super.magic is an le64 but it's treated as an unterminated string when compared against BTRFS_MAGIC which is defined as a string. Instead define BTRFS_MAGIC as a normal hex value and use endian helpers to compare it to the super's magic. I tested this by mounting an fs made before the change and made sure that it didn't introduce sparse errors. This matches a similar cleanup that is pending in btrfs-progs. David Sterba pointed out that we should fix the kernel side as well :). Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 13:00:01 -05:00
jeff.liu	a8bfd4abea	Btrfs: set/change the label of a mounted file system With this new ioctl(2) BTRFS_IOC_SET_FSLABEL, we can set/change the label of a mounted file system. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Goffredo Baroncelli <kreijack@inwind.it> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Goffredo Baroncelli <kreijack@inwind.it> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:59 -05:00
jeff.liu	867ab667e7	Btrfs: Add a new ioctl to get the label of a mounted file system Add a new ioctl(2) BTRFS_IOC_GET_FSLABLE, so that we can get the label upon a mounted filesystem. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Goffredo Baroncelli <kreijack@inwind.it> Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:58 -05:00
Josef Bacik	569e0f358c	Btrfs: place ordered operations on a per transaction list Miao made the ordered operations stuff run async, which introduced a deadlock where we could get somebody (sync) racing in and committing the transaction while a commit was already happening. The new committer would try and flush ordered operations which would hang waiting for the commit to finish because it is done asynchronously and no longer inherits the callers trans handle. To fix this we need to make the ordered operations list a per transaction list. We can get new inodes added to the ordered operation list by truncating them and then having another process writing to them, so this makes it so that anybody trying to add an ordered operation _must_ start a transaction in order to add itself to the list, which will keep new inodes from getting added to the ordered operations list after we start committing. This should fix the deadlock and also keeps us from doing a lot more work than we need to during commit. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:57 -05:00
Josef Bacik	dde5740fdd	Btrfs: relax the block group size limit for bitmaps Dave pointed out that xfstests 273 will tell you that it failed to load the space cache for a block group when it remounts. This is because we run out of space writing out the block group cache. This is ok and is working as it should, but let's try to be a bit nicer. This happens because the block group was 100mb, but bitmap entries cover 128mb, so we were only getting extent entries for this block group, which ended up being too many to fit in the free space cache. So relax the bitmap size requirements to block groups that are at least half the size a bitmap will cover or larger, that way we can still keep the amount of space used in the free space cache low enough to be able to write it out. With this patch I no longer fail to write out the free space cache. Thanks, Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:55 -05:00
Ilya Dryomov	3e39cea61c	Btrfs: allow for selecting only completely empty chunks Enhance balance usage filter by making it possible to balance out only completely empty chunks. Today, usage filter properly acts on values from 1 to 99 inclusive, usage=100 selects all chunks, and usage=0 selects no chunks. This commit changes the usage=0 case: the new meaning is to restripe only completely empty chunks and nothing else. Suggested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:54 -05:00
Ilya Dryomov	bf023ecfca	Btrfs: eliminate a use-after-free in btrfs_balance() Commit `5af3e8cc` introduced a use-after-free at volumes.c:3139: bctl is freed above in __cancel_balance() in all cases except for balance pause. Fix this by moving the offending check a couple statements above, the meaning of the check is preserved. Reported-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:53 -05:00
Josef Bacik	c8f2f24bd5	Btrfs: remove unused extent io tree ops V2 Nobody uses these io tree ops anymore so just remove them and clean up the code a bit. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:52 -05:00
David Sterba	210549ebe9	btrfs: add cancellation points to defrag The defrag operation can take very long, we want to have a way how to cancel it. The code checks for a pending signal at safe points in the defrag loops and returns EAGAIN. This means a user can press ^C after running 'btrfs fi defrag', woks for both defrag modes, files and root. Returning from the command was instant in my light tests, but may take longer depending on the aging factor of the filesystem. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:51 -05:00
David Sterba	b069e0c345	btrfs: put some enospc messages under enospc_debug The warning in use_block_rsv is not useful for users and may fill the logs unnecessarily. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:49 -05:00
Miao Xie	38851cc19a	Btrfs: implement unlocked dio write This idea is from ext4. By this patch, we can make the dio write parallel, and improve the performance. But because we can not update isize without i_mutex, the unlocked dio write just can be done in front of the EOF. We needn't worry about the race between dio write and truncate, because the truncate need wait untill all the dio write end. And we also needn't worry about the race between dio write and punch hole, because we have extent lock to protect our operation. I ran fio to test the performance of this feature. == Hardware == CPU: Intel(R) Core(TM)2 Duo CPU E7500 @ 2.93GHz Mem: 2GB SSD: Intel X25-M 120GB (Test Partition: 60GB) == config file == [global] ioengine=psync direct=1 bs=4k size=32G runtime=60 directory=/mnt/btrfs/ filename=testfile group_reporting thread [file1] numjobs=1 # 2 4 rw=randwrite == result (KBps) == write 1 2 4 lock 24936 24738 24726 nolock 24962 30866 32101 == result (iops) == write 1 2 4 lock 6234 6184 6181 nolock 6240 7716 8025 Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:48 -05:00
Miao Xie	2e60a51e62	Btrfs: serialize unlocked dio reads with truncate Currently, we can do unlocked dio reads, but the following race is possible: dio_read_task truncate_task ->btrfs_setattr() ->btrfs_direct_IO ->__blockdev_direct_IO ->btrfs_get_block ->btrfs_truncate() #alloc truncated blocks #to other inode ->submit_io() #INFORMATION LEAK In order to avoid this problem, we must serialize unlocked dio reads with truncate. There are two approaches: - use extent lock to protect the extent that we truncate - use inode_dio_wait() to make sure the truncating task will wait for the read DIO. If we use the 1st one, we will meet the endless truncation problem due to the nonlocked read DIO after we implement the nonlocked write DIO. It is because we still need invoke inode_dio_wait() avoid the race between write DIO and truncation. By that time, we have to introduce btrfs_inode_{block, resume}_nolock_dio() again. That is we have to implement this patch again, so I choose the 2nd way to fix the problem. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:47 -05:00
Miao Xie	0934856d46	Btrfs: fix deadlock due to unsubmitted The deadlock problem happened when running fsstress(a test program in LTP). Steps to reproduce: # mkfs.btrfs -b 100M <partition> # mount <partition> <mnt> # <Path>/fsstress -p 3 -n 10000000 -d <mnt> The reason is: btrfs_direct_IO() \|->do_direct_IO() \|->get_page() \|->get_blocks() \| \|->btrfs_delalloc_resereve_space() \| \|->btrfs_add_ordered_extent() ------- Add a new ordered extent \|->dio_send_cur_page(page0) -------------- We didn't submit bio here \|->get_page() \|->get_blocks() \|->btrfs_delalloc_resereve_space() \|->flush_space() \|->btrfs_start_ordered_extent() \|->wait_event() ---------- Wait the completion of the ordered extent that is mentioned above But because we didn't submit the bio that is mentioned above, the ordered extent can not complete, we would wait for its completion forever. There are two methods which can fix this deadlock problem: 1. submit the bio before we invoke get_blocks() 2. reserve the space before we do dio Though the 1st is the simplest way, we need modify the code of VFS, and it is likely to break contiguous requests, and introduce performance regression for the other filesystems. So we have to choose the 2nd way. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Cc: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:45 -05:00
Josef Bacik	4a7d0f6854	Btrfs: cleanup orphan reservation if truncate fails I noticed we were getting lots of warnings with xfstest 83 because we have reservations outstanding. This is because we moved the orphan add outside of the truncate, but we don't actually cleanup our reservation if something fails. This fixes the problem and I no longer see warnings. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:44 -05:00
Josef Bacik	5d80366e9b	Btrfs: steal from global reserve if we are cleaning up orphans Sometimes xfstest 83 will fail to remount the scratch device because we've gotten ourselves so full that we cannot cleanup the orphan items. In this case check to see if we're doing the orphan cleanup and if we are allow us to steal our reservation from the global block rsv. With this patch I've not been able to reproduce the failed mount problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:42 -05:00
Miao Xie	8696c53304	Btrfs: fix memory leak of pending_snapshot->inherit The argument "inherit" of btrfs_ioctl_snap_create_transid() was assigned to NULL during we created the snapshots, so we didn't free it though we called kfree() in the caller. But since we are sure the snapshot creation is done after the function - btrfs_ioctl_snap_create_transid() - completes, it is safe that we don't assign the pointer "inherit" to NULL, and just free it in the caller of btrfs_ioctl_snap_create_transid(). In this way, the code can become more readable. Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Cc: Arne Jansen <sensille@gmx.net> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:41 -05:00
Miao Xie	2b8195bb57	Btrfs: fix the race between bio and btrfs_stop_workers open_ctree() need read the metadata to initialize the global information of btrfs. But it may fail after it submit some bio, and then it will jump to the error path. Unfortunately, it doesn't check if there are some bios in flight, and just stop all the worker threads. As a result, when the submitted bios end, they can not find any worker thread which can deal with subsequent work, then oops happen. kernel BUG at fs/btrfs/async-thread.c:605! Fix this problem by invoking invalidate_inode_pages2() before we stop the worker threads. This function will wait until the bio end because it need lock the pages which are going to be invalidated, and if a page is under disk read IO, it must be locked. invalidate_inode_pages2() need wait until end bio handler to unlocked it. Reported-and-Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:40 -05:00
Mark Fasheh	cb95e7bf7b	btrfs: add "no file data" flag to btrfs send ioctl This patch adds the flag, BTRFS_SEND_FLAG_NO_FILE_DATA to the btrfs send ioctl code. When this flag is set, the btrfs send code will never write file data into the stream (thus also avoiding expensive reads of that data in the first place). BTRFS_SEND_C_UPDATE_EXTENT commands will be sent (instead of BTRFS_SEND_C_WRITE) with an offset, length pair indicating the extent in question. This patch does not affect the operation of BTRFS_SEND_C_CLONE commands - they will continue to be sent when a search finds an appropriate extent to clone from. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:39 -05:00
Liu Bo	2f697dc6a6	Btrfs: extend the checksum item as much as possible For write, we also reserve some space for COW blocks during updating the checksum tree, and we calculate the number of blocks by checking if the number of bytes outstanding that are going to need csums needs one more block for csum. When we add these checksum into the checksum tree, we use ordered sums list. Every ordered sum contains csums for each sector, and we'll first try to look up an existing csum item, a) if we don't yet have a proper csum item, then we need to insert one, b) or if we find one but the csum item is not big enough, then we need to extend it. The point is we'll unlock the whole path and then insert or extend. So others can hack in and update the tree. Each insert or extend needs update the tree with COW on, and we may need to insert/extend for many times. That means what we've reserved for updating checksum tree is NOT enough indeed. The case is even more serious with having several write threads at the same time, it can end up eating our reserved space quickly and starting eating globle reserve pool instead. I don't yet come up with a way to calculate the worse case for updating csum, but extending the checksum item as much as possible can be helpful in my test. The idea behind is that it can reduce the times we insert/extend so that it saves us precious reserved space. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:37 -05:00
Eric Sandeen	de78b51a28	btrfs: remove cache only arguments from defrag path The entry point at the defrag ioctl always sets "cache only" to 0; the codepaths haven't run for a long time as far as I can tell. Chris says they're dead code, so remove them. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:36 -05:00
Josef Bacik	e4a2bcaca9	Btrfs: if we aren't committing just end the transaction if we error out I hit a deadlock where transaction commit was waiting on num_writers to be 0. This happened because somebody came into btrfs_commit_transaction and noticed we had aborted and it went to cleanup_transaction. This shouldn't happen because cleanup_transaction is really to fixup a bad commit, it doesn't do the normal trans handle cleanup things. So if we have an error just do the normal btrfs_end_transaction dance and return. Once we are in the actual commit path we can use cleanup_transaction and be good to go. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:35 -05:00
Josef Bacik	3e04e7f10b	Btrfs: handle errors in compression submission path I noticed we would deadlock if we aborted a transaction while doing compressed io. This is because we don't unlock our pages if something goes horribly wrong. To fix this we need to make sure that we call extent_clear_unlock_delalloc in order to unlock all the pages. If we have to cow in the async submission thread we need to make sure to unlock our locked_page as the cow error path will not unlock the locked page as it depends on the caller to unlock that page. With this patch we no longer deadlock on the page lock when we have an aborted transaction. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:33 -05:00
Josef Bacik	70afa3998c	Btrfs: rework the overcommit logic to be based on the total size People have been complaining about random ENOSPC errors that will clear up after a umount or just a given amount of time. Chris was able to reproduce this with stress.sh and lots of processes and so was I. Basically the overcommit stuff would really let us get out of hand, in my tests I saw up to 30 gigs of outstanding reservations with only 2 gigs total of metadata space. This usually worked out fine but with so much outstanding reservation the flushing stuff short circuits to make sure we don't hang forever flushing when we really need ENOSPC. Plus we allocate chunks in order to alleviate the pressure, but this doesn't actually help us since we only use the non-allocated area in our over commit logic. So instead of basing overcommit on the amount of non-allocated space, instead just do it based on how much total space we have, and then limit it to the non-allocated space in case we are short on space to spill over into. This allows us to have the same performance as well as no longer giving random ENOSPC. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:32 -05:00
Josef Bacik	925396ecf2	Btrfs: account for orphan inodes properly during cleanup Dave sent me a panic where we were doing the orphan cleanup and panic'ed trying to release our reservation from the orphan block rsv. The reason for this is because our orphan block rsv had been free'd out from underneath us because the transaction commit found that there were no orphan inodes according to its count and decided to free it. This is incorrect so make sure we inc the orphan inodes count so the accounting is all done properly. This would also cause the warning in the orphan commit code normally if you had any orphans to cleanup as they would only decrement the orphan count so you'd get a negative orphan count which could cause problems during runtime. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:31 -05:00
Josef Bacik	0bec9ef525	Btrfs: unreserve space if our ordered extent fails to work When a transaction aborts or there's an EIO on an ordered extent or any error really we will not free up the space we reserved for this ordered extent. This results in warnings from the block group cache cleanup in the case of a transaction abort, or leaking space in the case of EIO on an ordered extent. Fix this up by free'ing the reserved space if we have an error at all trying to complete an ordered extent. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:29 -05:00
Josef Bacik	779880ef35	Btrfs: fix how we discard outstanding ordered extents on abort When we abort we've been just free'ing up all the ordered extents and hoping for the best. This results in lots of warnings from various places, warnings from btrfs_destroy_inode() because it's ENOSPC accounting isn't fixed. It will also screw up lots of pages who have been set private but never get cleared because the ordered extents are never allowed to be submitted. This patch fixes those warnings. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:28 -05:00
Josef Bacik	eb12db690c	Btrfs: fix freeing delayed ref head while still holding its mutex I hit this error when reproducing a bug that would end in a transaction abort. We take the delayed ref head's mutex to keep anybody from processing it while we're destroying it, but we fail to drop the mutex before we carry on and free the damned thing. Fix this by doing the remove logic for the head ourselves and unlock the mutex, that way we can avoid use after free's or hung tasks waiting on that mutex to come back so they know the delayed ref completed. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:27 -05:00
Eric Sandeen	063d006fa0	btrfs: ensure we don't overrun devices_info[] in __btrfs_alloc_chunk WARN_ON isn't enough, we need to stop the loop if for any reason we would overrun the devices_info array. I tried to track down the connection between the length of the alloc_devices list and the rw_devices counter but it wasn't immediately obvious, so be defensive about it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:26 -05:00
Eric Sandeen	1971e917c8	btrfs: remove unnecessary DEFINE_WAIT() declarations No point in DEFINE_WAIT(wait) if it's not used! Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:24 -05:00
Eric Sandeen	d4c0a7da21	btrfs: remove unused "item" in btrfs_insert_delayed_item() "item" was set but never used in this function. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:23 -05:00
Eric Sandeen	37252a66f3	btrfs: fix varargs in __btrfs_std_error __btrfs_std_error didn't always properly call va_end, and might call va_start even if fmt was NULL. Move all the varargs handling into the block where we have fmt. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:22 -05:00
Eric Sandeen	0e6360274f	btrfs: add missing break in btrfs_print_leaf() I don't think that BTRFS_DEV_EXTENT_KEY is supposed to fall through to BTRFS_DEV_STATS_KEY ... Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:20 -05:00
Eric Sandeen	1c697d4acc	btrfs: annotate intentional switch case fallthroughs This keeps static checkers happy. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:19 -05:00
Eric Sandeen	aa43a17c21	btrfs: handle null fs_info in btrfs_panic() At least backref_tree_panic() can apparently pass in a null fs_info, so handle that in __btrfs_panic to get the message out on the console. The btrfs_panic macro also uses fs_info, but that's largely pointless; it's testing to see if BTRFS_MOUNT_PANIC_ON_FATAL_ERROR is not set. But if it were set, __btrfs_panic() would have, well, paniced and we wouldn't be here, testing it! So just BUG() at this point. And since we only use fs_info once now, just use it directly. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:18 -05:00
Eric Sandeen	5a01604783	btrfs: remove unused fs_info from btrfs_decode_error() Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:17 -05:00
Eric Sandeen	d1d3cd27a3	btrfs: list_entry can't return NULL No need to test the result, we can't get a null pointer from list_entry() Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:15 -05:00
Eric Sandeen	b4c6f7b75c	btrfs: remove unused fd in btrfs_ioctl_send() All we do is set it to NULL and test it :) Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:14 -05:00
Josef Bacik	96f1bb5777	Btrfs: do not overcommit if we don't have enough space for global rsv Because of how little we allocate chunks now we can get really tight on metadata space before we will allocate a new chunk. This resulted in being unable to add device extents when allocating a new metadata chunk as we did not have enough space. This is because we were allowed to overcommit too much metadata without actually making sure we had enough space to make allocations. The idea behind overcommit is that we are allowed to say "sure you can have that reservation" when most of the free space is occupied by reservations, not actual allocations. But in this case where a majority of the total space is in use by actual allocations we can screw ourselves by not being able to make real allocations when it matters. So make sure we have enough real space for our global reserve, and if not then don't allow overcommitting. Thanks, Reported-and-tested-by: Jim Schutt <jaschut@sandia.gov> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:13 -05:00
Josef Bacik	0f5d42b287	Btrfs: remove extent mapping if we fail to add chunk I got a double free error when unmounting a file system that failed to add a chunk during its operation. This is because we will kfree the mapping that we created but leave the extent_map in the em_tree for chunks. So to fix this just remove the extent_map when we error out so we don't run into this problem. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:11 -05:00
Josef Bacik	0448748849	Btrfs: fix chunk allocation error handling If we error out allocating a dev extent we will have already created the block group and such which will cause problems since the allocator may have tried to allocate out of the block group that no longer exists. This will cause BUG_ON()'s in the bio submission path. This also makes a failure to allocate a dev extent a non-abort error, we will just clean up the dev extents we did allocate and exit. Now if we fail to delete the dev extents we will abort since we can't have half of the dev extents hanging around, but this will make us much less likely to abort. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:10 -05:00
Miao Xie	87533c4751	Btrfs: use bit operation for ->fs_state There is no lock to protect fs_info->fs_state, it will introduce some problems, such as the value may be covered by the other task when several tasks modify it. For example: Task0 - CPU0 Task1 - CPU1 mov %fs_state rax or $0x1 rax mov %fs_state rax or $0x2 rax mov rax %fs_state mov rax %fs_state The expected value is 3, but in fact, it is 2. Though this problem doesn't happen now (because there is only one flag currently), the code is error prone, if we add other flags, the above problem will happen to a certainty. Now we use bit operation for it to fix the above problem. In this way, we can make the code more robust and be easy to add new flags. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:09 -05:00
Miao Xie	de98ced9e7	Btrfs: use seqlock to protect fs_info->avail_{data, metadata, system}_alloc_bits There is no lock to protect fs_info->avail_{data, metadata, system}_alloc_bits, it may introduce some problem, such as the wrong profile information, so we add a seqlock to protect them. Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:08 -05:00
Miao Xie	df0af1a57f	Btrfs: use the inode own lock to protect its delalloc_bytes We need not use a global lock to protect the delalloc_bytes of the inode, just use its own lock. In this way, we can reduce the lock contention and ->delalloc_lock will just protect delalloc inode list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:06 -05:00
Miao Xie	963d678b0f	Btrfs: use percpu counter for fs_info->delalloc_bytes fs_info->delalloc_bytes is accessed very frequently, so use percpu counter instead of the u64 variant for it to reduce the lock contention. This patch also fixed the problem that we access the variant without the lock protection.At worst, we would not flush the delalloc inodes, and just return ENOSPC error when we still have some free space in the fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:05 -05:00
Miao Xie	e2d845211e	Btrfs: use percpu counter for dirty metadata count ->dirty_metadata_bytes is accessed very frequently, so use percpu counter instead of the u64 variant to reduce the contention of the lock. This patch also fixed the problem that we access it without lock protection in __btrfs_btree_balance_dirty(), which may cause we skip the dirty pages flush. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:04 -05:00
Miao Xie	c018daecea	Btrfs: protect fs_info->alloc_start fs_info->alloc_start is a 64bits variant, can be accessed by multi-task, but it is not protected strictly, it can be changed while we are accessing it. On 32bit machine, we will get wrong value because we access it by two instructions.(In fact, it is also possible that the same problem happens on the 64bit machine, because the compiler may split the 64bit operation into two 32bit operation.) For example: Assuming -> alloc_start is 0x0000 0000 0001 0000 at the beginning, then we remount and set ->alloc_start to 0x0000 0100 0000 0000. Task0 Task1 load high 32 bits set high 32 bits set low 32 bits load low 32 bits Task1 will get 0. This patch fixes this problem by using two locks to protect it fs_info->chunk_mutex sb->s_umount On the read side, we just need get one of these two locks, and on the write side, we must lock all of them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:02 -05:00
Miao Xie	8c6a3ee6db	Btrfs: add a comment for fs_info->max_inline Though ->max_inline is a 64bit variant, and may be accessed by multi-task, but it is just suggestive number, so we needn't add anything to protect fs_info->max_inline, just add a comment to explain wny we don't use a lock to protect it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 12:59:01 -05:00
Filipe Brandenburger	55e301fd57	Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h The header file will then be installed under /usr/include/linux so that userspace applications can refer to Btrfs ioctls by name and use the same structs used internally in the kernel. Signed-off-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:28 -05:00
Kusanagi Kouichi	82b22ac8f6	Btrfs: Check CAP_DAC_READ_SEARCH for BTRFS_IOC_INO_PATHS CAP_DAC_READ_SEARCH overrides read and search permission check on file and directory. It seems fit for BTRFS_IOC_INO_PATHS. Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:27 -05:00
Josef Bacik	fe5fafbebd	Revert "Btrfs: fix permissions of empty files not affected by umask" This reverts commit `2794ed013b`. Wasn't supposed to get used in btrfs_mknod, it was supposed to be in btrfs_create, which was done in commit `9185aa587b`. Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:26 -05:00
Miao Xie	5b947f1ba9	Btrfs: don't traverse the ordered operation list repeatedly btrfs_run_ordered_operations() needn't traverse the ordered operation list repeatedly, it is because the transaction commiter will invoke it again when there is no other writer in this transaction, it can ensure that no one can add new objects into the ordered operation list. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:24 -05:00
Miao Xie	63607cc86a	Btrfs: traverse and flush the delalloc inodes once btrfs_start_delalloc_inodes() needn't traverse and flush the delalloc inodes repeatedly. It is because we can regard the data that the users write after we start delalloc inodes flush as the one which is after the delalloc inodes flush is done, and we can flush it next time. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:23 -05:00
Miao Xie	eebc608406	Btrfs: check the return value of btrfs_run_ordered_operations() We forget to check the return value of btrfs_run_ordered_operations() when flushing all the pending stuffs, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:22 -05:00
Miao Xie	3edb2a68cb	Btrfs: check the return value of btrfs_start_delalloc_inodes() We forget to check the return value of btrfs_start_delalloc_inodes(), fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:21 -05:00
Miao Xie	e6ec716f0d	Btrfs: make raid attr array more readable The current code of raid attr arry is hard to understand and it is easy to introduce some problem if we modify the array. So I changed it and made it more readable. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:19 -05:00
Liu Bo	a1897fddd2	Btrfs: record first logical byte in memory This'd save us a rbtree search which may become expensive in large filesystem. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:18 -05:00
Liu Bo	39f9d028c9	Btrfs: save us a read_lock This does not change the logic of code, but can save us a read_lock. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:17 -05:00
Liu Bo	51fab69347	Btrfs: use token to avoid times mapping extent buffer The API in tree log code has done sort of changes, and it proves that we can benifit from using token, so do the same thing here. function_graph tracer's timer shows that it costs nearly half time of before(39.788us -> 22.391us). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:15 -05:00
Liu Bo	dcfac4156f	Btrfs: kill unused argument of btrfs_pin_extent_for_log_replay Argument 'trans' is not used any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:14 -05:00
Liu Bo	c53d613e52	Btrfs: kill unused argument of update_block_group Argument 'trans' is not used any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:13 -05:00
Liu Bo	f6373bf3dc	Btrfs: kill unused arguments of cache_block_group Argument 'trans' and 'root' are not used any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:11 -05:00
Liu Bo	17b85495cf	Btrfs: remove deprecated comments commit `d53ba47484` (Btrfs: use commit root when loading free space cache) has remove the deadlock check, and the related comments can be removed as well. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:10 -05:00
Josef Bacik	c6b305a89b	Btrfs: don't re-enter when allocating a chunk If we start running low on metadata space we will try to allocate a chunk, which could then try to allocate a chunk to add the device entry. The thing is we allocate a chunk before we try really hard to make the allocation, so we should be able to find space for the device entry. Add a flag to the trans handle so we know we're currently allocating a chunk so we can just bail out if we try to allocate another chunk. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:09 -05:00
Josef Bacik	2ab28f322f	Btrfs: wait on ordered extents at the last possible moment Since we don't actually copy the extent information from the source tree in the fast case we don't need to wait for ordered io to be completed in order to fsync, we just need to wait for the io to be completed. So when we're logging our file just attach all of the ordered extents to the log, and then when the log syncs just wait for IO_DONE on the ordered extents and then write the super. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:37:04 -05:00
Miao Xie	dfd79829b7	Btrfs: fix trivial error in btrfs_ioctl_resize() This patch fixes the following problem: - improper return value - unnecessary read-only check Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:44 -05:00
Miao Xie	4eee4fa4f8	Btrfs: use wrapper page_offset Use wrapper page_offset to get byte-offset into filesystem object for page. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:43 -05:00
Miao Xie	da633a4217	Btrfs: flush all dirty inodes if writeback can not start We may try to flush some dirty pages when there is no enough space to reserve. But it is possible that this operation fails, in order to get enough space to reserve successfully, we will sync all the delalloc file. This operation is safe, we needn't worry about the case that the filesystem goes from r/w to r/o. because the filesystem should guarantee all the dirty pages have been written into the disk after it becomes readonly, so the sync operation will do nothing if the filesystem is already readonly. Though it may waste lots of time, as a corner case, we needn't care. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:42 -05:00
Miao Xie	093486c453	Btrfs: make delayed ref lock logic more readable Locking and unlocking delayed ref mutex are in the different functions, and the name of lock functions is not uniform, so the readability is not so good, this patch optimizes the lock logic and makes it more readable. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:41 -05:00
Miao Xie	0e8c36a9fd	Btrfs: fix lots of orphan inodes when the space is not enough We're running into having 50-100 orphans left over with xfstests 83 because of ENOSPC when trying to start the transaction for the inode update. But in fact, it makes no sense in updating the inode for the new size while we're deleting the stupid thing. This patch fixes this problem. Reported-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:39 -05:00
Miao Xie	4ea41ce07d	Btrfs: cleanup similar code in delayed inode The delayed item commit code in several functions is similar, so cleanup it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-20 09:36:38 -05:00
Miao Xie	7892b5afe4	Btrfs: use common work instead of delayed work Since we do not want to delay the async transaction commit, we should use common work, not delayed work. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2013-02-20 09:36:37 -05:00
Miao Xie	7b5a1c5310	Btrfs: cleanup unnecessary clear when freeing a transaction or a trans handle We clear the transaction object and the trans handle when they are about to be freed, it is unnecessary, cleanup it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2013-02-20 09:36:35 -05:00
Miao Xie	78a6184a3f	Btrfs: use slabs for delayed reference allocation The delayed reference allocation is in the fast path of the IO, so use slabs to improve the speed of the allocation. And besides that, it can do check for leaked objects when the module is removed. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2013-02-20 09:36:34 -05:00
David Sterba	6f60cbd3ae	btrfs: access superblock via pagecache in scan_one_device btrfs_scan_one_device is calling set_blocksize() which can race with a concurrent process making dirty page cache pages. It can end up dropping dirty page cache pages on the floor, which isn't very nice when someone is just running btrfs dev scan to find filesystems on the box. Now that udev is registering btrfs devices as it discovers them, we can actually end up racing with our own mkfs program too. When this happens, we drop some of the important blocks written by mkfs. This commit changes scan_one_device to read the super out of the page cache instead of trying to use bread. This way we don't have to care about the blocksize of the device. This also drops the invalidate_bdev() call. It wasn't very polite to invalidate during the scan either. mkfs is putting the super into the page cache, there's no reason to invalidate at this point. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-15 16:57:47 -05:00
Arne Jansen	2a745b14bc	Btrfs: fix crash in log replay with qgroups enabled When replaying a log tree with qgroups enabled, tree_mod_log_rewind does a sanity-check of the number of items against the maximum possible number. It calculates that number with the nodesize of fs_root. Unfortunately fs_root is not yet set at this stage. So instead use the nodesize from tree_root, which is already initialized. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-14 20:47:41 -05:00
Linus Torvalds	8d19514fad	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "We've got corner cases for updating i_size that ceph was hitting, error handling for quotas when we run out of space, a very subtle snapshot deletion race, a crash while removing devices, and one deadlock between subvolume creation and the sb_internal code (thanks lockdep)." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: move d_instantiate outside the transaction during mksubvol Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata Btrfs: fix possible stale data exposure Btrfs: fix missing i_size update Btrfs: fix race between snapshot deletion and getting inode Btrfs: fix missing release of the space/qgroup reservation in start_transaction() Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() Btrfs: do not merge logged extents if we've removed them from the tree btrfs: don't try to notify udev about missing devices	2013-02-08 12:06:46 +11:00
Chris Mason	1a65e24b0b	Btrfs: move d_instantiate outside the transaction during mksubvol Dave Sterba triggered a lockdep complaint about lock ordering between the sb_internal lock and the cleaner semaphore. btrfs_lookup_dentry() checks for orphans if we're looking up the inode for a subvolume, and subvolume creation is triggering the lookup with a transaction running. This commit moves the d_instantiate after the transaction closes. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-06 12:11:10 -05:00
Jan Schmidt	eb6b88d92c	Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata When btrfs_qgroup_reserve returned a failure, we were missing a counter operation for BTRFS_I(inode)->outstanding_extents++, leading to warning messages about outstanding extents and space_info->bytes_may_use != 0. Additionally, the error handling code didn't take into account that we dropped the inode lock which might require more cleanup. Luckily, all the cleanup code we need is already there and can be shared with reserve_metadata_bytes, which is exactly what this patch does. Reported-by: Lev Vainblat <lev@zadarastorage.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-06 09:24:40 -05:00
Chris Mason	24f8ebe918	Merge git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git for-chris into for-linus	2013-02-05 19:24:44 -05:00
Josef Bacik	59fe4f4197	Btrfs: fix possible stale data exposure We specifically do not update the disk i_size if there are ordered extents outstanding for any area between the current disk_i_size and our ordered extent so that we do not expose stale data. The problem is the check we have only checks if the ordered extent starts at or after the current disk_i_size, which doesn't take into account an ordered extent that starts before the current disk_i_size and ends past the disk_i_size. Fix this by checking if the extent ends past the disk_i_size. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:16 -05:00
Josef Bacik	5d1f40202b	Btrfs: fix missing i_size update If we have an ordered extent before the ordered extent we are currently completing that is after the current disk_i_size we will put our i_size update into that ordered extent so that we do not expose stale data. The problem is that if our disk i_size is updated past the previous ordered extent we won't update the i_size with the pending i_size update. So check the pending i_size update and if its above the current disk i_size we need to go ahead and try to update. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:14 -05:00
Liu Bo	6f1c36055f	Btrfs: fix race between snapshot deletion and getting inode While running snapshot testscript created by Mitch and David, the race between autodefrag and snapshot deletion can lead to corruption of dead_root list so that we can get crash on btrfs_clean_old_snapshots(). And besides autodefrag, scrub also does the same thing, ie. read root first and get inode. Here is the story(take autodefrag as an example): (1) when we delete a snapshot or subvolume, it will set its root's refs to zero and do a iput() on its own inode, and if this inode happens to be the only active in-meory one in root's inode rbtree, it will add itself to the global dead_roots list for later cleanup. (2) after (1), the autodefrag thread may read another inode for defrag and the inode is just in the deleted snapshot/subvolume, but all of these are without checking if the root is still valid(refs > 0). So the end up result is adding the deleted snapshot/subvolume's root to the global dead_roots list AGAIN. Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu. So all we need to do is to take the lock to protect 'read root and get inode', since we synchronize to wait for the rcu grace period before adding something to the global dead_roots list. Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:13 -05:00
Miao Xie	843fcf3573	Btrfs: fix missing release of the space/qgroup reservation in start_transaction() When we fail to start a transaction, we need to release the reserved free space and qgroup space, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:11 -05:00
Miao Xie	0a3404dcff	Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write() If the checks at the beginning of btrfs_file_aio_write() fail, we needn't decrease ->sync_writers, because we have not increased it. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:10 -05:00
Josef Bacik	222c81dc38	Btrfs: do not merge logged extents if we've removed them from the tree You can run into this problem where if somebody is fsyncing and writing out the existing extents you will have removed the extent map from the em tree, but it's still valid for the current fsync so we go ahead and write it. The problem is we unconditionally try to merge it back into the em tree, but if we've removed it from the em tree that will cause use after free problems. Fix this to only merge if we are still a part of the tree. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-02-05 16:09:03 -05:00
Chris Mason	0e4e026366	Merge branch 'for-linus' into raid56-experimental Conflicts: fs/btrfs/volumes.c Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-05 10:04:03 -05:00
Chris Mason	1f0905ec15	Btrfs: remove conflicting check for minimum number of devices in raid56 The device removal code was incorrectly checking against two different limits for raid5 and raid6. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-05 10:01:42 -05:00
Tomasz Torcz	10e78e3a8a	Btrfs: select XOR_BLOCKS in Kconfig The Btrfs raid56 uses the generic xor helpers. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-05 09:55:30 -05:00
Chris Mason	bb721703aa	Btrfs: reduce CPU contention while waiting for delayed extent operations We batch up operations to the extent allocation tree, which allows us to deal with the recursive nature of using the extent allocation tree to allocate extents to the extent allocation tree. It also provides a mechanism to sort and collect extent operations, which makes it much more efficient to record extents that are close together. The delayed extent operations must all be finished before the running transaction commits, so we have code to make sure and run a few of the batched operations when closing our transaction handles. This creates a great deal of contention for the locks in the delayed extent operation tree, and also contention for the lock on the extent allocation tree itself. All the extra contention just slows down the operations and doesn't get things done any faster. This commit changes things to use a wait queue instead. As procs want to run the delayed operations, one of them races in and gets permission to hit the tree, and the others step back and wait for progress to be made. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:25 -05:00
Chris Mason	242e18c7c1	Btrfs: reduce lock contention on extent buffer locks The extent buffers have a refs_lock which we use to make coordinate freeing the extent buffer with operations on the radix tree. On tree roots and other extent buffers that very cache hot, this can be highly contended. These are also the extent buffers that are basically pinned in memory. This commit adds code to cmpxchg our way through the ref modifications, and as long as the result of the reference change is still pinned in ram, we skip the expensive spinlock. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:25 -05:00
Chris Mason	8de972b4fa	Btrfs: fix cluster alignment for mount -o ssd With the new raid56 code, we want to make sure we're properly aligning our allocation clusters with -o ssd Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:24 -05:00
Chris Mason	6ac0f4884e	Btrfs: add a plugging callback to raid56 writes Buffered writes and DIRECT_IO writes will often break up big contiguous changes to the file into sub-stripe writes. This adds a plugging callback to gather those smaller writes full stripe writes. Example on flash: fio job to do 64K writes in batches of 3 (which makes a full stripe): With plugging: 450MB/s Without plugging: 220MB/s Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:24 -05:00
Chris Mason	4ae10b3a13	Btrfs: Add a stripe cache to raid56 The stripe cache allows us to avoid extra read/modify/write cycles by caching the pages we read off the disk. Pages are cached when: * They are read in during a read/modify/write cycle * They are written during a read/modify/write cycle * They are involved in a parity rebuild Pages are not cached if we're doing a full stripe write. We're assuming that a full stripe write won't be followed by another partial stripe write any time soon. This provides a substantial boost in performance for workloads that synchronously modify adjacent offsets in the file, and for the parity rebuild use case in general. The size of the stripe cache isn't tunable (yet) and is set at 1024 entries. Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct Without the stripe cache -- 2.1MB/s With the stripe cache 21MB/s Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:23 -05:00
David Woodhouse	53b381b3ab	Btrfs: RAID5 and RAID6 This builds on David Woodhouse's original Btrfs raid5/6 implementation. The code has changed quite a bit, blame Chris Mason for any bugs. Read/modify/write is done after the higher levels of the filesystem have prepared a given bio. This means the higher layers are not responsible for building full stripes, and they don't need to query for the topology of the extents that may get allocated during delayed allocation runs. It also means different files can easily share the same stripe. But, it does expose us to incorrect parity if we crash or lose power while doing a read/modify/write cycle. This will be addressed in a later commit. Scrub is unable to repair crc errors on raid5/6 chunks. Discard does not work on raid5/6 (yet) The stripe size is fixed at 64KiB per disk. This will be tunable in a later commit. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 14:24:23 -05:00
David Woodhouse	64a167011b	Btrfs: add rw argument to merge_bio_hook() We'll want to merge writes so they can fill a full RAID[56] stripe, but not necessarily reads. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 11:49:47 -05:00
Eric Sandeen	3c91160808	btrfs: don't try to notify udev about missing devices If we remove a missing device, bdev is null, and if we send that off to btrfs_kobject_uevent we'll panic. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-02-01 11:47:37 -05:00
Jiri Kosina	617677295b	Merge branch 'master' into for-next Conflicts: drivers/devfreq/exynos4_bus.c Sync with Linus' tree to be able to apply patches that are against newer code (mvneta).	2013-01-29 10:48:30 +01:00
Greg Kroah-Hartman	422d26b6ec	Merge 3.8-rc5 into driver-core-next This resolves a gpio driver merge issue pointed out in linux-next. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-25 21:06:30 -08:00
Linus Torvalds	d7df025eb4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "It turns out that we had two crc bugs when running fsx-linux in a loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it all down. Miao also has a new OOM fix in this v2 pull as well. Ilya fixed a regression Liu Bo found in the balance ioctls for pausing and resuming a running balance across drives. Josef's orphan truncate patch fixes an obscure corruption we'd see during xfstests. Arne's patches address problems with subvolume quotas. If the user destroys quota groups incorrectly the FS will refuse to mount. The rest are smaller fixes and plugs for memory leaks." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits) Btrfs: fix repeated delalloc work allocation Btrfs: fix wrong max device number for single profile Btrfs: fix missed transaction->aborted check Btrfs: Add ACCESS_ONCE() to transaction->abort accesses Btrfs: put csums on the right ordered extent Btrfs: use right range to find checksum for compressed extents Btrfs: fix panic when recovering tree log Btrfs: do not allow logged extents to be merged or removed Btrfs: fix a regression in balance usage filter Btrfs: prevent qgroup destroy when there are still relations Btrfs: ignore orphan qgroup relations Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag Btrfs: fix unlock order in btrfs_ioctl_rm_dev Btrfs: fix unlock order in btrfs_ioctl_resize Btrfs: fix "mutually exclusive op is running" error code Btrfs: bring back balance pause/resume logic btrfs: update timestamps on truncate() btrfs: fix btrfs_cont_expand() freeing IS_ERR em Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents Btrfs: fix off-by-one in lseek ...	2013-01-25 10:55:21 -08:00
Miao Xie	1eafa6c737	Btrfs: fix repeated delalloc work allocation btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/ btrfs_queue_worker for this inode, and then it locks the list, checks the head of the list again. But because we don't delete the first inode that it deals with before, it will fetch the same inode. As a result, this function allocates a huge amount of btrfs_delalloc_work structures, and OOM happens. Fix this problem by splice this delalloc list. Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:27 -05:00
Miao Xie	c9f01bfe0c	Btrfs: fix wrong max device number for single profile The max device number of single profile is 1, not 0 (0 means 'as many as possible'). Fix it. Cc: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:26 -05:00
Miao Xie	2cba30f172	Btrfs: fix missed transaction->aborted check First, though the current transaction->aborted check can stop the commit early and avoid unnecessary operations, it is too early, and some transaction handles don't end, those handles may set transaction->aborted after the check. Second, when we commit the transaction, we will wake up some worker threads to flush the space cache and inode cache. Those threads also allocate some transaction handles and may set transaction->aborted if some serious error happens. So we need more check for ->aborted when committing the transaction. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:25 -05:00
Miao Xie	8d25a086eb	Btrfs: Add ACCESS_ONCE() to transaction->abort accesses We may access and update transaction->aborted on the different CPUs without lock, so we need ACCESS_ONCE() wrapper to prevent the compiler from creating unsolicited accesses and make sure we can get the right value. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:23 -05:00
Josef Bacik	e58dd74bcc	Btrfs: put csums on the right ordered extent I noticed a WARN_ON going off when adding csums because we were going over the amount of csum bytes that should have been allowed for an ordered extent. This is a leftover from when we used to hold the csums privately for direct io, but now we use the normal ordered sum stuff so we need to make sure and check if we've moved on to another extent so that the csums are added to the right extent. Without this we could end up with csums for bytenrs that don't have extents to cover them yet. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:22 -05:00
Liu Bo	192000dda2	Btrfs: use right range to find checksum for compressed extents For compressed extents, the range of checksum is covered by disk length, and the disk length is different with ram length, so we need to use disk length instead to get us the right checksum. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:51:17 -05:00
Josef Bacik	b0175117b9	Btrfs: fix panic when recovering tree log A user reported a BUG_ON(ret) that occured during tree log replay. Ret was -EAGAIN, so what I think happened is that we removed an extent that covered a bitmap entry and an extent entry. We remove the part from the bitmap and return -EAGAIN and then search for the next piece we want to remove, which happens to be an entire extent entry, so we just free the sucker and return. The problem is ret is still set to -EAGAIN so we trip the BUG_ON(). The user used btrfs-zero-log so I'm not 100% sure this is what happened so I've added a WARN_ON() to catch the other possibility. Thanks, Reported-by: Jan Steffens <jan.steffens@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:49:49 -05:00
Josef Bacik	201a903894	Btrfs: do not allow logged extents to be merged or removed We drop the extent map tree lock while we're logging extents, so somebody could come in and merge another extent into this one and screw up our logging, or they could even remove us from the list which would keep us from logging the extent or freeing our ref on it, so we need to make sure to not clear LOGGING until after the extent is logged, and then we can merge it to adjacent extents. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-24 12:49:48 -05:00
Ilya Dryomov	a105bb88f4	Btrfs: fix a regression in balance usage filter Commit `3fed40cc` ("Btrfs: cleanup duplicated division functions"), which was merged into 3.8-rc1, has introduced a regression by removing logic that was guarding us against bad user input. Bring it back. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-01-21 20:40:27 -05:00
Chris Mason	83bfccb5c0	Merge branch 'mutex-ops@next-for-chris' of git://github.com/idryomov/btrfs-unstable into linus	2013-01-21 20:39:06 -05:00
Chris Mason	daf2c08911	Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into linus	2013-01-21 20:26:55 -05:00
Arne Jansen	2cf6870396	Btrfs: prevent qgroup destroy when there are still relations Currently you can just destroy a qgroup even though it is in use by other qgroups or has qgroups assigned to it. This patch prevents destruction of qgroups unless they are completely unused. Otherwise destroy will return EBUSY. Reported-by: Eric Hopper <hopper@omnifarious.org> Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-01-21 20:18:11 -05:00
Arne Jansen	ff24858c65	Btrfs: ignore orphan qgroup relations If a qgroup that has still assignments is deleted by the user, the corresponding relations are left in the tree. This leads to an unmountable filesystem. With this patch, those relations are simple ignored. Reported-by: Eric Hopper <hopper@omnifarious.org> Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2013-01-21 20:18:11 -05:00
Kees Cook	38db331b57	fs/btrfs: remove depends on CONFIG_EXPERIMENTAL The CONFIG_EXPERIMENTAL config item has not carried much meaning for a while now and is almost always enabled by default. As agreed during the Linux kernel summit, remove it from any "depends on" lines in Kconfigs. Cc: Chris Mason <chris.mason@fusionio.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2013-01-21 14:39:05 -08:00
Ilya Dryomov	25122d15e2	Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag Operation-specific check (whether subvol is readonly or not) should go after the mutual exclusiveness check. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2013-01-20 16:21:22 +02:00
Ilya Dryomov	4ac20c70b0	Btrfs: fix unlock order in btrfs_ioctl_rm_dev Fix unlock order in btrfs_ioctl_rm_dev(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2013-01-20 16:21:21 +02:00
Ilya Dryomov	18f39c416d	Btrfs: fix unlock order in btrfs_ioctl_resize Fix unlock order in btrfs_ioctl_resize(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2013-01-20 16:21:20 +02:00
Ilya Dryomov	2c0c9da02a	Btrfs: fix "mutually exclusive op is running" error code The error code that is returned in response to starting a mutually exclusive operation when there is one already running got silently changed from EINVAL to EINPROGRESS by `5ac00add`. Returning EINPROGRESS to, say, add_dev, when rm_dev is running is misleading. Furthermore, the operation itself may want to use EINPROGRESS for other purposes. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2013-01-20 16:21:18 +02:00
Ilya Dryomov	ed0fb78fb6	Btrfs: bring back balance pause/resume logic Balance pause/resume logic got broken by `5ac00add` (went in into 3.8-rc1 as part of dev-replace merge). Offending commit took a stab at making mutually exclusive volume operations (add_dev, rm_dev, resize, balance, replace_dev) not block behind volume_mutex if another such operation is in progress and instead return an error right away. Balancing front-end relied on the blocking behaviour, so the fix is ugly, but short of a complete rework, it's the best we can do. Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2013-01-20 16:21:17 +02:00
Eric Sandeen	3972f2603d	btrfs: update timestamps on truncate() truncate() vs. ftruncate() differ in the VFS; truncate() doesn't set (ATTR_CTIME \| ATTR_MTIME), and it's up to the fs to do the timestamp updates if the size changes. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com>	2013-01-14 13:53:37 -05:00
Zach Brown	f276795627	btrfs: fix btrfs_cont_expand() freeing IS_ERR em btrfs_cont_expand() tries to free an IS_ERR em as it gets an error from btrfs_get_extent() and breaks out of its loop. An instance of -EEXIST was reported in the wild: https://bugzilla.redhat.com/show_bug.cgi?id=874407 I have no idea if that -EEXIST is surprising, or not. Regardless, this error handling should be cleaned up to handle other reasonable errors (ENOMEM, EIO; whatever). This seemed to be the only buggy freeing of the relatively rare IS_ERR em so I opted to fix the caller rather than teach free_extent_map() to use IS_ERR_OR_NULL(). Signed-off-by: Zach Brown <zab@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com>	2013-01-14 13:53:23 -05:00
Liu Bo	f9e4fb5393	Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents xfstests case 285 complains. It it because btrfs did not try to find unwritten delalloc bytes(only dirty pages, not yet writeback) behind prealloc extents, it ends up finding nothing while we're with SEEK_DATA. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:53:22 -05:00
Liu Bo	1214b53f90	Btrfs: fix off-by-one in lseek Lock end is inclusive. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:53:22 -05:00
Liu Bo	3268a2468e	Btrfs: reset path lock state to zero We forgot to reset the path lock state to zero after we unlock the path block, and this can lead to the ASSERT checker in tree unlock API. Reported-by: Slava Barinov <rayslava@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:53 -05:00
Liu Bo	ac5c93005b	Btrfs: let allocation start from the right raid type This'd avoid us empty looping. Say we have only one disk and the metadata raid type will be defaultly DUP, and we do not need to start from index=0(RAID10) and get over two empty loops to index=2(DUP). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:52 -05:00
Josef Bacik	f3fe820c20	Btrfs: add orphan before truncating pagecache Running xfstests 83 in a loop would sometimes fail the fsck. This happens because if we invalidate a page that already has an ordered extent setup for it we will complete the ordered extent ourselves, assuming that the truncate will clean everything up. The problem with this is there is plenty of time for the truncate to fail after we've done this work. So to fix this we need to add the orphan item first to make sure the cleanup gets done properly, and then we can truncate the pagecache and all that stuff and be safe. This fixes the btrfsck failures I was seeing while running 83 in a loop. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:52 -05:00
Josef Bacik	72bcd99d45	Btrfs: set flushing if we're limited flushing We still need to say we're flushing if we're limit flushing to keep somebody from coming in and stealing our reservation. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:51 -05:00
Miao Xie	9754767657	Btrfs: fix missing write access release in btrfs_ioctl_resize() We forget to give up the write access after we find some device operation is going on. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:51 -05:00
Miao Xie	dba60f3f5d	Btrfs: fix resize a readonly device We should not resize a readonly device, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2013-01-14 13:52:49 -05:00
Miao Xie	5c39da5b6c	Btrfs: do not delete a subvolume which is in a R/O subvolume Step to reproduce: # mkfs.btrfs <disk> # mount <disk> <mnt> # btrfs sub create <mnt>/subv0 # btrfs sub snap <mnt> <mnt>/subv0/snap0 # change <mnt>/subv0 from R/W to R/O # btrfs sub del <mnt>/subv0/snap0 We deleted the snapshot successfully. I think we should not be able to delete the snapshot since the parent subvolume is R/O. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2013-01-14 13:52:32 -05:00
Miao Xie	d86e56cf7d	Btrfs: disable qgroup id 0 Qgroup id 0 is a special number, we should set the id of a qgroup to 0. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2013-01-14 13:52:31 -05:00
Lukas Czerner	cc975eb460	btrfs: get the device in write mode when deleting it When we're deleting the device we should get it in write mode since we're going to re-write the super block magic on that device. And it should fail if the device is read-only. Signed-off-by: Lukas Czerner <lczerner@redhat.com>	2013-01-14 13:52:31 -05:00
Tsutomu Itoh	cfa7a9ccda	Btrfs: fix memory leak in name_cache_insert() We should free name_cache_entry before returning from the error handling code. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2013-01-14 13:52:30 -05:00
Miao Xie	10ee27a06c	vfs: re-implement writeback_inodes_sb(_nr)_if_idle() and rename them writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read() with down_read_trylock() because - If ->s_umount is write locked, then the sb is not idle. That is writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock. - writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start writeback, it may bring us deadlock problem when doing umount. In order to fix the problem, ext4 and btrfs implemented their own writeback functions instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle(). The name of these two functions is cumbersome, so rename them to try_to_writeback_inodes_sb(_nr). This idea came from Christoph Hellwig. Some code is from the patch of Kamal Mostafa. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>	2013-01-12 10:47:43 +08:00
Wang Sheng-Hui	210b907acf	btrfs: remove unnecessary cur_trans set before goto loop in join_transaction In the big loop, cur_trans will be set fs_info->running_transaction before it's used. And after kmem_cache_free it and goto loop, it will be setup again. No need to setup it immediately after freed. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2013-01-09 11:48:33 +01:00
Liu Bo	2c016dc2cb	btrfs: fix comment typos Convert 'hepler' to 'helper'. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2013-01-09 11:40:52 +01:00
Linus Torvalds	1f0377ff08	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS update from Al Viro: "fscache fixes, ESTALE patchset, vmtruncate removal series, assorted misc stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (79 commits) vfs: make lremovexattr retry once on ESTALE error vfs: make removexattr retry once on ESTALE vfs: make llistxattr retry once on ESTALE error vfs: make listxattr retry once on ESTALE error vfs: make lgetxattr retry once on ESTALE vfs: make getxattr retry once on an ESTALE error vfs: allow lsetxattr() to retry once on ESTALE errors vfs: allow setxattr to retry once on ESTALE errors vfs: allow utimensat() calls to retry once on an ESTALE error vfs: fix user_statfs to retry once on ESTALE errors vfs: make fchownat retry once on ESTALE errors vfs: make fchmodat retry once on ESTALE errors vfs: have chroot retry once on ESTALE error vfs: have chdir retry lookup and call once on ESTALE error vfs: have faccessat retry once on an ESTALE error vfs: have do_sys_truncate retry once on an ESTALE error vfs: fix renameat to retry on ESTALE errors vfs: make do_unlinkat retry once on ESTALE errors vfs: make do_rmdir retry once on ESTALE errors vfs: add a flags argument to user_path_parent ...	2012-12-20 18:14:31 -08:00
Linus Torvalds	1ca22254b3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull two btrfs reverts from Chris Mason: "I had missed that for two of the patches in my last pull, we had included different fixes during 3.7." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Revert "Btrfs: reorder tree mod log operations in deleting a pointer" Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems"	2012-12-20 13:57:09 -08:00
Jeff Layton	39e3c9553f	vfs: remove DCACHE_NEED_LOOKUP The code that relied on that flag was ripped out of btrfs quite some time ago, and never added back. Josef indicated that he was going to take a different approach to the problem in btrfs, and that we could just eliminate this flag. Cc: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-12-20 13:57:36 -05:00
Chris Mason	57ba86c00f	Revert "Btrfs: reorder tree mod log operations in deleting a pointer" This reverts commit `6a7a665d78`. This was bug was fixed differently in 3.6, so this commit isn't needed. Conflicts: fs/btrfs/ctree.c Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-18 19:35:32 -05:00
Chris Mason	4c3e696981	Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems" This reverts commit `95c80bb1f6`. The bug addressed by this commit was fixed differently back in 3.6 Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-18 15:43:18 -05:00
Linus Torvalds	a22180d266	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "A big set of fixes and features. In terms of line count, most of the code comes from Stefan, who added the ability to replace a single drive in place. This is different from how btrfs normally replaces drives, and is much much much faster. Josef is plowing through our synchronous write performance. This pull request does not include the DIO_OWN_WAITING patch that was discussed on the list, but it has a number of other improvements to cut down our latencies and CPU time during fsync/O_DIRECT writes. Miao Xie has a big series of fixes and is spreading out ordered operations over more CPUs. This improves performance and reduces contention. I've put in fixes for error handling around hash collisions. These are going back to individual stable kernels as I test against them. Otherwise we have a lot of fixes and cleanups, thanks everyone! raid5/6 is being rebased against the device replacement code. I'll have it posted this Friday along with a nice series of benchmarks." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits) Btrfs: fix a bug of per-file nocow Btrfs: fix hash overflow handling Btrfs: don't take inode delalloc mutex if we're a free space inode Btrfs: fix autodefrag and umount lockup Btrfs: fix permissions of empty files not affected by umask Btrfs: put raid properties into global table Btrfs: fix BUG() in scrub when first superblock reading gives EIO Btrfs: do not call file_update_time in aio_write Btrfs: only unlock and relock if we have to Btrfs: use tokens where we can in the tree log Btrfs: optimize leaf_space_used Btrfs: don't memset new tokens Btrfs: only clear dirty on the buffer if it is marked as dirty Btrfs: move checks in set_page_dirty under DEBUG Btrfs: log changed inodes based on the extent map tree Btrfs: add path->really_keep_locks Btrfs: do not mark ems as prealloc if we are writing to them Btrfs: keep track of the extents original block length Btrfs: inline csums if we're fsyncing Btrfs: don't bother copying if we're only logging the inode ...	2012-12-18 09:42:05 -08:00
Andrew Morton	965c8e59cf	lseek: the "whence" argument is called "whence" But the kernel decided to call it "origin" instead. Fix most of the sites. Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-17 17:15:12 -08:00
Liu Bo	213490b301	Btrfs: fix a bug of per-file nocow Users report a bug, the reproducer is: $ mkfs.btrfs /dev/loop0 $ mount /dev/loop0 /mnt/btrfs/ $ mkdir /mnt/btrfs/dir $ chattr +C /mnt/btrfs/dir/ $ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10; $ lsattr /mnt/btrfs/dir/foo ---------------C- /mnt/btrfs/dir/foo $ filefrag /mnt/btrfs/dir/foo /mnt/btrfs/dir/foo: 1 extent found ---> an extent $ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync $ filefrag /mnt/btrfs/dir/foo /mnt/btrfs/dir/foo: 3 extents found ---> with nocow, btrfs breaks the extent into three parts The new created file should not only inherit the NODATACOW flag, but also honor NODATASUM flag, because we must do COW on a file extent with checksum. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-17 14:48:21 -05:00
Chris Mason	9c52057c69	Btrfs: fix hash overflow handling The handling for directory crc hash overflows was fairly obscure, split_leaf returns EOVERFLOW when we try to extend the item and that is supposed to bubble up to userland. For a while it did so, but along the way we added better handling of errors and forced the FS readonly if we hit IO errors during the directory insertion. Along the way, we started testing only for EEXIST and the EOVERFLOW case was dropped. The end result is that we may force the FS readonly if we catch a directory hash bucket overflow. This fixes a few problem spots. First I add tests for EOVERFLOW in the places where we can safely just return the error up the chain. btrfs_rename is harder though, because it tries to insert the new directory item only after it has already unlinked anything the rename was going to overwrite. Rather than adding very complex logic, I added a helper to test for the hash overflow case early while it is still safe to bail out. Snapshot and subvolume creation had a similar problem, so they are using the new helper now too. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Pascal Junod <pascal@junod.info>	2012-12-17 14:48:21 -05:00
Josef Bacik	c64c2bd890	Btrfs: don't take inode delalloc mutex if we're a free space inode This confuses and angers lockdep even though it's ok. We don't really need the lock for free space inodes since only the transaction committer will be reserving space. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:29 -05:00
Josef Bacik	1135d6df22	Btrfs: fix autodefrag and umount lockup This happens because writeback_inodes_sb_nr_if_idle does down_read. This doesn't work for us and it has not been fixed upstream yet, so do it ourselves and use that instead so we can stop having this stupid long standing lockup. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:29 -05:00
Filipe Brandenburger	9185aa587b	Btrfs: fix permissions of empty files not affected by umask When a new file is created with btrfs_create(), the inode will initially be created with permissions 0666 and later on in btrfs_init_acl() it will be adapted to mask out the umask bits. The problem is that this change won't make it into the btrfs_inode unless there's another change to the inode (e.g. writing content changing the size or touching the file changing the mtime.) This fix adds a call to btrfs_update_inode() to btrfs_create() to make sure that the change will not get lost if the in-memory inode is flushed before other changes are made to the file. Signed-off-by: Filipe Brandenburger <filbranden@google.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:28 -05:00
Liu Bo	31e502298d	Btrfs: put raid properties into global table Raid properties can be shared among raid calculation code, we can put them into a global table to keep it simple. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:28 -05:00
Stefan Behrens	4ded4f6395	Btrfs: fix BUG() in scrub when first superblock reading gives EIO This fixes a very special case that can be reproduced by just disconnecting a disk at runtime, and without unmounting the filesystem first, start scrub on the filesystem with the disconnected disk. All read and write EIOs are handled correctly, only the first superblock is an exception and gives a BUG() in a subfunction. The BUG() is correct, it would crash later otherwise. The subfunction must not be called for superblocks and this is what the fix changes. Reported-by: Joeri Vanthienen <mail@joerivanthienen.be> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:28 -05:00
Josef Bacik	6c760c0724	Btrfs: do not call file_update_time in aio_write This starts a transaction and dirties the inode everytime we call it, which is super expensive if you have a write heavy workload. We will be updating the inode when the IO completes and we reserve the space for the inode update when we reserve space for the write, so there is no chance of loss of information or enospc issues. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:27 -05:00
Josef Bacik	5124e00ec5	Btrfs: only unlock and relock if we have to I noticed while doing fsync tests that we were always dropping the path and re-searching when we first cow the log root even though we've already gotten the write lock on the root. That's because we don't take into account that there might not be a parent node, so fix the check to make sure there is actually a parent node before we undo all of this work for nothing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:27 -05:00
Josef Bacik	0b1c6ccade	Btrfs: use tokens where we can in the tree log If we are syncing over and over the overhead of doing all those maps in fill_inode_item and log_changed_extents really starts to hurt, so use map tokens so we can avoid all the extra mapping. Since the token maps from our offset to the end of the page make sure to set the first thing in the item first so we really only do one map. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:26 -05:00
Josef Bacik	41be1f3b40	Btrfs: optimize leaf_space_used This gets called at least 4 times for every level while adding an object, and it involves 3 kmapping calls, which on my box take about 5us a piece. So instead use a token, which brings us down to 1 kmap call and makes this function take 1/3 of the time per call. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:26 -05:00
Josef Bacik	ad91455969	Btrfs: don't memset new tokens Our token logic depends on token->kaddr being set, and if it is not it sets everything properly as needed. So instead of memsetting just set token->kaddr to NULL. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:25 -05:00
Josef Bacik	ed7b63eb8a	Btrfs: only clear dirty on the buffer if it is marked as dirty No reason to set the path blocking or loop through all of the pages if the extent buffer isn't actually marked dirty. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:25 -05:00
Josef Bacik	bb146eb265	Btrfs: move checks in set_page_dirty under DEBUG This is a high traffic function, let's try and do as little as possible during normal operations shall we? Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:25 -05:00
Josef Bacik	70c8a91ce2	Btrfs: log changed inodes based on the extent map tree We don't really need to copy extents from the source tree since we have all of the information already available to us in the extent_map tree. So instead just write the extents straight to the log tree and don't bother to copy the extent items from the source tree. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:24 -05:00
Josef Bacik	d6393786cd	Btrfs: add path->really_keep_locks You'd think path->keep_locks would keep all the locks wouldn't you? You'd be wrong. It only keeps them if the slot is pointing to the last item in the node. This is for use with btrfs_next_leaf, which needs this sort of thing. But the horrible horrible things I'm going to do to the tree log means I really need everything held from root to leaf so I can add and delete items in the same search. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:24 -05:00
Josef Bacik	b11e234d21	Btrfs: do not mark ems as prealloc if we are writing to them We are going to use EM's to log extents in the future, so we need to not mark them as prealloc if they aren't actually prealloc extents. Instead mark them with FILLING so we know to ammend mod_start/mod_len and that way we don't confuse the extent logging code. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:23 -05:00
Josef Bacik	b493968096	Btrfs: keep track of the extents original block length If we've written to a prealloc extent we need to know the original block len for the extent. We can't figure this out currently since ->block_len is just set to the extent length. So introduce ->orig_block_len so that we know how many bytes were in the original extent for proper extent logging that future patches will need. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:23 -05:00
Josef Bacik	b812ce2879	Btrfs: inline csums if we're fsyncing The tree logging stuff needs the csums to be on the ordered extents in order to log them properly, so mark that we're sync and inline the csum creation so we don't have to wait on the csumming to be done when logging extents that are still in flight. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:22 -05:00
Josef Bacik	a95249b392	Btrfs: don't bother copying if we're only logging the inode We don't copy inode items anwyay, we just copy them straight into the log from the in memory inode. So if we know we're only logging the inode, don't bother dropping anything, just try to insert it and either if it succeeds or we get EEXIST we can update the inode item in the log and carry on. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:22 -05:00
Josef Bacik	e997615149	Btrfs: only log the inode item if we can get away with it Currently we copy all the file information into the log, inode item, the refs, xattrs etc. Except most of this doesn't change from fsync to fsync, just the inode item changes. So set a flag if an xattr changes or a link is added, and otherwise only log the inode item. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:21 -05:00
Anand Jain	5f3ab90a72	Btrfs: rename root_times_lock to root_item_lock Originally root_times_lock was introduced as part of send/receive code however newly developed patch to label the subvol reused the same lock, so renaming it for a meaningful name. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:21 -05:00
Lukas Czerner	b8b8ff590f	btrfs: Notify udev when removing device Currently udev does not know about the device being removed from the file system. This may result in the situation where we're unable to mount the file system by UUID or by LABEL because the by-uuid and by-label links may still point to the device which is no longer part of the btrfs file system and hence does not have any btrfs super block. It can be easily reproduced by the following: mkfs.btrfs -L bugfs /dev/loop[0-6] mount /dev/loop0 /mnt/test btrfs device delete /dev/loop0 /mnt/test umount /mnt/test mount LABEL=bugfs /mnt/test <---- this fails then see: ls -l /dev/disk/by-label/bugfs which will still point to the /dev/loop0 We did not noticed this before because libblkid would send the udev event for us when it notice that the link does not fit the reality, however it does not do that anymore and completely relies on udev information. Fix this by sending the KOBJ_CHANGE event to the bdev kobject after successful device removal. Note that this does not affect device addition, because we will open the device prior the addition from userspace and udev will notice that and reread the device afterwards. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:21 -05:00
Miao Xie	ac6a2b36f9	Btrfs: fix wrong return value of btrfs_truncate_page() ret variant may be set to 0 if we read page successfully, but it might be released before we lock it again. On this case, if we fail to allocate a new page, we will return 0, it is wrong, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:20 -05:00
Miao Xie	7426cc04d4	Btrfs: punch hole past the end of the file Since we can pre-allocate the space past EOF, we should be able to reclaim that space if we need. This patch implements it by removing the EOF check. Though the manual of fallocate command says we can use truncate command to reclaim the pre-allocated space which past EOF, but because truncate command changes the file size, we must run several commands to reclaim the space if we don't want to change the file size, so it is not a good choice. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:20 -05:00
Miao Xie	0061280d2c	Btrfs: fix the page that is beyond EOF Steps to reproduce: # mkfs.btrfs <disk> # mount <disk> <mnt> # dd if=/dev/zero of=<mnt>/<file> bs=512 seek=5 count=8 # fallocate -p -o 2048 -l 16384 <mnt>/<file> # dd if=/dev/zero of=<mnt>/<file> bs=4096 seek=3 count=8 conv=notrunc,nocreat # umount <mnt> # dmesg WARNING: at fs/btrfs/inode.c:7140 btrfs_destroy_inode+0x2eb/0x330 The reason is that we inputed a range which is beyond the end of the file. And because the end of this range was not page-aligned, we had to truncate the last page in this range, this operation is similar to a buffered file write. In other words, we reserved enough space and clear the data which was in the hole range on that page. But when we expanded that test file, write the data into the same page, we forgot that we have reserved enough space for the buffered write of that page because in most cases there is no page that is beyond the end of the file. As a result, we reserved the space twice. In fact, we needn't truncate the page if it is beyond the end of the file, just release the allocated space in that range. Fix the above problem by this way. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:19 -05:00
Miao Xie	6347b3c433	Btrfs: fix off-by-one error of the same page check in btrfs_punch_hole() (start + len) is the start of the adjacent extent, not the end of the current extent, so we should not use it to check the hole is on the same page or not. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:19 -05:00
Miao Xie	4b5829a8e3	Btrfs: fix missing reserved space release in error path of delalloc reservation We forget to release the reserved space in the error path of delalloc reservatiom, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:18 -05:00
Miao Xie	543eabd5e1	Btrfs: don't auto defrag a file when doing directIO If we runt the direct IO, we should not run auto defrag, because it may introduce buffered IO vs direcIO problem, and make direct IO slow down. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:18 -05:00
Wang Sheng-Hui	960097622d	Btrfs: use ctl->unit for free space calculation instead of block_group->sectorsize We should use ctl->unit for free space calculation instead of block_group->sectorsize even though for free space use_bitmap or free space cluster we only have sectorsize assigned to ctl->unit currently. Also, we can keep it consisten in code style. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:17 -05:00
Filipe Brandenburger	43baa579b3	Btrfs: refactor error handling to drop inode in btrfs_create() Refactor it by checking whether the inode has been created and needs to be dropped (drop_inode_on_err) and also if the err variable is set. That way the variable doesn't need to be set on each and every error handling block. Signed-off-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:17 -05:00
Filipe Brandenburger	2794ed013b	Btrfs: fix permissions of empty files not affected by umask When a new file is created with btrfs_create(), the inode will initially be created with permissions 0666 and later on in btrfs_init_acl() it will be adapted to mask out the umask bits. The problem is that this change won't make it into the btrfs_inode unless there's another change to the inode (e.g. writing content changing the size or touching the file changing the mtime.) This fix adds a call to btrfs_update_inode() to btrfs_create() to make sure that the change will not get lost if the in-memory inode is flushed before other changes are made to the file. Signed-off-by: Filipe Brandenburger <filbranden@google.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:16 -05:00
Tsutomu Itoh	05dadc09f5	Btrfs: add fiemap's flag check When the flag not supported is specified, it is necessary to return the error to the caller. So, we add the validity check of the fiemap's flag. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:16 -05:00
Liu Bo	01e6deb25a	Btrfs: don't add a NULL extended attribute Passing a null extended attribute value means to remove the attribute, but we don't have to add a new NULL extended attribute. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:15 -05:00
Liu Bo	755ac67f83	Btrfs: skip adding an acl attribute if we don't have to If the acl can be exactly represented in the traditional file mode permission bits, we don't set another acl attribute. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:15 -05:00
Miao Xie	0ff6fabdb0	Btrfs: fix off-by-one error of the reserved size of btrfs_allocate() alloc_end is not the real end of the current extent, it is the start of the next adjoining extent. So we needn't +1 when calculating the size the space that is about to be reserved. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:15 -05:00
Miao Xie	797f427711	Btrfs: use existing align macros in btrfs_allocate() The kernel developers have implemented some often-used align macros, we should use them instead of the complex code. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:14 -05:00
Stefan Behrens	af1be4f851	Btrfs: fix a scrub regression in case of write errors This regression was introduced by the device-replace patches. Scrub immediately stops checking those disks that have write errors. This is nothing that happens in the real world, but it is wrong since scrub is the tool to detect and repair defects. Fix it. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:14 -05:00
Stefan Behrens	f9c83748de	Btrfs: fix a build warning for an unused label This issue was detected by the "0-DAY kernel build testing". fs/btrfs/volumes.c: In function 'btrfs_rm_device': fs/btrfs/volumes.c:1505:1: warning: label 'error_close' defined but not used [-Wunused-label] Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:13 -05:00
Stefan Behrens	cb3806ec88	Btrfs: fix race in check-integrity caused by usage of bitfield The structure member mirror_num is modified concurrently to the structure member is_iodone. This doesn't require any locking by design, unless everything is stored in the same 32 bits of a bit field. This was the case and xfstest 284 was able to trigger false warnings from the checker code. This patch seperates the bits and fixes the race. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:13 -05:00
Miao Xie	b66f00da0c	Btrfs: fix freeze vs auto defrag If we freeze the fs, the auto defragment should not run. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:12 -05:00
Miao Xie	26176e7c2a	Btrfs: restructure btrfs_run_defrag_inodes() This patch restructure btrfs_run_defrag_inodes() and make the code of the auto defragment more readable. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:12 -05:00
Miao Xie	8ddc473433	Btrfs: fix unprotected defragable inode insertion We forget to get the defrag lock when we re-add the defragable inode, Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:12 -05:00
Miao Xie	9247f3170b	Btrfs: use slabs for auto defrag allocation The auto defrag allocation is in the fast path of the IO, so use slabs to improve the speed of the allocation. And besides that, it can do check for leaked objects when the module is removed. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:11 -05:00
Miao Xie	905b0dda06	Btrfs: get write access for qgroup operations We need get write access for qgroup operations, or we will modify the R/O fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:11 -05:00
Miao Xie	b8e95489bf	Btrfs: get write access for scrub We need get write access for scrub, or we will modify the R/O fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:10 -05:00
Miao Xie	da24927b1e	Btrfs: get write access when removing a device Steps to reproduce: # mkfs.btrfs -d single -m single <disk0> <disk1> # mount -o ro <disk0> <mnt0> # mount -o ro <disk0> <mnt1> # mount -o remount,rw <mnt0> # umount <mnt0> # btrfs device delete <disk1> <mnt1> We can remove a device from a R/O filesystem. The reason is that we just check the R/O flag of the super block object. It is not enough, because the kernel may set the R/O flag only for the mount point. We need invoke mnt_want_write_file() to do a full check. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:09 -05:00
Miao Xie	198605a8e2	Btrfs: get write access when doing resize fs Steps to reproduce: # mkfs.btrfs <partition> # mount -o ro <partition> <mnt0> # mount -o ro <partition> <mnt1> # mount -o remount,rw <mnt0> # umount <mnt0> # btrfs fi resize 10g <mnt1> We re-sized a R/O filesystem. The reason is that we just check the R/O flag of the super block object. It is not enough, because the kernel may set the R/O flag only for the mount point. We need invoke mnt_want_write_file() to do a full check. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:09 -05:00
Miao Xie	3c04ce0105	Btrfs: get write access when setting the default subvolume When wen want to set the default subvolume, we must get write access, or we will change the R/O file system. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:09 -05:00
Miao Xie	8cd2807f79	Btrfs: fix wrong return value of btrfs_wait_for_commit() If the id of the existed transaction is more than the one we specified, it means the specified transaction was commited, so we should return 0, not EINVAL. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:08 -05:00
Miao Xie	ff7c1d3355	Btrfs: don't start a new transaction when starting sync If there is no running transaction in the fs, we needn't start a new one when we want to start sync. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:08 -05:00
Miao Xie	9a8c28bec1	Btrfs: pass root object into btrfs_ioctl_{start, wait}_sync() Since we have gotten the root in the caller, just pass it into btrfs_ioctl_{start, wait}_sync() directly. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:07 -05:00
Liu Bo	db2254bce4	Btrfs: fix an while-loop of listxattr If we found an invalid xattr dir item, we'd better try the next one instead. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:07 -05:00
Wang Sheng-Hui	071401258a	Btrfs: do not warn_on io_ctl->cur in io_ctl_map_page io_ctl_map_page is called by many functions in free-space-cache. In most scenarios, the ->cur is not null, e.g. io_ctl_add_entry. I think we'd better remove the warn_on here. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:06 -05:00
Stefan Behrens	3f6bcfbd41	Btrfs: add support for device replace ioctls This is the commit that allows to start the device replace procedure. An ioctl() interface is added that supports starting and canceling the device replace procedure, and to retrieve the status and progress. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-16 20:46:06 -05:00
Linus Torvalds	a2013a13e6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial branch from Jiri Kosina: "Usual stuff -- comment/printk typo fixes, documentation updates, dead code elimination." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) HOWTO: fix double words typo x86 mtrr: fix comment typo in mtrr_bp_init propagate name change to comments in kernel source doc: Update the name of profiling based on sysfs treewide: Fix typos in various drivers treewide: Fix typos in various Kconfig wireless: mwifiex: Fix typo in wireless/mwifiex driver messages: i2o: Fix typo in messages/i2o scripts/kernel-doc: check that non-void fcts describe their return value Kernel-doc: Convention: Use a "Return" section to describe return values radeon: Fix typo and copy/paste error in comments doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c various: Fix spelling of "asynchronous" in comments. Fix misspellings of "whether" in comments. eisa: Fix spelling of "asynchronous". various: Fix spelling of "registered" in comments. doc: fix quite a few typos within Documentation target: iscsi: fix comment typos in target/iscsi drivers treewide: fix typo of "suport" in various comments and Kconfig treewide: fix typo of "suppport" in various comments ...	2012-12-13 12:00:02 -08:00
Stefan Behrens	ad6d620e2a	Btrfs: allow repair code to include target disk when searching mirrors Make the target disk of a running device replace operation available for reading. This is only used as a last ressort for the defect repair procedure. And it is dependent on the location of the data block to read, because during an ongoing device replace operation, the target drive is only partially filled with the filesystem data. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:45 -05:00
Stefan Behrens	72d7aefccd	Btrfs: increase BTRFS_MAX_MIRRORS by one for dev replace This change of the define is effective in all modes, it is required and used only in the case when a device replace procedure is running. The reason is that during an active device replace procedure, the target device of the copy operation is a mirror for the filesystem data as well that can be used to read data in order to repair read errors on other disks. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:44 -05:00
Stefan Behrens	30d9861ff9	Btrfs: optionally avoid reads from device replace source drive It is desirable to be able to configure the device replace procedure to avoid reading the source drive (the one to be copied) whenever possible. This is useful when the number of read errors on this disk is high, because it would delay the copy procedure alot. Therefore there is an option to avoid reading from the source disk unless the repair procedure really needs to access it. The regular read req asks for mapping the block with mirror_num == 0, in this case the source disk is avoided whenever possible. The repair code selects the mirror_num explicitly (mirror_num != 0), this case is not changed by this commit. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:44 -05:00
Stefan Behrens	472262f35a	Btrfs: changes to live filesystem are also written to replacement disk During a running dev replace operation, all write requests to the live filesystem are duplicated to also write to the target drive. Therefore btrfs_map_block() is changed to duplicate stripes that are written to the source disk of a device replace procedure to be written to the target disk as well. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:43 -05:00
Stefan Behrens	29a8d9a0bc	Btrfs: introduce GET_READ_MIRRORS functionality for btrfs_map_block() Before this commit, btrfs_map_block() was called with REQ_WRITE in order to retrieve the list of mirrors for a disk block. This needs to be changed for the device replace procedure since it makes a difference whether you are asking for read mirrors or for locations to write to. GET_READ_MIRRORS is introduced as a new interface to call btrfs_map_block(). In the current commit, the functionality is not yet changed, only the interface for GET_READ_MIRRORS is introduced and all the places that should use this new interface are adapted. The reason that REQ_WRITE cannot be abused anymore to retrieve a list of read mirrors is that during a running dev replace operation all write requests to the live filesystem are duplicated to also write to the target drive. Keep in mind that the target disk is only partially a valid copy of the source disk while the operation is ongoing. All writes go to the target disk, but not all reads would return valid data on the target disk. Therefore it is not possible anymore to abuse a REQ_WRITE interface to find valid mirrors for a REQ_READ. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:43 -05:00
Stefan Behrens	8dabb7420f	Btrfs: change core code of btrfs to support the device replace operations This commit contains all the essential changes to the core code of Btrfs for support of the device replace procedure. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:42 -05:00
Stefan Behrens	e93c89c1aa	Btrfs: add new sources for device replace code This adds a new file to the sources together with the header file and the changes to ioctl.h and ctree.h that are required by the new C source file. Additionally, 4 new functions are added to volume.c that deal with device creation and destruction. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:41 -05:00
Stefan Behrens	ff023aac31	Btrfs: add code to scrub to copy read data to another disk The device replace procedure makes use of the scrub code. The scrub code is the most efficient code to read the allocated data of a disk, i.e. it reads sequentially in order to avoid disk head movements, it skips unallocated blocks, it uses read ahead mechanisms, and it contains all the code to detect and repair defects. This commit adds code to scrub to allow the scrub code to copy read data to another disk. One goal is to be able to perform as fast as possible. Therefore the write requests are collected until huge bios are built, and the write process is decoupled from the read process with some kind of flow control, of course, in order to limit the allocated memory. The best performance on spinning disks could by reached when the head movements are avoided as much as possible. Therefore a single worker is used to interface the read process with the write process. The regular scrub operation works as fast as before, it is not negatively influenced and actually it is more or less unchanged. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:41 -05:00
Stefan Behrens	618919236b	Btrfs: handle errors from btrfs_map_bio() everywhere With the addition of the device replace procedure, it is possible for btrfs_map_bio(READ) to report an error. This happens when the specific mirror is requested which is located on the target disk, and the copy operation has not yet copied this block. Hence the block cannot be read and this error state is indicated by returning EIO. Some background information follows now. A new mirror is added while the device replace procedure is running. btrfs_get_num_copies() returns one more, and btrfs_map_bio(GET_READ_MIRROR) adds one more mirror if a disk location is involved that was already handled by the device replace copy operation. The assigned mirror num is the highest mirror number, e.g. the value 3 in case of RAID1. If btrfs_map_bio() is invoked with mirror_num == 0 (i.e., select any mirror), the copy on the target drive is never selected because that disk shall be able to perform the write requests as quickly as possible. The parallel execution of read requests would only slow down the disk copy procedure. Second case is that btrfs_map_bio() is called with mirror_num > 0. This is done from the repair code only. In this case, the highest mirror num is assigned to the target disk, since it is used last. And when this mirror is not available because the copy procedure has not yet handled this area, an error is returned. Everywhere in the code the handling of such errors is added now. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:40 -05:00
Stefan Behrens	63a212abc2	Btrfs: disallow some operations on the device replace target device This patch adds some code to disallow operations on the device that is used as the target for the device replace operation. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:39 -05:00
Stefan Behrens	5ac00addc7	Btrfs: disallow mutually exclusive admin operations from user mode Btrfs admin operations that are manually started from user mode and that cannot be executed at the same time return -EINPROGRESS. A common way to enter and leave this locked section is introduced since it used to be specific to the balance operation. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:38 -05:00
Stefan Behrens	a2bff64025	Btrfs: introduce a btrfs_dev_replace_item type Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:38 -05:00
Stefan Behrens	e922e087a3	Btrfs: enhance btrfs structures for device replace support Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:37 -05:00
Stefan Behrens	1acd6831d9	Btrfs: avoid risk of a deadlock in btrfs_handle_error Remove the attempt to cancel a running scrub or device replace operation in btrfs_handle_error() because it adds the risk of a deadlock. The only penalty of not canceling the operation is that some I/O remains active until the procedure completes. This is basically the same thing that happens to other tasks that are running in user mode context, they are not affected or stopped in btrfs_handle_error(), these tasks just need to handle write errors correctly. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:36 -05:00
Stefan Behrens	aa1b8cd409	Btrfs: pass fs_info instead of root A small number of functions that are used in a device replace procedure when the operation is resumed at mount time are unable to pass the same root pointer that would be used in the regular (ioctl) context. And since the root pointer is not required, only the fs_info is, the root pointer argument is replaced with the fs_info pointer argument. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:36 -05:00
Stefan Behrens	a8a6dab779	Btrfs: add btrfs_scratch_superblock() function This new function is used by the device replace procedure in a later patch. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:35 -05:00
Stefan Behrens	3ec706c831	Btrfs: pass fs_info to btrfs_map_block() instead of mapping_tree This is required for the device replace procedure in a later step. Two calling functions also had to be changed to have the fs_info pointer: repair_io_failure() and scrub_setup_recheck_block(). Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:34 -05:00
Stefan Behrens	5d9640517d	Btrfs: Pass fs_info to btrfs_num_copies() instead of mapping_tree This is required for the device replace procedure in a later step. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:34 -05:00
Stefan Behrens	7ba15b7d21	Btrfs: add two more find_device() methods The new function btrfs_find_device_missing_or_by_path() will be used for the device replace procedure. This function itself calls the second new function btrfs_find_device_by_path(). Unfortunately, it is not possible to currently make the rest of the code use these functions as well, since all functions that look similar at first view are all a little bit different in what they are doing. But in the future, new code could benefit from these two new functions, and currently, device replace uses them. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:33 -05:00
Stefan Behrens	beaf8ab3af	Btrfs: move some common code into a subfunction Some code to open block devices, to read the superblock and to handle errors was repeated multiple times in 3 places, and the following patch makes use of it as well. This code is now moved into a subfunction. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:33 -05:00
Stefan Behrens	b6bfebc132	Btrfs: cleanup scrub bio and worker wait code Just move some code into functions to make everything more readable. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:32 -05:00
Stefan Behrens	34f5c8e90b	Btrfs: in scrub repair code, simplify alloc error handling In the scrub repair code, the code is changed to handle memory allocation errors a little bit smarter. The change is to handle it just like a read error. This simplifies the code and removes a couple of lines of code, since the code to handle read errors is there anyway. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:31 -05:00
Stefan Behrens	cb2ced73d8	Btrfs: in scrub repair code, optimize the reading of mirrors In case that disk blocks need to be repaired (rewritten), the current code at first (for simplicity reasons) reads all alternate mirrors in the first step, afterwards selects the best one in a second step. This is now changed to read one alternate mirror after the other and to leave the loop early when a perfect mirror is found. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:31 -05:00
Stefan Behrens	7a9e998768	Btrfs: make the scrub page array dynamically allocated With the modified design (in order to support the devive replace procedure) it is necessary to alloc the page array dynamically. The reason is that pages are reused. At first a page is used for the bio to read the data from the filesystem, then the same page is reused for the bio that writes the data to the target disk. Since the read process and the write process are completely decoupled, this requires a new concept of refcounts and get/put functions for pages, and it requires to use newly created pages for each read bio which are freed after the write operation is finished. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:30 -05:00
Stefan Behrens	a36cf8b893	Btrfs: remove the block device pointer from the scrub context struct The block device is removed from the scrub context state structure. The scrub code as it is used for the device replace procedure reads the source data from whereever it is optimal. The source device might even be gone (disconnected, for instance due to a hardware failure). Or the drive can be so faulty so that the device replace procedure tries to avoid access to the faulty source drive as much as possible, and only if all other mirrors are damaged, as a last resort, the source disk is accessed. The modified scrub code operates as if it would handle the source drive and thereby generates an exact copy of the source disk on the target disk, even if the source disk is not present at all. Therefore the block device pointer to the source disk is removed in the scrub context struct and moved into the lower level scope of scrub_bio, fixup and page structures where the block device context is known. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:30 -05:00
Stefan Behrens	d9d181c1ba	Btrfs: rename the scrub context structure The device replace procedure makes use of the scrub code. The scrub code is the most efficient code to read the allocated data of a disk, i.e. it reads sequentially in order to avoid disk head movements, it skips unallocated blocks, it uses read ahead mechanisms, and it contains all the code to detect and repair defects. This commit is a first preparation step to adapt the scrub code to be shareable for the device replace procedure. The block device will be removed from the scrub context state structure in a later step. It used to be the source block device. The scrub code as it is used for the device replace procedure reads the source data from whereever it is optimal. The source device might even be gone (disconnected, for instance due to a hardware failure). Or the drive can be so faulty so that the device replace procedure tries to avoid access to the faulty source drive as much as possible, and only if all other mirrors are damaged, as a last resort, the source disk is accessed. The modified scrub code operates as if it would handle the source drive and thereby generates an exact copy of the source disk on the target disk, even if the source disk is not present at all. Therefore the block device pointer to the source disk is removed in a later patch, and therefore the context structure is renamed (this is the goal of the current patch) to reflect that no source block device scope is there anymore. Summary: This first preparation step consists of a textual substitution of the term "dev" to the term "ctx" whereever the scrub context is used. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:29 -05:00
Liu Bo	d25628bdd6	Btrfs: protect devices list with its mutex Since we've kill the bigger one volume_mutex, we need to add devices list mutex back. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:28 -05:00
Liu Bo	b53d3f5db2	Btrfs: cleanup for btrfs_btree_balance_dirty - 'nr' is no more used. - btrfs_btree_balance_dirty() and __btrfs_btree_balance_dirty() can share a bunch of code. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:28 -05:00
Alexander Block	3ef5969cd8	Btrfs: merge inode_list in __merge_refs When __merge_refs merges two refs, it is also needed to merge the inode_list of both refs. Otherwise we have missed backrefs and memory leaks. This happens for example if two inodes share an extent and both lie in the same leaf and thus also have the same parent. Signed-off-by: Alexander Block <ablock84@googlemail.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:27 -05:00
Tsutomu Itoh	e1f5790e05	Btrfs: set hole punching time properly Even if the hole punching is executed, the modification time of the file is not updated. So, current time is set to inode. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:26 -05:00
Stefan Behrens	d03f918ab9	Btrfs: Don't trust the superblock label and simply printk("%s") it Someone who is root or capable(CAP_SYS_ADMIN) could corrupt the superblock and make Btrfs printk("%s") crash while holding the uuid_mutex since nobody forces a limit on the string. Since the uuid_mutex is significant, the system would be unusable afterwards. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:26 -05:00
Liu Bo	109f2365f1	Btrfs: fix a double free on pending snapshots in error handling When creating a snapshot, failing to commit a transaction can end up with aborting the transaction, following by doing a cleanup for it, where we'll free all snapshots pending to disk. So we check it and avoid double free on pending snapshots. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:25 -05:00
Liu Bo	37c4146d22	Btrfs: fix a deadlock in aborting transaction due to ENOSPC When committing a transaction, we may bail out of running delayed refs due to ENOSPC, and then abort the current transaction to flip into readonly. But we'll hit a deadlock on ref head's lock since we forget to release its lock and other cleanup stuff. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:25 -05:00
Julia Lawall	6c1500f22a	fs/btrfs: drop if around WARN_ON Just use WARN_ON rather than an if containing only WARN_ON(1). A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression e; @@ - if (e) WARN_ON(1); + WARN_ON(e); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:24 -05:00
Julia Lawall	31b1a2bd75	fs/btrfs: use WARN Use WARN rather than printk followed by WARN_ON(1), for conciseness. A simplified version of the semantic patch that makes this transformation is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression list es; @@ -printk( +WARN(1, es); -WARN_ON(1); // </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:23 -05:00
Miao Xie	5269b67e3d	Btrfs: fix missing log when BTRFS_INODE_NEEDS_FULL_SYNC is set If we set BTRFS_INODE_NEEDS_FULL_SYNC, we should log all the extent, but now we forget to take it into account, and set a wrong max key, if so, we will skip the file extent metadata when doing logging. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:22 -05:00
Miao Xie	bbe1426764	Btrfs: fix unprotected extent map operation when logging file extents We forget to protect the modified_extents list, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:22 -05:00
Miao Xie	315a9850da	Btrfs: fix wrong file extent length There are two types of the file extent - inline extent and regular extent, When we log file extents, we didn't take inline extent into account, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:21 -05:00
Miao Xie	ca46963718	Btrfs: fix missing flush when committing a transaction Consider the following case: Task1 Task2 start_transaction commit_transaction check pending snapshots list and the list is empty. add pending snapshot into list skip the delalloc flush end_transaction ... And then the problem that the snapshot is different with the source subvolume happen. This patch fixes the above problem by flush all pending stuffs when all the other tasks end the transaction. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:21 -05:00
Miao Xie	b7d5b0a819	Btrfs: fix joining the same transaction handler more than 2 times If we flush inodes with pending delalloc in a transaction, we may join the same transaction handler more than 2 times. The reason is: Task use_count of trans handle commit_transaction 1 \|-> btrfs_start_delalloc_inodes 1 \|-> run_delalloc_nocow 1 \|-> join_transaction 2 \|-> cow_file_range 2 \|-> join_transaction 3 In fact, cow_file_range needn't join the transaction again because the caller have joined the transaction, so we fix this problem by this way. Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:20 -05:00
Liu Bo	4fde183d8c	Btrfs: cleanup for btrfs_wait_order_range Variable 'found' is no more used. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:19 -05:00
Liu Bo	9f3959c53d	Btrfs: get right arguments for btrfs_wait_ordered_range btrfs_wait_ordered_range expects for 'len' instead of 'end'. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:19 -05:00
Liu Bo	183f37fa35	Btrfs: do not log extents when we only log new names When we log new names, we need to log just enough to recreate the inode during log replay, and there is no need to log extents along with it. This actually fixes a bug revealed by xfstests 241, where it shows that we're logging some extents that have not updated metadata, so we don't get proper EXTENT_DATA items to be copied to log tree. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:18 -05:00
Stefan Behrens	292fd7fc39	Btrfs: don't allow degraded mount if too many devices are missing The current behavior is to allow mounting or remounting a filesystem writeable in degraded mode if at least one writeable device is present. The next failed write access to a missing device which is above the tolerance of the configured level of redundancy results in an read-only enforcement. Even without this, the next time barrier_all_devices() is called and more devices are missing than tolerable, the switch to read-only mode takes place. In order to behave predictably and to provide proper feedback to the user at mount time, this patch compares the number of missing devices with the number of devices that are tolerated to be missing according to the configured RAID level. If more devices are missing than tolerated, e.g. if two devices are missing in case of RAID1, only a read-only mount and remount is allowed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:18 -05:00
Masanari Iida	d142324873	Btrfs: Fix typo in fs/btrfs Correct spelling typo in btrfs. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:17 -05:00
jeff.liu	0253f40ef9	Btrfs: Remove the invalid shrink size check up from btrfs_shrink_dev() Remove an invalid size check up from btrfs_shrink_dev(). The new size should not larger than the device->total_bytes as it was already verified before coming to here(i.e. new_size < old_size). Remove invalid check up for btrfs_shrink_dev(). Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-12 17:15:16 -05:00
Namjae Jeon	d0e1d66b5a	writeback: remove nr_pages_dirtied arg from balance_dirty_pages_ratelimited_nr() There is no reason to pass the nr_pages_dirtied argument, because nr_pages_dirtied value from the caller is unused in balance_dirty_pages_ratelimited_nr(). Signed-off-by: Namjae Jeon <linkinjeon@gmail.com> Signed-off-by: Vivek Trivedi <vtrivedi018@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-12-11 17:22:21 -08:00
Miao Xie	9afab8820b	Btrfs: make ordered extent be flushed by multi-task Though the process of the ordered extents is a bit different with the delalloc inode flush, but we can see it as a subset of the delalloc inode flush, so we also handle them by flush workers. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:38 -05:00
Miao Xie	25287e0a16	Btrfs: make ordered operations be handled by multi-task The process of the ordered operations is similar to the delalloc inode flush, so we handle them by flush workers. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:37 -05:00
Miao Xie	8ccf6f19b6	Btrfs: make delalloc inodes be flushed by multi-task This patch introduce a new worker pool named "flush_workers", and if we want to force all the inode with pending delalloc to the disks, we can queue those inodes into the work queue of the worker pool, in this way, those inodes will be flushed by multi-task. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:37 -05:00
Josef Bacik	7b398f8e58	Btrfs: fill the global reserve when unpinning space Dave gave me an image of a very full file system that would abort the transaction because it ran out of space while committing the transaction. This is because we would think there was plenty of room to create a snapshot even though the global reserve was not full. This happens because we calculate the global reserve size before we unpin any space, so after we unpin the space we allow reservations to occur even though we haven't reserved all of the space for our global reserve. Fix this by adding to the global reserve while unpinning in order to make sure we always have enough space to do our work. With this patch we no longer end up with an aborted transaction, we return ENOSPC properly to the person trying to create the snapshot. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:36 -05:00
Liu Bo	32adf09013	Btrfs: cleanup unused arguments 'disk_key' is not used at all. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:35 -05:00
Liu Bo	0e411ecec6	Btrfs: kill unnecessary arguments in del_ptr The argument 'tree_mod_log' is not necessary since all of callers enable it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:35 -05:00
Liu Bo	6a7a665d78	Btrfs: reorder tree mod log operations in deleting a pointer Since we don't use MOD_LOG_KEY_REMOVE_WHILE_MOVING to add nritems during rewinding, we should insert a MOD_LOG_KEY_REMOVE operation first. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:34 -05:00
Liu Bo	95c80bb1f6	Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems Key MOD_LOG_KEY_REMOVE_WHILE_MOVING means that we're doing memmove inside an extent buffer node, and the node's number of items remains unchanged (unless we are inserting a single pointer, but we have MOD_LOG_KEY_ADD for that). So we don't need to increase node's number of items during rewinding, otherwise we may get an node larger than leafsize and cause general protection errors later. Here is the details, - If we do memory move for inserting a single pointer, we need to add node's nritems by one, and we honor MOD_LOG_KEY_ADD for adding. - If we do memory move for deleting a single pointer, we need to decrease node's nritems by one, and we honor MOD_LOG_KEY_REMOVE for deleting. - If we do memory move for balance left/right, we need to decrease node's nritems, and we honor MOD_LOG_KEY_REMOVE for balaning. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:33 -05:00
Miao Xie	de6c4115a2	Btrfs: fix unnecessary while loop when search the free space, cache When we find a bitmap free space entry, we may check the previous extent entry covers the offset or not. But if we find this entry is also a bitmap entry, we will continue to check the previous entry of the current one by a while loop. It is unnecessary because it is impossible that the extent entry which is in front of a bitmap entry can cover the offset of the entry after that bitmap entry. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:33 -05:00
Josef Bacik	de1ee92ac3	Btrfs: recheck bio against block device when we map the bio Alex reported a problem where we were writing between chunks on a rbd device. The thing is we do bio_add_page using logical offsets, but the physical offset may be different. So when we map the bio now check to see if the bio is still ok with the physical offset, and if it is not split the bio up and redo the bio_add_page with the physical sector. This fixes the problem for Alex and doesn't affect performance in the normal case. Thanks, Reported-and-tested-by: Alex Elder <elder@inktank.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:32 -05:00
Miao Xie	08e007d2e5	Btrfs: improve the noflush reservation In some places(such as: evicting inode), we just can not flush the reserved space of delalloc, flushing the delayed directory index and delayed inode is OK, but we don't try to flush those things and just go back when there is no enough space to be reserved. This patch fixes this problem. We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL. If we can in the transaction, we should not flush anything, or the deadlock would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used, and we will flush all things. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:31 -05:00
Miao Xie	561c294d4c	Btrfs: fix wrong comment in can_overcommit() The comment is not coincident with the code. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:30 -05:00
Miao Xie	3fed40cc97	Btrfs: cleanup duplicated division functions div_factor{_fine} has been implemented for two times, cleanup it. And I move them into a independent file named math.h because they are common math functions. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-12-11 13:31:30 -05:00
Adam Buchbinder	48fc7f7e78	Fix misspellings of "whether" in comments. "Whether" is misspelled in various comments across the tree; this fixes them. No code changes. Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2012-11-19 14:31:35 +01:00
Masanari Iida	926ccfef82	Btrfs: Fix printk and variable name Correct spelling typo in btrfs. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2012-11-01 09:48:04 +01:00
Liu Bo	52b1de91ea	btrfs: unpin_extent_cache: fix the typo and unnecessary arguements - unpint->unpin - prealloc is no more used Signed-off-by: Liu Bo <liub.liubo@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2012-10-30 10:18:50 +01:00
Linus Torvalds	f48d42773b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This has our series of fixes for the next rc. The biggest batch is from Jan Schmidt, fixing up some problems in our subvolume quota code and fixing btrfs send/receive to work with the new extended inode refs." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: do not bug when we fail to commit the transaction Btrfs: fix memory leak when cloning root's node Btrfs: Use btrfs_update_inode_fallback when creating a snapshot Btrfs: Send: preserve ownership (uid and gid) also for symlinks. Btrfs: fix deadlock caused by the nested chunk allocation btrfs: Return EINVAL when length to trim is less than FSB Btrfs: fix memory leak in btrfs_quota_enable() Btrfs: send correct rdev and mode in btrfs-send Btrfs: extended inode refs support for send mechanism Btrfs: Fix wrong error handling code Fix a sign bug causing invalid memory access in the ino_paths ioctl. Btrfs: comment for loop in tree_mod_log_insert_move Btrfs: fix extent buffer reference for tree mod log roots Btrfs: determine level of old roots Btrfs: tree mod log's old roots could still be part of the tree Btrfs: fix a tree mod logging issue for root replacement operations Btrfs: don't put removals from push_node_left into tree mod log twice	2012-10-26 09:34:04 -07:00
Josef Bacik	c37b2b6269	Btrfs: do not bug when we fail to commit the transaction We BUG if we fail to commit the transaction when creating a snapshot, which is just obnoxious. Remove the BUG_ON(). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-25 15:59:57 -04:00
Liu Bo	7bfdcf7fba	Btrfs: fix memory leak when cloning root's node After cloning root's node, we forgot to dec the src's ref which can lead to a memory leak. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-10-25 15:55:21 -04:00
Chris Mason	c657c3ef1a	Merge branch 'for-chris-fixed' of git://git.jan-o-sch.net/btrfs-unstable	2012-10-25 15:53:10 -04:00
Josef Bacik	be6aef6049	Btrfs: Use btrfs_update_inode_fallback when creating a snapshot On a really full file system I was getting ENOSPC back from btrfs_update_inode when trying to update the parent inode when creating a snapshot. Just use the fallback method so we can update the inode and not have to worry about having a delayed ref. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-25 15:50:18 -04:00
Alex Lyakas	e2d044fe77	Btrfs: Send: preserve ownership (uid and gid) also for symlinks. This patch also requires a change in the user-space part of "receive". We need to use "lchown" instead of "chown". We will do this in the following patch. Signed-off-by: Alex Lyakas <alex.btrfs@zadarastorage.com> if (S_ISREG(sctx->cur_inode_mode)) {	2012-10-25 15:47:31 -04:00
Miao Xie	671415b7db	Btrfs: fix deadlock caused by the nested chunk allocation Steps to reproduce: # mkfs.btrfs -m raid1 <disk1> <disk2> # btrfstune -S 1 <disk1> # mount <disk1> <mnt> # btrfs device add <disk3> <disk4> <mnt> # mount -o remount,rw <mnt> # dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1 Deadlock happened. It is because of the nested chunk allocation. When we wrote the data into the filesystem, we would allocate the data chunk because there was no data chunk in the filesystem. At the end of the data chunk allocation, we should insert the metadata of the data chunk into the extent tree, but there was no raid1 chunk, so we tried to lock the chunk allocation mutex to allocate the new chunk, but we had held the mutex, the deadlock happened. By rights, we would allocate the raid1 chunk when we added the second device because the profile of the seed filesystem is raid1 and we had two devices. But we didn't do that in fact. It is because the last step of the first device insertion didn't commit the transaction. So when we added the second device, we didn't cow the tree, and just inserted the relative metadata into the leaves which were generated by the first device insertion, and its profile was dup. So, I fix this problem by commiting the transaction at the end of the first device insertion. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-25 15:47:00 -04:00
Lukas Czerner	e515c18bfe	btrfs: Return EINVAL when length to trim is less than FSB Currently if len argument in btrfs_ioctl_fitrim() is smaller than one FSB we will continue and finally return 0 bytes discarded. However if the length to discard is smaller then file system block we should really return EINVAL. Signed-off-by: Lukas Czerner <lczerner@redhat.com>	2012-10-25 15:46:22 -04:00
Tsutomu Itoh	5b7ff5b3c4	Btrfs: fix memory leak in btrfs_quota_enable() We should free quota_root before returning from the error handling code. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-25 15:45:43 -04:00
Arne Jansen	d79e50433b	Btrfs: send correct rdev and mode in btrfs-send When sending a device file, the stream was missing the mode. Also the rdev was encoded wrongly. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-10-25 15:45:25 -04:00
Jan Schmidt	96b5bd7771	Btrfs: extended inode refs support for send mechanism This adds support for the new extended inode refs to btrfs send. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-10-25 15:45:16 -04:00
Stefan Behrens	84167d1905	Btrfs: Fix wrong error handling code gcc says "warning: comparison of unsigned expression >= 0 is always true" because i is an unsigned long. And gcc is right this time. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-10-25 15:40:03 -04:00
Gabriel de Perthuis	661bec6ba8	Fix a sign bug causing invalid memory access in the ino_paths ioctl. To see the problem, create many hardlinks to the same file (120 should do it), then look up paths by inode with: ls -i btrfs inspect inode-resolve -v $ino /mnt/btrfs I noticed the memory layout of the fspath->val data had some irregularities (some unnecessary gaps that stop appearing about halfway), so I'm not sure there aren't any bugs left in it.	2012-10-25 15:39:47 -04:00
Jan Schmidt	01763a2e37	Btrfs: comment for loop in tree_mod_log_insert_move Emphasis the way tree_mod_log_insert_move avoids adding MOD_LOG_KEY_REMOVE_WHILE_MOVING operations, depending on the direction of the move operation. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-10-24 12:36:40 +02:00
Jan Schmidt	d638108484	Btrfs: fix extent buffer reference for tree mod log roots In get_old_root we grab a lock on the extent buffer before we obtain a reference on that buffer. That order is changed now. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-10-24 12:36:39 +02:00
Jan Schmidt	5b6602e762	Btrfs: determine level of old roots In btrfs_find_all_roots' termination condition, we compare the level of the old buffer we got from btrfs_search_old_slot to the level of the current root node. We'd better compare it to the level of the rewinded root node. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-10-24 12:36:38 +02:00
Jan Schmidt	834328a849	Btrfs: tree mod log's old roots could still be part of the tree Tree mod log treated old root buffers as always empty buffers when starting the rewind operations. However, the old root may still be part of the current tree at a lower level, with still some valid entries. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-10-24 12:36:37 +02:00
Jan Schmidt	ba1bfbd592	Btrfs: fix a tree mod logging issue for root replacement operations Avoid the implicit free by tree_mod_log_set_root_pointer, which is wrong in two places. Where needed, we call tree_mod_log_free_eb explicitly now. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-10-23 15:09:14 +02:00
Jan Schmidt	57911b8ba8	Btrfs: don't put removals from push_node_left into tree mod log twice Independant of the check (push_items < src_items) tree_mod_log_eb_copy did log the removal of the old data entries from the source buffer. Therefore, we must not call tree_mod_log_eb_move if the check evaluates to true, as that would log the removal twice, finally resulting in (rewinded) buffers with wrong values for header_nritems. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-10-23 15:09:11 +02:00
Linus Torvalds	09a9ad6a1f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace compile fixes from Eric W Biederman: "This tree contains three trivial fixes. One compiler warning, one thinko fix, and one build fix" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: btrfs: Fix compilation with user namespace support enabled userns: Fix posix_acl_file_xattr_userns gid conversion userns: Properly print bluetooth socket uids	2012-10-13 13:23:39 -07:00
Eric W. Biederman	e9069f4708	btrfs: Fix compilation with user namespace support enabled When compiling with user namespace support btrfs fails like: fs/btrfs/tree-log.c: In function ‘fill_inode_item’: fs/btrfs/tree-log.c:2955:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_uid’ fs/btrfs/ctree.h:2026:1: note: expected ‘u32’ but argument is of type ‘kuid_t’ fs/btrfs/tree-log.c:2956:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_gid’ fs/btrfs/ctree.h:2027:1: note: expected ‘u32’ but argument is of type ‘kgid_t’ Fix this by using i_uid_read and i_gid_read in Cc: Chris Mason <chris.mason@fusionio.com> Cc: Josef Bacik <jbacik@fusionio.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-10-12 15:01:42 -07:00
Jeff Layton	4fa6b5ecbf	audit: overhaul __audit_inode_child to accomodate retrying In order to accomodate retrying path-based syscalls, we need to add a new "type" argument to audit_inode_child. This will tell us whether we're looking for a child entry that represents a create or a delete. If we find a parent, don't automatically assume that we need to create a new entry. Instead, use the information we have to try to find an existing entry first. Update it if one is found and create a new one if not. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-12 00:32:03 -04:00
Jeff Layton	c43a25abba	audit: reverse arguments to audit_inode_child Most of the callers get called with an inode and dentry in the reverse order. The compiler then has to reshuffle the arg registers and/or stack in order to pass them on to audit_inode_child. Reverse those arguments for a micro-optimization. Reported-by: Eric Paris <eparis@redhat.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-12 00:32:00 -04:00
Linus Torvalds	72055425e5	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "This is a large pull, with the bulk of the updates coming from: - Hole punching - send/receive fixes - fsync performance - Disk format extension allowing more hardlinks inside a single directory (btrfs-progs patch required to enable the compat bit for this one) I'm cooking more unrelated RAID code, but I wanted to make sure this original batch makes it in. The largest updates here are relatively old and have been in testing for some time." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (121 commits) btrfs: init ref_index to zero in add_inode_ref Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer Btrfs: fix page leakage Btrfs: do not warn_on when we cannot alloc a page for an extent buffer Btrfs: don't bug on enomem in readpage Btrfs: cleanup pages properly when ENOMEM in compression Btrfs: make filesystem read-only when submitting barrier fails Btrfs: detect corrupted filesystem after write I/O errors Btrfs: make compress and nodatacow mount options mutually exclusive btrfs: fix message printing Btrfs: don't bother committing delayed inode updates when fsyncing btrfs: move inline function code to header file Btrfs: remove unnecessary IS_ERR in bio_readpage_error() btrfs: remove unused function btrfs_insert_some_items() Btrfs: don't commit instead of overcommitting Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called Btrfs: be smarter about dropping things from the tree log Btrfs: don't lookup csums for prealloc extents Btrfs: cache extent state when writing out dirty metadata pages Btrfs: do not hold the file extent leaf locked when adding extent item ...	2012-10-10 10:49:20 +09:00
Chris Mason	f46dbe3dee	btrfs: init ref_index to zero in add_inode_ref Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-10-09 11:17:20 -04:00
Wang Sheng-Hui	1037a5affc	Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer In csum_dirty_buffer, we first get eb from page->private. Then we check if the page is the first page of eb. Later we check it again. Remove the repeated check here. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>	2012-10-09 09:37:30 -04:00
Josef Bacik	f60b1b49f6	Btrfs: fix page leakage Alloc_dummy_extent_buffer will not free the first page in the eb array if we fail to allocate a page, fix this. Thanks, Reported-by: David Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:20:56 -04:00
Josef Bacik	4804b38293	Btrfs: do not warn_on when we cannot alloc a page for an extent buffer It's just annoying and the user will have gotten a nice OOM killer message so they are already fully aware they are screwed :). Thanks, Reported-by: Jérôme Poulin <jeromepoulin@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:20:43 -04:00
Josef Bacik	edd33c99c4	Btrfs: don't bug on enomem in readpage Get rid of the BUG_ON(ret == -ENOMEM) in __extent_read_full_page. Thanks, Reported-by: Jérôme Poulin <jeromepoulin@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:20:31 -04:00
Josef Bacik	15e3004a0e	Btrfs: cleanup pages properly when ENOMEM in compression We were freeing non-existent pages which was causing a panic for a user who was suffering from ENOMEM. This patch fixes the problem. Thanks, Reported-by: Jérôme Poulin <jeromepoulin@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:20:25 -04:00
Stefan Behrens	5af3e8cce8	Btrfs: make filesystem read-only when submitting barrier fails So far the return code of barrier_all_devices() is ignored, which means that errors are ignored. The result can be a corrupt filesystem which is not consistent. This commit adds code to evaluate the return code of barrier_all_devices(). The normal btrfs_error() mechanism is used to switch the filesystem into read-only mode when errors are detected. In order to decide whether barrier_all_devices() should return error or success, the number of disks that are allowed to fail the barrier submission is calculated. This calculation accounts for the worst RAID level of metadata, system and data. If single, dup or RAID0 is in use, a single disk error is already considered to be fatal. Otherwise a single disk error is tolerated. The calculation of the number of disks that are tolerated to fail the barrier operation is performed when the filesystem gets mounted, when a balance operation is started and finished, and when devices are added or removed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-10-09 09:20:19 -04:00
Stefan Behrens	62856a9b73	Btrfs: detect corrupted filesystem after write I/O errors In check-integrity, detect when a superblock is written that points to blocks that have not been written to disk due to I/O write errors. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-10-09 09:20:10 -04:00
Andrei Popa	bedb2cca72	Btrfs: make compress and nodatacow mount options mutually exclusive If a filesystem is mounted with compression and then remounted by adding nodatacow, the compression is disabled but the compress flag is still visible. Also, if a filesystem is mounted with nodatacow and then remounted with compression, nodatacow flag is still present but it's not active. This patch: - removes compress flags and notifies that the compression has been disabled if the filesystem is mounted with nodatacow - removes nodatacow and nodatasum flags if mounted with compress. Signed-off-by: Andrei Popa <andrei.popa@i-neo.ro>	2012-10-09 09:20:03 -04:00
Daniel J Blueman	489406626c	btrfs: fix message printing Fix various messages to include newline and module prefix. Signed-off-by: Daniel J Blueman <daniel@quora.org>	2012-10-09 09:19:57 -04:00
Josef Bacik	94edf4ae43	Btrfs: don't bother committing delayed inode updates when fsyncing We can just copy the in memory inode into the tree log directly, no sense in updating the fs tree so we can copy it into the tree log tree. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:19:50 -04:00
Robin Dong	479ed9abdb	btrfs: move inline function code to header file When building btrfs from kernel code, it will report: fs/btrfs/extent_io.h:281: warning: 'extent_buffer_page' declared inline after being called fs/btrfs/extent_io.h:281: warning: previous declaration of 'extent_buffer_page' was here fs/btrfs/extent_io.h:280: warning: 'num_extent_pages' declared inline after being called fs/btrfs/extent_io.h:280: warning: previous declaration of 'num_extent_pages' was here because of the wrong declaration of inline functions. Signed-off-by: Robin Dong <sanbai@taobao.com>	2012-10-09 09:15:43 -04:00
Tsutomu Itoh	7a2d6a6464	Btrfs: remove unnecessary IS_ERR in bio_readpage_error() Because the value of extent_map is only a correct value or NULL, so IS_ERR is unnecessary. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-09 09:15:43 -04:00
Robin Dong	8d1a1317af	btrfs: remove unused function btrfs_insert_some_items() The function btrfs_insert_some_items() would not be called by any other functions, so remove it. Signed-off-by: Robin Dong <sanbai@taobao.com>	2012-10-09 09:15:43 -04:00
Josef Bacik	44734ed1ca	Btrfs: don't commit instead of overcommitting I don't think we have the same problem that this was supposed to fix originally since we can allocate chunks in the enospc path now. This code is causing us to constantly commit the transaction as we get close to using all of our available space in our currently allocated chunks, instead of allocating another chunk and carrying on with life, which is not nice for performance. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:15:42 -04:00
Tsutomu Itoh	f0bd95ea72	Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called We should confirm the value of extent_map before calling trace_btrfs_get_extent() because the value of extent_map has the possibility of NULL. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-09 09:15:42 -04:00
Josef Bacik	18ec90d63f	Btrfs: be smarter about dropping things from the tree log When we truncate existing items in the tree log we've been searching for each individual item and removing them. This is unnecessary churn and searching, just keep track of the slot we are on and how many items we need to delete and delete them all at once. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:15:41 -04:00
Josef Bacik	6f1fed7753	Btrfs: don't lookup csums for prealloc extents The tree logging stuff was looking up csums to copy over for prealloc extents which is just work we don't need to be doing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:15:41 -04:00
Josef Bacik	e6138876ad	Btrfs: cache extent state when writing out dirty metadata pages Everytime we write out dirty pages we search for an offset in the tree, convert the bits in the state, and then when we wait we search for the offset again and clear the bits. So for every dirty range in the io tree we are doing 4 rb searches, which is suboptimal. With this patch we are only doing 2 searches for every cycle (modulo weird things happening). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:15:41 -04:00
Josef Bacik	ce19533256	Btrfs: do not hold the file extent leaf locked when adding extent item For some reason we unlock everything except the leaf we are on, set the path blocking and then add the extent item for the extent we just finished writing. I can't for the life of me figure out why we would want to do this, and the history doesn't really indicate that there was a real reason for it, so just remove it. This will reduce our tree lock contention on heavy writes. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:15:40 -04:00
Josef Bacik	de0022b9da	Btrfs: do not async metadata csumming in certain situations There are a coule scenarios where farming metadata csumming off to an async thread doesn't help. The first is if our processor supports crc32c, in which case the csumming will be fast and so the overhead of the async model is not worth the cost. The other case is for our tree log. We will be making that stuff dirty and writing it out and waiting for it immediately. Even with software crc32c this gives me a ~15% increase in speed with O_SYNC workloads. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:15:40 -04:00
Zach Brown	221b831835	btrfs: fix min csum item size warnings in 32bit commit `7ca4be45a0` limited csum items to PAGE_CACHE_SIZE. It used min() with incompatible types in 32bit which generates warnings: fs/btrfs/file-item.c: In function ‘btrfs_csum_file_blocks’: fs/btrfs/file-item.c:717: warning: comparison of distinct pointer types lacks a cast This uses min_t(u32,) to fix the warnings. u32 seemed reasonable because btrfs_root->leafsize is u32 and PAGE_CACHE_SIZE is unsigned long. Signed-off-by: Zach Brown <zab@zabbo.net>	2012-10-09 09:15:39 -04:00
Josef Bacik	67b0fd63d5	Btrfs: run delayed refs first when out of space Running delayed refs is faster than running delalloc, so lets do that first to try and reclaim space. This makes my fs_mark test about 20% faster. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-09 09:15:39 -04:00
Miao Xie	354aa0fb6d	Btrfs: fix orphan transaction on the freezed filesystem With the following debug patch: static int btrfs_freeze(struct super_block sb) { + struct btrfs_fs_info fs_info = btrfs_sb(sb); + struct btrfs_transaction *trans; + + spin_lock(&fs_info->trans_lock); + trans = fs_info->running_transaction; + if (trans) { + printk("Transid %llu, use_count %d, num_writer %d\n", + trans->transid, atomic_read(&trans->use_count), + atomic_read(&trans->num_writers)); + } + spin_unlock(&fs_info->trans_lock); return 0; } I found there was a orphan transaction after the freeze operation was done. It is because the transaction may not be committed when the transaction handle end even though it is the last handle of the current transaction. This design avoid committing the transaction frequently, but also introduce the above problem. So I add btrfs_attach_transaction() which can catch the current transaction and commit it. If there is no transaction, it will return ENOENT, and do not anything. This function also can be used to instead of btrfs_join_transaction_freeze() because it don't increase the writer counter and don't start a new transaction, so it also can fix the deadlock between sync and freeze. Besides that, it is used to instead of btrfs_join_transaction() in transaction_kthread(), because if there is no transaction, the transaction kthread needn't anything. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-09 09:15:39 -04:00
Miao Xie	a698d0755a	Btrfs: add a type field for the transaction handle This patch add a type field into the transaction handle structure, in this way, we needn't implement various end-transaction functions and can make the code more simple and readable. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-09 09:15:38 -04:00
Miao Xie	e8830e606f	Btrfs: fix memory leak in start_transaction() This patch fixes memory leak of the transaction handle which happened when starting transaction failed on a freezed fs. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-09 09:15:38 -04:00
Mark Fasheh	d24bec3ae5	btrfs: extended inode ref iteration The iterate_irefs in backref.c is used to build path components from inode refs. This patch adds code to iterate extended refs as well. I had modify the callback function signature to abstract out some of the differences between ref structures. iref_to_path() also needed similar changes. Signed-off-by: Mark Fasheh <mfasheh@suse.de>	2012-10-09 09:15:01 -04:00
Mark Fasheh	f186373fef	btrfs: extended inode refs This patch adds basic support for extended inode refs. This includes support for link and unlink of the refs, which basically gets us support for rename as well. Inode creation does not need changing - extended refs are only added after the ref array is full. Signed-off-by: Mark Fasheh <mfasheh@suse.de>	2012-10-09 09:14:45 -04:00
Konstantin Khlebnikov	0b173bc4da	mm: kill vma flag VM_CAN_NONLINEAR Move actual pte filling for non-linear file mappings into the new special vma operation: ->remap_pages(). Filesystems must implement this method to get non-linear mapping support, if it uses filemap_fault() then generic_file_remap_pages() can be used. Now device drivers can implement this method and obtain nonlinear vma support. Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-10-09 16:22:17 +09:00
Jan Schmidt	5a1d7843ca	btrfs: improved readablity for add_inode_ref Moved part of the code into a sub function and replaced most of the gotos by ifs, hoping that it will be easier to read now. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Mark Fasheh <mfasheh@suse.de>	2012-10-08 20:09:02 -04:00
Josef Bacik	0aa4a17d82	Btrfs: handle not finding the extent exactly when logging changed extents I started hitting warnings when running xfstest 68 in a loop because there were EM's that were not lined up properly with the physical extents. This is ok, if we do something like punch a hole or write to a preallocated space or something like that we can have an EM that doesn't cover the entire physical extent. So fix the tree logging stuff to cope with this case so we don't just commit the transaction. With this patch I no longer see the warnings from the tree logging code. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-08 20:09:02 -04:00
David Sterba	005d6427ac	btrfs: move transaction aborts to the point of failure Call btrfs_abort_transaction as early as possible when an error condition is detected, that way the line number reported is useful and we're not clueless anymore which error path led to the abort. Signed-off-by: David Sterba <dsterba@suse.cz>	2012-10-08 20:09:02 -04:00
Miao Xie	8732d44f80	Btrfs: fix the missing error information in create_pending_snapshot() The macro btrfs_abort_transaction() can get the line number of the code where the problem happens, so we should invoke it in the place that the error occurs, or we will lose the line number. Reported-by: David Sterba <dave@jikos.cz> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-08 20:07:33 -04:00
Liu Bo	aa42ffd918	Btrfs: fix off-by-one in file clone Btrfs uses inclusive range end for lock_extent(), unlock_extent() and related functions, so we made off-by-one errors in file clone. This fixes it and also fixes some style problems. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-08 20:07:32 -04:00
David Sterba	7e97b8daf6	btrfs: allow setting NOCOW for a zero sized file via ioctl Hi, the patch si simple, but it has user visible impact and I'm not quite sure how to resolve it. In short, $subj says it, chattr -C supports it and we want to use it. The conditions that acutally allow to change the NOCOW flag are clear. What if I try to set the flag on a file that is not empty? Options: 1) whole ioctl will fail, EINVAL 2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file will stay COW-ed and checksummed 2.2) ioctl will succeed, flag will not be removed and a syslog message will warn that the COW flag has not been changed 2.2.1) dtto, no syslog message Man page of chattr states that "If it is set on a file which already has data blocks, it is undefined when the blocks assigned to the file will be fully stable." Yes, it's undefined and with current implementation it'll never happen. So from this end, the user cannot expect anything. I'm trying to find a reasonable behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of flags in a deep directory does not fail unnecessarily and does not pollute the log. My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting the fact that I know the code and otherwise would look there before consulting the documentation. The patch implements 2.2.1. david -------------8<------------------- From: David Sterba <dsterba@suse.cz> It's safe to turn off checksums for a zero sized file. http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030 "We cannot switch on NODATASUM for a file that already has extents that are checksummed. The invariant here is that either all the extents or none are checksummed. Theoretically it's possible to add/remove all checksums from a given file, but it's a potentially longtime operation, the file has to be in some intermediate state where the checksums partially exist but have to be ignored (for the csum->nocsum) until the file is fully converted, this brings more special cases to extent handling, it has to survive power failure and remain consistent, and probably needs to be restarted after next mount." Signed-off-by: David Sterba <dsterba@suse.cz>	2012-10-04 09:40:00 -04:00
Josef Bacik	c3308f84c1	Btrfs: fix punch hole when no extent exists I saw the warning in btrfs_drop_extent_cache where our end is less than our start while running xfstests 68 in a loop. This is because we unconditionally do drop_end = min(end, extent_end) in __btrfs_drop_extents(), even though we may not have found an extent in the range we were looking to drop. So keep track of wether or not we found something, and if we didn't just use our end. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:40:00 -04:00
Josef Bacik	926ced123b	Btrfs: don't do anything in our ->freeze_fs and ->unfreeze_fs We do not need to do anything special to freeze or unfreeze, it's all taken care of by the generic work, and what we currently have is wrong anyway since we shouldn't be returnning to userspace with mutexes held anyway. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:40:00 -04:00
Josef Bacik	892951a92e	Btrfs: remove unused write cache pages hook The btree inode has it's own write cache pages so we can remove this write cache pages hook as it's not used. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:39:59 -04:00
Josef Bacik	b5bae2612a	Btrfs: fix race when getting the eb out of page->private We can race when checking wether PagePrivate is set on a page and we actually have an eb saved in the pages private pointer. We could have easily written out this page and released it in the time that we did the pagevec lookup and actually got around to looking at this page. So use mapping->private_lock to ensure we get a consistent view of the page->private pointer. This is inline with the alloc and releasepage paths which use private_lock when manipulating page->private. Thanks, Reported-by: David Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:39:59 -04:00
Josef Bacik	ff44c6e36d	Btrfs: do not hold the write_lock on the extent tree while logging Dave Sterba pointed out a sleeping while atomic bug while doing fsync. This is because I'm an idiot and didn't realize that rwlock's were spin locks, so we've been holding this thing while doing allocations and such which is not good. This patch fixes this by dropping the write lock before we do anything heavy and re-acquire it when it is done. We also need to take a ref on the em's in case their corresponding pages are evicted and mark them as being logged so that releasepage does not remove them and doesn't remove them from our local list. Thanks, Reported-by: Dave Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:39:58 -04:00
Josef Bacik	98114659e0	Btrfs: fix race with freeze and free space inodes So we start our freeze, somebody comes in and does an fsync() on a file where we have to commit a transaction for whatever reason, and we will deadlock because the freeze is waiting on FS_FREEZE people to stop writing to the file system, but the transaction is waiting for its free space inodes to be written out, which are in turn waiting on sb_start_intwrite while trying to write the file extents. To fix this we'll just skip the sb_start_intwrite() if we TRANS_JOIN_NOLOCK since we're being waited on by a transaction commit so we're safe wrt to freeze and this will keep us from deadlocking. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:39:58 -04:00
Liu Bo	6bbe3a9c80	Btrfs: kill obsolete arguments in btrfs_wait_ordered_extents nocow_only is now an obsolete argument. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-04 09:39:57 -04:00
Liu Bo	2e90cf858f	Btrfs: cleanup fs_info->hashers fs_info->hashers is now an obsolete one. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-04 09:39:57 -04:00
Liu Bo	ab26e9d6c8	Btrfs: cleanup for duplicated code in find_free_extent There is already an 'add free space' phrase in front of this one, we needn't to redo it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-04 09:39:57 -04:00
Josef Bacik	60376ce4a8	Btrfs: fix race in sync and freeze again I screwed this up, there is a race between checking if there is a running transaction and actually starting a transaction in sync where we could race with a freezer and get ourselves into trouble. To fix this we need to make a new join type to only do the try lock on the freeze stuff. If it fails we'll return EPERM and just return from sync. This fixes a hang Liu Bo reported when running xfstest 68 in a loop. Thanks, Reported-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-04 09:39:56 -04:00
David Sterba	b3ae244e71	btrfs: return EPERM upon rmdir on a subvolume A subvolume cannot be deleted via rmdir, but the error code ENOTEMPTY is confusing. Return EPERM instead, as this is not permitted. Signed-off-by: David Sterba <dsterba@suse.cz>	2012-10-04 09:39:56 -04:00
Wei Yongjun	ebb3dad435	Btrfs: using for_each_set_bit_from to simplify the code Using for_each_set_bit_from() to simplify the code. spatch with a semantic match is used to found this. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>	2012-10-04 09:39:55 -04:00
Anand Jain	1bcea35597	Btrfs: write_buf is now callable outside send.c Developing service cmds needs it. Signed-off-by: Anand Jain <anand.jain@oracle.com>	2012-10-04 09:39:55 -04:00
Tsutomu Itoh	b4f359ab06	Btrfs: remove unnecessary code in btree_get_extent() Unnecessary lookup_extent_mapping() is removed because an error is returned to the caller. This patch was made based on the advice from Stefan Behrens, thanks. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-04 09:39:54 -04:00
Tsutomu Itoh	0433f20d43	Btrfs: cleanup of error processing in btree_get_extent() This patch simplifies a little complex error processing in btree_get_extent(). Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-04 09:39:54 -04:00
Linus Torvalds	aab174f0df	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: - big one - consolidation of descriptor-related logics; almost all of that is moved to fs/file.c (BTW, I'm seriously tempted to rename the result to fd.c. As it is, we have a situation when file_table.c is about handling of struct file and file.c is about handling of descriptor tables; the reasons are historical - file_table.c used to be about a static array of struct file we used to have way back). A lot of stray ends got cleaned up and converted to saner primitives, disgusting mess in android/binder.c is still disgusting, but at least doesn't poke so much in descriptor table guts anymore. A bunch of relatively minor races got fixed in process, plus an ext4 struct file leak. - related thing - fget_light() partially unuglified; see fdget() in there (and yes, it generates the code as good as we used to have). - also related - bits of Cyrill's procfs stuff that got entangled into that work; _not_ all of it, just the initial move to fs/proc/fd.c and switch of fdinfo to seq_file. - Alex's fs/coredump.c spiltoff - the same story, had been easier to take that commit than mess with conflicts. The rest is a separate pile, this was just a mechanical code movement. - a few misc patches all over the place. Not all for this cycle, there'll be more (and quite a few currently sit in akpm's tree)." Fix up trivial conflicts in the android binder driver, and some fairly simple conflicts due to two different changes to the sock_alloc_file() interface ("take descriptor handling from sock_alloc_file() to callers" vs "net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries" adding a dentry name to the socket) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits) MAX_LFS_FILESIZE should be a loff_t compat: fs: Generic compat_sys_sendfile implementation fs: push rcu_barrier() from deactivate_locked_super() to filesystems btrfs: reada_extent doesn't need kref for refcount coredump: move core dump functionality into its own file coredump: prevent double-free on an error path in core dumper usb/gadget: fix misannotations fcntl: fix misannotations ceph: don't abuse d_delete() on failure exits hypfs: ->d_parent is never NULL or negative vfs: delete surplus inode NULL check switch simple cases of fget_light to fdget new helpers: fdget()/fdput() switch o2hb_region_dev_write() to fget_light() proc_map_files_readdir(): don't bother with grabbing files make get_file() return its argument vhost_set_vring(): turn pollstart/pollstop into bool switch prctl_set_mm_exe_file() to fget_light() switch xfs_find_handle() to fget_light() switch xfs_swapext() to fget_light() ...	2012-10-02 20:25:04 -07:00
Kirill A. Shutemov	8c0a853770	fs: push rcu_barrier() from deactivate_locked_super() to filesystems There's no reason to call rcu_barrier() on every deactivate_locked_super(). We only need to make sure that all delayed rcu free inodes are flushed before we destroy related cache. Removing rcu_barrier() from deactivate_locked_super() affects some fast paths. E.g. on my machine exit_group() of a last process in IPC namespace takes 0.07538s. rcu_barrier() takes 0.05188s of that time. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-02 21:35:55 -04:00
Al Viro	99621b44aa	btrfs: reada_extent doesn't need kref for refcount All increments and decrements are under the same spinlock - have to be, since they need to protect the radix_tree it's found in. Just use int, no need to wank with kref... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-10-02 21:35:55 -04:00
Linus Torvalds	437589a74b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace changes from Eric Biederman: "This is a mostly modest set of changes to enable basic user namespace support. This allows the code to code to compile with user namespaces enabled and removes the assumption there is only the initial user namespace. Everything is converted except for the most complex of the filesystems: autofs4, 9p, afs, ceph, cifs, coda, fuse, gfs2, ncpfs, nfs, ocfs2 and xfs as those patches need a bit more review. The strategy is to push kuid_t and kgid_t values are far down into subsystems and filesystems as reasonable. Leaving the make_kuid and from_kuid operations to happen at the edge of userspace, as the values come off the disk, and as the values come in from the network. Letting compile type incompatible compile errors (present when user namespaces are enabled) guide me to find the issues. The most tricky areas have been the places where we had an implicit union of uid and gid values and were storing them in an unsigned int. Those places were converted into explicit unions. I made certain to handle those places with simple trivial patches. Out of that work I discovered we have generic interfaces for storing quota by projid. I had never heard of the project identifiers before. Adding full user namespace support for project identifiers accounts for most of the code size growth in my git tree. Ultimately there will be work to relax privlige checks from "capable(FOO)" to "ns_capable(user_ns, FOO)" where it is safe allowing root in a user names to do those things that today we only forbid to non-root users because it will confuse suid root applications. While I was pushing kuid_t and kgid_t changes deep into the audit code I made a few other cleanups. I capitalized on the fact we process netlink messages in the context of the message sender. I removed usage of NETLINK_CRED, and started directly using current->tty. Some of these patches have also made it into maintainer trees, with no problems from identical code from different trees showing up in linux-next. After reading through all of this code I feel like I might be able to win a game of kernel trivial pursuit." Fix up some fairly trivial conflicts in netfilter uid/git logging code. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (107 commits) userns: Convert the ufs filesystem to use kuid/kgid where appropriate userns: Convert the udf filesystem to use kuid/kgid where appropriate userns: Convert ubifs to use kuid/kgid userns: Convert squashfs to use kuid/kgid where appropriate userns: Convert reiserfs to use kuid and kgid where appropriate userns: Convert jfs to use kuid/kgid where appropriate userns: Convert jffs2 to use kuid and kgid where appropriate userns: Convert hpfs to use kuid and kgid where appropriate userns: Convert btrfs to use kuid/kgid where appropriate userns: Convert bfs to use kuid/kgid where appropriate userns: Convert affs to use kuid/kgid wherwe appropriate userns: On alpha modify linux_to_osf_stat to use convert from kuids and kgids userns: On ia64 deal with current_uid and current_gid being kuid and kgid userns: On ppc convert current_uid from a kuid before printing. userns: Convert s390 getting uid and gid system calls to use kuid and kgid userns: Convert s390 hypfs to use kuid and kgid where appropriate userns: Convert binder ipc to use kuids userns: Teach security_path_chown to take kuids and kgids userns: Add user namespace support to IMA userns: Convert EVM to deal with kuids and kgids in it's hmac computation ...	2012-10-02 11:11:09 -07:00
Miao Xie	90abccf2c6	Revert "Btrfs: do not do filemap_write_and_wait_range in fsync" This reverts commit `0885ef5b56` After applying the above patch, the performance slowed down because the dirty page flush can only be done by one task, so revert it. The following is the test result of sysbench: Before After 24MB/s 39MB/s Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:22 -04:00
Josef Bacik	698d0082c4	Btrfs: remove bytes argument from do_chunk_alloc Everybody is just making stuff up, and it's just used to see if we really do need to alloc a chunk, and since we do this when we already know we really do it's just a waste of space. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:21 -04:00
Josef Bacik	ea658badc4	Btrfs: delay block group item insertion So we have lots of places where we try to preallocate chunks in order to make sure we have enough space as we make our allocations. This has historically meant that we're constantly tweaking when we should allocate a new chunk, and historically we have gotten this horribly wrong so we way over allocate either metadata or data. To try and keep this from happening we are going to make it so that the block group item insertion is done out of band at the end of a transaction. This will allow us to create chunks even if we are trying to make an allocation for the extent tree. With this patch my enospc tests run faster (didn't expect this) and more efficiently use the disk space (this is what I wanted). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:21 -04:00
Kent Overstreet	be3940c0a9	btrfs: Kill some bi_idx references For immutable bio vecs, I've been auditing and removing bi_idx references. These were harmless, but removing them will make auditing easier. scrub_bio_end_io_worker() was open coding a bio_reset() - but this doesn't appear to have been needed for anything as right after it does a bio_put(), and perusing the code it doesn't appear anything else was holding a reference to the bio. The other use end_bio_extent_readpage() was just for a pr_debug() - changed it to something that might be a bit more useful. Signed-off-by: Kent Overstreet <koverstreet@google.com> CC: Chris Mason <chris.mason@oracle.com> CC: Stefan Behrens <sbehrens@giantdisaster.de>	2012-10-01 15:19:21 -04:00
Miao Xie	962197babe	Btrfs: fix unnecessary warning when the fragments make the space alloc fail When we wrote some data by compress mode into a btrfs filesystem which was full of the fragments, the kernel will report: BTRFS warning (device xxx): Aborting unused transaction. The reason is: We can not find a long enough free space to store the compressed data because of the fragmentary free space, and the compressed data can not be splited, so the kernel outputed the above message. In fact, btrfs can deal with this problem very well: it fall back to uncompressed IO, split the uncompressed data into small ones, and then store them into to the fragmentary free space. So we shouldn't output the above warning message. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:20 -04:00
Josef Bacik	69ffb54347	Btrfs: create a pinned em when writing to a prealloc range in DIO Wade Cline reported a problem where he was getting garbage and warnings when writing to a preallocated range via O_DIRECT. This is because we weren't creating our normal pinned extent_map for the range we were writing to, which was causing all sorts of issues. This patch fixes the problem and makes his testcase much happier. Thanks, Reported-by: Wade Cline <clinew@linux.vnet.ibm.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:20 -04:00
Josef Bacik	6df7881a84	Btrfs: move the sb_end_intwrite until after the throttle logic Sage reported the following lockdep backtrace ===================================== [ BUG: bad unlock balance detected! ] 3.6.0-rc2-ceph-00171-gc7ed62d #1 Not tainted ------------------------------------- btrfs-cleaner/7607 is trying to release lock (sb_internal) at: [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs] but there are no more locks to release! other info that might help us debug this: 1 lock held by btrfs-cleaner/7607: #0: (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b405>] cleaner_kthread+0x95/0x120 [btrfs] stack backtrace: Pid: 7607, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00171-gc7ed62d #1 Call Trace: [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [<ffffffff810afa9e>] print_unlock_inbalance_bug+0xfe/0x110 [<ffffffff810b289e>] lock_release_non_nested+0x1ee/0x310 [<ffffffff81172f9b>] ? kmem_cache_free+0x7b/0x160 [<ffffffffa004106c>] ? put_transaction+0x8c/0x130 [btrfs] [<ffffffffa00422ae>] ? btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [<ffffffff810b2a95>] lock_release+0xd5/0x220 [<ffffffff81173071>] ? kmem_cache_free+0x151/0x160 [<ffffffff8117d9ed>] __sb_end_write+0x7d/0x90 [<ffffffffa00422ae>] btrfs_commit_transaction+0xa6e/0xb20 [btrfs] [<ffffffff81079850>] ? __init_waitqueue_head+0x60/0x60 [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40 [<ffffffffa0042758>] __btrfs_end_transaction+0x368/0x3c0 [btrfs] [<ffffffffa0042808>] btrfs_end_transaction_throttle+0x18/0x20 [btrfs] [<ffffffffa00318f0>] btrfs_drop_snapshot+0x410/0x600 [btrfs] [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0 [<ffffffffa00430ef>] btrfs_clean_old_snapshots+0xaf/0x150 [btrfs] [<ffffffffa003b405>] ? cleaner_kthread+0x95/0x120 [btrfs] [<ffffffffa003b419>] cleaner_kthread+0xa9/0x120 [btrfs] [<ffffffffa003b370>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs] [<ffffffff810791ee>] kthread+0xae/0xc0 [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10 [<ffffffff81635430>] ? retint_restore_args+0x13/0x13 [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0 [<ffffffff8163e740>] ? gs_change+0x13/0x13 This is because the throttle stuff can commit the transaction, which expects to be the one stopping the intwrite stuff, but we've already done it in the __btrfs_end_transaction. Moving the sb_end_intewrite after this logic makes the lockdep go away. Thanks, Tested-by: Sage Weil <sage@inktank.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:19 -04:00
Liu Bo	425d17a290	Btrfs: use larger limit for translation of logical to inode This is the change of the kernel side. Translation of logical to inode used to have an upper limit 4k on inode container's size, but the limit is not large enough for a data with a great many of refs, so when resolving logical address, we can end up with "ioctl ret=0, bytes_left=0, bytes_missing=19944, cnt=510, missed=2493" This changes to regard 64k as the upper limit and use vmalloc instead of kmalloc to get memory more easily. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:19 -04:00
Liu Bo	df031f0752	Btrfs: use helper for logical resolve We already have a helper, iterate_inodes_from_logical(), for logical resolve, so just use it. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:18 -04:00
Liu Bo	69917e4312	Btrfs: fix a bug in parsing return value in logical resolve In logical resolve, we parse extent_from_logical()'s 'ret' as a kind of flag. It is possible to lose our errors because (-EXXXX & BTRFS_EXTENT_FLAG_TREE_BLOCK) is true. I'm not sure if it is on purpose, it just looks too hacky if it is. I'd rather use a real flag and a 'ret' to catch errors. Acked-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Liu Bo <liub.liubo@gmail.com>	2012-10-01 15:19:18 -04:00
liubo	0647d6bd16	Btrfs: cleanup for unused ref cache stuff As ref cache has been removed from btrfs, there is no user on its lock and its check. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:17 -04:00
Miao Xie	8407aa4643	Btrfs: fix corrupted metadata in the snapshot When we delete a inode, we will remove all the delayed items including delayed inode update, and then truncate all the relative metadata. If there is lots of metadata, we will end the current transaction, and start a new transaction to truncate the left metadata. In this way, we will leave a inode item that its link counter is > 0, and also may leave some directory index items in fs/file tree after the current transaction ends. In other words, the metadata in this fs/file tree is inconsistent. If we create a snapshot for this tree now, we will find a inode with corrupted metadata in the new snapshot, and we won't continue to drop the left metadata, because its link counter is not 0. We fix this problem by updating the inode item before the current transaction ends. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:17 -04:00
David Sterba	837e197283	btrfs: polish names of kmem caches Usecase: watch 'grep btrfs < /proc/slabinfo' easy to watch all caches in one go. Signed-off-by: David Sterba <dsterba@suse.cz>	2012-10-01 15:19:16 -04:00
Josef Bacik	a80c8dcf7e	Btrfs: fix our overcommit math I noticed I was seeing large lags when running my torrent test in a vm on my laptop. While trying to make it lag less I noticed that our overcommit math was taking into account the number of bytes we wanted to reclaim, not the number of bytes we actually wanted to allocate, which means we wouldn't overcommit as often. This patch fixes the overcommit math and makes shrink_delalloc() use that logic so that it will stop looping faster. We still have pretty high spikes of latency, but the test now takes 3 minutes less time (about 5% faster). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:16 -04:00
Josef Bacik	dea31f5233	Btrfs: wait on async pages when shrinking delalloc Mitch reported a problem where you could get an ENOSPC error when untarring a kernel git tree onto a 16gb file system with compress-force=zlib. This is because compression is a huge pain, it will return from ->writepages() without having actually created any ordered extents. To get around this we check to see if the async submit counter is up, and if it is wait until it drops to 0 before doing our normal ordered wait dance. With this patch I can now untar a kernel git tree onto a 16gb file system without getting ENOSPC errors. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:15 -04:00
Liu Bo	9e8a4a8b0b	Btrfs: use flag EXTENT_DEFRAG for snapshot-aware defrag We're going to use this flag EXTENT_DEFRAG to indicate which range belongs to defragment so that we can implement snapshow-aware defrag: We set the EXTENT_DEFRAG flag when dirtying the extents that need defragmented, so later on writeback thread can differentiate between normal writeback and writeback started by defragmentation. Original-Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:15 -04:00
Tsutomu Itoh	3d6b5c3b5c	Btrfs: check return value of ulist_alloc() properly ulist_alloc() has the possibility of returning NULL. So, it is necessary to check the return value. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-01 15:19:14 -04:00
Tsutomu Itoh	f54fb859da	Btrfs: fix error handling in delete_block_group_cache() btrfs_iget() never return NULL. So, NULL check is unnecessary. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-10-01 15:19:14 -04:00
Miao Xie	903889f462	Btrfs: fix wrong size for the reservation when doing, file pre-allocation. When we ran fsstress(a program in xfstests), the filesystem hung up when it is full. It was because the space reserved in btrfs_fallocate() was wrong, btrfs_fallocate() just used the size of the pre-allocation to reserve the space, didn't took the block size aligning into account, so the size of the reserved space was less than the allocated space, it caused the over reserve problem and made the filesystem hung up when invoking cow_file_range(). Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:14 -04:00
Miao Xie	69ce977a17	Btrfs: output more information when aborting a unused transaction handle Though we dump the stack information when aborting a unused transaction handle, we don't know the correct place where we decide to abort the transaction handle if one function has several place where the transaction abort function is invoked and jumps to the same place after this call. And beside that we also don't know the reason why we jump to abort the current handle. So I modify the transaction abort function and make it output the function name, line and error information. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:13 -04:00
Miao Xie	2ecb79239b	Btrfs: fix unprotected ->log_batch We forget to protect ->log_batch when syncing a file, this patch fix this problem by atomic operation. And ->log_batch is used to check if there are parallel sync operations or not, so it is unnecessary to reset it to 0 after the sync operation of the current log tree complete. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:12 -04:00
Miao Xie	48c03c4bcf	Btrfs: fix wrong size for the reservation of the, snapshot creation We should insert/update 6 items(root ref, root backref, dir item, dir index, root item and parent inode) when creating a snapshot, not 5 items, fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:12 -04:00
Miao Xie	42874b3db7	Btrfs: fix the snapshot that should not exist The snapshot should be the image of the fs tree before it was created, so the metadata of the snapshot should not exist in the its tree. But now, we found the directory item and directory name index is in both the snapshot tree and the fs tree. It introduces some problems and makes the users feel strange: # mkfs.btrfs /dev/sda1 # mount /dev/sda1 /mnt # mkdir /mnt/1 # cd /mnt/1 # btrfs subvolume snapshot /mnt snap0 # ls -a /mnt/1/snap0/1 . .. [no other file/dir] # ll /mnt/1/snap0/ total 0 drwxr-xr-x 1 root root 10 Ju1 24 12:11 1 ^^^ There is no file/dir in it, but it's size is 10 # cd /mnt/1/snap0/1/snap0 [Enter a unexisted directory successfully...] There is nothing in the directory 1 in snap0, but btrfs told the length of this directory is 10. Beside that, we can enter an unexisted directory, it is very strange to the users. # btrfs subvolume snapshot /mnt/1/snap0 /mnt/snap1 # ll /mnt/1/snap0/1/ total 0 [None] # ll /mnt/snap1/1/ total 0 drwxr-xr-x 1 root root 0 Ju1 24 12:14 snap0 And the source of snap1 did have any directory in Directory 1, but snap1 have a snap0, it is different between the source and the snapshot. So I think we should insert directory item and directory name index and update the parent inode as the last step of snapshot creation, and do not leave the useless metadata in the file tree. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:12 -04:00
Miao Xie	66d8f3dd1c	Btrfs: add a new "type" field into the block reservation structure Sometimes we need choose the method of the reservation according to the type of the block reservation, such as the reservation for the delayed inode update. Now we identify the type just by comparing the address of the reservation variants, it is very ugly if it is a temporary one because we need compare it with all the common reservation variants. So we add a new "type" field to keep the type the reservation variants. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:11 -04:00
Miao Xie	6352b91da1	Btrfs: use a slab for ordered extents allocation The ordered extent allocation is in the fast path of the IO, so use a slab to improve the speed of the allocation. "Size of the struct is 280, so this will fall into the size-512 bucket, giving 8 objects per page, while own slab will pack 14 objects into a page. Another benefit I see is to check for leaked objects when the module is removed (and the cache destroy takes place)." -- David Sterba Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:11 -04:00
Miao Xie	b9a8cc5bef	Btrfs: fix file extent discount problem in the, snapshot If a snapshot is created while we are writing some data into the file, the i_size of the corresponding file in the snapshot will be wrong, it will be beyond the end of the last file extent. And btrfsck will report: root 256 inode 257 errors 100 Steps to reproduce: # mkfs.btrfs <partition> # mount <partition> <mnt> # cd <mnt> # dd if=/dev/zero of=tmpfile bs=4M count=1024 & # for ((i=0; i<4; i++)) > do > btrfs sub snap . $i > done This because the algorithm of disk_i_size update is wrong. Though there are some ordered extents behind the current one which we use to update disk_i_size, it doesn't mean those extents will be dealt with in the same transaction. So We shouldn't use the offset of those extents to update disk_i_size. Or we will get the wrong i_size in the snapshot. We fix this problem by recording the max real i_size. If we find there is a ordered extent which is in front of the current one and doesn't complete, we will record the end of the current one into that ordered extent. Surely, if the current extent holds the end of other extent(it must be greater than the current one because it is behind the current one), we will record the number that the current extent holds. In this way, we can exclude the ordered extents that may not be dealth with in the same transaction, and be easy to know the real disk_i_size. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:10 -04:00
Miao Xie	361048f586	Btrfs: fix full backref problem when inserting shared block reference If we create several snapshots at the same time, the following BUG_ON() will be triggered. kernel BUG at fs/btrfs/extent-tree.c:6047! Steps to reproduce: # mkfs.btrfs <partition> # mount <partition> <mnt> # cd <mnt> # for ((i=0;i<2400;i++)); do touch long_name_to_make_tree_more_deep$i; done # for ((i=0; i<4; i++)) > do > mkdir $i > for ((j=0; j<200; j++)) > do > btrfs sub snap . $i/$j > done & > done The reason is: Before transaction commit, some operations changed the fs tree and new tree blocks were allocated because of COW. We used the implicit non-shared back reference for those newly allocated tree blocks because they were not shared by two or more trees. And then we created the first snapshot for the fs tree, according to the back reference rules, we also used implicit back refs for the child tree blocks of the root node of the fs tree, now those child nodes/leaves were shared by two trees. Then We didn't deal with the delayed references, and continued to change the fs tree(created the second snapshot and inserted the dir item of the new snapshot into the fs tree). According to the rules of the back reference, we added full back refs for those tree blocks whose parents have be shared by two trees. Now some newly allocated tree blocks had two types of the references. As we know, the delayed reference system handles these delayed references from back to front, and the full delayed reference is inserted after the implicit ones. So when we dealt with the back references of those newly allocated tree blocks, the full references was dealt with at first. And if the first reference is a shared back reference and the tree block that the reference points to is newly allocated, It would be considered as a tree block which is shared by two or more trees when it is allocated and should be a full back reference not a implicit one, the flag of its reference also should be set to FULL_BACKREF. But in fact, it was a non-shared tree block with a implicit reference at beginning, so it was not compulsory to set the flags to FULL_BACKREF. So BUG_ON was triggered. We have several methods to fix this bug: 1. deal with delayed references after the snapshot is created and before we change the source tree of the snapshot. This is the easiest and safest way. 2. modify the sort method of the delayed reference tree, make the full delayed references be inserted before the implicit ones. It is also very easy, but I don't know if it will introduce some problems or not. 3. modify select_delayed_ref() and make it select the implicit delayed reference at first. This way is not so good because it may wastes CPU time if we have lots of delayed references. 4. set the flags to FULL_BACKREF, this method is a little complex comparing with the 1st way. I chose the 1st way to fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:10 -04:00
Miao Xie	6fa9700e73	Btrfs: fix error path in create_pending_snapshot() This patch fixes the following problem: - If we failed to deal with the delayed dir items, we should abort transaction, just as its comment said. Fix it. - If root reference or root back reference insertion failed, we should abort transaction. Fix it. - Fix the double free problem of pending->inherit. - Do not restore the trans->rsv if we doesn't change it. - make the error path more clearly. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:09 -04:00
Wei Yongjun	cf93dccea6	Btrfs: fix possible memory leak in scrub_setup_recheck_block() bbio has been malloced in btrfs_map_block() and should be freed before leaving from the error handling cases. spatch with a semantic match is used to found this problem. (http://coccinelle.lip6.fr/) Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>	2012-10-01 15:19:09 -04:00
Josef Bacik	7014cdb493	Btrfs: btrfs_drop_extent_cache should never fail I noticed this when I was doing the fsync stuff, we allocate split extents if we drop an extent range that is in the middle of an existing extent. This BUG()'s if we fail to allocate memory, but the fact is this is just a cache, we will just regenerate the cache if we need it, the important part is that we free the range we are given. This can be done without allocations, so if we fail to allocate splits just skip the splitting stage and free our em and look for more extents to drop. This also makes btrfs_drop_extent_cache a void since nobody was checking the return value anyway. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:09 -04:00
Sage Weil	ac14aed665	Btrfs: do not take cleanup_work_sem in btrfs_run_delayed_iputs() Josef has suggested that this is not necessary. Removing it also avoids this lockdep splat (after the new sb_internal locking stuff was added): [ 604.090449] ====================================================== [ 604.114819] [ INFO: possible circular locking dependency detected ] [ 604.139262] 3.6.0-rc2-ceph-00144-g463b030 #1 Not tainted [ 604.162193] ------------------------------------------------------- [ 604.186139] btrfs-cleaner/6669 is trying to acquire lock: [ 604.209555] (sb_internal#2){.+.+..}, at: [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs] [ 604.257100] [ 604.257100] but task is already holding lock: [ 604.300366] (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs] [ 604.352989] [ 604.352989] which lock already depends on the new lock. [ 604.352989] [ 604.427104] [ 604.427104] the existing dependency chain (in reverse order) is: [ 604.478493] [ 604.478493] -> #1 (&fs_info->cleanup_work_sem){.+.+..}: [ 604.529313] [<ffffffff810b2c82>] lock_acquire+0xa2/0x140 [ 604.559621] [<ffffffff81632b69>] down_read+0x39/0x4e [ 604.589382] [<ffffffffa004db98>] btrfs_lookup_dentry+0x218/0x550 [btrfs] [ 604.596161] btrfs: unlinked 1 orphans [ 604.675002] [<ffffffffa006aadd>] create_subvol+0x62d/0x690 [btrfs] [ 604.708859] [<ffffffffa006d666>] btrfs_mksubvol.isra.52+0x346/0x3a0 [btrfs] [ 604.772466] [<ffffffffa006d7f2>] btrfs_ioctl_snap_create_transid+0x132/0x190 [btrfs] [ 604.842245] [<ffffffffa006d8ae>] btrfs_ioctl_snap_create+0x5e/0x80 [btrfs] [ 604.912852] [<ffffffffa00708ae>] btrfs_ioctl+0x138e/0x1990 [btrfs] [ 604.951888] [<ffffffff8118e9b8>] do_vfs_ioctl+0x98/0x560 [ 604.989961] [<ffffffff8118ef11>] sys_ioctl+0x91/0xa0 [ 605.026628] [<ffffffff8163d569>] system_call_fastpath+0x16/0x1b [ 605.064404] [ 605.064404] -> #0 (sb_internal#2){.+.+..}: [ 605.126832] [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90 [ 605.163671] [<ffffffff810b2c82>] lock_acquire+0xa2/0x140 [ 605.200228] [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0 [ 605.236818] [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs] [ 605.274029] [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs] [ 605.340520] [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs] [ 605.378720] [<ffffffff811972c8>] evict+0xb8/0x1c0 [ 605.416057] [<ffffffff811974d5>] iput+0x105/0x210 [ 605.452373] [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs] [ 605.521627] [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs] [ 605.560520] [<ffffffff810791ee>] kthread+0xae/0xc0 [ 605.598094] [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10 [ 605.636499] [ 605.636499] other info that might help us debug this: [ 605.636499] [ 605.736504] Possible unsafe locking scenario: [ 605.736504] [ 605.801931] CPU0 CPU1 [ 605.835126] ---- ---- [ 605.867093] lock(&fs_info->cleanup_work_sem); [ 605.898594] lock(sb_internal#2); [ 605.931954] lock(&fs_info->cleanup_work_sem); [ 605.965359] lock(sb_internal#2); [ 605.994758] [ 605.994758] * DEADLOCK * [ 605.994758] [ 606.075281] 2 locks held by btrfs-cleaner/6669: [ 606.104528] #0: (&fs_info->cleaner_mutex){+.+...}, at: [<ffffffffa003b5d5>] cleaner_kthread+0x95/0x120 [btrfs] [ 606.165626] #1: (&fs_info->cleanup_work_sem){.+.+..}, at: [<ffffffffa0048002>] btrfs_run_delayed_iputs+0x72/0x130 [btrfs] [ 606.231297] [ 606.231297] stack backtrace: [ 606.287723] Pid: 6669, comm: btrfs-cleaner Not tainted 3.6.0-rc2-ceph-00144-g463b030 #1 [ 606.347823] Call Trace: [ 606.376184] [<ffffffff8162a77c>] print_circular_bug+0x1fb/0x20c [ 606.409243] [<ffffffff810b25e8>] __lock_acquire+0x1ac8/0x1b90 [ 606.441343] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.474583] [<ffffffff810b2c82>] lock_acquire+0xa2/0x140 [ 606.505934] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.539429] [<ffffffff8132babd>] ? do_raw_spin_unlock+0x5d/0xb0 [ 606.571719] [<ffffffff8117dac6>] __sb_start_write+0xc6/0x1b0 [ 606.603498] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.637405] [<ffffffffa0042b84>] ? start_transaction+0x124/0x430 [btrfs] [ 606.670165] [<ffffffff81172e75>] ? kmem_cache_alloc+0xb5/0x160 [ 606.702144] [<ffffffffa0042b84>] start_transaction+0x124/0x430 [btrfs] [ 606.735562] [<ffffffffa00256a6>] ? block_rsv_add_bytes+0x56/0x80 [btrfs] [ 606.769861] [<ffffffffa00431a3>] btrfs_start_transaction+0x13/0x20 [btrfs] [ 606.804575] [<ffffffffa004ccfa>] btrfs_evict_inode+0x19a/0x330 [btrfs] [ 606.838756] [<ffffffff81634c6b>] ? _raw_spin_unlock+0x2b/0x40 [ 606.872010] [<ffffffff811972c8>] evict+0xb8/0x1c0 [ 606.903800] [<ffffffff811974d5>] iput+0x105/0x210 [ 606.935416] [<ffffffffa0048082>] btrfs_run_delayed_iputs+0xf2/0x130 [btrfs] [ 606.970510] [<ffffffffa003b5d5>] ? cleaner_kthread+0x95/0x120 [btrfs] [ 607.005648] [<ffffffffa003b5e1>] cleaner_kthread+0xa1/0x120 [btrfs] [ 607.040724] [<ffffffffa003b540>] ? btrfs_destroy_delayed_refs.isra.102+0x220/0x220 [btrfs] [ 607.104740] [<ffffffff810791ee>] kthread+0xae/0xc0 [ 607.137119] [<ffffffff810b379d>] ? trace_hardirqs_on+0xd/0x10 [ 607.169797] [<ffffffff8163e744>] kernel_thread_helper+0x4/0x10 [ 607.202472] [<ffffffff81635430>] ? retint_restore_args+0x13/0x13 [ 607.235884] [<ffffffff81079140>] ? flush_kthread_work+0x1a0/0x1a0 [ 607.268731] [<ffffffff8163e740>] ? gs_change+0x13/0x13 Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 15:19:08 -04:00
Sage Weil	e209db7ace	Btrfs: set journal_info in async trans commit worker We expect current->journal_info to point to the trans handle we are committing. Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 15:19:08 -04:00
Sage Weil	6fc4e35485	Btrfs: pass lockdep rwsem metadata to async commit transaction The freeze rwsem is taken by sb_start_intwrite() and dropped during the commit_ or end_transaction(). In the async case, that happens in a worker thread. Tell lockdep the calling thread is releasing ownership of the rwsem and the async thread is picking it up. XFS plays the same trick in fs/xfs/xfs_aops.c. Signed-off-by: Sage Weil <sage@inktank.com>	2012-10-01 15:19:07 -04:00
Josef Bacik	2aaa665581	Btrfs: add hole punching This patch adds hole punching via fallocate. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:07 -04:00
Josef Bacik	2671485d39	Btrfs: remove unused hint byte argument for btrfs_drop_extents I audited all users of btrfs_drop_extents and found that nobody actually uses the hint_byte argument. I'm sure it was used for something at some point but it's not used now, and the way the pinning works the disk bytenr would never be immediately useful anyway so lets just remove it. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:06 -04:00
Liu Bo	d279440511	Btrfs: check if an inode has no checksum when logging it This is based on Josef's "Btrfs: turbo charge fsync". If an inode is a BTRFS_INODE_NODATASUM one, we don't need to look for csum items any more. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:06 -04:00
Liu Bo	46d8bc3424	Btrfs: fix a bug in checking whether a inode is already in log This is based on Josef's "Btrfs: turbo charge fsync". The current btrfs checks if an inode is in log by comparing root's last_log_commit to inode's last_sub_trans[2]. But the problem is that this root->last_log_commit is shared among inodes. Say we have N inodes to be logged, after the first inode, root's last_log_commit is updated and the N-1 remained files will be skipped. This fixes the bug by keeping a local copy of root's last_log_commit inside each inode and this local copy will be maintained itself. [1]: we regard each log transaction as a subset of btrfs's transaction, i.e. sub_trans Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:06 -04:00
Miao Xie	321f0e7022	Btrfs: fix wrong orphan count of the fs/file tree If we add a new orphan item, we should increase the atomic counter, not decrease it. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-10-01 15:19:05 -04:00
Liu Bo	4e2f84e63d	Btrfs: improve fsync by filtering extents that we want This is based on Josef's "Btrfs: turbo charge fsync". The above Josef's patch performs very good in random sync write test, because we won't have too much extents to merge. However, it does not performs good on the test: dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync The reason is when we do sequencial sync write, we need to merge the current extent just with the previous one, so that we can get accumulated extents to log: A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ... So we'll have to flush more and more checksum into log tree, which is the bottleneck according to my tests. But we can avoid this by telling fsync the real extents that are needed to be logged. With this, I did the above dd sync write test (size=50m), w/o (orig) w/ (josef's) w/ (this) SATA 104KB/s 109KB/s 121KB/s ramdisk 1.5MB/s 1.5MB/s 10.7MB/s (613%) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:05 -04:00
Josef Bacik	ca7e70f590	Btrfs: do not needlessly restart the transaction for enospc We will stop and restart a transaction every time we move to a different leaf when truncating a file. This is for enospc reasons, but really we could probably get away with doing this a little better by actually working until we hit an ENOSPC. So add a ->failfast flag to the block_rsv and set it when we do truncates which will fail as soon as the block rsv runs out of space, and then at that point we can stop and restart the transaction and refill the block rsv and carry on. This will make rm'ing of a file with lots of extents a bit faster. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:04 -04:00
Liu Bo	06d3d22b45	Btrfs: cleanup extents after we finish logging inode This is based on Josef's "Btrfs: turbo charge fsync". We should cleanup those extents after we've finished logging inode, otherwise we may do redundant work on them. Signed-off-by: Liu Bo <bo.li.liu@oracle.com>	2012-10-01 15:19:04 -04:00
Josef Bacik	0fa83cdb1d	Btrfs: only warn if we hit an error when doing the tree logging I hit this a couple times while working on my fsync patch (all my bugs, not normal operation), but with my new stuff we could have new errors from cases I have not encountered, so instead of BUG()'ing we should be WARN()'ing so that we are notified there is a problem but the user doesn't lose their data. We can easily commit the transaction in the case that the tree logging fails and still be fine, so let's try and be as nice to the user as possible. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:03 -04:00
Josef Bacik	5dc562c541	Btrfs: turbo charge fsync At least for the vm workload. Currently on fsync we will 1) Truncate all items in the log tree for the given inode if they exist and 2) Copy all items for a given inode into the log The problem with this is that for things like VMs you can have lots of extents from the fragmented writing behavior, and worst yet you may have only modified a few extents, not the entire thing. This patch fixes this problem by tracking which transid modified our extent, and then when we do the tree logging we find all of the extents we've modified in our current transaction, sort them and commit them. We also only truncate up to the xattrs of the inode and copy that stuff in normally, and then just drop any extents in the range we have that exist in the log already. Here are some numbers of a 50 meg fio job that does random writes and fsync()s after every write Original Patched SATA drive 82KB/s 140KB/s Fusion drive 431KB/s 2532KB/s So around 2-6 times faster depending on your hardware. There are a few corner cases, for example if you truncate at all we have to do it the old way since there is no way to be sure what is in the log is ok. This probably could be done smarter, but if you write-fsync-truncate-write-fsync you deserve what you get. All this work is in RAM of course so if your inode gets evicted from cache and you read it in and fsync it we'll do it the slow way if we are still in the same transaction that we last modified the inode in. The biggest cool part of this is that it requires no changes to the recovery code, so if you fsync with this patch and crash and load an old kernel, it will run the recovery and be a-ok. I have tested this pretty thoroughly with an fsync tester and everything comes back fine, as well as xfstests. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:03 -04:00
Josef Bacik	224ecce517	Btrfs: fix possible corruption when fsyncing written prealloced extents While working on my fsync patch my fsync tester kept hitting mismatching md5sums when I would randomly write to a prealloc'ed region, syncfs() and then write to the prealloced region some more and then fsync() and then immediately reboot. This is because the tree logging code will skip writing csums for file extents who's generation is less than the current running transaction. When we mark extents as written we haven't been updating their generation so they were always being skipped. This wouldn't happen if you were to preallocate and then write in the same transaction, but if you for example prealloced a VM you could definitely run into this problem. This patch makes my fsync tester happy again. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:02 -04:00
Josef Bacik	54338b5cc4	Btrfs: do not allocate chunks as agressively Swinging this pendulum back the other way. We've been allocating chunks up to 2% of the disk no matter how much we actually have allocated. So instead fix this calculation to only allocate chunks if we have more than 80% of the space available allocated. Please test this as it will likely cause all sorts of ENOSPC problems to pop up suddenly. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:02 -04:00
Josef Bacik	7c735313bd	Btrfs: update last trans if we don't update the inode There is a completely impossible situation to hit where you can preallocate a file, fsync it, write into the preallocated region, have the transaction commit twice and then fsync and then immediately lose power and lose all of the contents of the write. This patch fixes this just so I feel better about the situation and because it is lightweight, we just update the last_trans when we finish an ordered IO and we don't update the inode itself. This way we are completely safe and I feel better. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-10-01 15:19:02 -04:00
Jan Schmidt	995e01b7af	Btrfs: fix gcc warnings for 32bit compiles Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-10-01 15:19:01 -04:00
Chris Mason	74dd17fbe3	Btrfs: fix btrfs send for inline items and compression The btrfs send code was assuming the offset of the file item into the extent translated to bytes on disk. If we're compressed, this isn't true, and so it was off into extents owned by other files. It was also improperly handling inline extents. This solves a crash where we may have gone past the end of the file extent item by not testing early enough for an inline extent. It also solves problems where we have a whole between the end of the inline item and the start of the full extent. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-10-01 15:19:00 -04:00
Alexander Block	6d85ed05e1	Btrfs: don't treat top/root directory inode as deleted/reused We can't do the deleted/reused logic for top/root inodes as it would create a stream that tries to delete and recreate the root dir. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:19:00 -04:00
Alexander Block	2981e225f7	Btrfs: ignore non-FS inodes for send/receive We have to ignore inode/space cache objects in send/receive. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:59 -04:00
Alexander Block	2f28f4787c	Btrfs: pass root instead of parent_root to iterate_inode_ref We need to pass the root that we determined earlier to iterate_inode_ref. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:58 -04:00
Alexander Block	d8347fa444	Btrfs: use <= instead of < in is_extent_unchanged Used the wrong compare operator here. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:58 -04:00
Alexander Block	3954096d4b	Btrfs: fix check for changed extent in is_extent_unchanged The previous check was working fine, but this check should be easier to read. Also, we could theoritically have some exotic bugs with the previous checks. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:57 -04:00
Alexander Block	5dc67d0ba9	Btrfs: free nce and nce_head on error in name_cache_insert Both were leaked in case of error. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:56 -04:00
Alexander Block	3e126f32f8	Btrfs: remove unused tmp_path from iterate_dir_item A leftover from older code and unused now. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:55 -04:00
Alexander Block	e938c8ad54	Btrfs: code cleanups for send/receive Doing some code cleanups as suggested by Arne. Changes do not change any logic. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:55 -04:00
Alexander Block	766702ef49	Btrfs: add/fix comments/documentation for send/receive As the subject already said, add/fix comments. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:54 -04:00
Alexander Block	e479d9bb5f	Btrfs: update send_progress at correct places Updating send_progress in process_recorded_refs was not correct. It got updated too early in the cur_inode_new_gen case. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:53 -04:00
Alexander Block	34d73f54e2	Btrfs: make aux field of ulist 64 bit Btrfs send/receive uses the aux field to store inode numbers. On 32 bit machines this may become a problem. Also fix all users of ulist_add and ulist_add_merged. Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:53 -04:00
Alexander Block	7e0926fe5f	Btrfs: fix use of radix_tree for name_cache in send/receive We can't easily use the index of the radix tree for inums as the radix tree uses 32bit indexes on 32bit kernels. For 32bit kernels, we now use the lower 32bit of the inum as index and an additional list to store multiple entries per radix tree entry. Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:52 -04:00
Alexander Block	17589bd96e	Btrfs: fix memory leak for name_cache in send/receive When everything is done, name_cache_free is called which however forgot to call kfree on the cache entries. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:51 -04:00
Alexander Block	adbe7fb6c4	Btrfs: don't break in the final loop of find_extent_clone If we break, we may miss the clone from send_root which we prefer over all other clones. Commit is a result of Arne's review. Reported-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:50 -04:00
Alexander Block	52f9e53ede	Btrfs: use normal return path for root == send_root case Don't have a seperate return path for the mentioned case. Now we do the same "take lowest inode/offset" logic for all found clones. Commit is a result of Arne's review. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:50 -04:00
Alexander Block	35075bb046	Btrfs: use kmalloc instead of stack for backref_ctx Make sure to never get in trouble due to the backref_ctx which was on the stack before. Commit is a result of Arne's review. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:49 -04:00
Alexander Block	ee849c0472	Btrfs: rename backref_ctx::found_in_send_root to found_itself The new name should be easier to understand/read. Commit is a result of Arne's review. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:48 -04:00
Alexander Block	d27aed5e24	Btrfs: remove unused use_list from send/receive code use_list is a leftover and unused. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:48 -04:00
Alexander Block	ccf1626b49	Btrfs: add correct parent to check_dirs when dir got moved We only added the parent for the new position of a moved dir. We also need to add the old parent of the moved dir. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:47 -04:00
Alexander Block	9ea3ef516d	Btrfs: remove unused code with #if 0 fs_path_remove is not used at the moment due to a previous patch. Remove it for now (with #if 0) to avoid compile warnings. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:46 -04:00
Alexander Block	b9291affaa	Btrfs: add missing check for dir != tmp_dir to is_first_ref We missed that check which resultet in all refs with the same name being reported as first_ref. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:45 -04:00
Alexander Block	1f4692da95	Btrfs: fix cur_ino < parent_ino case for send/receive When the current inodes inum is smaller then the inum of the parent directory strange things were happending due to wrong path resolution and other bugs. Fix this with a new approach for the problem. Reported-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:45 -04:00
Alexander Block	85a7b33b96	Btrfs: add rdev to get_inode_info in send/receive We need rdev in the next commit. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-10-01 15:18:44 -04:00
Linus Torvalds	99dbb1632f	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull the trivial tree from Jiri Kosina: "Tiny usual fixes all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits) doc: fix old config name of kprobetrace fs/fs-writeback.c: cleanup riteback_sb_inodes kerneldoc btrfs: fix the commment for the action flags in delayed-ref.h btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID vfs: fix kerneldoc for generic_fh_to_parent() treewide: fix comment/printk/variable typos ipr: fix small coding style issues doc: fix broken utf8 encoding nfs: comment fix platform/x86: fix asus_laptop.wled_type module parameter mfd: printk/comment fixes doc: getdelays.c: remember to close() socket on error in create_nl_socket() doc: aliasing-test: close fd on write error mmc: fix comment typos dma: fix comments spi: fix comment/printk typos in spi Coccinelle: fix typo in memdup_user.cocci tmiofb: missing NULL pointer checks tools: perf: Fix typo in tools/perf tools/testing: fix comment / output typos ...	2012-10-01 09:06:36 -07:00
Al Viro	2903ff019b	switch simple cases of fget_light to fdget Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 22:20:08 -04:00
Al Viro	8319aa9127	switch btrfs_ioctl_clone() to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:09 -04:00
Al Viro	ecd188159e	switch btrfs_ioctl_snap_create_transid() to fget_light() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-09-26 21:10:07 -04:00
Eric W. Biederman	2f2f43d3c7	userns: Convert btrfs to use kuid/kgid where appropriate Cc: Chris Mason <chris.mason@fusionio.com> Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2012-09-21 03:13:31 -07:00
Wang Sheng-Hui	44a075bde9	btrfs: fix the commment for the action flags in delayed-ref.h The action field has been merged into struct btrfs_delayed_ref_node, and no struct btrfs_delayed_ref is available now. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2012-09-21 08:07:12 +02:00
Eric W. Biederman	5f3a4a28ec	userns: Pass a userns parameter into posix_acl_to_xattr and posix_acl_from_xattr - Pass the user namespace the uid and gid values in the xattr are stored in into posix_acl_from_xattr. - Pass the user namespace kuid and kgid values should be converted into when storing uid and gid values in an xattr in posix_acl_to_xattr. - Modify all callers of posix_acl_from_xattr and posix_acl_to_xattr to pass in &init_user_ns. In the short term this change is not strictly needed but it makes the code clearer. In the longer term this change is necessary to be able to mount filesystems outside of the initial user namespace that natively store posix acls in the linux xattr format. Cc: Theodore Tso <tytso@mit.edu> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>	2012-09-18 01:01:35 -07:00
Linus Torvalds	6167f81fd1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull a btrfs revert from Chris Mason: "My for-linus branch has one revert in the new quota code. We're building up more fixes at etc for the next merge window, but I'm keeping them out unless they are bigger regressions or have a huge impact." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Revert "Btrfs: fix some error codes in btrfs_qgroup_inherit()"	2012-09-16 12:58:44 -07:00
Chris Mason	f3a87f1b0c	Revert "Btrfs: fix some error codes in btrfs_qgroup_inherit()" This reverts commit `5986802c2f`. Both paths are not error paths but regular cases where non-qgroup subvols are involved. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-09-14 20:06:30 -04:00
Wang Sheng-Hui	527a136138	btrfs: fix trivial typo for the comment of BTRFS_FREE_INO_OBJECTID It should be storing, not sotring. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2012-09-06 11:11:14 +02:00
Liu Bo	8bad3c0244	btrfs: fix comment typo in btrfs_finish_ordered_io Fix typo errors in comments of btrfs_finish_ordered_io. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2012-09-01 08:27:57 -07:00
Linus Torvalds	318e151019	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "I've split out the big send/receive update from my last pull request and now have just the fixes in my for-linus branch. The send/recv branch will wander over to linux-next shortly though. The largest patches in this pull are Josef's patches to fix DIO locking problems and his patch to fix a crash during balance. They are both well tested. The rest are smaller fixes that we've had queued. The last rc came out while I was hacking new and exciting ways to recover from a misplaced rm -rf on my dev box, so these missed rc3." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits) Btrfs: fix that repair code is spuriously executed for transid failures Btrfs: fix ordered extent leak when failing to start a transaction Btrfs: fix a dio write regression Btrfs: fix deadlock with freeze and sync V2 Btrfs: revert checksum error statistic which can cause a BUG() Btrfs: remove superblock writing after fatal error Btrfs: allow delayed refs to be merged Btrfs: fix enospc problems when deleting a subvol Btrfs: fix wrong mtime and ctime when creating snapshots Btrfs: fix race in run_clustered_refs Btrfs: don't run __tree_mod_log_free_eb on leaves Btrfs: increase the size of the free space cache Btrfs: barrier before waitqueue_active Btrfs: fix deadlock in wait_for_more_refs btrfs: fix second lock in btrfs_delete_delayed_items() Btrfs: don't allocate a seperate csums array for direct reads Btrfs: do not strdup non existent strings Btrfs: do not use missing devices when showing devname Btrfs: fix that error value is changed by mistake Btrfs: lock extents as we map them in DIO ...	2012-08-29 11:36:22 -07:00
Stefan Behrens	256dd1bb37	Btrfs: fix that repair code is spuriously executed for transid failures If verify_parent_transid() fails for all mirrors, the current code calls repair_io_failure() anyway which means: - that the disk block is rewritten without repairing anything and - that a kernel log message is printed which misleadingly claims that a read error was corrected. This is an example: parent transid verify failed on 615015833600 wanted 110423 found 110424 parent transid verify failed on 615015833600 wanted 110423 found 110424 btrfs read error corrected: ino 1 off 615015833600 (dev /dev/...) It is wrong to ignore the results from verify_parent_transid() and to call repair_eb_io_failure() when the verification of the transids failed. This commit fixes the issue. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-08-28 16:53:43 -04:00
Liu Bo	d280e5be94	Btrfs: fix ordered extent leak when failing to start a transaction We cannot just return error before freeing ordered extent and releasing reserved space when we fail to start a transacion. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-08-28 16:53:42 -04:00
Liu Bo	24c03fa5cf	Btrfs: fix a dio write regression This bug is introduced by commit 3b8bde746f6f9bd36a9f05f5f3b6e334318176a9 (Btrfs: lock extents as we map them in DIO). In dio write, we should unlock the section which we didn't do IO on in case that we fall back to buffered write. But we need to not only unlock the section but also cleanup reserved space for the section. This bug was found while running xfstests 133, with this 133 no longer complains. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-08-28 16:53:41 -04:00
Josef Bacik	bd7de2c9a4	Btrfs: fix deadlock with freeze and sync V2 We can deadlock with freeze right now because we unconditionally start a transaction in our ->sync_fs() call. To fix this just check and see if we have a running transaction to commit. This saves us from the deadlock because at this point we'll have the umount sem for the sb so we're safe from freezes coming in after we've done our check. With this patch the freeze xfstests no longer deadlocks. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-08-28 16:53:40 -04:00
Stefan Behrens	5ee0844d64	Btrfs: revert checksum error statistic which can cause a BUG() Commit `442a4f6308` added btrfs device statistic counters for detected IO and checksum errors to Linux 3.5. The statistic part that counts checksum errors in end_bio_extent_readpage() can cause a BUG() in a subfunction: "kernel BUG at fs/btrfs/volumes.c:3762!" That part is reverted with the current patch. However, the counting of checksum errors in the scrub context remains active, and the counting of detected IO errors (read, write or flush errors) in all contexts remains active. Cc: stable <stable@vger.kernel.org> # 3.5 Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-08-28 16:53:39 -04:00
Stefan Behrens	68ce9682a4	Btrfs: remove superblock writing after fatal error With commit `acce952b0`, btrfs was changed to flag the filesystem with BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal error happened like a write I/O errors of all mirrors. In such situations, on unmount, the superblock is written in btrfs_error_commit_super(). This is done with the intention to be able to evaluate the error flag on the next mount. A warning is printed in this case during the next mount and the log tree is ignored. The issue is that it is possible that the superblock points to a root that was not written (due to write I/O errors). The result is that the filesystem cannot be mounted. btrfsck also does not start and all the other btrfs-progs tools fail to start as well. However, mount -o recovery is working well and does the right things to recover the filesystem (i.e., don't use the log root, clear the free space cache and use the next mountable root that is stored in the root backup array). This patch removes the writing of the superblock when BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error flag in the mount function. These lines can be used to reproduce the issue (using /dev/sdm): SCRATCH_DEV=/dev/sdm SCRATCH_MNT=/mnt echo 0 25165824 linear $SCRATCH_DEV 0 \| dmsetup create foo ls -alLF /dev/mapper/foo mkfs.btrfs /dev/mapper/foo mount /dev/mapper/foo $SCRATCH_MNT echo bar > $SCRATCH_MNT/foo sync echo 0 25165824 error \| dmsetup reload foo dmsetup resume foo ls -alF $SCRATCH_MNT touch $SCRATCH_MNT/1 ls -alF $SCRATCH_MNT sleep 35 echo 0 25165824 linear $SCRATCH_DEV 0 \| dmsetup reload foo dmsetup resume foo sleep 1 umount $SCRATCH_MNT btrfsck /dev/mapper/foo dmsetup remove foo Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-08-28 16:53:38 -04:00
Josef Bacik	ae1e206b80	Btrfs: allow delayed refs to be merged Daniel Blueman reported a bug with fio+balance on a ramdisk setup. Basically what happens is the balance relocates a tree block which will drop the implicit refs for all of its children and adds a full backref. Once the block is relocated we have to add the implicit refs back, so when we cow the block again we add the implicit refs for its children back. The problem comes when the original drop ref doesn't get run before we add the implicit refs back. The delayed ref stuff will specifically prefer ADD operations over DROP to keep us from freeing up an extent that will have references to it, so we try to add the implicit ref before it is actually removed and we panic. This worked fine before because the add would have just canceled the drop out and we would have been fine. But the backref walking work needs to be able to freeze the delayed ref stuff in time so we have this ever increasing sequence number that gets attached to all new delayed ref updates which makes us not merge refs and we run into this issue. So to fix this we need to merge delayed refs. So everytime we run a clustered ref we need to try and merge all of its delayed refs. The backref walking stuff locks the delayed ref head before processing, so if we have it locked we are safe to merge any refs inside of the sequence number. If there is no sequence number we can merge all refs. Doing this not only fixes our bug but keeps the delayed ref code from adding and removing useless refs and batching together multiple refs into one search instead of one search per delayed ref, which will really help our commit times. I ran this with Daniels test and 276 and I haven't seen any problems. Thanks, Reported-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-08-28 16:53:38 -04:00
Josef Bacik	5a24e84c55	Btrfs: fix enospc problems when deleting a subvol Subvol delete is a special kind of awful where we use the global reserve to cover the ENOSPC requirements. The problem is once we're done removing everything we do a btrfs_update_inode(), which by default will try to do the delayed update stuff which will use it's own reserve. There will be no space in this reserve and we'll return ENOSPC. So instead use btrfs_update_inode_fallback() which will just fallback to updating the inode item in the case of enospc. This is fine because the global reserve covers the space requirements for this. With this patch I can now delete a subvol on a problem image Dave Sterba sent me. Thanks, Reported-by: David Sterba <dave@jikos.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-08-28 16:53:37 -04:00
Miao Xie	c0f62dedd0	Btrfs: fix wrong mtime and ctime when creating snapshots When we created a new snapshot, the mtime and ctime of its parent directory were not updated. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-08-28 16:53:36 -04:00
Arne Jansen	22cd2e7de7	Btrfs: fix race in run_clustered_refs With commit commit `d1270cd91f` Author: Arne Jansen <sensille@gmx.net> Date: Tue Sep 13 15:16:43 2011 +0200 Btrfs: put back delayed refs that are too new I added a window where the delayed_ref's head->ref_mod code can diverge from the sum of the remaining refs, because we release the head->mutex in the middle. This leads to btrfs_lookup_extent_info returning wrong numbers. This patch fixes this by adjusting the head's ref_mod with each delayed ref we run. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-08-28 16:53:35 -04:00
Chris Mason	b12a3b1ea2	Btrfs: don't run __tree_mod_log_free_eb on leaves When we split a leaf, we may end up inserting a new root on top of that leaf. The reflog code was incorrectly assuming the old root was always a node. This makes sure we skip over leaves. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-08-28 16:53:34 -04:00
Josef Bacik	6fc823b10f	Btrfs: increase the size of the free space cache Arne was complaining about the space cache having mismatching generation numbers when debugging a deadlock. This is because we can run out of space in our preallocated range for our space cache if you have a pretty fragmented amount of space in your pinned space. So just increase the amount of space we preallocate for space cache so we can be sure to have enough space. This will only really affect data ranges since their the only chunks that end up larger than 256MB. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-08-28 16:53:34 -04:00
Josef Bacik	66657b318e	Btrfs: barrier before waitqueue_active We need a barrir before calling waitqueue_active otherwise we will miss wakeups. So in places that do atomic_dec(); then atomic_read() use atomic_dec_return() which imply a memory barrier (see memory-barriers.txt) and then add an explicit memory barrier everywhere else that need them. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-08-28 16:53:33 -04:00
Arne Jansen	1fa11e265f	Btrfs: fix deadlock in wait_for_more_refs Commit `a168650c` introduced a waiting mechanism to prevent busy waiting in btrfs_run_delayed_refs. This can deadlock with btrfs_run_ordered_operations, where a tree_mod_seq is held while waiting for the io to complete, while the end_io calls btrfs_run_delayed_refs. This whole mechanism is unnecessary. If not enough runnable refs are available to satisfy count, just return as count is more like a guideline than a strict requirement. In case we have to run all refs, commit transaction makes sure that no other threads are working in the transaction anymore, so we just assert here that no refs are blocked. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-08-28 16:53:32 -04:00
Fengguang Wu	6209526531	btrfs: fix second lock in btrfs_delete_delayed_items() Fix a real bug caught by coccinelle. fs/btrfs/delayed-inode.c:1013:1-11: second lock on line 1013 Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>	2012-08-28 16:53:31 -04:00
Josef Bacik	c329861da4	Btrfs: don't allocate a seperate csums array for direct reads We've been allocating a big array for csums instead of storing them in the io_tree like we do for buffered reads because previously we were locking the entire range, so we didn't have an extent state for each sector of the range. But now that we do the range locking as we map the buffers we can limit the mapping lenght to sectorsize and use the private part of the io_tree for our csums. This allows us to avoid an extra memory allocation for direct reads which could incur latency. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-08-28 16:53:30 -04:00
Josef Bacik	99f5944b84	Btrfs: do not strdup non existent strings When we close devices we add back empty devices for some reason that escapes me. In the case of a missing dev we don't allocate an rcu_string for it's name, so check to see if the device has a name and if it doesn't don't bother strdup()'ing it. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-08-28 16:53:29 -04:00
Josef Bacik	aa9ddcd4b5	Btrfs: do not use missing devices when showing devname If you do the following mkfs.btrfs /dev/sdb /dev/sdc rmmod btrfs dd if=/dev/zero of=/dev/sdb bs=1M count=1 mount -o degraded /dev/sdc /mnt/btrfs-test the box will panic trying to deref the name for the missing dev since it is the lower numbered devid. So fix show_devname to not use missing devices. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-08-28 16:53:29 -04:00
Stefan Behrens	3627bf4503	Btrfs: fix that error value is changed by mistake In iterate_inodes_from_logical() the error result from extent_from_logical() is patched by mistake. Typically ENOENT is patched to EINVAL because (-ENOENT & BTRFS_EXTENT_FLAG_TREE_BLOCK) evaluates to true. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-08-28 16:53:28 -04:00
Josef Bacik	eb838e73dc	Btrfs: lock extents as we map them in DIO A deadlock in xfstests 113 was uncovered by commit `d187663ef2` This is because we would not return EIOCBQUEUED for short AIO reads, instead we'd wait for the DIO to complete and then return the amount of data we transferred, which would allow our stuff to unlock the remaning amount. But with this change this no longer happens, so if we have a short AIO read (for example if we try to read past EOF), we could leave the section from EOF to the end of where we tried to read locked. Fixing this is tricky since there is no clear way to know exactly how much data DIO truly submitted for IO, so to make this less hard on ourselves and less combersome we need to lock the extents as we try to map them, and then we unlock any areas we didn't actually map. This makes us completely safe from deadlocks and reliance on a particular behavior of the DIO code. This also lays the groundwork for allowing us to use the normal csum storage method for reads which means we can remove an allocation. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-08-28 16:53:27 -04:00
Dan Carpenter	dadd1105ca	Btrfs: fix some endian bugs handling the root times "trans->transid" is cpu endian but we want to store the data as little endian. "item->ctime.nsec" is only 32 bits, not 64. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-08-28 16:53:26 -04:00
Dan Carpenter	55e591ffde	Btrfs: unlock on error in btrfs_delalloc_reserve_metadata() We should release this mutex before returning the error code. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-08-28 16:53:25 -04:00
Dan Carpenter	57a5a88203	Btrfs: checking for NULL instead of IS_ERR add_qgroup_rb() never returns NULL, only error pointers. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-08-28 16:53:25 -04:00
Dan Carpenter	5986802c2f	Btrfs: fix some error codes in btrfs_qgroup_inherit() These are returning zero when it should be returning a negative error code. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-08-28 16:53:24 -04:00
Stefan Behrens	aa2ffd0616	Btrfs: fix a misplaced address operator in a condition This should obviously not be "if (&flag)" but "if (flag)". Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-08-28 16:53:23 -04:00
Linus Torvalds	15fc5deb1f	Merge branch 'for-linus-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs merge fix from Chris Mason: "This fixes a merge error in rc1. The calls to mnt_want_write should have been removed." * 'for-linus-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: remove mnt_want_write call in btrfs_mksubvol	2012-08-12 21:28:41 +03:00
Alexander Block	e00da2067b	Btrfs: remove mnt_want_write call in btrfs_mksubvol We got a recursive lock in mksubvol because the caller already held a lock. I think we got into this due to a merge error. Commit `a874a63` removed the mnt_want_write call from btrfs_mksubvol and added a replacement call to mnt_want_write_file in btrfs_ioctl_snap_create_transid. Commit `e7848683` however tried to move all calls to mnt_want_write above i_mutex. So somewhere while merging this, it got mixed up. The solution is to remove the mnt_want_write call completely from mksubvol. Reported-by: David Sterba <dave@jikos.cz> Signed-off-by: Alexander Block <ablock84@googlemail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-08-09 11:01:54 -04:00
Artem Bityutskiy	b257031408	btrfs: nuke pdflush from comments The pdflush thread is long gone, so this patch removes references to pdflush from btrfs comments. Cc: Chris Mason <chris.mason@fusionio.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-08-04 12:15:35 +04:00
Artem Bityutskiy	34eaadaf22	btrfs: nuke write_super from comments The '->write_super' superblock method is gone, and this patch removes all the references to 'write_super' from btrfs. Cc: Chris Mason <chris.mason@fusionio.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-08-04 12:15:35 +04:00
Linus Torvalds	a0e881b7c1	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second vfs pile from Al Viro: "The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the deadlock reproduced by xfstests 068), symlink and hardlink restriction patches, plus assorted cleanups and fixes. Note that another fsfreeze deadlock (emergency thaw one) is not dealt with - the series by Fernando conflicts a lot with Jan's, breaks userland ABI (FIFREEZE semantics gets changed) and trades the deadlock for massive vfsmount leak; this is going to be handled next cycle. There probably will be another pull request, but that stuff won't be in it." Fix up trivial conflicts due to unrelated changes next to each other in drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c} * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits) delousing target_core_file a bit Documentation: Correct s_umount state for freeze_fs/unfreeze_fs fs: Remove old freezing mechanism ext2: Implement freezing btrfs: Convert to new freezing mechanism nilfs2: Convert to new freezing mechanism ntfs: Convert to new freezing mechanism fuse: Convert to new freezing mechanism gfs2: Convert to new freezing mechanism ocfs2: Convert to new freezing mechanism xfs: Convert to new freezing code ext4: Convert to new freezing mechanism fs: Protect write paths by sb_start_write - sb_end_write fs: Skip atime update on frozen filesystem fs: Add freezing handling to mnt_want_write() / mnt_drop_write() fs: Improve filesystem freezing handling switch the protection of percpu_counter list to spinlock nfsd: Push mnt_want_write() outside of i_mutex btrfs: Push mnt_want_write() outside of i_mutex fat: Push mnt_want_write() outside of i_mutex ...	2012-08-01 10:26:23 -07:00
Jan Kara	b2b5ef5c8e	btrfs: Convert to new freezing mechanism We convert btrfs_file_aio_write() to use new freeze check. We also add proper freeze protection to btrfs_page_mkwrite(). We also add freeze protection to the transaction mechanism to avoid starting transactions on frozen filesystem. At minimum this is necessary to stop iput() of unlinked file to change frozen filesystem during truncation. Checks in cleaner_kthread() and transaction_kthread() can be safely removed since btrfs_freeze() will lock the mutexes and thus block the threads (and they shouldn't have anything to do anyway). CC: linux-btrfs@vger.kernel.org CC: Chris Mason <chris.mason@oracle.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-31 09:45:52 +04:00
Joe Perches	533574c6bc	btrfs: use printk_get_level and printk_skip_level, add __printf, fix fallout Use the generic printk_get_level() to search a message for a kern_level. Add __printf to verify format and arguments. Fix a few messages that had mismatches in format and arguments. Add #ifdef CONFIG_PRINTK blocks to shrink the object size a bit when not using printk. [akpm@linux-foundation.org: whitespace tweak] Signed-off-by: Joe Perches <joe@perches.com> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-30 17:25:14 -07:00
Jan Kara	e7848683ae	btrfs: Push mnt_want_write() outside of i_mutex When mnt_want_write() starts to handle freezing it will get a full lock semantics requiring proper lock ordering. So push mnt_want_write() call consistently outside of i_mutex. CC: Chris Mason <chris.mason@oracle.com> CC: linux-btrfs@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-31 01:02:51 +04:00
Stephen Rothwell	a1857ebe75	Btrfs: using vmalloc and friends needs vmalloc.h On powerpc, we don't get the implicit vmalloc.h include, and as a result the build fails noisily: fs/btrfs/send.c: In function 'fs_path_free': fs/btrfs/send.c:185:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration] fs/btrfs/send.c: In function 'fs_path_ensure_buf': fs/btrfs/send.c:215:4: error: implicit declaration of function 'vmalloc' [-Werror=implicit-function-declaration] fs/btrfs/send.c:215:12: warning: assignment makes pointer from integer without a cast [enabled by default] fs/btrfs/send.c:225:12: warning: assignment makes pointer from integer without a cast [enabled by default] fs/btrfs/send.c:233:13: warning: assignment makes pointer from integer without a cast [enabled by default] fs/btrfs/send.c: In function 'iterate_dir_item': fs/btrfs/send.c:900:10: warning: assignment makes pointer from integer without a cast [enabled by default] fs/btrfs/send.c:909:11: warning: assignment makes pointer from integer without a cast [enabled by default] fs/btrfs/send.c: In function 'btrfs_ioctl_send': fs/btrfs/send.c:4463:17: warning: assignment makes pointer from integer without a cast [enabled by default] fs/btrfs/send.c:4469:17: warning: assignment makes pointer from integer without a cast [enabled by default] fs/btrfs/send.c:4475:2: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration] fs/btrfs/send.c:4475:20: warning: assignment makes pointer from integer without a cast [enabled by default] fs/btrfs/send.c:4483:21: warning: assignment makes pointer from integer without a cast [enabled by default] Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-07-26 18:08:30 -07:00
Linus Torvalds	e2aed8dfa5	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull large btrfs update from Chris Mason: "This pull request is very large, and the two main features in here have been under testing/devel for quite a while. We have subvolume quotas from the strato developers. This enables full tracking of how many blocks are allocated to each subvolume (and all snapshots) and you can set limits on a per-subvolume basis. You can also create quota groups and toss multiple subvolumes into a big group. It's everything you need to be a web hosting company and give each user their own subvolume. The userland side of the quotas is being refreshed, they'll send out details on where to grab it soon. Next is the kernel side of btrfs send/receive from Alexander Block. This leverages the same infrastructure as the quota code to figure out relationships between blocks and their owners. It can then compute the difference between two snapshots and sends the diffs in a neutral format into userland. The basic model: create a snapshot send that snapshot as the initial backup make changes create a second snapshot send the incremental as a backup delete the first snapshot (use the second snapshot for the next incremental) The receive portion is all in userland, and in the 'next' branch of my btrfs-progs repo. There's still some work to do in terms of optimizing the send side from kernel to userland. The really important part is figuring out how two snapshots are different, and this is where we are concentrating right now. The initial send of a dataset is a little slower than tar, but the incremental sends are dramatically faster than what rsync can do. On top of all of that, we have a nice queue of fixes, cleanups and optimizations." Fix up trivial modify/del conflict in fs/btrfs/ioctl.c Also fix up semantic conflict in fs/btrfs/send.c: the interface to dentry_open() changed in commit `765927b2d5` ("switch dentry_open() to struct path, make it grab references itself"), and since it now grabs whatever references it needs, we should no longer do the mntget() on the mnt (and we need to dput() the dentry reference we took). * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (65 commits) Btrfs: uninit variable fixes in send/receive Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive Btrfs: add btrfs_compare_trees function Btrfs: introduce subvol uuids and times Btrfs: make iref_to_path non static Btrfs: add a barrier before a waitqueue_active check Btrfs: call the ordered free operation without any locks held Btrfs: Check INCOMPAT flags on remount and add helper function Btrfs: add helper for tree enumeration btrfs: allow cross-subvolume file clone Btrfs: improve multi-thread buffer read Btrfs: make btrfs's allocation smoothly with preallocation Btrfs: lock the transition from dirty to writeback for an eb Btrfs: fix potential race in extent buffer freeing Btrfs: don't return true in releasepage unless we actually freed the eb Btrfs: suppress printk() if all device I/O stats are zero Btrfs: remove unwanted printk() for btrfs device I/O stats Btrfs: rewrite BTRFS_SETGET_FUNCS Btrfs: zero unused bytes in inode item Btrfs: kill free_space pointer from inode structure ... Conflicts: fs/btrfs/ioctl.c	2012-07-26 14:48:55 -07:00
Chris Mason	b24baf6917	Btrfs: uninit variable fixes in send/receive Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-07-25 19:21:10 -04:00
Chris Mason	113c1cb530	Merge branch 'send-v2' of git://github.com/ablock84/linux-btrfs into for-linus This is the kernel portion of btrfs send/receive Conflicts: fs/btrfs/Makefile fs/btrfs/backref.h fs/btrfs/ctree.c fs/btrfs/ioctl.c fs/btrfs/ioctl.h Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-07-25 19:19:10 -04:00
Alexander Block	31db9f7c23	Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive This patch introduces the BTRFS_IOC_SEND ioctl that is required for send. It allows btrfs-progs to implement full and incremental sends. Patches for btrfs-progs will follow. Signed-off-by: Alexander Block <ablock84@googlemail.com> Reviewed-by: David Sterba <dave@jikos.cz> Reviewed-by: Arne Jansen <sensille@gmx.net> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>	2012-07-25 23:30:19 +02:00
Alexander Block	7069830a9e	Btrfs: add btrfs_compare_trees function This function is used to find the differences between two trees. The tree compare skips whole subtrees if it detects shared tree blocks and thus is pretty fast. Signed-off-by: Alexander Block <ablock84@googlemail.com> Reviewed-by: David Sterba <dave@jikos.cz> Reviewed-by: Arne Jansen <sensille@gmx.net> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>	2012-07-25 23:30:14 +02:00
Alexander Block	8ea05e3a42	Btrfs: introduce subvol uuids and times This patch introduces uuids for subvolumes. Each subvolume has it's own uuid. In case it was snapshotted, it also contains parent_uuid. In case it was received, it also contains received_uuid. It also introduces subvolume ctime/otime/stime/rtime. The first two are comparable to the times found in inodes. otime is the origin/creation time and ctime is the change time. stime/rtime are only valid on received subvolumes. stime is the time of the subvolume when it was sent. rtime is the time of the subvolume when it was received. Additionally to the times, we have a transid for each time. They are updated at the same place as the times. btrfs receive uses stransid and rtransid to find out if a received subvolume changed in the meantime. If an older kernel mounts a filesystem with the extented fields, all fields become invalid. The next mount with a new kernel will detect this and reset the fields. Signed-off-by: Alexander Block <ablock84@googlemail.com> Reviewed-by: David Sterba <dave@jikos.cz> Reviewed-by: Arne Jansen <sensille@gmx.net> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>	2012-07-25 23:28:38 +02:00
Alexander Block	91cb916ca2	Btrfs: make iref_to_path non static Make iref_to_path non static (needed in send) and rename it to btrfs_iref_to_path Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-07-25 23:25:06 +02:00
Chris Mason	cd1cfc4915	Btrfs: add a barrier before a waitqueue_active check We were missing wakeups on the delayed ref waitqueue due to races on waitqueue_active. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-07-25 16:15:08 -04:00
Chris Mason	e9fbcb4220	Btrfs: call the ordered free operation without any locks held Each ordered operation has a free callback, and this was called with the worker spinlock held. Josef made the free callback also call iput, which we can't do with the spinlock. This drops the spinlock for the free operation and grabs it again before moving through the rest of the list. We'll circle back around to this and find a cleaner way that doesn't bounce the lock around so much. Signed-off-by: Chris Mason <chris.mason@fusionio.com> cc: stable@kernel.org	2012-07-25 16:15:07 -04:00
Mitch Harder	2b0ce2c290	Btrfs: Check INCOMPAT flags on remount and add helper function In support of the recently added capability to remount with lzo compression, provide a helper function to check the compression INCOMPAT flags when remounting with lzo compression, and set the flags if necessary. Also, implement the new helper function when defragmenting with explicit lzo compression and when setting the default subvolume. Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-07-25 16:14:31 -04:00
Chris Mason	b478b2baa3	Merge branch 'qgroup' of git://git.jan-o-sch.net/btrfs-unstable into for-linus Conflicts: fs/btrfs/ioctl.c fs/btrfs/ioctl.h fs/btrfs/transaction.c fs/btrfs/transaction.h Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-07-25 16:11:38 -04:00
Arne Jansen	e679376911	Btrfs: add helper for tree enumeration Often no exact match is wanted but just the next lower or higher item. There's a lot of duplicated code throughout btrfs to deal with the corner cases. This patch adds a helper function that can facilitate searching. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-25 17:33:18 +02:00
David Sterba	362a20c5e2	btrfs: allow cross-subvolume file clone Lift the EXDEV condition and allow different root trees for files being cloned, then pass source inode's root when searching for extents. Cloning is not allowed to cross vfsmounts, ie. when two subvolumes from one filesystem are mounted separately. Signed-off-by: David Sterba <dsterba@suse.cz>	2012-07-25 17:33:09 +02:00
Linus Torvalds	d14b7a419a	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "Trivial updates all over the place as usual." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits) Fix typo in include/linux/clk.h . pci: hotplug: Fix typo in pci iommu: Fix typo in iommu video: Fix typo in drivers/video Documentation: Add newline at end-of-file to files lacking one arm,unicore32: Remove obsolete "select MISC_DEVICES" module.c: spelling s/postition/position/g cpufreq: Fix typo in cpufreq driver trivial: typo in comment in mksysmap mach-omap2: Fix typo in debug message and comment scsi: aha152x: Fix sparse warning and make printing pointer address more portable. Change email address for Steve Glendinning Btrfs: fix typo in convert_extent_bit via: Remove bogus if check netprio_cgroup.c: fix comment typo backlight: fix memory leak on obscure error path Documentation: asus-laptop.txt references an obsolete Kconfig item Documentation: ManagementStyle: fixed typo mm/vmscan: cleanup comment error in balance_pgdat mm: cleanup on the comments of zone_reclaim_stat ...	2012-07-24 13:34:56 -07:00
Liu Bo	67c9684f48	Btrfs: improve multi-thread buffer read While testing with my buffer read fio jobs[1], I find that btrfs does not perform well enough. Here is a scenario in fio jobs: We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file, and all of them will race on add_to_page_cache_lru(), and if one thread successfully puts its page into the page cache, it takes the responsibility to read the page's data. And what's more, reading a page needs a period of time to finish, in which other threads can slide in and process rest pages: t1 t2 t3 t4 add Page1 read Page1 add Page2 \| read Page2 add Page3 \| \| read Page3 add Page4 \| \| \| read Page4 -----\|------------\|-----------\|-----------\|-------- v v v v bio bio bio bio Now we have four bios, each of which holds only one page since we need to maintain consecutive pages in bio. Thus, we can end up with far more bios than we need. Here we're going to a) delay the real read-page section and b) try to put more pages into page cache. With that said, we can make each bio hold more pages and reduce the number of bios we need. Here is some numbers taken from fio results: w/o patch w patch ------------- -------- --------------- READ: 745MB/s +25% 934MB/s [1]: [global] group_reporting thread numjobs=4 bs=32k rw=read ioengine=sync directory=/mnt/btrfs/ [READ] filename=foobar size=2000M invalidate=1 Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:10 -04:00
Liu Bo	df57dbe6bf	Btrfs: make btrfs's allocation smoothly with preallocation For backref walking, we've introduce delayed ref's sequence. However, it changes our preallocation behavior. The story is that when we preallocate an extent and then mark it written piece by piece, the ideal case should be that we don't need to COW the extent, which is why we use 'preallocate'. But we may not make use of preallocation, since when we check for cross refs on the extent, we may have two ref entries which have the same content except the sequence value, and we recognize them as cross refs and do COW to allocate another extent. So we end up with several pieces of space instead of an whole extent. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:10 -04:00
Josef Bacik	51561ffec9	Btrfs: lock the transition from dirty to writeback for an eb There is a small window where an eb can have no IO bits set on it, which could potentially result in extent_buffer_under_io() returning false when we want it to return true, which could result in not fun things happening. So in order to protect this case we need to hold the refs_lock when we make this transition to make sure we get reliable results out of extent_buffer_udner_io(). Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:09 -04:00
Josef Bacik	594831c4b2	Btrfs: fix potential race in extent buffer freeing This sounds sort of impossible but it is the only thing I can think of and at the very least it is theoretically possible so here it goes. If we are in try_release_extent_buffer we will check that the ref count on the extent buffer is 1 and not under IO, and then go down and clear the tree ref. If between this check and clearing the tree ref somebody else comes in and grabs a ref on the eb and the marks it dirty before try_release_extent_buffer() does it's tree ref clear we can end up with a dirty eb that will be freed while it is still dirty which will result in a panic. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:09 -04:00
Josef Bacik	e64860aa05	Btrfs: don't return true in releasepage unless we actually freed the eb I noticed while looking at an extent_buffer race that we will unconditionally return 1 if we get down to release_extent_buffer after clearing the tree ref. However we can easily race in here and get a ref on the eb and not actually free the eb. So make release_extent_buffer return 1 if it free'd the eb and 0 if not so we can be a little kinder to the vm. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:08 -04:00
Stefan Behrens	a98cdb85b9	Btrfs: suppress printk() if all device I/O stats are zero Code is added to suppress the I/O stats printing at mount time if all statistic values are zero. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-07-23 16:28:07 -04:00
Stefan Behrens	5021976d8d	Btrfs: remove unwanted printk() for btrfs device I/O stats People complained about the annoying kernel log message "btrfs: no dev_stats entry found ... (OK on first mount after mkfs)" everytime a filesystem is mounted for the first time after running mkfs. Since the distribution of the btrfs-progs is not synchronized to the kernel version, mkfs like it is now will be used also in the future. Then this message is not useful to find errors, it is just annoying. This commit removes the printk(). Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-07-23 16:28:07 -04:00
Li Zefan	18077bb413	Btrfs: rewrite BTRFS_SETGET_FUNCS BTRFS_SETGET_FUNCS macro is used to generate btrfs_set_foo() and btrfs_foo() functions, which read and write specific fields in the extent buffer. The total number of set/get functions is ~200, but in fact we only need 8 functions: 2 for u8 field, 2 for u16, 2 for u32 and 2 for u64. It results in redunction of ~37K bytes. text data bss dec hex filename 629661 12489 216 642366 9cd3e fs/btrfs/btrfs.o.orig 592637 12489 216 605342 93c9e fs/btrfs/btrfs.o Signed-off-by: Li Zefan <lizefan@huawei.com>	2012-07-23 16:28:06 -04:00
Li Zefan	293f7e0740	Btrfs: zero unused bytes in inode item The otime field is not zeroed, so users will see random otime in an old filesystem with a new kernel which has otime support in the future. The reserved bytes are also not zeroed, and we'll have compatibility issue if we make use of those bytes. Signed-off-by: Li Zefan <lizefan@huawei.com>	2012-07-23 16:28:05 -04:00
Li Zefan	b4d7c3c945	Btrfs: kill free_space pointer from inode structure Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type, which means every inode has the same BTRFS_I(inode)->free_space pointer. This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits). Signed-off-by: Li Zefan <lizefan@huawei.com>	2012-07-23 16:28:05 -04:00
Anand Jain	d5b025d510	btrfs read error corrected message floods the console during recovery Changing printk_in_rcu to printk_ratelimited_in_rcu will suffice Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:04 -04:00
Jan Schmidt	e6466e354a	Btrfs: fix buffer leak in btrfs_next_old_leaf When calling btrfs_next_old_leaf, we were leaking an extent buffer in the rare case of using the deadlock avoidance code needed for the tree mod log. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:03 -04:00
Liu Bo	f6175efab1	Btrfs: do not count in readonly bytes If a block group is ro, do not count its entries in when we dump space info. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:03 -04:00
Liu Bo	799ffc3c31	Btrfs: add ro notification to dump_space_info Block group has ro attributes, make dump_space_info show it. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:02 -04:00
Liu Bo	cf7c1ef6e1	Btrfs: fix a bug of writting free space cache during balance Here is the whole story: 1) A free space cache consists of two parts: o free space cache inode, which is special becase it's stored in root tree. o free space info, which is stored as the above inode's file data. But we only build up another new inode and does not flush its free space info onto disk when we _clear and setup_ free space cache, and this ends up with that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN. And holding DC_SETUP means that we will not truncate this free space cache inode, which means the disk offset of its file extent will remain _unchanged_ at least until next transaction finishes committing itself. 2) We can set a block group readonly when we relocate the block group. However, if the readonly block group covers the disk offset where our free space cache inode is going to write, it will force the free space cache inode into cow_file_range() and it'll end up hitting a BUG_ON. 3) Due to the above analysis, we fix this bug by adding the missing dirty flag. 4) However, it's not over, there is still another case, nospace_cache. With nospace_cache, we do not want to set dirty flag, instead we just truncate free space cache inode and bail out with setting cache state DC_WRITTEN. We can benifit from it since it saves us another 'pre-allocation' part which usually costs a lot. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:02 -04:00
Liu Bo	0678938423	Btrfs: do not abort transaction in prealloc case During disk balance, we prealloc new file extent for file data relocation, but we may fail in 'no available space' case, and it leads to flipping btrfs into readonly. It is not necessary to bail out and abort transaction since we do have several ways to rescue ourselves from ENOSPC case. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:01 -04:00
Liu Bo	83eea1f1ba	Btrfs: kill root from btrfs_is_free_space_inode Since root can be fetched via BTRFS_I macro directly, we can save an args for btrfs_is_free_space_inode(). Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:00 -04:00
Liu Bo	51a8cf9d2d	Btrfs: fix btrfs_is_free_space_inode to recognize btree inode For btree inode, its root is also 'tree root', so btree inode can be misunderstood as a free space inode. We should add one more check for btree inode. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:28:00 -04:00
Stefan Behrens	c0901581ad	Btrfs: avoid I/O repair BUG() from btree_read_extent_buffer_pages() From btree_read_extent_buffer_pages(), currently repair_io_failure() can be called with mirror_num being zero when submit_one_bio() returned an error before. This used to cause a BUG_ON(!mirror_num) in repair_io_failure() and indeed this is not a case that needs the I/O repair code to rewrite disk blocks. This commit prevents calling repair_io_failure() in this case and thus avoids the BUG_ON() and malfunction. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:27:59 -04:00
Josef Bacik	f4c738c2e7	Btrfs: rework shrink_delalloc So shrink_delalloc has grown all sorts of cruft over the years thanks to many reworkings of how we track enospc. What happens now as we fill up the disk is we will loop for freaking ever hoping to reclaim a arbitrary amount of space of metadata, this was from when everybody flushed at the same time. Now we only have people flushing one at a time. So instead of trying to reclaim a huge amount of space, just try to flush a decent chunk of space, and stop looping as soon as we have enough free space to satisfy our reservation. This makes xfstests 224 go much faster. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:27:58 -04:00
Liu Bo	b9ca0664dc	Btrfs: do not set subvolume flags in readonly mode $ mkfs.btrfs /dev/sdb7 $ btrfstune -S1 /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs mount: block device /dev/sdb7 is write-protected, mounting read-only $ btrfs dev add /dev/sdb8 /mnt/btrfs/ Now we get a btrfs in which mnt flags has readonly but sb flags does not. So for those ioctls that only check sb flags with MS_RDONLY, it is going to be a problem. Setting subvolume flags is such an ioctl, we should use mnt_want_write_file() to check RO flags. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>	2012-07-23 16:27:58 -04:00
Liu Bo	e54bfa3104	Btrfs: use mnt_want_write_file instead of mnt_want_write mnt_want_write_file is faster when file has been opened for write. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>	2012-07-23 16:27:57 -04:00
Liu Bo	768e9dfe82	Btrfs: remove redundant r/o check for superblock mnt_want_write() and mnt_want_write_file() will check sb->s_flags with MS_RDONLY, and we don't need to do it ourselves. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>	2012-07-23 16:27:56 -04:00
Liu Bo	a874a63e13	Btrfs: check write access to mount earlier while creating snapshots Move check of write access to mount into upper functions so that we can use mnt_want_write_file instead, which is faster than mnt_want_write. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>	2012-07-23 16:27:56 -04:00
Liu Bo	287082b0bd	Btrfs: fix typo in cow_file_range_async and async_cow_submit It should be 10 * 1024 * 1024. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2012-07-23 16:27:55 -04:00
Josef Bacik	0e72110692	Btrfs: change how we indicate we're adding csums There is weird logic I had to put in place to make sure that when we were adding csums that we'd used the delalloc block rsv instead of the global block rsv. Part of this meant that we had to free up our transaction reservation before we ran the delayed refs since csum deletion happens during the delayed ref work. The problem with this is that when we release a reservation we will add it to the global reserve if it is not full in order to keep us going along longer before we have to force a transaction commit. By releasing our reservation before we run delayed refs we don't get the opportunity to drain down the global reserve for the work we did, so we won't refill it as often. This isn't a problem per-se, it just results in us possibly committing transactions more and more often, and in rare cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran out of space in our block rsv. This also helps us by holding onto space while the delayed refs run so we don't end up with as many people trying to do things at the same time, which again will help us not force commits or hit the use_block_rsv warnings. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:27:55 -04:00
Tsutomu Itoh	b995929515	Btrfs: return error of btrfs_update_inode() to caller We didn't check error of btrfs_update_inode(), but that error looks easy to bubble back up. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:27:54 -04:00
Dan Carpenter	23291a044c	Btrfs: fix error handling in __add_reloc_root() We dereferenced "node" in the error message after freeing it. Also btrfs_panic() can return so we should return an error code instead of continuing. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-07-23 16:27:53 -04:00
Ilya Dryomov	44c44af2f4	Btrfs: do not ignore errors from btrfs_cleanup_fs_roots() when mounting There used to be a BUG_ON(ret) there before EH patch (`79787eaa`) went in. Bail out with EINVAL. Cc: David Sterba <dsterba@suse.cz> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-07-23 16:27:53 -04:00
Ilya Dryomov	fed425c742	Btrfs: do not return EINVAL instead of ENOMEM from open_ctree() When bailing from open_ctree() err is returned, not ret. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-07-23 16:27:52 -04:00
Josef Bacik	02db0844be	Btrfs: add DEVICE_READY ioctl This will be used in conjunction with btrfs device ready <dev>. This is needed for initrd's to have a nice and lightweight way to tell if all of the devices needed for a file system are in the cache currently. This keeps them from having to do mount+sleep loops waiting for devices to show up. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 16:27:42 -04:00
Josef Bacik	96c3f4331a	Btrfs: flush delayed inodes if we're short on space Those crazy gentoo guys have been complaining about ENOSPC errors on their portage volumes. This is because doing things like untar tends to create lots of new files which will soak up all the reservation space in the delayed inodes. Usually this gets papered over by the fact that we will try and commit the transaction, however if this happens in the wrong spot or we choose not to commit the transaction you will be screwed. So add the ability to expclitly flush delayed inodes to free up space. Please test this out guys to make sure it works since as usual I cannot reproduce. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-23 15:41:40 -04:00
David Sterba	b27f7c0c15	btrfs: join DEV_STATS ioctls to one Commit `c11d2c236c` (Btrfs: add ioctl to get and reset the device stats) introduced two ioctls doing almost the same thing distinguished by just the ioctl number which encodes "do reset after read". I have suggested http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html to implement it via the ioctl args. This hasn't happen, and I think we should use a more clean way to pass flags and should not waste ioctl numbers. CC: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: David Sterba <dsterba@suse.cz>	2012-07-23 15:41:40 -04:00
Andrew Mahone	a43a211133	btrfs: ignore unfragmented file checks in defrag when compression enabled - rebased Rebased on btrfs-next and retested. Inform should_defrag_range if BTRFS_DEFRAG_RANGE_COMPRESS is set. If so, skip checks for adjacent extents and extent size when deciding whether to defrag, as these can prevent an uncompressed and unfragmented file from being compressed as requested. Signed-off-by: Andrew Mahone <andrew.mahone@gmail.com>	2012-07-23 15:41:39 -04:00
Dan Carpenter	e4b50e14c8	Btrfs: small naming cleanup in join_transaction() "root->fs_info" and "fs_info" are the same, but "fs_info" is prefered because it is shorter and that's what is used in the rest of the function. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-07-23 15:41:39 -04:00
Alexander Block	2bc5565286	Btrfs: don't update atime on RO subvolumes Before the update_time inode operation was indroduced, it was not possible to prevent updates of atime on RO subvolumes. VFS was only able to check for RO on the mount, but did not know anything about btrfs subvolumes. btrfs_update_time does now check if the root is RO and skip updating of times. Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-07-23 15:41:38 -04:00
Arnd Hannemann	063849eafd	Btrfs: allow mount -o remount,compress=no Btrfs allows to turn on compression on a mounted and used filesystem by issuing mount -o remount,compress=lzo. This patch allows to turn compression off again while the filesystem is mounted. As suggested by David Sterba if the compress-force option was set, it is implicitly cleared if compression is turned off. Tested-by: David Sterba <dsterba@suse.cz> Signed-off-by: Arnd Hannemann <arnd@arndnet.de>	2012-07-23 15:41:38 -04:00
Josef Bacik	c5c3c5f31e	Btrfs: remove ->dirty_inode We do all of our inode updating when we change it, and now that we do ->update_time we don't need ->dirty_inode for atime updates anymore, so just remove it. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-07-23 15:41:38 -04:00
Chris Mason	cbea5ac1ee	Btrfs: reduce calls to wake_up on uncontended locks The btrfs locks were unconditionally calling wake_up as the locks were released. This lead to extra thrashing on the waitqueue, especially for locks that were dominated by readers. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-07-23 15:36:18 -04:00
Chris Mason	e39e64ac0c	Btrfs: don't wait around for new log writers on an SSD Waiting on spindles improves performance, but ssds want all the IO as quickly as we can push it down. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-07-23 15:36:17 -04:00
Al Viro	11e62a8fab	btrfs: switch btrfs_ioctl_balance() to mnt_want_write_file() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-23 00:01:43 +04:00
David Howells	9249e17fe0	VFS: Pass mount flags to sget() Pass mount flags to sget() so that it can use them in initialising a new superblock before the set function is called. They could also be passed to the compare function. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:38:34 +04:00
Al Viro	ebfc3b49a7	don't pass nameidata to ->create() boolean "does it have to be exclusive?" flag is passed instead; Local filesystem should just ignore it - the object is guaranteed not to be there yet. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:34:47 +04:00
Al Viro	00cd8dd3bf	stop passing nameidata to ->lookup() Just the flags; only NFS cares even about that, but there are legitimate uses for such argument. And getting rid of that completely would require splitting ->lookup() into a couple of methods (at least), so let's leave that alone for now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:34:32 +04:00
Al Viro	b3d9b7a3c7	vfs: switch i_dentry/d_alias to hlist Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-07-14 16:32:55 +04:00
Liu Bo	10983f2e8d	Btrfs: fix typo in convert_extent_bit It should be convert_extent_bit. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2012-07-12 11:27:34 +02:00
Arne Jansen	6f72c7e20d	Btrfs: add qgroup inheritance When creating a subvolume or snapshot, it is necessary to initialize the qgroup account with a copy of some other (tracking) qgroup. This patch adds parameters to the ioctls to pass the information from which qgroup to inherit. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-12 10:54:40 +02:00
Arne Jansen	5d13a37bd5	Btrfs: add qgroup ioctls Ioctls to control the qgroup feature like adding and removing qgroups and assigning qgroups. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-12 10:54:39 +02:00
Arne Jansen	c556723794	Btrfs: hooks to reserve qgroup space Like block reserves, reserve a small piece of space on each transaction start and for delalloc. These are the hooks that can actually return EDQUOT to the user. The amount of space reserved is tracked in the transaction handle. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-12 10:54:39 +02:00
Jan Schmidt	546adb0d81	Btrfs: hooks for qgroup to record delayed refs Hooks into qgroup code to record refs and into transaction commit. This is the main entry point for qgroup. Basically every change in extent backrefs got accounted to the appropriate qgroups. Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-07-12 10:54:38 +02:00
Arne Jansen	bcef60f249	Btrfs: quota tree support and startup Init the quota tree along with the others on open_ctree and close_ctree. Add the quota tree to the list of well known trees in btrfs_read_fs_root_no_name. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-12 10:54:38 +02:00
Jan Schmidt	edf39272db	Btrfs: call the qgroup accounting functions Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-07-12 10:54:37 +02:00
Arne Jansen	bed92eae26	Btrfs: qgroup implementation and prototypes Signed-off-by: Arne Jansen <sensille@gmx.net> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-07-12 10:54:21 +02:00
Arne Jansen	709c0486b9	Btrfs: Test code to change the order of delayed-ref processing Normally delayed refs get processed in ascending bytenr order. This correlates in most cases to the order added. To expose dependencies on this order, we start to process the tree in the middle instead of the beginning. This code is only effective when SCRAMBLE_DELAYED_REFS is defined. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-10 15:14:44 +02:00
Arne Jansen	416ac51da9	Btrfs: qgroup state and initialization Add state to fs_info. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-10 15:14:44 +02:00
Arne Jansen	20897f5c86	Btrfs: added helper to create new trees This creates a brand new tree. Will be used to create the quota tree. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-10 15:14:43 +02:00
Arne Jansen	d13603ef6e	Btrfs: check the root passed to btrfs_end_transaction This patch only add a consistancy check to validate that the same root is passed to start_transaction and end_transaction. Subvolume quota depends on this. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-10 15:14:43 +02:00
Arne Jansen	2f38b3e190	Btrfs: add helper for tree enumeration Often no exact match is wanted but just the next lower or higher item. There's a lot of duplicated code throughout btrfs to deal with the corner cases. This patch adds a helper function that can facilitate searching. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-10 15:14:42 +02:00
Arne Jansen	630dc772ea	Btrfs: qgroup on-disk format Not all features are in use by the current version and thus may change in the future. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-07-10 15:14:42 +02:00
Jan Schmidt	097b8a7c9e	Btrfs: join tree mod log code with the code holding back delayed refs We've got two mechanisms both required for reliable backref resolving (tree mod log and holding back delayed refs). You cannot make use of one without the other. So instead of requiring the user of this mechanism to setup both correctly, we join them into a single interface. Additionally, we stop inserting non-blockers into fs_info->tree_mod_seq_list as we did before, which was of no value. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-07-10 15:14:41 +02:00
Jan Schmidt	cf5388307a	Btrfs: fix buffer leak in btrfs_next_old_leaf When calling btrfs_next_old_leaf, we were leaking an extent buffer in the rare case of using the deadlock avoidance code needed for the tree mod log. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-07-10 15:14:41 +02:00
Linus Torvalds	5eecb9cc90	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "I held off on my rc5 pull because I hit an oops during log recovery after a crash. I wanted to make sure it wasn't a regression because we have some logging fixes in here. It turns out that a commit during the merge window just made it much more likely to trigger directory logging instead of full commits, which exposed an old bug. The new backref walking code got some additional fixes. This should be the final set of them. Josef fixed up a corner where our O_DIRECT writes and buffered reads could expose old file contents (not stale, just not the most recent). He and Liu Bo fixed crashes during tree log recover as well. Ilya fixed errors while we resume disk balancing operations on readonly mounts." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: run delayed directory updates during log replay Btrfs: hold a ref on the inode during writepages Btrfs: fix tree log remove space corner case Btrfs: fix wrong check during log recovery Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS Btrfs: resume balance on rw (re)mounts properly Btrfs: restore restriper state on all mounts Btrfs: fix dio write vs buffered read race Btrfs: don't count I/O statistic read errors for missing devices Btrfs: resolve tree mod log locking issue in btrfs_next_leaf Btrfs: fix tree mod log rewind of ADD operations Btrfs: leave critical region in btrfs_find_all_roots as soon as possible Btrfs: always put insert_ptr modifications into the tree mod log Btrfs: fix tree mod log for root replacements at leaf level Btrfs: support root level changes in __resolve_indirect_ref Btrfs: avoid waiting for delayed refs when we must not	2012-07-05 13:06:25 -07:00
Chris Mason	b6305567e7	Btrfs: run delayed directory updates during log replay While we are resolving directory modifications in the tree log, we are triggering delayed metadata updates to the filesystem btrees. This commit forces the delayed updates to run so the replay code can find any modifications done. It stops us from crashing because the directory deleltion replay expects items to be removed immediately from the tree. Signed-off-by: Chris Mason <chris.mason@fusionio.com> cc: stable@kernel.org	2012-07-02 15:39:19 -04:00
Josef Bacik	7fd1a3f73f	Btrfs: hold a ref on the inode during writepages We can race with unlink and not actually be able to do our igrab in btrfs_add_ordered_extent. This will result in all sorts of problems. Instead of doing the complicated work to try and handle returning an error properly from btrfs_add_ordered_extent, just hold a ref to the inode during writepages. If we cannot grab a ref we know we're freeing this inode anyway and can just drop the dirty pages on the floor, because screw them we're going to invalidate them anyway. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-02 15:39:18 -04:00
Josef Bacik	bdb7d303b3	Btrfs: fix tree log remove space corner case The tree log stuff can have allocated space that we end up having split across a bitmap and a real extent. The free space code does not deal with this, it assumes that if it finds an extent or bitmap entry that the entire range must fall within the entry it finds. This isn't necessarily the case, so rework the remove function so it can handle this case properly. This fixed two panics the user hit, first in the case where the space was initially in a bitmap and then in an extent entry, and then the reverse case. Thanks, Reported-and-tested-by: Shaun Reich <sreich@kde.org> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-02 15:39:18 -04:00
Liu Bo	6bf02314d9	Btrfs: fix wrong check during log recovery When we're evicting an inode during log recovery, we need to ensure that the inode is not in orphan state any more, which means inode's run_time flags has _no_ BTRFS_INODE_HAS_ORPHAN_ITEM. Thus, the BUG_ON was triggered because of a wrong check for the flags. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com>	2012-07-02 15:39:17 -04:00
Alexander Block	d3a94048c9	Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS We used the wrong ioctl macro for the getflags ioctl before. As we don't have the set/getflags ioctls in the user space ioctl.h at the moment, it's safe to fix it now. Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Alexander Block <ablock84@googlemail.com>	2012-07-02 15:39:17 -04:00
Ilya Dryomov	2b6ba629b5	Btrfs: resume balance on rw (re)mounts properly This introduces btrfs_resume_balance_async(), which, given that restriper state was recovered earlier by btrfs_recover_balance(), resumes balance in btrfs-balance kthread. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-07-02 15:39:17 -04:00
Ilya Dryomov	68310a5e42	Btrfs: restore restriper state on all mounts Fix a bug that triggered asserts in btrfs_balance() in both normal and resume modes -- restriper state was not properly restored on read-only mounts. This factors out resuming code from btrfs_restore_balance(), which is now also called earlier in the mount sequence to avoid the problem of some early writes getting the old profile. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-07-02 15:39:16 -04:00
Josef Bacik	c3473e8300	Btrfs: fix dio write vs buffered read race Miao pointed out there's a problem with mixing dio writes and buffered reads. If the read happens between us invalidating the page range and actually locking the extent we can bring in pages into page cache. Then once the write finishes if somebody tries to read again it will just find uptodate pages and we'll read stale data. So we need to lock the extent and check for uptodate bits in the range. If there are uptodate bits we need to unlock and invalidate again. This will keep this race from happening since we will hold the extent locked until we create the ordered extent, and then teh read side always waits for ordered extents. There was also a race in how we updated i_size, previously we were relying on the generic DIO stuff to adjust the i_size after the DIO had completed, but this happens outside of the extent lock which means reads could come in and not see the updated i_size. So instead move this work into where we create the extents, and then this way the update ordered i_size stuff works properly in the endio handlers. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-07-02 15:36:23 -04:00
Stefan Behrens	597a60fade	Btrfs: don't count I/O statistic read errors for missing devices It is normal behaviour of the low level btrfs function btrfs_map_bio() to complete a bio with -EIO if the device is missing, instead of just preventing the bio creation in an earlier step. This used to cause I/O statistic read error increments and annoying printk_ratelimited messages. This commit fixes the issue. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Reported-by: Carey Underwood <cwillu@cwillu.com>	2012-07-02 15:36:23 -04:00
Jan Schmidt	d42244a0d3	Btrfs: resolve tree mod log locking issue in btrfs_next_leaf With the tree mod log, we may end up with two roots (the current root and a rewinded version of it) both pointing to two leaves, l1 and l2, of which l2 had already been cow-ed in the current transaction. If we don't rewind any tree blocks, we cannot have two roots both pointing to an already cowed tree block. Now there is btrfs_next_leaf, which has a leaf locked and wants a lock on the next (right) leaf. And there is push_leaf_left, which has a (cowed!) leaf locked and wants a lock on the previous (left) leaf. In order to solve this dead lock situation, we use try_lock in btrfs_next_leaf (only in case it's called with a tree mod log time_seq paramter) and if we fail to get a lock on the next leaf, we give up our lock on the current leaf and retry from the very beginning. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-27 16:34:40 +02:00
Jan Schmidt	19956c7e94	Btrfs: fix tree mod log rewind of ADD operations When a MOD_LOG_KEY_ADD operation is rewinded, we remove the key from the tree block. If its not the last key, removal involves a move operation. This move operation was explicitly done before this commit. However, at insertion time, there's a move operation before the actual addition to make room for the new key, which is recorded in the tree mod log as well. This means, we must drop the move operation when rewinding the add operation, because the next operation we'll be rewinding will be the corresponding MOD_LOG_MOVE_KEYS operation. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-27 16:34:40 +02:00
Jan Schmidt	155725c9c0	Btrfs: leave critical region in btrfs_find_all_roots as soon as possible When delayed refs exist, btrfs_find_all_roots used to hold the delayed ref mutex way longer than actually required. We ought to drop it immediately after we're done collecting all the delayed refs. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-27 16:34:39 +02:00
Jan Schmidt	c3e0696523	Btrfs: always put insert_ptr modifications into the tree mod log Several callers of insert_ptr set the tree_mod_log parameter to 0 to avoid addition to the tree mod log. In fact, we need all of those operations. This commit simply removes the additional parameter and makes addition to the tree mod log unconditional. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-27 16:34:39 +02:00
Jan Schmidt	28da9fb446	Btrfs: fix tree mod log for root replacements at leaf level For the tree mod log, we don't log any operations at leaf level. If the root is at the leaf level (i.e. the tree consists only of the root), then __tree_mod_log_oldest_root will find a ROOT_REPLACE operation in the log (because we always log that one no matter which level), but no other operations. With this patch __tree_mod_log_oldest_root exits cleanly instead of BUGging in this situation. get_old_root checks if its really a root at leaf level in case we don't have any operations and WARNs if this assumption breaks. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-27 16:34:38 +02:00
Jan Schmidt	9345457f4a	Btrfs: support root level changes in __resolve_indirect_ref With the tree mod log, we can have a tree that's two levels high, but btrfs_search_old_slot may still return a path with the tree root at level one instead. __resolve_indirect_ref must care for this and accept parents in a lower level than expected. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-27 16:34:38 +02:00
Jan Schmidt	8ca78f3eda	Btrfs: avoid waiting for delayed refs when we must not We track two conditions to decide if we should sleep while waiting for more delayed refs, the number of delayed refs (num_refs) and the first entry in the list of blockers (first_seq). When we suspect staleness, we save num_refs and do one more cycle. If nothing changes, we then save first_seq for later comparison and do wait_event. We ought to save first_seq the very same moment we're saving num_refs. Otherwise we cannot be sure that nothing has changed and we might start waiting when we shouldn't, which could lead to starvation. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-27 16:34:35 +02:00
Linus Torvalds	8874e812fe	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is a small pull with btrfs fixes. The biggest of the bunch is another fix for the new backref walking code. We're still hammering out one btrfs dio vs buffered reads problem, but that one will have to wait for the next rc." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: delay iput with async extents Btrfs: add a missing spin_lock Btrfs: don't assume to be on the correct extent in add_all_parents Btrfs: introduce btrfs_next_old_item	2012-06-21 13:41:07 -07:00
Josef Bacik	cb77fcd885	Btrfs: delay iput with async extents There is some concern that these iput()'s could be the final iputs and could induce lockups on people waiting on writeback. This would happen in the rare case that we don't create ordered extents because of an error, but it is theoretically possible and we already have a mechanism to deal with this so just make them delayed iputs to negate any worry. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-06-21 07:19:36 -04:00
Josef Bacik	e18fca7342	Btrfs: add a missing spin_lock When fixing up the locking in the delayed ref destruction work I accidently broke the locking myself ;(. Add back a spin_lock that should be there and we are now all set. Thanks, Btrfs: add a missing spin_lock When fixing up the locking in the delayed ref destruction work I accidently broke the locking myself ;(. Add back a spin_lock that should be there and we are now all set. Thanks, Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-06-21 07:19:35 -04:00
Alexander Block	69bca40d41	Btrfs: don't assume to be on the correct extent in add_all_parents add_all_parents did assume that path is already at a correct extent data item, which may not be true in case of data extents that were partly rewritten and splitted. We need to check if we're on a matching extent for every item and only for the ones after the first. The loop is changed to do this now. This patch also fixes a bug introduced with commit 3b127fd8 "Btrfs: remove obsolete btrfs_next_leaf call from __resolve_indirect_ref". The removal of next_leaf did sometimes result in slot==nritems when the above described case happens, and thus resulting in invalid values (e.g. wanted_obejctid) in add_all_parents (leading to missed backrefs or even crashes). Signed-off-by: Alexander Block <ablock84@googlemail.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-06-21 07:19:34 -04:00
Alexander Block	1c8f52a5e9	Btrfs: introduce btrfs_next_old_item We introduce btrfs_next_old_item that uses btrfs_next_old_leaf instead of btrfs_next_leaf. btrfs_next_item is also changed to simply call btrfs_next_old_item with time_seq being 0. Signed-off-by: Alexander Block <ablock84@googlemail.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-06-21 07:19:34 -04:00
Linus Torvalds	d865983292	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs compile warning fixes from Chris Mason. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: cast devid to unsigned long long for printk %llu Btrfs: init old_generation in get_old_root	2012-06-16 17:01:41 -07:00
Chris Mason	a8c4a33b98	Btrfs: cast devid to unsigned long long for printk %llu Avoid warning in 32 bit machines Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-06-15 20:07:17 -04:00
Chris Mason	4325edd078	Btrfs: init old_generation in get_old_root gcc was giving an uninit variable warning here. Strictly speaking we don't need to init it, but this will make things much less error prone. Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-06-15 20:06:54 -04:00
Linus Torvalds	718f58ad61	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs update from Chris Mason: "The dates look like I had to rebase this morning because there was a compiler warning for a printk arg that I had missed earlier. These are all fixes, including one to prevent using stale pointers for device names, and lots of fixes around transaction abort cleanups (Josef, Liu Bo). Jan Schmidt also sent in a number of fixes for the new reference number tracking code. Liu Bo beat me to updating the MAINTAINERS file. Since he thought to also fix the git url, I kept his commit." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits) Btrfs: update MAINTAINERS info for BTRFS FILE SYSTEM Btrfs: destroy the items of the delayed inodes in error handling routine Btrfs: make sure that we've made everything in pinned tree clean Btrfs: avoid memory leak of extent state in error handling routine Btrfs: do not resize a seeding device Btrfs: fix missing inherited flag in rename Btrfs: fix incompat flags setting Btrfs: fix defrag regression Btrfs: call filemap_fdatawrite twice for compression Btrfs: keep inode pinned when compressing writes Btrfs: implement ->show_devname Btrfs: use rcu to protect device->name Btrfs: unlock everything properly in the error case for nocow Btrfs: fix btrfs_destroy_marked_extents Btrfs: abort the transaction if the commit fails Btrfs: wake up transaction waiters when aborting a transaction Btrfs: fix locking in btrfs_destroy_delayed_refs Btrfs: pass locked_page into extent_clear_unlock_delalloc if theres an error Btrfs: fix race in tree mod log addition Btrfs: add btrfs_next_old_leaf ...	2012-06-15 16:04:37 -07:00
Miao Xie	67cde3448d	Btrfs: destroy the items of the delayed inodes in error handling routine the items of the delayed inodes were forgotten to be freed, this patch fixes it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-06-15 11:42:28 -04:00
Liu Bo	ed0eaa1498	Btrfs: make sure that we've made everything in pinned tree clean Since we have two trees for recording pinned extents, we need to go through both of them to make sure that we've done everything clean. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-06-15 11:42:27 -04:00
Liu Bo	6e841e32b1	Btrfs: avoid memory leak of extent state in error handling routine We've forgotten to clear extent states in pinned tree, which will results in space counter mismatch and memory leak: WARNING: at fs/btrfs/extent-tree.c:7537 btrfs_free_block_groups+0x1f3/0x2e0 [btrfs]() ... space_info 2 has 8380416 free, is not full space_info total=12582912, used=4096, pinned=4096, reserved=0, may_use=0, readonly=4194304 btrfs state leak: start 29364224 end 29376511 state 1 in tree ffff880075f20090 refs 1 ... Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-06-15 11:42:27 -04:00
Liu Bo	4e42ae1bdc	Btrfs: do not resize a seeding device Seeding devices are not supposed to change any more. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-06-15 11:42:26 -04:00
Liu Bo	bc1782374b	Btrfs: fix missing inherited flag in rename When we move a file into a directory with compression flag, we need to inherite BTRFS_INODE_COMPRESS and clear BTRFS_INODE_NOCOMPRESS as well. But if we move a file into a directory without compression flag, we need to clear both of them. It is the way how our setflags deals with compression flag, so keep the same behaviour here. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>	2012-06-15 11:33:30 -04:00
Chris Mason	acbcabd2de	Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into for-linus	2012-06-15 11:33:16 -04:00
Li Zefan	69e380d176	Btrfs: fix incompat flags setting It's a bug, but it happens to work, as BTRFS_COMPRESS_LZO == 2, which has only one bit set. Signed-off-by: Li Zefan <lizefan@huawei.com>	2012-06-14 21:30:57 -04:00
Li Zefan	6c282eb40e	Btrfs: fix defrag regression If a file has 3 small extents: \| ext1 \| ext2 \| ext3 \| Running "btrfs fi defrag" will only defrag the last two extents, if those extent mappings hasn't been read into memory from disk. This bug was introduced by commit `17ce6ef8d7` ("Btrfs: add a check to decide if we should defrag the range") The cause is, that commit looked into previous and next extents using lookup_extent_mapping() only. While at it, remove the code that checks the previous extent, since it's sufficient to check the next extent. Signed-off-by: Li Zefan <lizefan@huawei.com>	2012-06-14 21:30:55 -04:00
Josef Bacik	7ddf5a42d3	Btrfs: call filemap_fdatawrite twice for compression I removed this in an earlier commit and I was wrong. Because compression can return from filemap_fdatawrite() without having actually set any of it's pages as writeback() it can make filemap_fdatawait() do essentially nothing, and then we won't find any ordered extents because they may not have been created yet. So not only does this make fsync() completely useless, but it will also screw up if you truncate on a non-page aligned offset since we zero out the end and then wait on ordered extents and then call drop caches. We can drop the cache before the io completes and then we try to unpin the extent we just wrote we won't find it and everything goes sideways. So fix this by putting it back and put a giant comment there to keep me from trying to remove it in the future. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-06-14 21:30:54 -04:00
Josef Bacik	8180ef8894	Btrfs: keep inode pinned when compressing writes A user reported lots of problems using compression on the new code and it turns out part of the problem was that igrab() was failing when we added a new ordered extent. This is because when writing out an inode under compression we immediately return without actually doing anything to the pages, and then in another thread at some point down the line actually do the ordered dance. The problem is between the point that we start writeback and we actually add the ordered extent we could be trying to reclaim the inode, which makes igrab() return NULL. So we need to do an igrab() when we create the async extent and then drop it when we are done with it. This makes sure we stay pinned in memory until the ordered extent can get a reference on it and we are good to go. With this patch we no longer panic in btrfs_finish_ordered_io(). Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-06-14 21:30:53 -04:00
Josef Bacik	9c5085c147	Btrfs: implement ->show_devname Because btrfs can remove the device that was mounted we need to have a ->show_devname so that in this case we can print out some other device in the file system to /proc/mount. So if there are multiple devices in a btrfs file system we will just print the device with the lowest devid that we can find. This will make everything consistent and deal with device removal properly. The drawback is if you mount with a device that is higher than the lowest devicd it won't show up as the mounted device in /proc/mounts, but this is a small price to pay. This was inspired by Miao Xie's patch. Thanks, Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <josef@redhat.com>	2012-06-14 21:30:37 -04:00
Josef Bacik	606686eeac	Btrfs: use rcu to protect device->name Al pointed out that we can just toss out the old name on a device and add a new one arbitrarily, so anybody who uses device->name in printk could possibly use free'd memory. Instead of adding locking around all of this he suggested doing it with RCU, so I've introduced a struct rcu_string that does just that and have gone through and protected all accesses to device->name that aren't under the uuid_mutex with rcu_read_lock(). This protects us and I will use it for dealing with removing the device that we used to mount the file system in a later patch. Thanks, Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <josef@redhat.com>	2012-06-14 21:29:16 -04:00
Josef Bacik	17ca04aff7	Btrfs: unlock everything properly in the error case for nocow I was getting hung on umount when a transaction was aborted because a range of one of the free space inodes was still locked. This is because the nocow stuff doesn't unlock anything on error. This fixed the problem and I verified that is what was happening. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-06-14 21:29:15 -04:00
Josef Bacik	ee670f0af3	Btrfs: fix btrfs_destroy_marked_extents So we're forcing the eb's to have their ref count set to 1 so invalidatepage works but this breaks lots of things, for example root nodes, and is just plain wrong, we don't need to just evict all of this stuff. Also drop the invalidatepage altogether and add a page_cache_release(). With this patch we no longer hang when trying to access the root nodes after an aborted transaction and we no longer leak memory. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-06-14 21:29:14 -04:00
Josef Bacik	7b8b92af58	Btrfs: abort the transaction if the commit fails If a transaction commit fails we don't abort it so we don't set an error on the file system. This patch fixes that by actually calling the abort stuff and then adding a check for a fs error in the transaction start stuff to make sure it is caught properly. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-06-14 21:29:13 -04:00
Josef Bacik	d7096fc3ef	Btrfs: wake up transaction waiters when aborting a transaction I was getting lots of hung tasks and a NULL pointer dereference because we are not cleaning up the transaction properly when it aborts. First we need to reset the running_transaction to NULL so we don't get a bad dereference for any start_transaction callers after this. Also we cannot rely on waitqueue_active() since it's just a list_empty(), so just call wake_up() directly since that will do the barrier for us and such. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-06-14 21:29:12 -04:00
Josef Bacik	b939d1ab76	Btrfs: fix locking in btrfs_destroy_delayed_refs The transaction abort stuff was throwing warnings from the list debugging code because we do a list_del_init outside of the delayed_refs spin lock. The delayed refs locking makes baby Jesus cry so it's not hard to get wrong, but we need to take the ref head mutex to make sure it's not being processed currently, and so if it is we need to drop the spin lock and then take and drop the mutex and do the search again. If we can take the mutex then we can safely remove the head from the list and carry on. Now when the transaction aborts I don't get the list debugging warnings. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-06-14 21:29:11 -04:00
Josef Bacik	beb42dd793	Btrfs: pass locked_page into extent_clear_unlock_delalloc if theres an error While doing my enospc work I got a transaction abortion that resulted in a panic when we tried to unlock_page() an already unlocked page. This is because we aren't calling extent_clear_unlock_delalloc with the locked page so it was unlocking all the pages in the range. This is wrong since __extent_writepage expects to have the page locked still unless we return *page_started as 1. This should keep us from panicing. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-06-14 21:29:09 -04:00
Jan Schmidt	3310c36eef	Btrfs: fix race in tree mod log addition When adding to the tree modification log, we grab two locks at different stages. We must not drop the outer lock until we're done with section protected by the inner lock. This moves the unlock call for the outer lock to the appropriate position. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-14 18:52:39 +02:00
Jan Schmidt	3d7806eca4	Btrfs: add btrfs_next_old_leaf To make sense of the tree mod log, the backref walker not only needs btrfs_search_old_slot, but it also called btrfs_next_leaf, which in turn was calling btrfs_search_slot. This obviously didn't give the correct result. This commit adds btrfs_next_old_leaf, a drop-in replacement for btrfs_next_leaf with a time_seq parameter. If it is zero, it behaves exactly like btrfs_next_leaf. If it is non-zero, it will use btrfs_search_old_slot with this time_seq parameter. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-14 18:52:09 +02:00
Jan Schmidt	a95236d99f	Btrfs: fix return value for __tree_mod_log_oldest_root In __tree_mod_log_oldest_root() we must return the found operation even if it's not a ROOT_REPLACE operation. Otherwise, the caller assumes that there are no operations to be rewinded and returns immediately. The code in the caller is modified to improve readability. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-14 18:44:22 +02:00
Jan Schmidt	8ba97a15e7	Btrfs: use btrfs_read_lock_root_node in get_old_root get_old_root could race with root node updates because we weren't locking the node early enough. Use btrfs_read_lock_root_node to grab the root locked in the very beginning and release the lock as soon as possible (just like btrfs_search_slot does). Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-14 18:44:21 +02:00
Jan Schmidt	f617e2fd52	Btrfs: remove obsolete btrfs_next_leaf call from __resolve_indirect_ref When resolving indirect refs, we used to call btrfs_next_leaf in case we didn't find an exact match. While we should find exact matches most of the time, in case we don't, we must continue searching. Treating those matches differently depending on the level we're searching doesn't make sense. Even worse, we might end up searching for a key larger than the largest, in which case there is no next_leaf and subsequent jobs would fail. This commit drops the bogous lines. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-14 18:44:20 +02:00
Jan Schmidt	4d5a0565ce	Btrfs: remove call to btrfs_header_nritems with no effect This is a leftover from cleanup patch `559af821`. Before the cleanup, btrfs_header_nritems was called inside an if condition. As it has no side effects we need to preserve here, it should simply be dropped. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-06-04 14:35:29 +02:00
Linus Torvalds	1193755ac6	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs changes from Al Viro. "A lot of misc stuff. The obvious groups: * Miklos' atomic_open series; kills the damn abuse of ->d_revalidate() by NFS, which was the major stumbling block for all work in that area. * ripping security_file_mmap() and dealing with deadlocks in the area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in general. * ->encode_fh() switched to saner API; insane fake dentry in mm/cleancache.c gone. * assorted annotations in fs (endianness, __user) * parts of Artem's ->s_dirty work (jff2 and reiserfs parts) * ->update_time() work from Josef. * other bits and pieces all over the place. Normally it would've been in two or three pull requests, but signal.git stuff had eaten a lot of time during this cycle ;-/" Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the 'truncate_range' inode method was removed by the VM changes, the VFS update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due to sparse fix added twice, with other changes nearby). * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits) nfs: don't open in ->d_revalidate vfs: retry last component if opening stale dentry vfs: nameidata_to_filp(): don't throw away file on error vfs: nameidata_to_filp(): inline __dentry_open() vfs: do_dentry_open(): don't put filp vfs: split __dentry_open() vfs: do_last() common post lookup vfs: do_last(): add audit_inode before open vfs: do_last(): only return EISDIR for O_CREAT vfs: do_last(): check LOOKUP_DIRECTORY vfs: do_last(): make ENOENT exit RCU safe vfs: make follow_link check RCU safe vfs: do_last(): use inode variable vfs: do_last(): inline walk_component() vfs: do_last(): make exit RCU safe vfs: split do_lookup() Btrfs: move over to use ->update_time fs: introduce inode operation ->update_time reiserfs: get rid of resierfs_sync_super reiserfs: mark the superblock as dirty a bit later ...	2012-06-01 10:34:35 -07:00
Josef Bacik	e41f941a23	Btrfs: move over to use ->update_time Btrfs had been doing it's own file_update_time so we could catch ENOSPC properly, so just update our btrfs_update_time to work with the new stuff and then we'll be fancy later. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-06-01 12:07:52 -04:00
Linus Torvalds	51eab603f5	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This includes a fairly large change from Josef around data writeback completion. Before, the writeback wasn't completed until the metadata insertions for the extent were done, and this made for fairly large latency spikes on the last page of each ordered extent. We already had a separate mechanism for tracking pending metadata insertions, so Josef just needed to tweak things a little to end writeback earlier on the page. Overall it makes us much friendly to memory reclaim and lowers latencies quite a lot for synchronous IO. Jan Schmidt has finished some background work required to track btree blocks as they go through changes in ownership. It's the missing piece he needed for both btrfs send/receive and subvolume quotas. Neither of those are ready yet, but the new tracking code is included here. Most of the time, the new code is off. It is only used by scrub and other backref walkers. Stefan Behrens has added io failure tracking. This includes counters for which drives are causing the most trouble so the admin (or an automated tool) can choose to kick them out. We're tracking IO errors, crc errors, and generation checks we do on each metadata block. RAID5/6 did miss the cut this time because I'm having trouble with corruptions. I'll nail it down next week and post as a beta testing before 3.6" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (58 commits) Btrfs: fix tree mod log rewinded level and rewinding of moved keys Btrfs: fix tree mod log del_ptr Btrfs: add tree_mod_dont_log helper Btrfs: add missing spin_lock for insertion into tree mod log Btrfs: add inodes before dropping the extent lock in find_all_leafs Btrfs: use delayed ref sequence numbers for all fs-tree updates Btrfs: fix false positive in check-integrity on unmount Btrfs: fix runtime warning in check-integrity check data mode Btrfs: set ioprio of scrub readahead to idle Btrfs: fix return code in drop_objectid_items Btrfs: check to see if the inode is in the log before fsyncing Btrfs: return value of btrfs_read_buffer is checked correctly Btrfs: read device stats on mount, write modified ones during commit Btrfs: add ioctl to get and reset the device stats Btrfs: add device counters for detected IO and checksum errors btrfs: Drop unused function btrfs_abort_devices() Btrfs: fix the same inode id problem when doing auto defragment Btrfs: fall back to non-inline if we don't have enough space Btrfs: fix how we deal with the orphan block rsv Btrfs: convert the inode bit field to use the actual bit operations ...	2012-06-01 08:37:31 -07:00
Chris Mason	1e20932a23	Merge branch 'for-chris' of git://git.jan-o-sch.net/btrfs-unstable into for-linus Conflicts: fs/btrfs/ulist.h Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-05-31 16:49:53 -04:00
Jan Schmidt	c31931088f	Btrfs: fix tree mod log rewinded level and rewinding of moved keys When we rewind REMOVE_WHILE_FREEING operations, there's code that allocates a fresh buffer instead of cloning the old one. Setting that buffer's level correctly was missing in this case. When rewinding a MOVE_KEYS operation, btrfs_node_key_ptr_offset(slot) was missing for memmove_extent_buffer()'s arguments. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-31 19:56:19 +02:00
Jan Schmidt	f395694c2c	Btrfs: fix tree mod log del_ptr Logging for del_ptr when we're not deleting the last pointer was wrong. This fixes both, duplicate log entries and log sequence. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-31 19:56:19 +02:00
Jan Schmidt	e9b7fd4d8b	Btrfs: add tree_mod_dont_log helper Replace duplicate code by small inline helper function. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-31 19:56:18 +02:00
Jan Schmidt	926dd8a640	Btrfs: add missing spin_lock for insertion into tree mod log tree_mod_alloc calls __get_tree_mod_seq and must acquire a spinlock before doing so. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-31 19:56:18 +02:00
Jan Schmidt	3301958b7c	Btrfs: add inodes before dropping the extent lock in find_all_leafs We must build up the inode list with the extent lock held after following indirect refs. This also requires an extension to ulists, which allows to modify the stored aux value in case a key already exists in the list. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-31 19:53:08 +02:00
Jan Schmidt	95a06077f7	Btrfs: use delayed ref sequence numbers for all fs-tree updates The sequence number for delayed refs is needed to postpone certain delayed refs for a very short period while walking backrefs. Before the tree modification log, we thought we'd only have to hold back those references that don't have a counter operation. While now we've the tree mod log, we're rewinding fs tree blocks to a defined consistent state. We cannot know in advance for which tree block we'll be doing rewind operations later. Therefore, we must postpone all the delayed refs for fs-tree blocks, even those having a counter operation. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-30 18:18:21 +02:00
Stefan Behrens	48235a68a3	Btrfs: fix false positive in check-integrity on unmount During unmount, it could happen that the integrity checker printed a warning message "attempt to free ... on umount which is not yet iodone" which turned out to be a false positive. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-05-30 10:23:44 -04:00
Stefan Behrens	86ff7ffce0	Btrfs: fix runtime warning in check-integrity check data mode If a file_extent_item was located at the very end of a leaf and there was not enough space to hold a full item, but there was enough space to hold one of type BTRFS_FILE_EXTENT_INLINE or PREALLOC, and it was only such a short item, a warning was printed anyway. This check is now fixed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-05-30 10:23:43 -04:00
Stefan Behrens	3d136a1131	Btrfs: set ioprio of scrub readahead to idle Reduce ioprio class of scrub readahead threads to idle priority. This setting is fixed. This priority has shown the best performance during all measurements. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-05-30 10:23:43 -04:00
Josef Bacik	5bdbeb2187	Btrfs: fix return code in drop_objectid_items So dpkg fsync()'s the file and the directory containing the file whenever it writes to a file which is really slow in btrfs. This is partly because fsync()'ing a directory _always_ committed the transaction instead of just going to the tree log. This is because drop_objectid_items() would return 1 since it does a btrfs_search_slot() which returns 1. In tree-log jargon this means that we have to commit the transaction to be safe. So just check if ret is greater than 0 and set it to 0 if it does. With this patch we now use the tree-log instead of committing the entire transaction, which is twice as fast on my box. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:42 -04:00
Josef Bacik	22ee6985de	Btrfs: check to see if the inode is in the log before fsyncing We have this check down in the actual logging code, but this is after we start a transaction and all that good stuff. So move the helper inode_in_log() out so we can call it in fsync() and avoid starting a transaction altogether and just exit if we've already fsync()'ed this file recently. You would notice this issue if you fsync()'ed a file over and over again until the transaction committed. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:42 -04:00
Tsutomu Itoh	018642a1f1	Btrfs: return value of btrfs_read_buffer is checked correctly btrfs_read_buffer() has the possibility of returning the error. Therefore, I add the code in which the return value of btrfs_read_buffer() is checked. Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>	2012-05-30 10:23:41 -04:00
Stefan Behrens	733f4fbbc1	Btrfs: read device stats on mount, write modified ones during commit The device statistics are written into the device tree with each transaction commit. Only modified statistics are written. When a filesystem is mounted, the device statistics for each involved device are read from the device tree and used to initialize the counters. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-05-30 10:23:41 -04:00
Stefan Behrens	c11d2c236c	Btrfs: add ioctl to get and reset the device stats An ioctl interface is added to get the device statistic counters. A second ioctl is added to atomically get and reset these counters. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-05-30 10:23:40 -04:00
Stefan Behrens	442a4f6308	Btrfs: add device counters for detected IO and checksum errors The goal is to detect when drives start to get an increased error rate, when drives should be replaced soon. Therefore statistic counters are added that count IO errors (read, write and flush). Additionally, the software detected errors like checksum errors and corrupted blocks are counted. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-05-30 10:23:39 -04:00
Asias He	d07eb91170	btrfs: Drop unused function btrfs_abort_devices() 1) This function is not used anywhere. 2) Using the blk_abort_queue() to abort the queue seems not correct. blk_abort_queue() is used for timeout handling (block/blk-timeout.c). Cc: Chris Mason <chris.mason@oracle.com> Cc: linux-btrfs@vger.kernel.org Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-kernel@vger.kernel.org Signed-off-by: Asias He <asias@redhat.com>	2012-05-30 10:23:39 -04:00
Miao Xie	762f226326	Btrfs: fix the same inode id problem when doing auto defragment Two files in the different subvolumes may have the same inode id, so The rb-tree which is used to manage the defragment object must take it into account. This patch fix this problem. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>	2012-05-30 10:23:38 -04:00
Josef Bacik	2adcac1a73	Btrfs: fall back to non-inline if we don't have enough space If cow_file_range_inline fails with ENOSPC we abort the transaction which isn't very nice. This really shouldn't be happening anyways but there's no sense in making it a horrible error when we can easily just go allocate normal data space for this stuff. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:38 -04:00
Josef Bacik	8a35d95ff4	Btrfs: fix how we deal with the orphan block rsv Ceph was hitting this race where we would remove an inode from the per-root orphan list before we would release the space we had reserved for the inode. We actually don't need a list or anything, we just need to make sure the root doesn't try to free up the orphan reserve until after the inodes have released their reservations. So use an atomic counter instead of a list on the root and only decrement the counter after we've released our reservation. I've tested this as well as several others and we no longer see the warnings that you would see while running ceph. Thanks, Btrfs: fix how we deal with the orphan block rsv Ceph was hitting this race where we would remove an inode from the per-root orphan list before we would release the space we had reserved for the inode. We actually don't need a list or anything, we just need to make sure the root doesn't try to free up the orphan reserve until after the inodes have released their reservations. So use an atomic counter instead of a list on the root and only decrement the counter after we've released our reservation. I've tested this as well as several others and we no longer see the warnings that you would see while running ceph. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:37 -04:00
Josef Bacik	72ac3c0d79	Btrfs: convert the inode bit field to use the actual bit operations Miao pointed this out while I was working on an orphan problem that messing with a bitfield where different ranges are protected by different locks doesn't work out right. Turns out we've been doing this forever where we have different parts of the bit field protected by either no lock at all or different locks which could cause all sorts of weird problems including the issue I was hitting. So instead make a runtime_flags thing that we use the normal bit operations on that are all atomic so we can keep having our no/different locking for the different flags and then make force_compress it's own thing so it can be treated normally. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:36 -04:00
Josef Bacik	cd023e7b17	Btrfs: merge contigous regions when loading free space cache When we write out the free space cache we will write out everything that is in our in memory tree, and then we will just walk the pinned extents tree and write anything we see there. The problem with this is that during normal operations the pinned extents will be merged back into the free space tree normally, and then we can allocate space from the merged areas and commit them to the tree log. If we crash and replay the tree log we will crash again because the tree log will try to free up space from what looks like 2 seperate but contiguous entries, since one entry is from the original free space cache and the other was a pinned extent that was merged back. To fix this we just need to walk the free space tree after we load it and merge contiguous entries back together. This will keep the tree log stuff from breaking and it will make the allocator behave more nicely. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:36 -04:00
Liu Bo	9ba1f6e44e	Btrfs: do not do balance in readonly mode In normal cases, we would not be allowed to do balance in RO mode. However, when we're using a seeding device and adding another device to sprout, things will change: $ mkfs.btrfs /dev/sdb7 $ btrfstune -S 1 /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs -o ro $ btrfs fi bal /mnt/btrfs -----------------------> fail. $ btrfs dev add /dev/sdb8 /mnt/btrfs $ btrfs fi bal /mnt/btrfs -----------------------> works! It should not be designed as an exception, and we'd better add another check for mnt flags. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:35 -04:00
Liu Bo	d1ac6e41d5	Btrfs: use fastpath in extent state ops as much as possible Fully utilize our extent state's new helper functions to use fastpath as much as possible. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:34 -04:00
Liu Bo	f8c5d0b443	Btrfs: fix wrong error returned by adding a device Reproduce: $ mkfs.btrfs /dev/sdb7 $ mount /dev/sdb7 /mnt/btrfs -o ro $ btrfs dev add /dev/sdb8 /mnt/btrfs ERROR: error adding the device '/dev/sdb8' - Invalid argument Since we mount with readonly options, and /dev/sdb7 is not a seeding one, a readonly notification is preferred. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com> Reviewed-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:34 -04:00
Josef Bacik	5fd0204355	Btrfs: finish ordered extents in their own thread We noticed that the ordered extent completion doesn't really rely on having a page and that it could be done independantly of ending the writeback on a page. This patch makes us not do the threaded endio stuff for normal buffered writes and direct writes so we can end page writeback as soon as possible (in irq context) and only start threads to do the ordered work when it is actually done. Compression needs to be reworked some to take advantage of this as well, but atm it has to do a find_get_page in its endio handler so it must be done in its own thread. This makes direct writes quite a bit faster. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:33 -04:00
Josef Bacik	4e89915220	Btrfs: do not check delalloc when updating disk_i_size We are checking delalloc to see if it is ok to update the i_size. There are 2 cases it stops us from updating 1) If there is delalloc between our current disk_i_size and this ordered extent 2) If there is delalloc between our current ordered extent and the next ordered extent These tests are racy however since we can set delalloc for these ranges at any time. Also for the first case if we notice there is delalloc between disk_i_size and our ordered extent we will not update disk_i_size and assume that when that delalloc bit gets written out it will update everything properly. However if we crash before that we will have file extents outside of our i_size, which is not good, so this test is dangerous as well as racy. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:33 -04:00
Jim Meyering	f60d16a892	Btrfs: avoid buffer overrun in mount option handling There is an off-by-one error: allocating room for a maximal result string but without room for a trailing NUL. That, can lead to returning a transformed string that is not NUL-terminated, and then to a caller reading beyond end of the malloc'd buffer. Rewrite to s/kzalloc/kmalloc/, remove unwarranted use of strncpy (the result is guaranteed to fit), remove dead strlen at end, and change a few variable names and comments. Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Jim Meyering <meyering@redhat.com>	2012-05-30 10:23:32 -04:00
Jim Meyering	a27202fbe9	Btrfs: NUL-terminate path buffer in DEV_INFO ioctl result A device with name of length BTRFS_DEVICE_PATH_NAME_MAX or longer would not be NUL-terminated in the DEV_INFO ioctl result buffer. Signed-off-by: Jim Meyering <meyering@redhat.com>	2012-05-30 10:23:31 -04:00
Jim Meyering	f07c9a79f0	Btrfs: avoid buffer overrun in btrfs_printk The buffer read-overrun would be triggered by a printk format starting with <N>, where N is a single digit. NUL-terminate after strncpy. Use memcpy, not strncpy, since we know the string we're copying fits in the destination buffer and contains no NUL byte. Signed-off-by: Jim Meyering <meyering@redhat.com>	2012-05-30 10:23:31 -04:00
Daniel J Blueman	2eec6c8102	Fix minor type issues Address some minor type issues identified by sparse checker. Signed-off-by: Daniel J Blueman <daniel@quora.org>	2012-05-30 10:23:30 -04:00
Sergei Trofimovich	0d2450abfa	btrfs: allow changing 'thread_pool' size at remount time Changing 'mount -oremount,thread_pool=2 /' didn't make any effect: maximum amount of worker threads is specified in 2 places: - in 'strict btrfs_fs_info::thread_pool_size' - in each worker struct: 'struct btrfs_workers::max_workers' 'mount -oremount' updated only 'btrfs_fs_info::thread_pool_size'. Fix it by pushing new maximum value to all created worker structures as well. Cc: Josef Bacik <josef@redhat.com> Cc: Chris Mason <chris.mason@oracle.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>	2012-05-30 10:23:30 -04:00
Josef Bacik	0885ef5b56	Btrfs: do not do filemap_write_and_wait_range in fsync We already do the btrfs_wait_ordered_range which will do this for us, so just remove this call so we don't call it twice. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:29 -04:00
Josef Bacik	551ebb2d34	Btrfs: remove useless waiting and extra filemap work In btrfs_wait_ordered_range we have been calling filemap_fdata_write() twice because compression does strange things and then waiting. Then we look up ordered extents and if we find any we will always schedule_timeout(); once and then loop back around and do it all again. We will even check to see if there is delalloc pages on this range and loop again. So this patch gets rid of the multipe fdata_write() calls and just does filemap_write_and_wait(). In the case of compression we will still find the ordered extents and start those individually if we need to so that is ok, but in the normal buffered case we avoid all this weird overhead. Then in the case of the schedule_timeout(1), we don't need it. All callers either 1) don't care, they just want to make sure what they just wrote maeks it to disk or 2) are doing the lock()->lookup ordered->unlock->flush thing in which case it will lock and check for ordered extents _anyway_ so get back to them as quickly as possible. The delaloc check is simply not needed, this only catches the case where we write to the file again since doing the filemap_write_and_wait() and if the caller truly cares about that it will take care of everything itself. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:28 -04:00
Josef Bacik	d7dbe9e7f6	Btrfs: fix compile warnings in extent_io.c These warnings are bogus since we will always have at least one page in an eb, but to make the compiler happy just set ret = 0 in these two cases. Thanks, Btrfs: fix compile warnings in extent_io.c These warnings are bogus since we will always have at least one page in an eb, but to make the compiler happy just set ret = 0 in these two cases. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:28 -04:00
Josef Bacik	30f8fe3e47	Btrfs: cache no acl on new inodes When running compilebench I noticed we were spending some time looking up acls on new inodes, which shouldn't be happening since there were no acls. This is because when we init acls on the inode after creating them we don't cache the fact there are no acls if there aren't any. Doing this adds a little bit of a bump to my compilebench runs. Thanks, Btrfs: cache no acl on new inodes Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:27 -04:00
Josef Bacik	0c4d2d95d0	Btrfs: use i_version instead of our own sequence We've been keeping around the inode sequence number in hopes that somebody would use it, but nobody uses it and people actually use i_version which serves the same purpose, so use i_version where we used the incore inode's sequence number and that way the sequence is updated properly across the board, and not just in file write. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-05-30 10:23:27 -04:00
Jan Schmidt	20b297d620	Btrfs: tree mod log sanity checks in join_transaction When a fresh transaction begins, the tree mod log must be clean. Users of the tree modification log must ensure they never span across transaction boundaries. We reset the sequence to 0 in this safe situation to make absolutely sure overflow can't happen. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-30 15:17:36 +02:00
Jan Schmidt	19ae4e8133	Btrfs: fs_info variable for join_transaction Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-30 15:17:35 +02:00
Jan Schmidt	8445f61cad	Btrfs: use the tree modification log for backref resolving This enables backref resolving on life trees while they are changing. This is a prerequisite for quota groups and just nice to have for everything else. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-30 15:17:34 +02:00
Jan Schmidt	5d9e75c41d	Btrfs: add btrfs_search_old_slot The tree modification log together with the current state of the tree gives a consistent, old version of the tree. btrfs_search_old_slot is used to search through this old version and return old (dummy!) extent buffers. Naturally, this function cannot do any tree modifications. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-30 15:17:33 +02:00
Jan Schmidt	f3ea38da3e	Btrfs: add del_ptr and insert_ptr modifications to the tree mod log Record all relevant modifications to block pointers in the tree mod log so that we can rewind them later on for backref walking. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-30 15:17:32 +02:00
Jan Schmidt	f230475e62	Btrfs: put all block modifications into the tree mod log When running functions that can make changes to the internal trees (e.g. btrfs_search_slot), we check if somebody may be interested in the block we're currently modifying. If so, we record our modification to be able to rewind it later on. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-30 15:17:29 +02:00
Jan Schmidt	bd989ba359	Btrfs: add tree modification log functions The tree mod log will log modifications made fs-tree nodes. Most modifications are done by autobalance of the tree. Such changes are recorded as long as a block entry exists. When released, the log is cleaned. With the tree modification log, it's possible to reconstruct a consistent old state of the tree. This is required to do backref walking on a busy file system. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-30 15:17:01 +02:00
Al Viro	528c032764	btrfs: trivial endianness annotations Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-05-29 23:28:35 -04:00
Al Viro	b0b0382bb4	->encode_fh() API change pass inode + parent's inode or NULL instead of dentry + bool saying whether we want the parent or not. NOTE: that needs ceph fix folded in. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-05-29 23:28:33 -04:00
Linus Torvalds	90324cc1b1	avoid iput() from flusher thread -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQIcBAABAgAGBQJPw2J/AAoJECvKgwp+S8Ja5jkP/3uMxkhf8XQpXCI3O1QVfaQr uZFfM8sINqIPDVm1dtFjFj7f8Bw9mhE2KAnnJ1rKT8tQwqq9yAse1QPlhCG1ZqoP +AnMDDXHtx7WmQZXhBvS9b+unpZ7Jr6r6pO5XrmTL2kRL3YJPUhZ2+xbTT5belTB KoAu4WqORZRxfXoC76S7U8K+D4NcAGhAOxCClsIjmY+oocCiCag4FZOyzYIFViqc ghUN/+rLQ3fqGGv2yO7Ylx1gUM7sxIwkZQ/h962jFAtxz9czImr2NmRoMliOaOkS tvcnIf+E3u0n/zIjzFvzhxKgHJPP8PkcPMk60d3jKmFngBkqFTzNUeVTP8md7HrV 4DlXisWr+z7YVyWUCFaNcJLmjiWSwQ8DV/clRLobeBf9EJKan5F1PjFgl6PLJM5F Qr1+LHMNaetdulBwMRTyveZTzYqw9RmDnD9dWMo4mX/kTpvtC4jTPVV7hkRD+Qlv 5vTRR+VXL3Q50yClLf0AQMSKTnH2gBuepM/b+7cShLGfsMln8DtUjmbigv+niL63 BibcCIbIlP2uWGnl37VhsC34AT+RKt3lggrBOpn/7XJMq/wKR7IRP/7V9TfYgaUN NBa+wtnLDa1pZEn/X7izdcQP62PzDtmB+ObvYT0Yb40A4+2ud3qF/lB53c1A1ewF /9c4zxxekjHZnn2oooEa =oLXf -----END PGP SIGNATURE----- Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux Pull writeback tree from Wu Fengguang: "Mainly from Jan Kara to avoid iput() in the flusher threads." * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Avoid iput() from flusher thread vfs: Rename end_writeback() to clear_inode() vfs: Move waiting for inode writeback from end_writeback() to evict_inode() writeback: Refactor writeback_single_inode() writeback: Remove wb->list_lock from writeback_single_inode() writeback: Separate inode requeueing after writeback writeback: Move I_DIRTY_PAGES handling writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() writeback: Move clearing of I_SYNC into inode_sync_complete() writeback: initialize global_dirty_limit fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds mm: page-writeback.c: local functions should not be exposed globally	2012-05-28 09:54:45 -07:00
Jan Schmidt	f29021b29a	Btrfs: add tree mod log to fs_info Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-26 12:17:54 +02:00
Jan Schmidt	815a51c74a	Btrfs: dummy extent buffers for tree mod log The tree modification log needs two ways to create dummy extent buffers, once by allocating a fresh one (to rebuild an old root) and once by cloning an existing one (to make private rewind modifications) to it. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-26 12:17:54 +02:00
Jan Schmidt	64947ec0d1	Btrfs: move struct seq_list to ctree.h Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-26 12:17:53 +02:00
Jan Schmidt	5581a51a59	Btrfs: don't set for_cow parameter for tree block functions Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed parameter for_cow = 1. In fact, these two functions should never mark their tree modification operations as for_cow, because they can change the number of blocks referenced by a tree. Hence, we remove the extra for_cow parameter from these functions and make them pass a zero down. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-26 12:17:53 +02:00
Jan Schmidt	976b1908d9	Btrfs: look into the extent during find_all_leafs Before this patch we called find_all_leafs for a data extent, then called find_all_roots and then looked into the extent to grab the information we were seeking. This was done without holding the leaves locked to avoid deadlocks. However, this can obviouly race with concurrent tree modifications. Instead, we now look into the extent while we're holding the lock during find_all_leafs and store this information together with the leaf list. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-26 12:17:52 +02:00
Jan Schmidt	d5c88b735f	Btrfs: bugfix: ignore the wrong key for indirect tree block backrefs The key we store with a tree block backref is only a hint. It is set when the ref is created and can remain correct for a long time. As the tree is rebalanced, however, eventually the key no longer points to the correct destination. With this patch, we change find_parent_nodes to no longer add keys unless it knows for sure they're correct (e.g. because they're for an extent data backref). Then when we later encounter a backref ref with no parent and no key set, we grab the block and take the first key from the block itself. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-26 12:17:51 +02:00
Jan Schmidt	dadcaf78b5	Btrfs: bugfix in btrfs_find_parent_nodes That one has been around since the addition of backref.c. Due to the way we calculate our slot numbers, after adding inline refs we're missing one keyed ref unless it's located at the beginning of a new leaf. Reported-by: Alexander Block <ablock84@googlemail.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-26 12:17:51 +02:00
Jan Schmidt	cd1b413c5c	Btrfs: ulist realloc bugfix ulist_next gets the pointer to the previously returned element to find the next element from there. However, when we call ulist_add while iteration with ulist_next is in progress (ulist explicitly supports this), we can realloc the ulist internal memory, which makes the pointer to the previous element useless. Instead, we now use an iterator parameter that's independent from the internal pointers. Reported-by: Alexander Block <ablock84@googlemail.com> Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-05-26 12:17:49 +02:00
Linus Torvalds	e8650a0823	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial updates from Jiri Kosina: "As usual, it's mostly typo fixes, redundant code elimination and some documentation updates." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (57 commits) edac, mips: don't change code that has been removed in edac/mips tree xtensa: Change mail addresses of Hannes Weiner and Oskar Schirmer lib: Change mail address of Oskar Schirmer net: Change mail address of Oskar Schirmer arm/m68k: Change mail address of Sebastian Hess i2c: Change mail address of Oskar Schirmer net: Fix tcp_build_and_update_options comment in struct tcp_sock atomic64_32.h: fix parameter naming mismatch Kconfig: replace "--- help ---" with "---help---" c2port: fix bogus Kconfig "default no" edac: Fix spelling errors. qla1280: Remove redundant NULL check before release_firmware() call remoteproc: remove redundant NULL check before release_firmware() qla2xxx: Remove redundant NULL check before release_firmware() call. aic94xx: Get rid of redundant NULL check before release_firmware() call tehuti: delete redundant NULL check before release_firmware() qlogic: get rid of a redundant test for NULL before call to release_firmware() bna: remove redundant NULL test before release_firmware() tg3: remove redundant NULL test before release_firmware() call typhoon: get rid of redundant conditional before all to release_firmware() ...	2012-05-22 19:22:50 -07:00
Dan Carpenter	a25c75d5ad	Btrfs: cleanup: use consistent lock naming It confuses Smatch that we use two names for the same lock. Plus the shorter name is nicer. This doesn't change how the code works, it's just a cleanup. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-05-11 10:56:41 -04:00
Stefan Behrens	e06baab418	Btrfs: change integrity checker to support big blocks The integrity checker used to be coded for nodesize == leafsize == sectorsize == PAGE_CACHE_SIZE. This is now changed to support sizes for nodesize and leafsize which are N * PAGE_CACHE_SIZE. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-05-11 10:56:40 -04:00
Wang Sheng-Hui	fd5e62a37c	Btrfs: remove the useless assignment to entry in function tree_insert of file extent_io.c In tree_insert, var entry is used in the loop only, and is useless out of the loop. Remove the useless assignment after the loop. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>	2012-05-11 10:56:40 -04:00
Wang Sheng-Hui	477d7eafa9	Btrfs: fix the comment for find_first_extent_bit The return value of find_first_extent_bit is 1 or 0, no < 0. And if found something, return 0; if nothing was found, return 1. Fix the comment. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>	2012-05-11 10:56:39 -04:00
Wang Sheng-Hui	39bab87ba6	Btrfs: fix btrfs_release_extent_buffer_page with the right usage of num_extent_pages num_extent_pages returns the number of pages in the specific range, not the index of the last page in the eb range. btrfs_release_extent_buffer_page is called with start_idx set 0 in current codes, so it's not a problem yet. But the logic is indeed wrong. Fix it here. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>	2012-05-11 10:56:38 -04:00
Wang Sheng-Hui	1b303fc054	Btrfs: cleanup the comment for clear_state_bit in extent_io.c No 'delete' arg is used for clear_state_bit. Cleanup the comment. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>	2012-05-11 10:56:38 -04:00
Wang Sheng-Hui	f775738f6f	btrfs/ctree.c: remove the unnecessary 'return -1;' at the end of bin_search The code path should not reach there. Remove it. Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>	2012-05-11 10:56:37 -04:00
Linus Torvalds	271fd5d728	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "The big ones here are a memory leak we introduced in rc1, and a scheduling while atomic if the transid on disk doesn't match the transid we expected. This happens for corrupt blocks, or out of date disks. It also fixes up the ioctl definition for our ioctl to resolve logical inode numbers. The __u32 was a merging error and doesn't match what we ship in the progs." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: avoid sleeping in verify_parent_transid while atomic Btrfs: fix crash in scrub repair code when device is missing btrfs: Fix mismatching struct members in ioctl.h Btrfs: fix page leak when allocing extent buffers Btrfs: Add properly locking around add_root_to_dirty_list	2012-05-06 10:20:07 -07:00
Chris Mason	b9fab919b7	Btrfs: avoid sleeping in verify_parent_transid while atomic verify_parent_transid needs to lock the extent range to make sure no IO is underway, and so it can safely clear the uptodate bits if our checks fail. But, a few callers are using it with spinlocks held. Most of the time, the generation numbers are going to match, and we don't want to switch to a blocking lock just for the error case. This adds an atomic flag to verify_parent_transid, and changes it to return EAGAIN if it needs to block to properly verifiy things. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-05-06 07:23:47 -04:00
Jan Kara	dbd5768f87	vfs: Rename end_writeback() to clear_inode() After we moved inode_sync_wait() from end_writeback() it doesn't make sense to call the function end_writeback() anymore. Rename it to clear_inode() which well says what the function really does - set I_CLEAR flag. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>	2012-05-06 13:43:41 +08:00
Stefan Behrens	ea9947b439	Btrfs: fix crash in scrub repair code when device is missing Fix that when scrub tries to repair an I/O or checksum error and one of the devices containing the mirror is missing, it crashes in bio_add_page because the bdev is a NULL pointer for missing devices. Reported-by: Marco L. Crociani <marco.crociani@gmail.com> Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-05-04 15:16:07 -04:00
Alexander Block	d04b1debc9	btrfs: Fix mismatching struct members in ioctl.h Fix the size members of btrfs_ioctl_ino_path_args and btrfs_ioctl_logical_ino_args. The user space btrfs-progs utilities used __u64 and the kernel headers used __u32 before. Signed-off-by: Alexander Block <ablock84@googlemail.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-05-04 15:16:06 -04:00
Josef Bacik	17de39ac17	Btrfs: fix page leak when allocing extent buffers If we happen to alloc a extent buffer and then alloc a page and notice that page is already attached to an extent buffer, we will only unlock it and free our existing eb. Any pages currently attached to that eb will be properly freed, but we don't do the page_cache_release() on the page where we noticed the other extent buffer which can cause us to leak pages and I hope cause the weird issues we've been seeing in this area. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-05-04 15:16:06 -04:00
Chris Mason	e5846fc665	Btrfs: Add properly locking around add_root_to_dirty_list add_root_to_dirty_list happens once at the very beginning of the transaction, but it is still racey. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-05-04 15:14:11 -04:00
Linus Torvalds	f7b0069317	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This has our collection of bug fixes. I missed the last rc because I thought our patches were making NFS crash during my xfs test runs. Turns out it was an NFS client bug fixed by someone else while I tried to bisect it. All of these fixes are small, but some are fairly high impact. The biggest are fixes for our mount -o remount handling, a deadlock due to GFP_KERNEL allocations in readdir, and a RAID10 error handling bug. This was tested against both 3.3 and Linus' master as of this morning." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (26 commits) Btrfs: reduce lock contention during extent insertion Btrfs: avoid deadlocks from GFP_KERNEL allocations during btrfs_real_readdir Btrfs: Fix space checking during fs resize Btrfs: fix block_rsv and space_info lock ordering Btrfs: Prevent root_list corruption Btrfs: fix repair code for RAID10 Btrfs: do not start delalloc inodes during sync Btrfs: fix that check_int_data mount option was ignored Btrfs: don't count CRC or header errors twice while scrubbing Btrfs: fix btrfs_ioctl_dev_info() crash on missing device btrfs: don't return EINTR Btrfs: double unlock bug in error handling Btrfs: always store the mirror we read the eb from fs/btrfs/volumes.c: add missing free_fs_devices btrfs: fix early abort in 'remount' Btrfs: fix max chunk size check in chunk allocator Btrfs: add missing read locks in backref.c Btrfs: don't call free_extent_buffer twice in iterate_irefs Btrfs: Make free_ipath() deal gracefully with NULL pointers Btrfs: avoid possible use-after-free in clear_extent_bit() ...	2012-04-28 09:30:07 -07:00
Chris Mason	dc7fdde39e	Btrfs: reduce lock contention during extent insertion We're spending huge amounts of time on lock contention during end_io processing because we unconditionally assume we are overwriting an existing extent in the file for each IO. This checks to see if we are outside i_size, and if so, it uses a less expensive readonly search of the btree to look for existing extents. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-04-27 14:51:05 -04:00
Chris Mason	fede766f28	Btrfs: avoid deadlocks from GFP_KERNEL allocations during btrfs_real_readdir Btrfs has an optimization where it will preallocate dentries during readdir to fill in enough information to open the inode without an extra lookup. But, we're calling d_alloc, which is doing GFP_KERNEL allocations, and that leads to deadlocks because our readdir code has tree locks held. For now, disable this optimization. We'll fix the gfp mask in the next merge window. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-04-27 14:23:22 -04:00
Daniel J Blueman	7654b72417	Btrfs: Fix space checking during fs resize Fix out-of-space checking, addressing a warning and potential resource leak when resizing the filesystem down while allocating blocks. Signed-off-by: Daniel J Blueman <daniel@quora.org> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-04-27 13:55:14 -04:00
Stefan Behrens	1f699d38b6	Btrfs: fix block_rsv and space_info lock ordering may_commit_transaction() calls spin_lock(&space_info->lock); spin_lock(&delayed_rsv->lock); and update_global_block_rsv() calls spin_lock(&block_rsv->lock); spin_lock(&sinfo->lock); Lockdep complains about this at run time. Everywhere except in update_global_block_rsv(), the space_info lock is the outer lock, therefore the locking order in update_global_block_rsv() is changed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-04-27 13:55:14 -04:00
Daniel J Blueman	1daf3540fa	Btrfs: Prevent root_list corruption I was seeing root_list corruption on unmount during fs resize in 3.4-rc4; add correct locking to address this. Signed-off-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-04-27 13:55:13 -04:00
Jan Schmidt	3e74317ad7	Btrfs: fix repair code for RAID10 btrfs_map_block sets mirror_num, so that the repair code knows eventually which device gave us the read error. For RAID10, mirror_num must be 1 or 2. Before this fix mirror_num was incorrectly related to our stripe index. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-04-27 13:55:13 -04:00
Josef Bacik	996d282c7f	Btrfs: do not start delalloc inodes during sync btrfs_start_delalloc_inodes will just walk the list of delalloc inodes and start writing them out, but it doesn't splice the list or anything so as long as somebody is doing work on the box you could end up in this section _forever_. So just remove it, it's not needed anyway since sync will start writeback on all inodes anyway, all we need to do is wait for ordered extents and then we can commit the transaction. In my horrible torture test sync goes from taking 4 minutes to about 1.5 minutes. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2012-04-27 13:55:12 -04:00
Stefan Behrens	25cd999e1a	Btrfs: fix that check_int_data mount option was ignored The bitfield member mount_opt was too small by one bit to hold the mount option that enabled to include data extents in the integrity checker. Since the same issue happened when the BTRFS_MOUNT_PANIC_ON_FATAL_ERROR option was added (git rebase silently merges so that the increase of the size of the bitfield member is lost), the bit limit was removed entirely. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-04-18 19:22:38 +02:00
Stefan Behrens	5c84fc3c39	Btrfs: don't count CRC or header errors twice while scrubbing Each CRC or header error was counted twice, this is now fixed. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-04-18 19:22:36 +02:00
Stefan Behrens	99ba55ad69	Btrfs: fix btrfs_ioctl_dev_info() crash on missing device When a filesystem is mounted with the degraded option, it is possible that some of the devices are not there. btrfs_ioctl_dev_info() crashs in this case because the device name is a NULL pointer. This ioctl was only used for scrub. Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>	2012-04-18 19:22:35 +02:00
Arne Jansen	b9688bb845	btrfs: don't return EINTR It is basically a good thing if we are interruptible when waiting for free space, but the generality in which it is implemented currently leads to system calls being interruptible that are not documented this way. For example git can't handle interrupted unlink(), leading to corrupt repos under space pressure. Instead we raise the bar to only be interruptible by SIGKILL. Thanks to David Sterba for suggesting this. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-04-18 19:22:33 +02:00
Dan Carpenter	253beebd5a	Btrfs: double unlock bug in error handling The caller expects this function to return with the lock held and releases it immediately on error. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>	2012-04-18 19:22:31 +02:00
Josef Bacik	5cf1ab5613	Btrfs: always store the mirror we read the eb from A user reported a panic where we were trying to fix a bad mirror but the mirror number we were giving was 0, which is invalid. This is because we don't do the transid verification until after the read, so as far as the read code is concerned the read was a success. So instead store the mirror we read from so that if there is some failure post read we know which mirror to try next and which mirror needs to be fixed if we find a good copy of the block. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com>	2012-04-18 19:22:30 +02:00
Julia Lawall	48d282326b	fs/btrfs/volumes.c: add missing free_fs_devices Free fs_devices as done in the error-handling code just below. Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>	2012-04-18 19:22:28 +02:00
Sergei Trofimovich	8a3db1849e	btrfs: fix early abort in 'remount' Cc: Jeff Mahoney <jeffm@suse.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Josef Bacik <josef@redhat.com> Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>	2012-04-18 19:22:26 +02:00
Ilya Dryomov	37db63a400	Btrfs: fix max chunk size check in chunk allocator Fix a bug, where in case we need to adjust stripe_size so that the length of the resulting chunk is less than or equal to max_chunk_size, DUP chunks turn out to be only half as big as they could be. Cc: Arne Jansen <sensille@gmx.net> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2012-04-18 19:22:25 +02:00
Jan Schmidt	b916a59adf	Btrfs: add missing read locks in backref.c iref_to_path and iterate_irefs both increment the eb's refcount to use it after releasing the path. Both depend on consistent data remaining in the extent buffer and need a read lock to protect it. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-04-18 19:22:23 +02:00
Jan Schmidt	aefc1eb13e	Btrfs: don't call free_extent_buffer twice in iterate_irefs Avoid calling free_extent_buffer more than once when the iterator function returns non-zero. The only code that uses this is scrub repair for corrupted nodatasum blocks. Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>	2012-04-18 19:22:21 +02:00
Jesper Juhl	4735fb2828	Btrfs: Make free_ipath() deal gracefully with NULL pointers Make free_ipath() behave like most other freeing functions in the kernel and gracefully do nothing when passed a NULL pointer. Besides this making the bahaviour consistent with functions such as kfree(), vfree(), btrfs_free_path() etc etc, it also fixes a real NULL deref issue in fs/btrfs/ioctl.c::btrfs_ioctl_ino_to_path(). In that function we have this code: ... ipath = init_ipath(size, root, path); if (IS_ERR(ipath)) { ret = PTR_ERR(ipath); ipath = NULL; goto out; } ... out: btrfs_free_path(path); free_ipath(ipath); ... If we ever take the true branch of that 'if' statement we'll end up passing a NULL pointer to free_ipath() which will subsequently dereference it and we'll go "Boom" :-( This patch will avoid that. Signed-off-by: Jesper Juhl <jj@chaosbits.net>	2012-04-18 19:22:20 +02:00
Li Zefan	cdc6a39525	Btrfs: avoid possible use-after-free in clear_extent_bit() clear_extent_bit() { next_node = rb_next(&state->rb_node); ... clear_state_bit(state); <-- this may free next_node if (next_node) { state = rb_entry(next_node); ... } } clear_state_bit() calls merge_state() which may free the next node of the passing extent_state, so clear_extent_bit() may end up referencing freed memory. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>	2012-04-18 19:22:18 +02:00
Li Zefan	8e52acf704	Btrfs: retrurn void from clear_state_bit Currently it returns a set of bits that were cleared, but this return value is not used at all. Moreover it doesn't seem to be useful, because we may clear the bits of a few extent_states, but only the cleared bits of last one is returned. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>	2012-04-18 19:22:16 +02:00
David Sterba	871383be59	btrfs: add missing unlocks to transaction abort paths Added in commit `49b25e0540` ("btrfs: enhance transaction abort infrastructure") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David Sterba <dsterba@suse.cz>	2012-04-18 19:22:14 +02:00
Liu Bo	8d082fb727	Btrfs: do not mount when we have a sectorsize unequal to PAGE_SIZE Our code is not ready to cope with a sectorsize that's not equal to PAGE_SIZE. It will lead to hanging-on while writing something. Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>	2012-04-18 19:22:13 +02:00
Arne Jansen	207a232cca	btrfs: don't add both copies of DUP to reada extent tree Normally when there are 2 copies of a block, we add both to the reada extent tree and prefetch only the one that is easier to reach. This way we can better utilize multiple devices. In case of DUP this makes no sense as both copies reside on the same device. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-04-18 19:12:44 +02:00
Arne Jansen	8c9c2bf7a3	btrfs: fix race in reada When inserting into the radix tree returns EEXIST, get the existing entry without giving up the spinlock in between. There was a race for both the zones trees and the extent tree. Signed-off-by: Arne Jansen <sensille@gmx.net>	2012-04-18 19:12:44 +02:00

... 33 34 35 36 37 ...

5817 Commits