Commit Graph

3533 Commits

Author SHA1 Message Date
Joe Perches
8be04b9374 treewide: Add __GFP_NOWARN to k.alloc calls with v.alloc fallbacks
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.

Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.

Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-08-20 13:06:40 +02:00
Zach Brown
db62efbbf8 btrfs: don't loop on large offsets in readdir
When btrfs readdir() hits the last entry it sets the readdir offset to a
huge value to stop buggy apps from breaking when the same name is
returned by readdir() with concurrent rename()s.

But unconditionally setting the offset to INT_MAX causes readdir() to
loop returning any entries with offsets past INT_MAX.  It only takes a
few hours of constant file creation and removal to create entries past
INT_MAX.

So let's set the huge offset to LLONG_MAX if the last entry has already
overflowed 32bit loff_t.   Without large offsets behaviour is identical.
With large offsets 64bit apps will work and 32bit apps will be no more
broken than they currently are if they see large offsets.

Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:34:56 -04:00
Josef Bacik
cfad392b22 Btrfs: check to see if root_list is empty before adding it to dead roots
A user reported a panic when running with autodefrag and deleting snapshots.
This is because we could end up trying to add the root to the dead roots list
twice.  To fix this check to see if we are empty before adding ourselves to the
dead roots list.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:30:23 -04:00
Josef Bacik
f3b15ccdbb Btrfs: release both paths before logging dir/changed extents
The ceph guys tripped over this bug where we were still holding onto the
original path that we used to copy the inode with when logging.  This is based
on Chris's fix which was reported to fix the problem.  We need to drop the paths
in two cases anyway so just move the drop up so that we don't have duplicate
code.  Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:30:16 -04:00
Josef Bacik
ee20a98314 Btrfs: allow splitting of hole em's when dropping extent cache
I noticed while running multi-threaded fsync tests that sometimes fsck would
complain about an improper gap.  This happens because we fail to add a hole
extent to the file, which was happening when we'd split a hole EM because
btrfs_drop_extent_cache was just discarding the whole em instead of splitting
it.  So this patch fixes this by allowing us to split a hole em properly, which
means that added holes actually get logged properly and we no longer see this
fsck error.  Thankfully we're tolerant of these sort of problems so a user would
not see any adverse effects of this bug, other than fsck complaining.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:30:09 -04:00
Josef Bacik
ed8c4913da Btrfs: make sure the backref walker catches all refs to our extent
Because we don't mess with the offset into the extent for compressed we will
properly find both extents for this case

[extent a][extent b][rest of extent a]

but because we already added a ref for the front half we won't add the inode
information for the second half.  This causes us to leak that memory and not
print out the other offset when we do logical-resolve.  So fix this by calling
ulist_add_merge and then add our eie to the existing entry if there is one.
With this patch we get both offsets out of logical-resolve.  With this and the
other 2 patches I've sent we now pass btrfs/276 on my vm with compress-force=lzo
set.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:30:03 -04:00
Josef Bacik
8ca15e05e6 Btrfs: fix backref walking when we hit a compressed extent
If you do btrfs inspect-internal logical-resolve on a compressed extent that has
been partly overwritten it won't find anything.  This is because we try and
match the extent offset we've searched for based on the extent offset in the
data extent entry.  However this doesn't work for compressed extents because the
offsets are for the uncompressed size, not the compressed size.  So instead only
do this check if we are not compressed, that way we can get an actual entry for
the physical offset rather than nothing for compressed.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:29:56 -04:00
Josef Bacik
b76bb70136 Btrfs: do not offset physical if we're compressed
xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
offsetting the physical based on the extent offset.  This is perfectly fine with
uncompressed extents, however the extent offset is into the uncompressed area,
not the compressed.  So we can return a physical value that isn't at all within
the area we have allocated on disk.  Fix this by returning the start of the
extent if it is compressed no matter what the offset.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:29:50 -04:00
Liu Bo
b5b9b5b318 Btrfs: fix extent buffer leak after backref walking
commit 47fb091fb787420cd195e66f162737401cce023f(Btrfs: fix unlock after free on rewinded tree blocks)
takes an extra increment on the reference of allocated dummy extent buffer, so now we
cannot free this dummy one, and end up with extent buffer leak.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:29:42 -04:00
Liu Bo
e68afa49ae Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents
For partial extents, snapshot-aware defrag does not work as expected,
since
a) we use the wrong logical offset to search for parents, which should be
   disk_bytenr + extent_offset, not just disk_bytenr,
b) 'offset' returned by the backref walking just refers to key.offset, not
   the 'offset' stored in btrfs_extent_data_ref which is
   (key.offset - extent_offset).

The reproducer:
$ mkfs.btrfs sda
$ mount sda /mnt
$ btrfs sub create /mnt/sub
$ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done
$ btrfs sub snap /mnt/sub /mnt/snap1
$ btrfs sub snap /mnt/sub /mnt/snap2
$ sync; btrfs filesystem defrag /mnt/sub/foo;
$ umount /mnt
$ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared.

This addresses the above two problems.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:29:17 -04:00
Jie Liu
7cddc19392 btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified
Create a small file and fallocate it to a big size with
FALLOC_FL_KEEP_SIZE option, then truncate it back to the
small size again, the disk free space is not changed back
in this case. i.e,

total 4
-rw-r--r-- 1 root root 512 Jun 28 11:35 test

Filesystem      Size  Used Avail Use% Mounted on
....
/dev/sdb1       8.0G   56K  7.2G   1% /mnt

-rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test

Filesystem      Size  Used Avail Use% Mounted on
....
/dev/sdb1       8.0G  5.1G  2.2G  70% /mnt

Filesystem      Size  Used Avail Use% Mounted on
....
/dev/sdb1       8.0G  5.1G  2.2G  70% /mnt

With this fix, the truncated up space is back as:
Filesystem      Size  Used Avail Use% Mounted on
....
/dev/sdb1       8.0G   56K  7.2G   1% /mnt

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09 19:28:56 -04:00
Stefan Behrens
115930cb2d Btrfs: fix wrong write offset when replacing a device
Miao Xie reported the following issue:

The filesystem was corrupted after we did a device replace.

Steps to reproduce:
 # mkfs.btrfs -f -m single -d raid10 <device0>..<device3>
 # mount <device0> <mnt>
 # btrfs replace start -rfB 1 <device4> <mnt>
 # umount <mnt>
 # btrfsck <device4>

The reason for the issue is that we changed the write offset by mistake,
introduced by commit 625f1c8dc.

We read the data from the source device at first, and then write the
data into the corresponding place of the new device. In order to
implement the "-r" option, the source location is remapped using
btrfs_map_block(). The read takes place on the mapped location, and
the write needs to take place on the unmapped location. Currently
the write is using the mapped location, and this commit changes it
back by undoing the change to the write address that the aforementioned
commit added by mistake.

Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-19 15:07:26 -04:00
Josef Bacik
d29a9f629e Btrfs: re-add root to dead root list if we stop dropping it
If we stop dropping a root for whatever reason we need to add it back to the
dead root list so that we will re-start the dropping next transaction commit.
The other case this happens is if we recover a drop because we will add a root
without adding it to the fs radix tree, so we can leak it's root and commit root
extent buffer, adding this to the dead root list makes this cleanup happen.
Thanks,

Cc: stable@vger.kernel.org
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-19 15:07:19 -04:00
Josef Bacik
fec386ac14 Btrfs: fix lock leak when resuming snapshot deletion
We aren't setting path->locks[level] when we resume a snapshot deletion which
means we won't unlock the buffer when we free the path.  This causes deadlocks
if we happen to re-allocate the block before we've evicted the extent buffer
from cache.  Thanks,

Cc: stable@vger.kernel.org
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-19 15:07:11 -04:00
Josef Bacik
3c8f242257 Btrfs: update drop progress before stopping snapshot dropping
Alex pointed out a problem and fix that exists in the drop one snapshot at a
time patch.  If we decide we need to exit for whatever reason (umount for
example) we will just exit the snapshot dropping without updating the drop
progress.  So the next time we go to resume we will BUG_ON() because we can't
find the extent we left off at because we never updated it.  This patch fixes
the problem.

Cc: stable@vger.kernel.org
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-19 15:07:03 -04:00
Sachin Kamat
cd633972e1 Btrfs: volume: Replace PTR_RET with PTR_ERR_OR_ZERO
PTR_RET is now deprecated. Use PTR_ERR_OR_ZERO instead.

Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2013-07-16 16:06:02 +09:30
Rusty Russell
8c6ffba0ed PTR_RET is now PTR_ERR_OR_ZERO(): Replace most.
Sweep of the simple cases.

Cc: netdev@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-15 11:25:01 +09:30
Linus Torvalds
e3a0dd98e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "These are the usual mixture of bugs, cleanups and performance fixes.
  Miao has some really nice tuning of our crc code as well as our
  transaction commits.

  Josef is peeling off more and more problems related to early enospc,
  and has a number of important bug fixes in here too"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (81 commits)
  Btrfs: wait ordered range before doing direct io
  Btrfs: only do the tree_mod_log_free_eb if this is our last ref
  Btrfs: hold the tree mod lock in __tree_mod_log_rewind
  Btrfs: make backref walking code handle skinny metadata
  Btrfs: fix crash regarding to ulist_add_merge
  Btrfs: fix several potential problems in copy_nocow_pages_for_inode
  Btrfs: cleanup the code of copy_nocow_pages_for_inode()
  Btrfs: fix oops when recovering the file data by scrub function
  Btrfs: make the chunk allocator completely tree lockless
  Btrfs: cleanup orphaned root orphan item
  Btrfs: fix wrong mirror number tuning
  Btrfs: cleanup redundant code in btrfs_submit_direct()
  Btrfs: remove btrfs_sector_sum structure
  Btrfs: check if we can nocow if we don't have data space
  Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc
  Btrfs: use a percpu to keep track of possibly pinned bytes
  Btrfs: check for actual acls rather than just xattrs when caching no acl
  Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate
  Btrfs: optimize reada_for_balance
  Btrfs: optimize read_block_for_search
  ...
2013-07-09 12:33:09 -07:00
Linus Torvalds
80cc38b163 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual stuff from trivial tree"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
  treewide: relase -> release
  Documentation/cgroups/memory.txt: fix stat file documentation
  sysctl/net.txt: delete reference to obsolete 2.4.x kernel
  spinlock_api_smp.h: fix preprocessor comments
  treewide: Fix typo in printk
  doc: device tree: clarify stuff in usage-model.txt.
  open firmware: "/aliasas" -> "/aliases"
  md: bcache: Fixed a typo with the word 'arithmetic'
  irq/generic-chip: fix a few kernel-doc entries
  frv: Convert use of typedef ctl_table to struct ctl_table
  sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
  doc: clk: Fix incorrect wording
  Documentation/arm/IXP4xx fix a typo
  Documentation/networking/ieee802154 fix a typo
  Documentation/DocBook/media/v4l fix a typo
  Documentation/video4linux/si476x.txt fix a typo
  Documentation/virtual/kvm/api.txt fix a typo
  Documentation/early-userspace/README fix a typo
  Documentation/video4linux/soc-camera.txt fix a typo
  lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
  ...
2013-07-04 11:40:58 -07:00
Linus Torvalds
790eac5640 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second set of VFS changes from Al Viro:
 "Assorted f_pos race fixes, making do_splice_direct() safe to call with
  i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
  ->d_hash/->d_compare calling conventions changes from Linus, misc
  stuff all over the place."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
  Document ->tmpfile()
  ext4: ->tmpfile() support
  vfs: export lseek_execute() to modules
  lseek_execute() doesn't need an inode passed to it
  block_dev: switch to fixed_size_llseek()
  cpqphp_sysfs: switch to fixed_size_llseek()
  tile-srom: switch to fixed_size_llseek()
  proc_powerpc: switch to fixed_size_llseek()
  ubi/cdev: switch to fixed_size_llseek()
  pci/proc: switch to fixed_size_llseek()
  isapnp: switch to fixed_size_llseek()
  lpfc: switch to fixed_size_llseek()
  locks: give the blocked_hash its own spinlock
  locks: add a new "lm_owner_key" lock operation
  locks: turn the blocked_list into a hashtable
  locks: convert fl_link to a hlist_node
  locks: avoid taking global lock if possible when waking up blocked waiters
  locks: protect most of the file_lock handling with i_lock
  locks: encapsulate the fl_link list handling
  locks: make "added" in __posix_lock_file a bool
  ...
2013-07-03 09:10:19 -07:00
Jie Liu
46a1c2c7ae vfs: export lseek_execute() to modules
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().

To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.

Thanks Dave Chinner for this suggestion.

[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]

v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-07-03 16:23:27 +04:00
Linus Torvalds
9e239bb939 Lots of bug fixes, cleanups and optimizations. In the bug fixes
category, of note is a fix for on-line resizing file systems where the
 block size is smaller than the page size (i.e., file systems 1k blocks
 on x86, or more interestingly file systems with 4k blocks on Power or
 ia64 systems.)
 
 In the cleanup category, the ext4's punch hole implementation was
 significantly improved by Lukas Czerner, and now supports bigalloc
 file systems.  In addition, Jan Kara significantly cleaned up the
 write submission code path.  We also improved error checking and added
 a few sanity checks.
 
 In the optimizations category, two major optimizations deserve
 mention.  The first is that ext4_writepages() is now used for
 nodelalloc and ext3 compatibility mode.  This allows writes to be
 submitted much more efficiently as a single bio request, instead of
 being sent as individual 4k writes into the block layer (which then
 relied on the elevator code to coalesce the requests in the block
 queue).  Secondly, the extent cache shrink mechanism, which was
 introduce in 3.9, no longer has a scalability bottleneck caused by the
 i_es_lru spinlock.  Other optimizations include some changes to reduce
 CPU usage and to avoid issuing empty commits unnecessarily.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABCAAGBQJR0XhgAAoJENNvdpvBGATwMXkQAJwTPk5XYLqtAwLziFLvM6wG
 0tWa1QAzTNo80tLyM9iGqI6x74X5nddLw5NMICUmPooOa9agMuA4tlYVSss5jWzV
 yyB7vLzsc/2eZJusuVqfTKrdGybE+M766OI6VO9WodOoIF1l51JXKjktKeaWegfv
 NkcLKlakD4V+ZASEDB/cOcR/lTwAs9dQ89AZzgPiW+G8Do922QbqkENJB8mhalbg
 rFGX+lu9W0f3fqdmT3Xi8KGn3EglETdVd6jU7kOZN4vb5LcF5BKHQnnUmMlpeWMT
 ksOVasb3RZgcsyf5ZOV5feXV601EsNtPBrHAmH22pWQy3rdTIvMv/il63XlVUXZ2
 AXT3cHEvNQP0/yVaOTCZ9xQVxT8sL4mI6kENP9PtNuntx7E90JBshiP5m24kzTZ/
 zkIeDa+FPhsDx1D5EKErinFLqPV8cPWONbIt/qAgo6663zeeIyMVhzxO4resTS9k
 U2QEztQH+hDDbjgABtz9M/GjSrohkTYNSkKXzhTjqr/m5huBrVMngjy/F4/7G7RD
 vSEx5aXqyagnrUcjsupx+biJ1QvbvZWOVxAE/6hNQNRGDt9gQtHAmKw1eG2mugHX
 +TFDxodNE4iWEURenkUxXW3mDx7hFbGZR0poHG3M/LVhKMAAAw0zoKrrUG5c70G7
 XrddRLGlk4Hf+2o7/D7B
 =SwaI
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 update from Ted Ts'o:
 "Lots of bug fixes, cleanups and optimizations.  In the bug fixes
  category, of note is a fix for on-line resizing file systems where the
  block size is smaller than the page size (i.e., file systems 1k blocks
  on x86, or more interestingly file systems with 4k blocks on Power or
  ia64 systems.)

  In the cleanup category, the ext4's punch hole implementation was
  significantly improved by Lukas Czerner, and now supports bigalloc
  file systems.  In addition, Jan Kara significantly cleaned up the
  write submission code path.  We also improved error checking and added
  a few sanity checks.

  In the optimizations category, two major optimizations deserve
  mention.  The first is that ext4_writepages() is now used for
  nodelalloc and ext3 compatibility mode.  This allows writes to be
  submitted much more efficiently as a single bio request, instead of
  being sent as individual 4k writes into the block layer (which then
  relied on the elevator code to coalesce the requests in the block
  queue).  Secondly, the extent cache shrink mechanism, which was
  introduce in 3.9, no longer has a scalability bottleneck caused by the
  i_es_lru spinlock.  Other optimizations include some changes to reduce
  CPU usage and to avoid issuing empty commits unnecessarily."

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (86 commits)
  ext4: optimize starting extent in ext4_ext_rm_leaf()
  jbd2: invalidate handle if jbd2_journal_restart() fails
  ext4: translate flag bits to strings in tracepoints
  ext4: fix up error handling for mpage_map_and_submit_extent()
  jbd2: fix theoretical race in jbd2__journal_restart
  ext4: only zero partial blocks in ext4_zero_partial_blocks()
  ext4: check error return from ext4_write_inline_data_end()
  ext4: delete unnecessary C statements
  ext3,ext4: don't mess with dir_file->f_pos in htree_dirblock_to_tree()
  jbd2: move superblock checksum calculation to jbd2_write_superblock()
  ext4: pass inode pointer instead of file pointer to punch hole
  ext4: improve free space calculation for inline_data
  ext4: reduce object size when !CONFIG_PRINTK
  ext4: improve extent cache shrink mechanism to avoid to burn CPU time
  ext4: implement error handling of ext4_mb_new_preallocation()
  ext4: fix corruption when online resizing a fs with 1K block size
  ext4: delete unused variables
  ext4: return FIEMAP_EXTENT_UNKNOWN for delalloc extents
  jbd2: remove debug dependency on debug_fs and update Kconfig help text
  jbd2: use a single printk for jbd_debug()
  ...
2013-07-02 09:39:34 -07:00
Josef Bacik
0e267c44c3 Btrfs: wait ordered range before doing direct io
My recent truncate patch uncovered this bug, but I can reproduce it without the
truncate patch.  If you mount with -o compress-force, do a direct write to some
area, do a buffered write to some other area, and then do a direct read you will
get the wrong data for where you did the buffered write.  This is because the
generic direct io helpers only call filemap_write_and_wait once, and for
compression we need it twice.  So to be safe add the btrfs_wait_ordered_range to
the start of the direct io function to make sure any compressed writes have
truly been written.  This patch makes xfstests 130 pass when you mount with -o
compress-force=lzo.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:49 -04:00
Josef Bacik
7fb7d76f96 Btrfs: only do the tree_mod_log_free_eb if this is our last ref
There is another bug in the tree mod log stuff in that we're calling
tree_mod_log_free_eb every single time a block is cow'ed.  The problem with this
is that if this block is shared by multiple snapshots we will call this multiple
times per block, so if we go to rewind the mod log for this block we'll BUG_ON()
in __tree_mod_log_rewind because we try to rewind a free twice.  We only want to
call tree_mod_log_free_eb if we are actually freeing the block.  With this patch
I no longer hit the panic in __tree_mod_log_rewind.  Thanks,

Cc: stable@vger.kernel.org
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:20 -04:00
Josef Bacik
f1ca7e98a6 Btrfs: hold the tree mod lock in __tree_mod_log_rewind
We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
forward in the tree mod entries, otherwise we'll end up with random entries and
trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the panics
people were seeing when running

find /whatever -type f -exec btrfs fi defrag {} \;

Thansk,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:18 -04:00
Josef Bacik
261c84b662 Btrfs: make backref walking code handle skinny metadata
I missed fixing the backref stuff when I introduced the skinny metadata.  If you
try and do things like snapshot aware defrag with skinny metadata you are going
to see tons of warnings related to the backref count being less than 0.  This is
because the delayed refs will be found for stuff just fine, but it won't find
the skinny metadata extent refs.  With this patch I'm not seeing warnings
anymore.  Thanks,

Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:51:02 -04:00
Liu Bo
35f0399db6 Btrfs: fix crash regarding to ulist_add_merge
Several users reported this crash of NULL pointer or general protection,
the story is that we add a rbtree for speedup ulist iteration, and we
use krealloc() to address ulist growth, and krealloc() use memcpy to copy
old data to new memory area, so it's OK for an array as it doesn't use
pointers while it's not OK for a rbtree as it uses pointers.

So krealloc() will mess up our rbtree and it ends up with crash.

Reviewed-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:59 -04:00
Miao Xie
edd1400be9 Btrfs: fix several potential problems in copy_nocow_pages_for_inode
- It makes no sense that we deal with a inode in the dead tree.
- fix the race between dio and page copy by waiting the dio completion
- avoid the page copy vs truncate/punch hole
- check if the page is in the page cache or not

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:58 -04:00
Miao Xie
826aa0a82c Btrfs: cleanup the code of copy_nocow_pages_for_inode()
- It make no sense that we continue to do something after the error
  happened, just go back with this patch.
- remove some check of copy_nocow_pages_for_inode(), such as page check
  after write, inode check in the end of the function, because we are
  sure they exist.
- remove the unnecessary goto in the return value check of the write

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:56 -04:00
Miao Xie
26b2589190 Btrfs: fix oops when recovering the file data by scrub function
We get oops while running btrfs replace start test,
------------[ cut here ]------------
kernel BUG at mm/filemap.c:608!
[SNIP]
Call Trace:
  [<ffffffffa04b36c7>] copy_nocow_pages_for_inode+0x217/0x3f0 [btrfs]
  [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
  [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
  [<ffffffffa04bb8ce>] iterate_extent_inodes+0x1ae/0x300 [btrfs]
  [<ffffffffa04bbab2>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
  [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
  [<ffffffffa04b3b07>] copy_nocow_pages_worker+0x97/0x150 [btrfs]
  [<ffffffffa048eed4>] worker_loop+0x134/0x540 [btrfs]
  [<ffffffff816274ea>] ? __schedule+0x3ca/0x7f0
  [<ffffffffa048eda0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
  [<ffffffff8106f2f0>] kthread+0xc0/0xd0
  [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
  [<ffffffff8163181c>] ret_from_fork+0x7c/0xb0
  [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
[SNIP]
 RIP  [<ffffffff8111f4c5>] unlock_page+0x35/0x40
  RSP <ffff88010316bb98>
 ---[ end trace 421e79ad0dd72c7d ]---

it is because we forgot to lock the page again after we read data to
the page. Fix it.

Signed-off-by: Lin Feng <linfeng@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:55 -04:00
Josef Bacik
6df9a95e63 Btrfs: make the chunk allocator completely tree lockless
When adjusting the enospc rules for relocation I ran into a deadlock because we
were relocating the only system chunk and that forced us to try and allocate a
new system chunk while holding locks in the chunk tree, which caused us to
deadlock.  To fix this I've moved all of the dev extent addition and chunk
addition out to the delayed chunk completion stuff.  We still keep the in-memory
stuff which makes sure everything is consistent.

One change I had to make was to search the commit root of the device tree to
find a free dev extent, and hold onto any chunk em's that we allocated in that
transaction so we do not allocate the same dev extent twice.  This has the side
effect of fixing a bug with balance that has been there ever since balance
existed.  Basically you can free a block group and it's dev extent and then
immediately allocate that dev extent for a new block group and write stuff to
that dev extent, all within the same transaction.  So if you happen to crash
during a balance you could come back to a completely broken file system.  This
patch should keep these sort of things from happening in the future since we
won't be able to allocate free'd dev extents until after the transaction
commits.  This has passed all of the xfstests and my super annoying stress test
followed by a balance.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:53 -04:00
Josef Bacik
68a7342c51 Btrfs: cleanup orphaned root orphan item
I hit a weird problem were my root item had been deleted but the orphan item had
not.  This isn't necessarily a problem, but it keeps the file system from being
mounted.  To fix this we just need to axe the orphan item if we can't find the
fs root when we're putting them altogether.  With this patch I was able to
successfully mount my file system.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:52 -04:00
Miao Xie
a70c6172e7 Btrfs: fix wrong mirror number tuning
Now reading the data from the target device of the replace operation is allowed,
so the mirror number that is greater than the stripes number of a chunk is valid,
we will tune it when we find there is no target device later. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:50 -04:00
Miao Xie
e6da5d2ec9 Btrfs: cleanup redundant code in btrfs_submit_direct()
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:48 -04:00
Miao Xie
f51a4a1826 Btrfs: remove btrfs_sector_sum structure
Using the structure btrfs_sector_sum to keep the checksum value is
unnecessary, because the extents that btrfs_sector_sum points to are
continuous, we can find out the expected checksums by btrfs_ordered_sum's
bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
removing bytenr, there is only one member in the structure, so it makes
no sense to keep the structure, just remove it, and use a u32 array to
store the checksum value.

By this change, we don't use the while loop to get the checksums one by
one. Now, we can get several checksum value at one time, it improved the
performance by ~74% on my SSD (31MB/s -> 54MB/s).

test command:
 # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:47 -04:00
Josef Bacik
7ee9e4405f Btrfs: check if we can nocow if we don't have data space
We always just try and reserve data space when we write, but if we are out of
space but have prealloc'ed extents we should still successfully write.  This
patch will try and see if we can write to prealloc'ed space and if we can go
ahead and allow the write to continue.  With this patch we now pass xfstests
generic/274.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:45 -04:00
Josef Bacik
925a6efb8f Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc
try_to_writeback_inodes_sb_nr returns 1 if writeback is already underway, which
is completely fraking useless for us as we need to make sure pages are actually
written before we go and check if there are ordered extents.  So replace this
with an open coding of try_to_writeback_inodes_sb_nr minus the writeback
underway check so that we are sure to actually have flushed some dirty pages out
and will have ordered extents to use.  With this patch xfstests generic/273 now
passes.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:43 -04:00
Josef Bacik
b150a4f10d Btrfs: use a percpu to keep track of possibly pinned bytes
There are all of these checks in the ENOSPC code to see if committing the
transaction would free up enough space to make the allocation.  This is because
early on we just committed the transaction and hoped and prayed, which resulted
in cases where it took _forever_ to get an ENOSPC when we really were out of
space.  So we check space_info->bytes_pinned, except this isn't completely true
because it doesn't account for space we may free but are stuck in delayed refs.
So tests like xfstests 226 would fail because we wouldn't commit the transaction
to free up the data space.  So instead add a percpu counter that will be a
little fuzzier, it will add bytes as soon as we try to free up the space, and
remove any space it doesn't actually free up when we get around to doing the
actual free.  We then 0 out this counter every transaction period so we have a
better idea of how much space we will actually free up by committing this
transaction.  With this patch we now pass xfstests 226.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:42 -04:00
Josef Bacik
f23b5a5995 Btrfs: check for actual acls rather than just xattrs when caching no acl
We have an optimization that will go ahead and cache no acls on an inode if
there are no xattrs on the inode.  This saves us a lookup later to check the
acls for writes or any other access.  The problem is I use selinux so I always
have an xattr on inodes, so make this test a little smarter and check for the
actual acl hash on the key and if it isn't there then we still get to cache no
acl which makes everybody who uses selinux a little happier.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-02 11:50:40 -04:00
Josef Bacik
a71754fc68 Btrfs: move btrfs_truncate_page to btrfs_cont_expand instead of btrfs_truncate
This has plagued us forever and I'm so over working around it.  When we truncate
down to a non-page aligned offset we will call btrfs_truncate_page to zero out
the end of the page and write it back to disk, this will keep us from exposing
stale data if we truncate back up from that point.  The problem with this is it
requires data space to do this, and people don't really expect to get ENOSPC
from truncate() for these sort of things.  This also tends to bite the orphan
cleanup stuff too which keeps people from mounting.  To get around this we can
just move this into btrfs_cont_expand() to make sure if we are truncating up
from a non-page size aligned i_size we will zero out the rest of this page so
that we don't expose stale data.  This will give ENOSPC if you try to truncate()
up or if you try to write past the end of isize, which is much more reasonable.
This fixes xfstests generic/083 failing to mount because of the orphan cleanup
failing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:33 -04:00
Josef Bacik
0b08851fda Btrfs: optimize reada_for_balance
This patch does two things.  First we no longer explicitly read in the blocks
we're trying to readahead.  For things like balance_level we may never actually
use the blocks so this just adds uneeded latency, and balance_level and
split_node will both read in the blocks they care about explicitly so if the
blocks need to be waited on it will be done there.  Secondly we no longer drop
the path if we do readahead, we just set the path blocking before we call
reada_for_balance() and then we're good to go.  Hopefully this will cut down on
the number of re-searches.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:32 -04:00
Josef Bacik
bdf7c00e8f Btrfs: optimize read_block_for_search
This patch does two things, first it only does one call to
btrfs_buffer_uptodate() with the gen specified instead of once with 0 and then
again with gen specified.  The other thing is to call btrfs_read_buffer() on the
buffer we've found instead of dropping it and then calling read_tree_block().
This will keep us from doing yet another radix tree lookup for a buffer we've
already found.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:32 -04:00
Josef Bacik
fdf8e2ea3c Btrfs: unlock extent range on enospc in compressed submit
A user reported a deadlock where the async submit thread was blocked on the
lock_extent() lock, and then everybody behind him was locked on the page lock
for the page he was holding.  Looking at the code I noticed we do not unlock the
extent range when we get ENOSPC and goto retry.  This is bad because we
immediately try to lock that range again to do the cow, which will cause a
deadlock.  Fix this by unlocking the range.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:31 -04:00
Wang Sheng-Hui
90b6d2830a Btrfs: fix the comment typo for btrfs_attach_transaction_barrier
The comment is for btrfs_attach_transaction_barrier, not for
btrfs_attach_transaction. Fix the typo.

Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Acked-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:30 -04:00
Josef Bacik
aee68ee5f5 Btrfs: fix not being able to find skinny extents during relocate
We unconditionally search for the EXTENT_ITEM_KEY for metadata during balance,
and then check the key that we found to see if it is actually a
METADATA_ITEM_KEY, but this doesn't work right because METADATA is a higher key
value, so if what we are looking for happens to be the first item in the leaf
the search will dump us out at the previous leaf, and we won't find our item.
So instead do what we do everywhere else, search for the skinny extent first and
if we don't find it go back and re-search for the extent item.  This patch fixes
the panic I was hitting when balancing a large file system with skinny extents.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:30 -04:00
Josef Bacik
da61d31a78 Btrfs: cleanup backref search commit root flag stuff
Looking into this backref problem I noticed we're using a macro to what turns
out to essentially be a NULL check to see if we need to search the commit root.
I'm killing this, let's just do what everybody else does and checks if trans ==
NULL.  I've also made it so we pass in the path to __resolve_indirect_refs which
will have the search_commit_root flag set properly already and that way we can
avoid allocating another path when we have a perfectly good one to use.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:29 -04:00
Josef Bacik
d88d46c6e0 Btrfs: free csums when we're done scrubbing an extent
A user reported scrub taking up an unreasonable amount of ram as it ran.  This
is because we lookup the csums for the extent we're scrubbing but don't free it
up until after we're done with the scrub, which means we can take up a whole lot
of ram.  This patch fixes this by dropping the csums once we're done with the
extent we've scrubbed.  The user reported this to fix their problem.  Thanks,

Reported-and-tested-by: Remco Hosman <remco@hosman.xs4all.nl>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:28 -04:00
Josef Bacik
1be41b78bc Btrfs: fix transaction throttling for delayed refs
Dave has this fs_mark script that can make btrfs abort with sufficient amount of
ram.  This is because with more ram we can keep more dirty metadata in cache
which in a round about way makes for many more pending delayed refs.  What
happens is we end up not throttling the transaction enough so when we go to
commit the transaction when we've completely filled the file system we'll
abort() because we use all of the space in the global reserve and we still have
delayed refs to run.  To fix this we need to make the delayed ref flushing and
the transaction throttling dependant upon the number of delayed refs that we
have instead of how much reserved space is left in the global reserve.  With
this patch we not only stop aborting transactions but we also get a smoother run
speed with fs_mark and it makes us about 10% faster.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:28 -04:00
Josef Bacik
501407aab8 Btrfs: stop waiting on current trans if we aborted
I hit a hang when run_delayed_refs returned an error in the beginning of
btrfs_commit_transaction.  If we decide we need to commit the transaction in
btrfs_end_transaction we'll set BLOCKED and start to commit, but if we get an
error this early on we'll just exit without committing.  This is fine, except
that anybody else who tried to start a transaction will sit in
wait_current_trans() since we're set to BLOCKED and we never set it to something
else and woke people up.  To fix this we want to check for trans->aborted
everywhere we wait for the transaction state to change, and make
btrfs_abort_transaction() wake up any waiters there may be.  All the callers
will notice that the transaction has aborted and exit out properly.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:27 -04:00
Josef Bacik
f971fe29b1 Btrfs: wake up delayed ref flushing waiters on abort
I hit a deadlock because we aborted when flushing delayed refs but didn't wake
any of the other flushers up and so everybody was just sleeping forever.  This
should fix the problem.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:26 -04:00
Jie Liu
3fb4037599 btrfs: fix the code comments for LZO compression workspace
Fix the code comments for lzo compression workspace.
The buf item is used to store the decompressed data
and cbuf is used to store the compressed data.

Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:26 -04:00
Miao Xie
5bc7247ac4 Btrfs: fix broken nocow after balance
Balance will create reloc_root for each fs root, and it's going to
record last_snapshot to filter shared blocks.  The side effect of
setting last_snapshot is to break nocow attributes of files.

Since the extents are not shared by the relocation tree after the balance,
we can recover the old last_snapshot safely if no one snapshoted the
source tree. We fix the above problem by this way.

Reported-by: Kyle Gates <kylegates@hotmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-07-01 08:52:25 -04:00
Al Viro
6d0379ec49 btrfs: more open-coded file_inode()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:24 +04:00
Al Viro
9cdda8d31f [readdir] convert btrfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:57:00 +04:00
Josef Bacik
8c2a1a3028 Btrfs: exclude logged extents before replying when we are mixed
With non-mixed block groups we replay the logs before we're allowed to do any
writes, so we get away with not pinning/removing the data extents until right
when we replay them.  However with mixed block groups we allocate out of the
same pool, so we could easily allocate a metadata block that was logged in our
tree log.  To deal with this we just need to notice that we have mixed block
groups and do the normal excluding/removal dance during the pin stage of the log
replay and that way we don't allocate metadata blocks from areas we have logged
data extents.  With this patch we now pass xfstests generic/311 with mixed
block groups turned on.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:17 -04:00
Josef Bacik
01cd33674e Btrfs: put our inode if orphan cleanup fails
When we cross into a different subvol when doing a lookup we will run the orhpan
cleanup.  If this fails however we do not drop the ref to the inode we were
looking up before we return an error, which leads to busy inodes on umount.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:16 -04:00
Josef Bacik
c69b26b011 Btrfs: add some missing iput()'s in btrfs_orphan_cleanup
There are some error cases that we don't do an iput() on our inode, fix this.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:15 -04:00
Josef Bacik
e78417d192 Btrfs: do not pin while under spin lock
When testing a corrupted fs I noticed I was getting sleep while atomic errors
when the transaction aborted.  This is because btrfs_pin_extent may need to
allocate memory and we are calling this under the spin lock.  Fix this by moving
it out and doing the pin after dropping the spin lock but before dropping the
mutex, the same way it works when delayed refs run normally.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:13 -04:00
Thomas Meyer
a5959bc0a1 Btrfs: Cocci spatch "memdup.spatch"
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:12 -04:00
Thomas Meyer
97a184fe81 Btrfs: Cocci spatch "ptr_ret.spatch"
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:11 -04:00
Jan Schmidt
b382a324b6 Btrfs: fix qgroup rescan resume on mount
When called during mount, we cannot start the rescan worker thread until
open_ctree is done. This commit restuctures the qgroup rescan internals to
enable a clean deferral of the rescan resume operation.

First of all, the struct qgroup_rescan is removed, saving us a malloc and
some initialization synchronizations problems. Its only element (the worker
struct) now lives within fs_info just as the rest of the rescan code.

Then setting up a rescan worker is split into several reusable stages.
Currently we have three different rescan startup scenarios:
	(A) rescan ioctl
	(B) rescan resume by mount
	(C) rescan by quota enable

Each case needs its own combination of the four following steps:
	(1) set the progress [A, C: zero; B: state of umount]
	(2) commit the transaction [A]
	(3) set the counters [A, C: zero; B: state of umount]
	(4) start worker [A, B, C]

qgroup_rescan_init does step (1). There's no extra function added to commit
a transaction, we've got that already. qgroup_rescan_zero_tracking does
step (3). Step (4) is nothing more than a call to the generic
btrfs_queue_worker.

We also get rid of a double check for the rescan progress during
btrfs_qgroup_account_ref, which is no longer required due to having step 2
from the list above.

As a side effect, this commit prepares to move the rescan start code from
btrfs_run_qgroups (which is run during commit) to a less time critical
section.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:10 -04:00
Jan Schmidt
eb1716af88 Btrfs: avoid double free of fs_info->qgroup_ulist
When btrfs_read_qgroup_config or btrfs_quota_enable return non-zero, we've
already freed the fs_info->qgroup_ulist. The final btrfs_free_qgroup_config
called from quota_disable makes another ulist_free(fs_info->qgroup_ulist)
call.

We set fs_info->qgroup_ulist to NULL on the mentioned error paths, turning
the ulist_free in btrfs_free_qgroup_config into a noop.

Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:08 -04:00
Jan Schmidt
4373519db4 Btrfs: fix memory patcher through fs_info->qgroup_ulist
Commit 5b7c665e introduced fs_info->qgroup_ulist, that is allocated during
btrfs_read_qgroup_config and meant to be used later by the qgroup accounting
code. However, it is always freed before btrfs_read_qgroup_config returns,
becuase the commit mentioned above adds a check for (ret), where a check
for (ret < 0) would have been the right choice. This commit fixes the check.

Cc: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:07 -04:00
Josef Bacik
d52be818e6 Btrfs: simplify unlink reservations
Dave pointed out a problem where if you filled up a file system as much as
possible you couldn't remove any files.  The whole unlink reservation thing is
convoluted because it tries to guess if it's going to add space to unlink
something or not, and has all these odd uncommented cases where it simply does
not try.  So to fix this I've added a way to conditionally steal from the global
reserve if we can't make our normal reservation.  If we have more than half the
space in the global reserve free we will go ahead and steal from the global
reserve.  With this patch Dave's reproducer now works and I can rm all the files
on the file system.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:06 -04:00
Miao Xie
c6adc9cc08 Btrfs: merge pending IO for tree log write back
Before applying this patch, we flushed the log tree of the fs/file
tree firstly, and then flushed the log root tree. It is ineffective,
especially on the hard disk. This patch improved this problem by wrapping
the above two flushes by the same blk_plug.

By test, the performance of the sync write went up ~60%(2.9MB/s -> 4.6MB/s)
on my scsi disk whose disk buffer was enabled.

Test step:
 # mkfs.btrfs -f -m single <disk>
 # mount <disk> <mnt>
 # dd if=/dev/zero of=<mnt>/file0 bs=32K count=1024 oflag=sync

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:05 -04:00
Liu Bo
a96fbc7288 Btrfs: allow file data clone within a file
We did not allow file data clone within the same file because of
deadlock issues.

However, we now use nested lock to avoid deadlock between the
parent directory and the child file.

So it's safe to do file clone within the same file when the two
ranges are not overlapped.

Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:03 -04:00
Liu Bo
b7394eb91c Btrfs: remove unused code in btrfs_del_root
'leaf' and 'ri' is not used somehow.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:02 -04:00
Liu Bo
2da1c669f0 Btrfs: kill replicate code in replay_one_buffer
EXTREF is treated same as REF, so we can make the code tidy.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:30:01 -04:00
Liu Bo
33157e05db Btrfs: check if leaf's parent exists before pushing items around
During splitting a leaf, pushing items around to hopefully get some space only
works when we have a parent, ie. we have at least one sibling leaf.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:58 -04:00
Liu Bo
fdd99c7294 Btrfs: dont do log_removal in insert_new_root
As for splitting a leaf, root is just the leaf, and tree mod log does not apply
on leaf, so in this case, we don't do log_removal.

As for splitting a node, the old root is kept as a normal node and we have nicely
put records in tree mod log for moving keys and items, so in this case we don't do
that either.

As above, insert_new_root can get rid of log_removal.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:57 -04:00
Wei Yongjun
4b286cd1f5 Btrfs: return error code in btrfs_check_trunc_cache_free_space()
Fix to return error code instead always return 0 from function
btrfs_check_trunc_cache_free_space().
Introduced by commit 7b61cd9224
(Btrfs: don't use global block reservation for inode cache truncation)

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:56 -04:00
Josef Bacik
139f807a1e Btrfs: fix estale with btrfs send
This fixes bugzilla 57491.  If we take a snapshot of a fs with a unlink ongoing
and then try to send that root we will run into problems.  When comparing with a
parent root we will search the parents and the send roots commit_root, which if
we've just created the snapshot will include the file that needs to be evicted
by the orphan cleanup.  So when we find a changed extent we will try and copy
that info into the send stream, but when we lookup the inode we use the normal
root, which no longer has the inode because the orphan cleanup deleted it.  The
best solution I have for this is to check our otransid with the generation of
the commit root and if they match just commit the transaction again, that way we
get the changes from the orphan cleanup.  With this patch the reproducer I made
for this bugzilla no longer returns ESTALE when trying to do the send.  Thanks,

Cc: stable@vger.kernel.org
Reported-by: Chris Wilson <jakdaw@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:55 -04:00
Anand Jain
183860f6a0 btrfs: device delete to get errors from the kernel
when user runs command btrfs dev del the raid requisite error if any
goes to the /var/log/messages, its not good idea to clutter messages
with these user (knowledge) errors, further user don't have to review
the system messages to know problem with the cli it should be dropped
to the user as part of the cli return.

to bring this feature created a set of the ERROR defined
BTRFS_ERROR_DEV* error codes and created their error string.

I expect this enum to be added with other error which we might
want to communicate to the user land

v3:
moved the code with in the file no logical change

v1->v2:
introduce error codes for the device mgmt usage

v1:
adds a parameter in the ioctl arg struct to carry the error string

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:53 -04:00
Josef Bacik
c73e293678 Btrfs: do delay iput in sync_fs
We get lock inversion with umount if we allow iputs from sync_fs, so use the
delay iput flag to keep this from happening.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:52 -04:00
Miao Xie
4a9d8bdee3 Btrfs: make the state of the transaction more readable
We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.

This patch improved the above problem. In this patch, we define 6 states
for the transaction,
  enum btrfs_trans_state {
	TRANS_STATE_RUNNING		= 0,
	TRANS_STATE_BLOCKED		= 1,
	TRANS_STATE_COMMIT_START	= 2,
	TRANS_STATE_COMMIT_DOING	= 3,
	TRANS_STATE_UNBLOCKED		= 4,
	TRANS_STATE_COMPLETED		= 5,
	TRANS_STATE_MAX			= 6,
  }
and just use 1 variant to track those state.

In order to make the blocked handle types for each state more clear,
we introduce a array:
  unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
	[TRANS_STATE_RUNNING]		= 0U,
	[TRANS_STATE_BLOCKED]		= (__TRANS_USERSPACE |
					   __TRANS_START),
	[TRANS_STATE_COMMIT_START]	= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH),
	[TRANS_STATE_COMMIT_DOING]	= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH |
					   __TRANS_JOIN),
	[TRANS_STATE_UNBLOCKED]		= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH |
					   __TRANS_JOIN |
					   __TRANS_JOIN_NOLOCK),
	[TRANS_STATE_COMPLETED]		= (__TRANS_USERSPACE |
					   __TRANS_START |
					   __TRANS_ATTACH |
					   __TRANS_JOIN |
					   __TRANS_JOIN_NOLOCK),
  }
it is very intuitionistic.

Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:51 -04:00
Miao Xie
581227d0d2 Btrfs: remove the time check in btrfs_commit_transaction()
We checked the commit time to avoid committing the transaction
frequently, but it is unnecessary because:
- It made the transaction commit spend more time, and delayed the
  operation of the external writers(TRANS_START/TRANS_USERSPACE).
- Except the space that we have to commit transaction, such as
  snapshot creation, btrfs doesn't commit the transaction on its
  own initiative.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:50 -04:00
Miao Xie
3f1e3fa65c Btrfs: remove unnecessary varient ->num_joined in btrfs_transaction structure
We used ->num_joined track if there were some writers which join the current
transaction when the committer was sleeping. If some writers joined the current
transaction, we has to continue the while loop to do some necessary stuff, such
as flush the ordered operations. But it is unnecessary because we will do it
after the while loop.

Besides that, tracking ->num_joined would make the committer drop into the while
loop when there are lots of internal writers(TRANS_JOIN).

So we remove ->num_joined and don't track if there are some writers which join
the current transaction when the committer is sleeping.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:48 -04:00
Miao Xie
824366177a Btrfs: don't flush the delalloc inodes in the while loop if flushoncommit is set
It is unnecessary to flush the delalloc inodes again and again because
we don't care the dirty pages which are introduced after the flush, and
they will be flush in the transaction commit.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:47 -04:00
Miao Xie
0860adfdb2 Btrfs: don't wait for all the writers circularly during the transaction commit
btrfs_commit_transaction has the following loop before we commit the
transaction.

do {
    // attempt to do some useful stuff and/or sleep
} while (atomic_read(&cur_trans->num_writers) > 1 ||
	 (should_grow && cur_trans->num_joined != joined));

This is used to prevent from the TRANS_START to get in the way of a
committing transaction. But it does not prevent from TRANS_JOIN, that
is we would do this loop for a long time if some writers JOIN the
current transaction endlessly.

Because we need join the current transaction to do some useful stuff,
we can not block TRANS_JOIN here. So we introduce a external writer
counter, which is used to count the TRANS_USERSPACE/TRANS_START writers.
If the external writer counter is zero, we can break the above loop.

In order to make the code more clear, we don't use enum variant
to define the type of the transaction handle, use bitmask instead.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:46 -04:00
Miao Xie
25d8c284c7 Btrfs: remove the code for the impossible case in cleanup_transaction()
If the transaction is removed from the transaction list, it means the
transaction has been committed successfully. So it is impossible to
call cleanup_transaction(), otherwise there is something wrong with
the code logic. Thus, we use BUG_ON() instead of the original handle.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:45 -04:00
Miao Xie
ac6738792f Btrfs: cleanup unnecessary assignment when cleaning up all the residual transaction
When we umount a fs with serious errors, we will invoke btrfs_cleanup_transactions()
to clean up the residual transaction. At this time, It is impossible to start a new
transaction, so we needn't assign trans_no_join to 1, and also needn't clear running
transaction every time we destroy a residual transaction.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:44 -04:00
Miao Xie
6a03843df4 Btrfs: just flush the delalloc inodes in the source tree before snapshot creation
Before applying this patch, we need flush all the delalloc inodes in
the fs when we want to create a snapshot, it wastes time, and make
the transaction commit be blocked for a long time. It means some other
user operation would also be blocked for a long time.

This patch improves this problem, we just flush the delalloc inodes that
in the source trees before snapshot creation, so the transaction commit
will complete quickly.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:42 -04:00
Miao Xie
199c2a9c3d Btrfs: introduce per-subvolume ordered extent list
The reason we introduce per-subvolume ordered extent list is the same
as the per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:41 -04:00
Miao Xie
eb73c1b7ce Btrfs: introduce per-subvolume delalloc inode list
When we create a snapshot, we need flush all delalloc inodes in the
fs, just flushing the inodes in the source tree is OK. So we introduce
per-subvolume delalloc inode list.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:40 -04:00
Miao Xie
b0feb9d96e Btrfs: introduce grab/put functions for the root of the fs/file tree
The grab/put funtions will be used in the next patch, which need grab
the root object and ensure it is not freed. We use reference counter
instead of the srcu lock is to aovid blocking the memory reclaim task,
which invokes synchronize_srcu().

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:38 -04:00
Miao Xie
cb517eabba Btrfs: cleanup the similar code of the fs root read
There are several functions whose code is similar, such as
  btrfs_find_last_root()
  btrfs_read_fs_root_no_radix()

Besides that, some functions are invoked twice, it is unnecessary,
for example, we are sure that all roots which is found in
  btrfs_find_orphan_roots()
have their orphan items, so it is unnecessary to check the orphan
item again.

So cleanup it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:37 -04:00
Miao Xie
babbf170c7 Btrfs: make the snap/subv deletion end more early when the fs is R/O
The snapshot/subvolume deletion might spend lots of time, it would make
the remount task wait for a long time. This patch improve this problem,
we will break the deletion if the fs is remounted to be R/O. It will make
the users happy.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:36 -04:00
Miao Xie
dc7f370c05 Btrfs: move the R/O check out of btrfs_clean_one_deleted_snapshot()
If the fs is remounted to be R/O, it is unnecessary to call
btrfs_clean_one_deleted_snapshot(), so move the R/O check out of
this function. And besides that, it can make the check logic in the
caller more clear.

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:34 -04:00
Miao Xie
05323cd135 Btrfs: make the cleaner complete early when the fs is going to be umounted
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:33 -04:00
Miao Xie
d027824564 Btrfs: remove unnecessary ->s_umount in cleaner_kthread()
In order to avoid the R/O remount, we acquired ->s_umount lock during
we deleted the dead snapshots and subvolumes. But it is unnecessary,
because we have cleaner_mutex.

We use cleaner_mutex to protect the process of the dead snapshots/subvolumes
deletion. And when we remount the fs to be R/O, we also acquire this mutex to
do cleanup after we change the status of the fs. That is this lock can serialize
the above operations, the cleaner can be aware of the status of the fs, and if
the cleaner is deleting the dead snapshots/subvolumes, the remount task will
wait for it. So it is safe to remove ->s_umount in cleaner_kthread().

Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:32 -04:00
Stefan Behrens
3c64a1aba7 Btrfs: cleanup: don't check the same thing twice
btrfs_read_fs_root_no_name() already checks if btrfs_root_refs()
is zero and returns ENOENT in this case. There is no need to do
it again in six places.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:30 -04:00
Stefan Behrens
b1b195969f Btrfs: cleanup, btrfs_read_fs_root_no_name() doesn't return NULL
No need to check for NULL in send.c and disk-io.c.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:29 -04:00
Stefan Behrens
78a1068b28 Btrfs: delete unused function
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:28 -04:00
Liu Bo
5798b92d2b Btrfs: remove useless copy in quota_ctl
We don't need to copy it back to user side as it remains unchanged.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:27 -04:00
Andreas Philipp
1c89cdd1ce Minor format cleanup.
Clean up the format of the definitions of BTRFS_BLOCK_GROUP_RAID5 and
BTRFS_BLOCK_GROUP_RAID6.

Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:25 -04:00
Tsutomu Itoh
924794c936 Btrfs: cleanup unused arguments in send.c
sctx is removed from the argument of the function that
doesn't use sctx.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:24 -04:00
Stefan Behrens
8f69dbd236 Btrfs: fix a comment
The size parameter to btrfs_extend_item() is the number of bytes
to add to the item, not the size of the item after the operation
(like it is for btrfs_truncate_item(), there the size parameter
is not the number of bytes to take away, but the total size of
the item after truncation).
Fix it in the comment.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:23 -04:00
Jan Schmidt
57254b6ebc Btrfs: add ioctl to wait for qgroup rescan completion
btrfs_qgroup_wait_for_completion waits until the currently running qgroup
operation completes. It returns immediately when no rescan process is in
progress. This is useful to automate things around the rescan process (e.g.
testing).

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:22 -04:00
Wang Shilong
1e8f915868 Btrfs: introduce qgroup_ulist to avoid frequently allocating/freeing ulist
When doing qgroup accounting, we call ulist_alloc()/ulist_free() every time
when we want to walk qgroup tree.

By introducing 'qgroup_ulist', we only need to call ulist_alloc()/ulist_free()
once. This reduce some sys time to allocate memory, see the measurements below

fsstress -p 4 -n 10000 -d $dir

With this patch:

real    0m50.153s
user    0m0.081s
sys     0m6.294s

real    0m51.113s
user    0m0.092s
sys     0m6.220s

real    0m52.610s
user    0m0.096s
sys     0m6.125s	avg 6.213
-----------------------------------------------------
Without the patch:

real    0m54.825s
user    0m0.061s
sys     0m10.665s

real    1m6.401s
user    0m0.089s
sys     0m11.218s

real    1m13.768s
user    0m0.087s
sys     0m10.665s       avg 10.849

we can see the sys time reduce ~43%.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:21 -04:00
David Sterba
85965600f5 btrfs: show compiled-in config features at module load time
We want to know if there are debugging features compiled in, this may
affect performance. The message is printed before the sanity checks.
Also kill version.h file that serves no purpose, we don't use any
version tag for kernel module.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:19 -04:00
David Sterba
e6d2960582 btrfs: move ifdef around sanity checks out of init_btrfs_fs
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:18 -04:00
David Sterba
905d0f564e btrfs: add prefix to sanity tests messages
And change the message level to KERN_INFO.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:17 -04:00
David Sterba
8d599ae1bf btrfs: add debug check for extent_io range alignment
The 'end' value must exactly cover the end of the interval, which means
one byte less than the expected block alignment, or in case of a file
smaller than one block, one byte less than the inode size.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:15 -04:00
Henrik Nordvik
15b0a89d71 Btrfs: fix check on same raid type flag twice
Code checked for raid 5 flag in two else-if branches, so code would never be reached. Probably a copy-paste bug.

Signed-off-by: Henrik Nordvik <henrikno@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-14 11:29:14 -04:00
Linus Torvalds
a2648ebb7e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "This is an assortment of crash fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: stop all workers before cleaning up roots
  Btrfs: fix use-after-free bug during umount
  Btrfs: init relocate extent_io_tree with a mapping
  btrfs: Drop inode if inode root is NULL
  Btrfs: don't delete fs_roots until after we cleanup the transaction
2013-06-13 22:34:14 -07:00
Josef Bacik
13e6c37b98 Btrfs: stop all workers before cleaning up roots
Dave reported a panic because the extent_root->commit_root was NULL in the
caching kthread.  That is because we just unset it in free_root_pointers, which
is not the correct thing to do, we have to either wait for the caching kthread
to complete or hold the extent_commit_sem lock so we know the thread has exited.
This patch makes the kthreads all stop first and then we do our cleanup.  This
should fix the race.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-06-08 15:11:35 -04:00
Liu Bo
2932505abe Btrfs: fix use-after-free bug during umount
Commit be283b2e67
(    Btrfs: use helper to cleanup tree roots) introduced the following bug,

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000034
 IP: [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
[...]
 Pid: 2463, comm: btrfs-cache-1 Tainted: G           O 3.9.0+ #4 innotek GmbH VirtualBox/VirtualBox
 RIP: 0010:[<ffffffffa039368c>]  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]
 Process btrfs-cache-1 (pid: 2463, threadinfo ffff880112d60000, task ffff880117679730)
[...]
 Call Trace:
  [<ffffffffa0398a99>] btrfs_search_slot+0x104/0x64d [btrfs]
  [<ffffffffa039aea4>] btrfs_next_old_leaf+0xa7/0x334 [btrfs]
  [<ffffffffa039b141>] btrfs_next_leaf+0x10/0x12 [btrfs]
  [<ffffffffa039ea13>] caching_thread+0x1a3/0x2e0 [btrfs]
  [<ffffffffa03d8811>] worker_loop+0x14b/0x48e [btrfs]
  [<ffffffffa03d86c6>] ? btrfs_queue_worker+0x25c/0x25c [btrfs]
  [<ffffffff81068d3d>] kthread+0x8d/0x95
  [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
  [<ffffffff8151e5ac>] ret_from_fork+0x7c/0xb0
  [<ffffffff81068cb0>] ? kthread_freezable_should_stop+0x43/0x43
RIP  [<ffffffffa039368c>] extent_buffer_get+0x4/0xa [btrfs]

We've free'ed commit_root before actually getting to free block groups where
caching thread needs valid extent_root->commit_root.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:10:01 -04:00
Josef Bacik
a9995eece3 Btrfs: init relocate extent_io_tree with a mapping
Dave reported a NULL pointer deref.  This is caused because he thought he'd be
smart and add sanity checks to the extent_io bit operations, but he didn't
expect a tree to have a NULL mapping.  To fix this we just need to init the
relocation's processed_blocks with the btree_inode->i_mapping.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:07:53 -04:00
Naohiro Aota
6379ef9fb2 btrfs: Drop inode if inode root is NULL
There is a path where btrfs_drop_inode() is called with its inode's root
is NULL: In btrfs_new_inode(), when btrfs_set_inode_index() fails,
iput() is called. We should handle this case before taking look at the
root->root_item.

Signed-off-by: Naohiro Aota <naota@elisp.net>
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:07:53 -04:00
Josef Bacik
7b5ff90ed0 Btrfs: don't delete fs_roots until after we cleanup the transaction
We get a use after free if we had a transaction to cleanup since there could be
delayed inodes which refer to their respective fs_root.  Thanks

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-06-08 15:07:53 -04:00
Masanari Iida
8b513d0cf6 treewide: Fix typo in printk
Correct spelling typo in various part of drivers

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-05-28 12:02:13 +02:00
Stefan Behrens
7e21f14d17 btrfs: fix btrfs_extend_item() comment
The size parameter to btrfs_extend_item() is the number of bytes
to add to the item, not the size of the item after the operation
(like it is for btrfs_truncate_item(), there the size parameter
is not the number of bytes to take away, but the total size of
the item after truncation).
Fix it in the comment.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-05-28 12:02:12 +02:00
Lukas Czerner
d47992f86b mm: change invalidatepage prototype to accept length
Currently there is no way to truncate partial page where the end
truncate point is not at the end of the page. This is because it was not
needed and the functionality was enough for file system truncate
operation to work properly. However more file systems now support punch
hole feature and it can benefit from mm supporting truncating page just
up to the certain point.

Specifically, with this functionality truncate_inode_pages_range() can
be changed so it supports truncating partial page at the end of the
range (currently it will BUG_ON() if 'end' is not at the end of the
page).

This commit changes the invalidatepage() address space operation
prototype to accept range to be invalidated and update all the instances
for it.

We also change the block_invalidatepage() in the same way and actually
make a use of the new length argument implementing range invalidation.

Actual file system implementations will follow except the file systems
where the changes are really simple and should not change the behaviour
in any way .Implementation for truncate_page_range() which will be able
to accept page unaligned ranges will follow as well.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
2013-05-21 23:17:23 -04:00
Linus Torvalds
130901ba33 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "Miao Xie has been very busy, fixing races and enospc problems and many
  other small but important pieces.

  Alexandre Oliva discovered some problems with how our error handling
  was interacting with the block layer and for now has disabled our
  partial handling of sub-page writes.  The real sub-page work is in a
  series of patches from IBM that we still need to integrate and test.
  The code Alexandre has turned off was really incomplete.

  Josef has more error handling fixes and an important fix for the new
  skinny extent format.

  This also has my fix for the tracepoint crash from late in 3.9.  It's
  the first stage in a larger clean up to get rid of btrfs_bio and make
  a proper bioset for all the items we need to tack into the bio.  For
  now the bioset only holds our mirror_num and stripe_index, but for the
  next merge window I'll shuffle more in."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (25 commits)
  Btrfs: use a btrfs bioset instead of abusing bio internals
  Btrfs: make sure roots are assigned before freeing their nodes
  Btrfs: explicitly use global_block_rsv for quota_tree
  btrfs: do away with non-whole_page extent I/O
  Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
  Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
  Btrfs: pause the space balance when remounting to R/O
  Btrfs: fix unprotected root node of the subvolume's inode rb-tree
  Btrfs: fix accessing a freed tree root
  Btrfs: return errno if possible when we fail to allocate memory
  Btrfs: update the global reserve if it is empty
  Btrfs: don't steal the reserved space from the global reserve if their space type is different
  Btrfs: optimize the error handle of use_block_rsv()
  Btrfs: don't use global block reservation for inode cache truncation
  Btrfs: don't abort the current transaction if there is no enough space for inode cache
  Correct allowed raid levels on balance.
  Btrfs: fix possible memory leak in replace_path()
  Btrfs: fix possible memory leak in the find_parent_nodes()
  Btrfs: don't allow device replace on RAID5/RAID6
  Btrfs: handle running extent ops with skinny metadata
  ...
2013-05-18 11:35:28 -07:00
Chris Mason
c5cb6a0573 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next 2013-05-17 21:53:17 -04:00
Chris Mason
9be3395bcd Btrfs: use a btrfs bioset instead of abusing bio internals
Btrfs has been pointer tagging bi_private and using bi_bdev
to store the stripe index and mirror number of failed IOs.

As bios bubble back up through the call chain, we use these
to decide if and how to retry our IOs.  They are also used
to count IO failures on a per device basis.

Recently a bio tracepoint was added lead to crashes because
we were abusing bi_bdev.

This commit adds a btrfs bioset, and creates explicit fields
for the mirror number and stripe index.  The plan is to
extend this structure for all of the fields currently in
struct btrfs_bio, which will mean one less kmalloc in
our IO path.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Tejun Heo <tj@kernel.org>
2013-05-17 21:52:52 -04:00
Josef Bacik
655b09fe54 Btrfs: make sure roots are assigned before freeing their nodes
If we fail to load the chunk tree we'll call free_root_pointers, except we may
not have assigned the roots for the dev_root/extent_root/csum_root yet, so we
could NULL pointer deref at this point.  Just add checks to make sure these
roots are set to keep us from panicing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:38 -04:00
Stefan Behrens
3a6cad9009 Btrfs: explicitly use global_block_rsv for quota_tree
The quota_tree was set up to use the empty_block_rsv before
which would be problematic when the filesystem is filled up
and ENOSPC happens during internal operations while the quota
tree is updated and COWed (when the btrfs_qgroup_info_item
items) are written. In fact, use_block_rsv() which is used
in btrfs_cow_block() falls back to the global_block_rsv in
this case. But just in order to make it more clear what is
happening, change it to explicitly use the global_block_rsv.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:36 -04:00
Alexandre Oliva
17a5adccf3 btrfs: do away with non-whole_page extent I/O
end_bio_extent_readpage computes whole_page based on bv_offset and
bv_len, without taking into account that blk_update_request may modify
them when some of the blocks to be read into a page produce a read
error.  This would cause the read to unlock only part of the file
range associated with the page, which would in turn leave the entire
page locked, which would not only keep the process blocked instead of
returning -EIO to it, but also prevent any further access to the file.

It turns out that btrfs always issues whole-page reads and writes.
The special handling of non-whole_page appears to be a mistake or a
left-over from a time when this wasn't the case.  Indeed,
end_bio_extent_writepage distinguished between whole_page and
non-whole_page writes but behaved identically in both cases!

I've replaced the whole_page computations with warnings, just to be
sure that we're not issuing partial page reads or writes.  The
warnings should probably just go away some time.

Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:35 -04:00
Miao Xie
b216cbfb52 Btrfs: don't invoke btrfs_invalidate_inodes() in the spin lock context
btrfs_invalidate_inodes() may sleep, so we should not invoke it in the
spin lock context. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:34 -04:00
Miao Xie
314297c2a3 Btrfs: remove BUG_ON() in btrfs_read_fs_tree_no_radix()
We have checked if ->node is NULL or not, so it is unnecessary to
use BUG_ON() to check again. Remove it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:32 -04:00
Miao Xie
061594ef17 Btrfs: pause the space balance when remounting to R/O
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:31 -04:00
Miao Xie
e1409cef85 Btrfs: fix unprotected root node of the subvolume's inode rb-tree
The root node of the rb-tree may be changed, so we should get it under
the lock. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:30 -04:00
Miao Xie
89042e5ad2 Btrfs: fix accessing a freed tree root
inode_tree_del() will move the tree root into the dead root list, and
then the tree will be destroyed by the cleaner. So if we remove the
delayed node which is cached in the inode after inode_tree_del(),
we may access a freed tree root. Fix it.

Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:29 -04:00
Liu Bo
b9aa55bed1 Btrfs: return errno if possible when we fail to allocate memory
We need to set return value explicitly, otherwise we'll lose the error
value.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:27 -04:00
Miao Xie
d88033dbf4 Btrfs: update the global reserve if it is empty
Before applying this patch, we reserved the space for the global reserve
by the minimum unit if we found it is empty, it was unreasonable and
inefficient, because if the global reserve space was depleted, it implied
that the size of the global reserve was too small. In this case, we shoud
update the global reserve and fill it.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:26 -04:00
Miao Xie
5881cfc924 Btrfs: don't steal the reserved space from the global reserve if their space type is different
If the type of the space we need is different with the global reserve, we
can not steal the space from the global reserve, because we can not allocate
the space from the free space cache that the global reserve points to.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:25 -04:00
Miao Xie
b586b32374 Btrfs: optimize the error handle of use_block_rsv()
cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:24 -04:00
Miao Xie
7b61cd9224 Btrfs: don't use global block reservation for inode cache truncation
It is very likely that there are lots of subvolumes/snapshots in the filesystem,
so if we use global block reservation to do inode cache truncation, we may hog
all the free space that is reserved in global rsv. So it is better that we do
the free space reservation for inode cache truncation by ourselves.

Cc: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:22 -04:00
Miao Xie
7cfa9e51d2 Btrfs: don't abort the current transaction if there is no enough space for inode cache
The filesystem with inode cache was forced to be read-only when we umounted it.

Steps to reproduce:
 # mkfs.btrfs -f ${DEV}
 # mount -o inode_cache ${DEV} ${MNT}
 # dd if=/dev/zero of=${MNT}/file1 bs=1M count=8192
 # btrfs fi syn ${MNT}
 # dd if=${MNT}/file1 of=/dev/null bs=1M
 # rm -f ${MNT}/file1
 # btrfs fi syn ${MNT}
 # umount ${MNT}

It is because there was no enough space to do inode cache truncation, and then
we aborted the current transaction.

But no space error is not a serious problem when we write out the inode cache,
and it is safe that we just skip this step if we meet this problem. So we need
not abort the current transaction.

Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:21 -04:00
Andreas Philipp
8250dabedb Correct allowed raid levels on balance.
Raid5 with 3 devices is well defined while the old logic allowed
raid5 only with a minimum of 4 devices when converting the block group
profile via btrfs balance. Creating a raid5 with just three devices
using mkfs.btrfs worked always as expected. This is now fixed and the
whole logic is rewritten.

Signed-off-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:20 -04:00
Stefan Behrens
379cde741b Btrfs: fix possible memory leak in replace_path()
In replace_path(), if read_tree_block() fails, we cannot return
directly, we should free some allocated memory otherwise memory
leak happens.

Similar to Wang's "Btrfs: fix possible memory leak in the
find_parent_nodes()" patch, the current commit fixes an issue that
is related to the "Btrfs: fix all callers of read_tree_block"
commit.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:19 -04:00
Wang Shilong
c16c2e2e51 Btrfs: fix possible memory leak in the find_parent_nodes()
In the find_parent_nodes(), if read_tree_block() fails, we can
not return directly, we should free some allocated memory otherwise
memory leak happens.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:17 -04:00
Stefan Behrens
4968810752 Btrfs: don't allow device replace on RAID5/RAID6
This is not yet supported and causes crashes. One sad user reported
that it destroyed his filesystem.

One failure is in __btrfs_map_block+0xc1f calling kmalloc(0).

0x5f21f is in __btrfs_map_block (fs/btrfs/volumes.c:4923).
4918                            num_stripes = map->num_stripes;
4919                            max_errors = nr_parity_stripes(map);
4920
4921                            raid_map = kmalloc(sizeof(u64) * num_stripes,
4922                                               GFP_NOFS);
4923                            if (!raid_map) {
4924                                    ret = -ENOMEM;
4925                                    goto out;
4926                            }
4927

There might be more issues. Until this is really tested, don't allow
users to start the procedure on RAID5/RAID6 filesystems.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:16 -04:00
Josef Bacik
b1c79e0947 Btrfs: handle running extent ops with skinny metadata
Chris hit a bug where we weren't finding extent records when running extent ops.
This is because we use the delayed_ref_head when running the extent op, which
means we can't use the ->type checks to see if we are metadata.  We also lose
the level of the metadata we are working on.  So to fix this we can just check
the ->is_data section of the extent_op, and we can store the level of the buffer
we were modifying in the extent_op.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:15 -04:00
Josef Bacik
73e1e61fb8 Btrfs: remove warn on in free space cache writeout
This catches block groups that are too large to properly cache.  We deal with
this case fine, so the warning just confuses users.  Remove the warning.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:13 -04:00
Josef Bacik
69a85bd87c Btrfs: don't null pointer deref on abort
I'm sorry, theres no excuse for this sort of work.  We need to use
root->leafsize since eb may be NULL.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:12 -04:00
Gabriel de Perthuis
03b71c6ca6 btrfs: don't stop searching after encountering the wrong item
The search ioctl skips items that are too large for a result buffer, but
inline items of a certain size occuring before any search result is
found would trigger an overflow and stop the search entirely.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=57641

Cc: stable@vger.kernel.org
Signed-off-by: Gabriel de Perthuis <g2p.code+btrfs@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 21:40:10 -04:00
Liu Bo
a52f4cd2b1 Btrfs: fix off-by-one in fiemap
lock_extent/unlock_extent expect an exclusive end.

Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 16:27:26 -04:00
David Sterba
60b62978bc btrfs: annotate quota tree for lockdep
Quota tree has been missing from lockdep annotations, though no warning
has been seen in the wild.

There's currently one entry that does not belong there,
BTRFS_ORPHAN_OBJECTID.  No such tree exists, it's probably a copy &
paste mistake, the id is defined among tree ids.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 16:27:25 -04:00
Linus Torvalds
983a5f84a4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs update from Chris Mason:
 "These are mostly fixes.  The biggest exceptions are Josef's skinny
  extents and Jan Schmidt's code to rebuild our quota indexes if they
  get out of sync (or you enable quotas on an existing filesystem).

  The skinny extents are off by default because they are a new variation
  on the extent allocation tree format.  btrfstune -x enables them, and
  the new format makes the extent allocation tree about 30% smaller.

  I rebased this a few days ago to rework Dave Sterba's crc checks on
  the super block, but almost all of these go back to rc6, since I
  though 3.9 was due any minute.

  The biggest missing fix is the tracepoint bug that was hit late in
  3.9.  I ran into problems with that in overnight testing and I'm still
  tracking it down.  I'll definitely have that fixed for rc2."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (101 commits)
  Btrfs: allow superblock mismatch from older mkfs
  btrfs: enhance superblock checks
  btrfs: fix misleading variable name for flags
  btrfs: use unsigned long type for extent state bits
  Btrfs: improve the loop of scrub_stripe
  btrfs: read entire device info under lock
  btrfs: remove unused gfp mask parameter from release_extent_buffer callchain
  btrfs: handle errors returned from get_tree_block_key
  btrfs: make static code static & remove dead code
  Btrfs: deal with errors in write_dev_supers
  Btrfs: remove almost all of the BUG()'s from tree-log.c
  Btrfs: deal with free space cache errors while replaying log
  Btrfs: automatic rescan after "quota enable" command
  Btrfs: rescan for qgroups
  Btrfs: split btrfs_qgroup_account_ref into four functions
  Btrfs: allocate new chunks if the space is not enough for global rsv
  Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
  btrfs: move leak debug code to functions
  Btrfs: return free space in cow error path
  Btrfs: set UUID in root_item for created trees
  ...
2013-05-09 13:07:40 -07:00
Linus Torvalds
4de13d7aa8 Merge branch 'for-3.10/core' of git://git.kernel.dk/linux-block
Pull block core updates from Jens Axboe:

 - Major bit is Kents prep work for immutable bio vecs.

 - Stable candidate fix for a scheduling-while-atomic in the queue
   bypass operation.

 - Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
   discard bios.

 - Tejuns changes to convert the writeback thread pool to the generic
   workqueue mechanism.

 - Runtime PM framework, SCSI patches exists on top of these in James'
   tree.

 - A few random fixes.

* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
  relay: move remove_buf_file inside relay_close_buf
  partitions/efi.c: replace useless kzalloc's by kmalloc's
  fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
  block: fix max discard sectors limit
  blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
  Documentation: cfq-iosched: update documentation help for cfq tunables
  writeback: expose the bdi_wq workqueue
  writeback: replace custom worker pool implementation with unbound workqueue
  writeback: remove unused bdi_pending_list
  aoe: Fix unitialized var usage
  bio-integrity: Add explicit field for owner of bip_buf
  block: Add an explicit bio flag for bios that own their bvec
  block: Add bio_alloc_pages()
  block: Convert some code to bio_for_each_segment_all()
  block: Add bio_for_each_segment_all()
  bounce: Refactor __blk_queue_bounce to not use bi_io_vec
  raid1: use bio_copy_data()
  pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
  pktcdvd: use bio_copy_data()
  block: Add bio_copy_data()
  ...
2013-05-08 10:13:35 -07:00
Kent Overstreet
a27bb332c0 aio: don't include aio.h in sched.h
Faster kernel compiles by way of fewer unnecessary includes.

[akpm@linux-foundation.org: fix fallout]
[akpm@linux-foundation.org: fix build]
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-05-07 20:16:25 -07:00
Chris Mason
667e7d94a1 Btrfs: allow superblock mismatch from older mkfs
We've added new checks to make sure the super block crc is correct
during mount.  A fresh filesystem from an older mkfs won't have the
crc set.  This adds a warning when it finds a newly created filesystem
but doesn't fail the mount.

Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-05-07 11:00:13 -04:00
David Sterba
1104a88551 btrfs: enhance superblock checks
The superblock checksum is not verified upon mount. <awkward silence>

Add that check and also reorder existing checks to a more logical
order.

Current mkfs.btrfs does not calculate the correct checksum of
super_block and thus a freshly created filesytem will fail to mount when
this patch is applied.

First transaction commit calculates correct superblock checksum and
saves it to disk.

Reproducer:
$ mfks.btrfs /dev/sda
$ mount /dev/sda /mnt
$ btrfs scrub start /mnt
$ sleep 5
$ btrfs scrub status /mnt
... super:2 ...

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-05-07 10:50:27 -04:00
David Sterba
b6919a58f0 btrfs: fix misleading variable name for flags
The variable was named 'data' in btrfs_reserve_extent and that's the
only function that actually uses it to let btrfs_get_alloc_profile know
what profile we want. Then it's passed down as u64 flags.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:27 -04:00
David Sterba
410748882a btrfs: use unsigned long type for extent state bits
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:27 -04:00
Liu Bo
625f1c8dc6 Btrfs: improve the loop of scrub_stripe
1) Right now scrub_stripe() is looping in some unnecessary cases:
* when the found extent item's objectid has been out of the dev extent's range
  but we haven't finish scanning all the range within the dev extent
* when all the items has been processed but we haven't finish scanning all the
  range within the dev extent

In both cases, we can just finish the loop to save costs.

2) Besides, when the found extent item's length is larger than the stripe
len(64k), we don't have to release the path and search again as it'll get at the
same key used in the last loop, we can instead increase the logical cursor in
place till all space of the extent is scanned.

3) And we use 0 as the key's offset to search btree, then get to previous item
to find a smaller item, and again have to move to the next one to get the right
item.  Setting offset=-1 and previous_item() is the correct way.

4) As we won't find any checksum at offset unless this 'offset' is in a data
extent, we can just find checksum when we're really going to scrub an extent.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:26 -04:00
David Sterba
55793c0d03 btrfs: read entire device info under lock
There's a theoretical possibility of reading stale (or even more
theoretically, freed) data from DEV_INFO ioctl when the device would
disappear between an early mutex unlock and data being copied from the
device structure.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:25 -04:00
David Sterba
f7a52a40ca btrfs: remove unused gfp mask parameter from release_extent_buffer callchain
It's unused since 0b32f4bbb4.

Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:24 -04:00
David Sterba
34c2b29079 btrfs: handle errors returned from get_tree_block_key
Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:24 -04:00
Eric Sandeen
48a3b6366f btrfs: make static code static & remove dead code
Big patch, but all it does is add statics to functions which
are in fact static, then remove the associated dead-code fallout.

removed functions:

btrfs_iref_to_path()
__btrfs_lookup_delayed_deletion_item()
__btrfs_search_delayed_insertion_item()
__btrfs_search_delayed_deletion_item()
find_eb_for_page()
btrfs_find_block_group()
range_straddles_pages()
extent_range_uptodate()
btrfs_file_extent_length()
btrfs_scrub_cancel_devid()
btrfs_start_transaction_lflush()

btrfs_print_tree() is left because it is used for debugging.
btrfs_start_transaction_lflush() and btrfs_reada_detach() are
left for symmetry.

ulist.c functions are left, another patch will take care of those.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:23 -04:00
Josef Bacik
634554dc0a Btrfs: deal with errors in write_dev_supers
If you try to mount -o loop a restored file system it will panic if the file
ends up being smaller than the original disk.  This is because we go to try and
get a block for a super that may be past the EOF which makes __getblk return
NULL for a buffer head when we aren't expecting it to.  Fix this by dealing with
this case and just jacking up the errors count.  With this patch we no longer
panic when mounting a restored file system loopback.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:22 -04:00
Josef Bacik
3650860b90 Btrfs: remove almost all of the BUG()'s from tree-log.c
There were a whole bunch and I was doing it for other things.  I haven't tested
these error paths but at the very least this is better than panicing.  I've only
left 2 BUG_ON()'s since they are logic errors and I want to replace them with a
ASSERT framework that we can compile out for production users.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:21 -04:00
Josef Bacik
b50c6e250e Btrfs: deal with free space cache errors while replaying log
So everybody who got hit by my fsync bug will still continue to hit this
BUG_ON() in the free space cache, which is pretty heavy handed.  So I took a
file system that had this bug and fixed up all the BUG_ON()'s and leaks that
popped up when I tried to mount a broken file system like this.  With this patch
we just fail to mount instead of panicing.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:20 -04:00
Jan Schmidt
3d7b5a2882 Btrfs: automatic rescan after "quota enable" command
When qgroup tracking is enabled, we do an automatic cycle of the new rescan
mechanism.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:20 -04:00
Jan Schmidt
2f2320360b Btrfs: rescan for qgroups
If qgroup tracking is out of sync, a rescan operation can be started. It
iterates the complete extent tree and recalculates all qgroup tracking data.
This is an expensive operation and should not be used unless required.

A filesystem under rescan can still be umounted. The rescan continues on the
next mount.  Status information is provided with a separate ioctl while a
rescan operation is in progress.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:19 -04:00
Jan Schmidt
46b665ceb1 Btrfs: split btrfs_qgroup_account_ref into four functions
The function is separated into a preparation part and the three accounting
steps mentioned in the qgroups documentation. The goal is to make steps two
and three usable by the rescan functionality. A side effect is that the
function is restructured into readable subunits.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:18 -04:00
Miao Xie
3c76cd84e0 Btrfs: allocate new chunks if the space is not enough for global rsv
When running the 208th of xfstests, the fs returned the enospc
error when there was lots of free space in the disk.

By bisect debug, we found it was introduced by commit 96f1bb5777.
This commit makes the space check for the global reservation in
can_overcommit() be inconsistent with should_alloc_chunk().
can_overcommit() requires that the free space is 2 times the size
of the global reservation, or we can't do overcommit. And instead,
we need reclaim some reserved space, and if we still don't have
enough free space, we need allocate a new chunk. But unfortunately,
should_alloc_chunk() just requires that the free space is 1 time
the size of the global reservation, that is we would not try to
allocate a new chunk if the free space size is in the middle of
these two requires, and just return the enospc error. Fix it.

Cc: Jim Schutt <jaschut@sandia.gov>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:17 -04:00
Jan Schmidt
fc36ed7e0b Btrfs: separate sequence numbers for delayed ref tracking and tree mod log
Sequence numbers for delayed refs have been introduced in the first version
of the qgroup patch set. To solve the problem of find_all_roots on a busy
file system, the tree mod log was introduced. The sequence numbers for that
were simply shared between those two users.

However, at one point in qgroup's quota accounting, there's a statement
accessing the previous sequence number, that's still just doing (seq - 1)
just as it would have to in the very first version.

To satisfy that requirement, this patch makes the sequence number counter 64
bit and splits it into a major part (used for qgroup sequence number
counting) and a minor part (incremented for each tree modification in the
log). This enables us to go exactly one major step backwards, as required
for qgroups, while still incrementing the sequence counter for tree mod log
insertions to keep track of their order. Keeping them in a single variable
means there's no need to change all the code dealing with comparisons of two
sequence numbers.

The sequence number is reset to 0 on commit (not new in this patch), which
ensures we won't overflow the two 32 bit counters.

Without this fix, the qgroup tracking can occasionally go wrong and WARN_ONs
from the tree mod log code may happen.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:17 -04:00
Eric Sandeen
6d49ba1b47 btrfs: move leak debug code to functions
Clean up the leak debugging in extent_io.c by moving
the debug code into functions.  This also removes the
list_heads used for debugging from the extent_buffer
and extent_state structures when debug is not enabled.

Since we need a global debug config to do that last
part, implement CONFIG_BTRFS_DEBUG to accommodate.

Thanks to Dave Sterba for the Kconfig bit.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:16 -04:00
Liu Bo
ace68bac61 Btrfs: return free space in cow error path
Replace some BUG_ONs with proper handling and take allocated space back to
free space cache for later use.

We don't have to worry about extent maps since they'd be freed in releasepage
path.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:15 -04:00
Stefan Behrens
6463fe58ea Btrfs: set UUID in root_item for created trees
It is a rare exception that a new tree is created, like the qgroups
tree. So far these new trees have an all-zero UUID in their root
items. All trees that mkfs.btrfs has created get an UUID during the
first mount when btrfs_read_root_item() rewrites the root_item to
the v2 structure style. These UUID are never used so far, but
anyway, since it is better to have it uniform for all trees, this
commit adds some lines that generate and write an UUID for newly
created trees.

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:14 -04:00
Stefan Behrens
5fbf83c10c Btrfs: delete unused parameter to btrfs_read_root_item()
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:14 -04:00
Tsutomu Itoh
ecc7ada77b Btrfs: fix error handling in btrfs_ioctl_send()
fget() returns NULL if error. So, we should check NULL or not.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:13 -04:00
Tsutomu Itoh
ba1eeaac99 Btrfs: remove unused variable in __process_changed_new_xattr()
Variable 'p' is not used any more. So, remove it.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:12 -04:00
Josef Bacik
54067ae95e Btrfs: various abort cleanups
I have a broken file system that when it aborts leaves all sorts of accounting
things wrong and gives you lots of WARN_ON()'s other than the abort.  This is
because we're not cleaning up various parts of the file system when we abort.
The first chunks are specific to mount failures, we weren't cleaning up the
block group cached inodes and we weren't cleaning up any transactions that had
been aborted, which leaves a bunch of things laying around.

The second half of this are related to the cleanup parts.  First we don't need
to release space for the dirty pages from the trans_block_rsv, that's all
handled by the trans handles so this is just plain wrong.  The other thing is we
need to pin down extents that were set ->must_insert_reserved for delayed refs.
This isn't so much for the pinning but more for the cleaning up the
cache->reserved counter since we are no longer going to use those reserved
bytes.  With this patch I no longer see a bunch of WARN_ON()'s when I try to
mount this broken file system, just the initial one from the abort.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:11 -04:00
Josef Bacik
fd8b2b6115 Btrfs: cleanup destroy_marked_extents
We can just look up the extent_buffers for the range and free stuff that way.
This makes the cleanup a bit cleaner and we can make sure to evict the
extent_buffers pretty quickly by marking them as stale.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:11 -04:00
Josef Bacik
abefa55ac1 Btrfs: check return value of commit when recovering log
We need to check the return value of the commit in case something goes wrong,
otherwise we could end up going down the line and doing more stuff (like orphan
cleanup) before we notice we should have errored out.  We need to do this before
we free up the log_tree_root since the caller will handle all of that.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:10 -04:00
Josef Bacik
32b0253803 Btrfs: don't panic if we're trying to drop too many refs
This is just obnoxious.  Just print a message, abort the transaction, and return
an error.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:09 -04:00
Josef Bacik
171f6537ab Btrfs: cleanup fs roots if we fail to mount
We can run the tree logging recovery or the orphan cleanup on mount, so we'll
end up looking up a random fs tree in the meantime.  So we need to clean this up
so we don't leave extent buffers hanging around on the cache.  With this patch
we no longer leak extent buffers on failure to mount.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:08 -04:00
Josef Bacik
eb384b55ae Btrfs: fix extent logging with O_DIRECT into prealloc
This is the same as the fix from commit

Btrfs: fix bad extent logging

but for O_DIRECT.  I missed this when I fixed the problem originally, we were
still using the em for the orig_start and orig_block_len, which would be the
merged extent.  We need to use the actual extent from the on disk file extent
item, which we have to lookup to make sure it's ok to nocow anyway so just pass
in some pointers to hold this info.  Thanks,

Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:07 -04:00
Josef Bacik
416bc6580b Btrfs: fix all callers of read_tree_block
We kept leaking extent buffers when mounting a broken file system and it turns
out it's because not everybody uses read_tree_block properly.  You need to check
and make sure the extent_buffer is uptodate before you use it.  This patch fixes
everybody who calls read_tree_block directly to make sure they check that it is
uptodate and free it and return an error if it is not.  With this we no longer
leak EB's when things go horribly wrong.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:07 -04:00
Josef Bacik
51bf5f0bc4 Btrfs: only exclude supers in the range of our block group
If we fail to load block groups halfway through we can leave extent_state's on
the excluded tree.  This is because we just lookup the supers and add them to
the excluded tree regardless of which block group we are looking at currently.
This is a problem because we remove the excluded extents for the range of the
block group only, so if we don't ever load a block group for one of the excluded
extents we won't ever free it.  This fixes the problem by only adding excluded
extents if it falls in the block group range we care about.  With this patch
we're no longer leaking space when we fail to read all of the block groups.
Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:06 -04:00
Josef Bacik
1c24c3ce6a Btrfs: add tree block level sanity check
With a users corrupted fs I was getting weird behavior and panics and it turns
out it was because one of his tree blocks had a bogus header level.  So add this
to the sanity checks in the endio handler for tree blocks.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:05 -04:00
Josef Bacik
5ec8dca761 Btrfs: don't try and free ebs twice in log replay
This work is done by btrfs_free_path() anyway so there's no need for this
duplicate work.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:04 -04:00
Josef Bacik
fb7669b5a0 Btrfs: don't BUG_ON() in btrfs_num_copies
A user sent me a btrfs-image that was panicing because of some corruption.  This
is because we pass in a bogus value to btrfs_num_copies, and it panics.  Instead
just return 1.  We only call btrfs_num_copies to see if there are other copies
to try and read for things, so if we just return 1 it will make the callers exit
out with an appropriate error value.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:04 -04:00
Josef Bacik
79fb65a1f6 Btrfs: don't call readahead hook until we have read the entire eb
Martin Steigerwald reported a BUG_ON() where we were given a bogus bytenr to
map.  Turns out he is using > PAGESIZE leafsizes.  The readahead stuff is called
every time we do a completion, but we may not have finished reading in all the
pages, so the bytenr we read off the node could be completely bogus.  Fix this
by only calling the readahead hook once all pages have been read in.  Thanks,

Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:03 -04:00
Josef Bacik
9bb91873e3 Btrfs: deal with bad mappings in btrfs_map_block
Martin Steigerwald reported a BUG_ON() in btrfs_map_block where we didn't find
a chunk for a particular block we were trying to map.  This happened because the
block was bogus.  We shouldn't be BUG_ON()'ing in this case, just print a
message and return an error.  This came from reada_add_block and it appears to
deal with an error fine so we should be good there.  Thanks,

Reported-by: Martin Steigerwald <Martin@lichtvoll.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:02 -04:00
Josef Bacik
d4c7ca86b5 Btrfs: use REQ_META for all metadata IO
We need to tag metadata io with REQ_META to avoid priority inversion when using
io throttling cqroups.  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:01 -04:00
Josef Bacik
0a3896d0f5 Btrfs: fix possible infinite loop in slow caching
So I noticed there is an infinite loop in the slow caching code.  If we return 1
when we hit the end of the tree, so we could end up caching the last block group
the slow way and suddenly we're looping forever because we just keep
re-searching and trying again.  Fix this by only doing btrfs_next_leaf() if we
don't need_resched().  Thanks,

Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:01 -04:00
Josef Bacik
62dbd7176e Btrfs: fix lockdep warning
The locking order for stuff is

__sb_start_write
ordered_mutex

but with sync() we don't do __sb_start_write for some strange reason, which
means that our iput in wait_ordered_extents could start a transaction which does
the __sb_start_write while we're holding the ordered_mutex.  Fix this by using
delayed iput in sync.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:55:00 -04:00
Wang Shilong
534e6623b7 Btrfs: add all ioctl checks before user change for quota operations
Since all the quota configurations are loaded in memory, and we can
have ioctl checks before operating in the disk. It is safe to do such
things because qgroup_ioctl_lock is held outside.

Without these extra checks firstly, it should be ok to do user change
for quota operations. For example:

if we want to add an existed qgroup, we will do:
	->add_qgroup_item()
		->add_qgroup_rb()

add_qgroup_item() will return -EEXIST to us, however, qgroups are all
in memory, why not check them in memory firstly.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:59 -04:00
Wang Shilong
3c97185c65 Btrfs: fix missing check about ulist_add() in qgroup.c
ulist_add() may return -ENOMEM, fix missing check about
return value.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:58 -04:00
Stefan Behrens
70023da276 Btrfs: clear received_uuid field for new writable snapshots
For created snapshots, the full root_item is copied from the source
root and afterwards selectively modified. The current code forgets
to clear the field received_uuid. The only problem is that it is
confusing when you look at it with 'btrfs subv list', since for
writable snapshots, the contents of the snapshot can be completely
unrelated to the previously received snapshot.
The receiver ignores such snapshots anyway because he also checks
the field stransid in the root_item and that value used to be reset
to zero for all created snapshots.

This commit changes two things:
- clear the received_uuid field for new writable snapshots.
- don't clear the send/receive related information like the stransid
  for read-only snapshots (which makes them useable as a parent for
  the automatic selection of parents in the receive code).

Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:58 -04:00
Josef Bacik
b8d7f3ac10 Btrfs: don't force pages under writeback to finish when aborting
Dave reported a BUG_ON() that happened in end_page_writeback() after an abort.
This happened because we unconditionally call end_page_writeback() in the endio
case, which is right.  However when we abort the transaction we will call
end_page_writeback() on any writeback pages we find, which is wrong.  We need to
lock the page and wait on page writeback to complete if it is.  There is nothing
unsafe about this since we are discarding the transaction anyway.  Thanks,

Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:57 -04:00
Wang Shilong
ccf7f29d1a Btrfs: remove unused variable in the iterate_extent_inodes()
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:56 -04:00
Liu Bo
0abd5b1724 Btrfs: return error when we specify wrong start to defrag
We need such a sanity check for wrong start when we defrag a file, otherwise,
even with a wrong start that's larger than file size, we can end up changing
not only inode's force compress flag but also FS's incompat flags.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:55 -04:00
Vincent
3c59ccd32a Btrfs: fix reada debug code compilation
This fixes the following errors:

  fs/btrfs/reada.c: In function ‘btrfs_reada_wait’:
  fs/btrfs/reada.c:958:42: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’)
  fs/btrfs/reada.c:961:41: error: invalid operands to binary < (have ‘atomic_t’ and ‘int’)

Signed-off-by: Vincent Stehlé <vincent.stehle@laposte.net>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:55 -04:00
Tsutomu Itoh
fd279faefa Btrfs: cleanup of function where btrfs_extend_item() is called
Argument 'trans' became unnecessary from setup_inline_extent_backref()
that called btrfs_extend_item().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:54 -04:00
Tsutomu Itoh
4b90c68015 Btrfs: remove unused argument of btrfs_extend_item()
Argument 'trans' is not used in btrfs_extend_item().

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:53 -04:00
Tsutomu Itoh
afe5fea72b Btrfs: cleanup of function where fixup_low_keys() is called
If argument 'trans' is unnecessary in the function where
fixup_low_keys() is called, 'trans' is deleted.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:52 -04:00
Tsutomu Itoh
d6a0a12684 Btrfs: remove unused argument of fixup_low_keys()
Argument 'trans' is not used in fixup_low_keys(). So, remove it.

Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:52 -04:00
Wang Shilong
b4fcd6be6b Btrfs: fix confusing edquot happening case
Step to reproduce:
	mkfs.btrfs <disk>
	mount <disk> <mnt>
	dd if=/dev/zero of=/<mnt>/data bs=1M count=10
	sync
	btrfs quota enable <mnt>
	btrfs qgroup create 0/5 <mnt>
	btrfs qgroup limit 5M 0/5 <mnt>
	rm -f /<mnt>/data
	sync
	btrfs qgroup show <mnt>
	dd if=/dev/zero of=data bs=1M count=1

>From the perspective of users, qgroup's referenced or exclusive
is negative,but user can not continue to write data! a workaround
way is to cast u64 to s64 when doing qgroup reservation.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:51 -04:00
Wang Shilong
e36902d4cc Btrfs: do not continue if out of memory happens
If out of memory happens, we should return -ENOMEM directly to the caller
rather than continue the work.

Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:50 -04:00
Nathaniel Yazdani
9c931c5ab2 btrfs: fix minor typo in comment
In the comment describing the sync_writers field of the btrfs_inode
struct, "fsyncing" was misspelled "fsycing."

Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:49 -04:00
Wang Shilong
98ad43be0a Btrfs: cleanup to remove reduplicate code in transaction.c
Signed-off-by: Wang Shilong <wangsl-fnst@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:49 -04:00
Jan Schmidt
47fb091fb7 Btrfs: fix unlock after free on rewinded tree blocks
When tree_mod_log_rewind decides to make a copy of the current tree buffer
for its modifications, it subsequently freed the buffer before unlocking it.
Obviously, those operations are required in reverse order.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:48 -04:00
Jan Schmidt
30b0463a93 Btrfs: fix accessing the root pointer in tree mod log functions
The tree mod log functions were accessing root->node->... directly, without
use of btrfs_root_node() or explicit rcu locking. This could lead to an
extent buffer reference being leaked and another reference being freed too
early when preemtion was enabled.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:47 -04:00
Jan Schmidt
90f8d62ebb Btrfs: fix tree mod log regression on root split operations
Commit d9abbf1c changed tree mod log locking around ROOT_REPLACE operations.
When a tree root is split, however, we were logging removal of all elements
from the root node before logging removal of half of the elements for the
split operation. This leads to a BUG_ON when rewinding.

This commit removes the erroneous logging of removal of all elements.

Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-06 15:54:47 -04:00