The pdflush thread is long gone, so this patch removes references to pdflush
from btrfs comments.
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The '->write_super' superblock method is gone, and this patch removes all the
references to 'write_super' from btrfs.
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull second vfs pile from Al Viro:
"The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
deadlock reproduced by xfstests 068), symlink and hardlink restriction
patches, plus assorted cleanups and fixes.
Note that another fsfreeze deadlock (emergency thaw one) is *not*
dealt with - the series by Fernando conflicts a lot with Jan's, breaks
userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
for massive vfsmount leak; this is going to be handled next cycle.
There probably will be another pull request, but that stuff won't be
in it."
Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
delousing target_core_file a bit
Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
fs: Remove old freezing mechanism
ext2: Implement freezing
btrfs: Convert to new freezing mechanism
nilfs2: Convert to new freezing mechanism
ntfs: Convert to new freezing mechanism
fuse: Convert to new freezing mechanism
gfs2: Convert to new freezing mechanism
ocfs2: Convert to new freezing mechanism
xfs: Convert to new freezing code
ext4: Convert to new freezing mechanism
fs: Protect write paths by sb_start_write - sb_end_write
fs: Skip atime update on frozen filesystem
fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
fs: Improve filesystem freezing handling
switch the protection of percpu_counter list to spinlock
nfsd: Push mnt_want_write() outside of i_mutex
btrfs: Push mnt_want_write() outside of i_mutex
fat: Push mnt_want_write() outside of i_mutex
...
We convert btrfs_file_aio_write() to use new freeze check. We also add proper
freeze protection to btrfs_page_mkwrite(). We also add freeze protection to
the transaction mechanism to avoid starting transactions on frozen filesystem.
At minimum this is necessary to stop iput() of unlinked file to change frozen
filesystem during truncation.
Checks in cleaner_kthread() and transaction_kthread() can be safely removed
since btrfs_freeze() will lock the mutexes and thus block the threads (and they
shouldn't have anything to do anyway).
CC: linux-btrfs@vger.kernel.org
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Use the generic printk_get_level() to search a message for a kern_level.
Add __printf to verify format and arguments. Fix a few messages that
had mismatches in format and arguments. Add #ifdef CONFIG_PRINTK blocks
to shrink the object size a bit when not using printk.
[akpm@linux-foundation.org: whitespace tweak]
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.
CC: Chris Mason <chris.mason@oracle.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On powerpc, we don't get the implicit vmalloc.h include, and as a result
the build fails noisily:
fs/btrfs/send.c: In function 'fs_path_free':
fs/btrfs/send.c:185:4: error: implicit declaration of function 'vfree' [-Werror=implicit-function-declaration]
fs/btrfs/send.c: In function 'fs_path_ensure_buf':
fs/btrfs/send.c:215:4: error: implicit declaration of function 'vmalloc' [-Werror=implicit-function-declaration]
fs/btrfs/send.c:215:12: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c:225:12: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c:233:13: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c: In function 'iterate_dir_item':
fs/btrfs/send.c:900:10: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c:909:11: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c: In function 'btrfs_ioctl_send':
fs/btrfs/send.c:4463:17: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c:4469:17: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c:4475:2: error: implicit declaration of function 'vzalloc' [-Werror=implicit-function-declaration]
fs/btrfs/send.c:4475:20: warning: assignment makes pointer from integer without a cast [enabled by default]
fs/btrfs/send.c:4483:21: warning: assignment makes pointer from integer without a cast [enabled by default]
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull large btrfs update from Chris Mason:
"This pull request is very large, and the two main features in here
have been under testing/devel for quite a while.
We have subvolume quotas from the strato developers. This enables
full tracking of how many blocks are allocated to each subvolume (and
all snapshots) and you can set limits on a per-subvolume basis. You
can also create quota groups and toss multiple subvolumes into a big
group. It's everything you need to be a web hosting company and give
each user their own subvolume.
The userland side of the quotas is being refreshed, they'll send out
details on where to grab it soon.
Next is the kernel side of btrfs send/receive from Alexander Block.
This leverages the same infrastructure as the quota code to figure out
relationships between blocks and their owners. It can then compute
the difference between two snapshots and sends the diffs in a neutral
format into userland.
The basic model:
create a snapshot
send that snapshot as the initial backup
make changes
create a second snapshot
send the incremental as a backup
delete the first snapshot
(use the second snapshot for the next incremental)
The receive portion is all in userland, and in the 'next' branch of my
btrfs-progs repo.
There's still some work to do in terms of optimizing the send side
from kernel to userland. The really important part is figuring out
how two snapshots are different, and this is where we are
concentrating right now. The initial send of a dataset is a little
slower than tar, but the incremental sends are dramatically faster
than what rsync can do.
On top of all of that, we have a nice queue of fixes, cleanups and
optimizations."
Fix up trivial modify/del conflict in fs/btrfs/ioctl.c
Also fix up semantic conflict in fs/btrfs/send.c: the interface to
dentry_open() changed in commit 765927b2d5 ("switch dentry_open() to
struct path, make it grab references itself"), and since it now grabs
whatever references it needs, we should no longer do the mntget() on the
mnt (and we need to dput() the dentry reference we took).
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (65 commits)
Btrfs: uninit variable fixes in send/receive
Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive
Btrfs: add btrfs_compare_trees function
Btrfs: introduce subvol uuids and times
Btrfs: make iref_to_path non static
Btrfs: add a barrier before a waitqueue_active check
Btrfs: call the ordered free operation without any locks held
Btrfs: Check INCOMPAT flags on remount and add helper function
Btrfs: add helper for tree enumeration
btrfs: allow cross-subvolume file clone
Btrfs: improve multi-thread buffer read
Btrfs: make btrfs's allocation smoothly with preallocation
Btrfs: lock the transition from dirty to writeback for an eb
Btrfs: fix potential race in extent buffer freeing
Btrfs: don't return true in releasepage unless we actually freed the eb
Btrfs: suppress printk() if all device I/O stats are zero
Btrfs: remove unwanted printk() for btrfs device I/O stats
Btrfs: rewrite BTRFS_SETGET_FUNCS
Btrfs: zero unused bytes in inode item
Btrfs: kill free_space pointer from inode structure
...
Conflicts:
fs/btrfs/ioctl.c
This is the kernel portion of btrfs send/receive
Conflicts:
fs/btrfs/Makefile
fs/btrfs/backref.h
fs/btrfs/ctree.c
fs/btrfs/ioctl.c
fs/btrfs/ioctl.h
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch introduces the BTRFS_IOC_SEND ioctl that is
required for send. It allows btrfs-progs to implement
full and incremental sends. Patches for btrfs-progs will
follow.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
This function is used to find the differences between
two trees. The tree compare skips whole subtrees if it
detects shared tree blocks and thus is pretty fast.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
This patch introduces uuids for subvolumes. Each
subvolume has it's own uuid. In case it was snapshotted,
it also contains parent_uuid. In case it was received,
it also contains received_uuid.
It also introduces subvolume ctime/otime/stime/rtime. The
first two are comparable to the times found in inodes. otime
is the origin/creation time and ctime is the change time.
stime/rtime are only valid on received subvolumes.
stime is the time of the subvolume when it was
sent. rtime is the time of the subvolume when it was
received.
Additionally to the times, we have a transid for each
time. They are updated at the same place as the times.
btrfs receive uses stransid and rtransid to find out
if a received subvolume changed in the meantime.
If an older kernel mounts a filesystem with the
extented fields, all fields become invalid. The next
mount with a new kernel will detect this and reset the
fields.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Reviewed-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reviewed-by: Alex Lyakas <alex.bolshoy.btrfs@gmail.com>
Each ordered operation has a free callback, and this was called with the
worker spinlock held. Josef made the free callback also call iput,
which we can't do with the spinlock.
This drops the spinlock for the free operation and grabs it again before
moving through the rest of the list. We'll circle back around to this
and find a cleaner way that doesn't bounce the lock around so much.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
cc: stable@kernel.org
In support of the recently added capability to remount with lzo
compression, provide a helper function to check the compression
INCOMPAT flags when remounting with lzo compression, and set
the flags if necessary.
Also, implement the new helper function when defragmenting with
explicit lzo compression and when setting the default subvolume.
Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Often no exact match is wanted but just the next lower or
higher item. There's a lot of duplicated code throughout
btrfs to deal with the corner cases. This patch adds a
helper function that can facilitate searching.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Lift the EXDEV condition and allow different root trees for files being
cloned, then pass source inode's root when searching for extents.
Cloning is not allowed to cross vfsmounts, ie. when two subvolumes from
one filesystem are mounted separately.
Signed-off-by: David Sterba <dsterba@suse.cz>
Pull trivial tree from Jiri Kosina:
"Trivial updates all over the place as usual."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
Fix typo in include/linux/clk.h .
pci: hotplug: Fix typo in pci
iommu: Fix typo in iommu
video: Fix typo in drivers/video
Documentation: Add newline at end-of-file to files lacking one
arm,unicore32: Remove obsolete "select MISC_DEVICES"
module.c: spelling s/postition/position/g
cpufreq: Fix typo in cpufreq driver
trivial: typo in comment in mksysmap
mach-omap2: Fix typo in debug message and comment
scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
Change email address for Steve Glendinning
Btrfs: fix typo in convert_extent_bit
via: Remove bogus if check
netprio_cgroup.c: fix comment typo
backlight: fix memory leak on obscure error path
Documentation: asus-laptop.txt references an obsolete Kconfig item
Documentation: ManagementStyle: fixed typo
mm/vmscan: cleanup comment error in balance_pgdat
mm: cleanup on the comments of zone_reclaim_stat
...
While testing with my buffer read fio jobs[1], I find that btrfs does not
perform well enough.
Here is a scenario in fio jobs:
We have 4 threads, "t1 t2 t3 t4", starting to buffer read a same file,
and all of them will race on add_to_page_cache_lru(), and if one thread
successfully puts its page into the page cache, it takes the responsibility
to read the page's data.
And what's more, reading a page needs a period of time to finish, in which
other threads can slide in and process rest pages:
t1 t2 t3 t4
add Page1
read Page1 add Page2
| read Page2 add Page3
| | read Page3 add Page4
| | | read Page4
-----|------------|-----------|-----------|--------
v v v v
bio bio bio bio
Now we have four bios, each of which holds only one page since we need to
maintain consecutive pages in bio. Thus, we can end up with far more bios
than we need.
Here we're going to
a) delay the real read-page section and
b) try to put more pages into page cache.
With that said, we can make each bio hold more pages and reduce the number
of bios we need.
Here is some numbers taken from fio results:
w/o patch w patch
------------- -------- ---------------
READ: 745MB/s +25% 934MB/s
[1]:
[global]
group_reporting
thread
numjobs=4
bs=32k
rw=read
ioengine=sync
directory=/mnt/btrfs/
[READ]
filename=foobar
size=2000M
invalidate=1
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For backref walking, we've introduce delayed ref's sequence. However,
it changes our preallocation behavior.
The story is that when we preallocate an extent and then mark it written
piece by piece, the ideal case should be that we don't need to COW the
extent, which is why we use 'preallocate'.
But we may not make use of preallocation, since when we check for cross refs on
the extent, we may have two ref entries which have the same content except
the sequence value, and we recognize them as cross refs and do COW to allocate
another extent.
So we end up with several pieces of space instead of an whole extent.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is a small window where an eb can have no IO bits set on it, which
could potentially result in extent_buffer_under_io() returning false when we
want it to return true, which could result in not fun things happening. So
in order to protect this case we need to hold the refs_lock when we make
this transition to make sure we get reliable results out of
extent_buffer_udner_io(). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This sounds sort of impossible but it is the only thing I can think of and
at the very least it is theoretically possible so here it goes.
If we are in try_release_extent_buffer we will check that the ref count on
the extent buffer is 1 and not under IO, and then go down and clear the tree
ref. If between this check and clearing the tree ref somebody else comes in
and grabs a ref on the eb and the marks it dirty before
try_release_extent_buffer() does it's tree ref clear we can end up with a
dirty eb that will be freed while it is still dirty which will result in a
panic. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed while looking at an extent_buffer race that we will
unconditionally return 1 if we get down to release_extent_buffer after
clearing the tree ref. However we can easily race in here and get a ref on
the eb and not actually free the eb. So make release_extent_buffer return 1
if it free'd the eb and 0 if not so we can be a little kinder to the vm.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Code is added to suppress the I/O stats printing at mount time if all
statistic values are zero.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
People complained about the annoying kernel log message
"btrfs: no dev_stats entry found ... (OK on first mount after mkfs)"
everytime a filesystem is mounted for the first time after running
mkfs. Since the distribution of the btrfs-progs is not synchronized
to the kernel version, mkfs like it is now will be used also in the
future. Then this message is not useful to find errors, it is just
annoying. This commit removes the printk().
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
BTRFS_SETGET_FUNCS macro is used to generate btrfs_set_foo() and
btrfs_foo() functions, which read and write specific fields in the
extent buffer.
The total number of set/get functions is ~200, but in fact we only
need 8 functions: 2 for u8 field, 2 for u16, 2 for u32 and 2 for u64.
It results in redunction of ~37K bytes.
text data bss dec hex filename
629661 12489 216 642366 9cd3e fs/btrfs/btrfs.o.orig
592637 12489 216 605342 93c9e fs/btrfs/btrfs.o
Signed-off-by: Li Zefan <lizefan@huawei.com>
The otime field is not zeroed, so users will see random otime in an old
filesystem with a new kernel which has otime support in the future.
The reserved bytes are also not zeroed, and we'll have compatibility
issue if we make use of those bytes.
Signed-off-by: Li Zefan <lizefan@huawei.com>
Inodes always allocate free space with BTRFS_BLOCK_GROUP_DATA type,
which means every inode has the same BTRFS_I(inode)->free_space pointer.
This shrinks struct btrfs_inode by 4 bytes (or 8 bytes on 64 bits).
Signed-off-by: Li Zefan <lizefan@huawei.com>
When calling btrfs_next_old_leaf, we were leaking an extent buffer in the
rare case of using the deadlock avoidance code needed for the tree mod log.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If a block group is ro, do not count its entries in when we dump space info.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Block group has ro attributes, make dump_space_info show it.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Here is the whole story:
1)
A free space cache consists of two parts:
o free space cache inode, which is special becase it's stored in root tree.
o free space info, which is stored as the above inode's file data.
But we only build up another new inode and does not flush its free space info
onto disk when we _clear and setup_ free space cache, and this ends up with
that the block group cache's cache_state remains DC_SETUP instead of DC_WRITTEN.
And holding DC_SETUP means that we will not truncate this free space cache inode,
which means the disk offset of its file extent will remain _unchanged_ at least
until next transaction finishes committing itself.
2)
We can set a block group readonly when we relocate the block group.
However,
if the readonly block group covers the disk offset where our free space cache
inode is going to write, it will force the free space cache inode into
cow_file_range() and it'll end up hitting a BUG_ON.
3)
Due to the above analysis, we fix this bug by adding the missing dirty flag.
4)
However, it's not over, there is still another case, nospace_cache.
With nospace_cache, we do not want to set dirty flag, instead we just truncate
free space cache inode and bail out with setting cache state DC_WRITTEN.
We can benifit from it since it saves us another 'pre-allocation' part which
usually costs a lot.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
During disk balance, we prealloc new file extent for file data relocation,
but we may fail in 'no available space' case, and it leads to flipping btrfs
into readonly.
It is not necessary to bail out and abort transaction since we do have several
ways to rescue ourselves from ENOSPC case.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since root can be fetched via BTRFS_I macro directly, we can save an args
for btrfs_is_free_space_inode().
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For btree inode, its root is also 'tree root', so btree inode can be
misunderstood as a free space inode.
We should add one more check for btree inode.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
From btree_read_extent_buffer_pages(), currently repair_io_failure()
can be called with mirror_num being zero when submit_one_bio() returned
an error before. This used to cause a BUG_ON(!mirror_num) in
repair_io_failure() and indeed this is not a case that needs the I/O
repair code to rewrite disk blocks.
This commit prevents calling repair_io_failure() in this case and thus
avoids the BUG_ON() and malfunction.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So shrink_delalloc has grown all sorts of cruft over the years thanks to
many reworkings of how we track enospc. What happens now as we fill up the
disk is we will loop for freaking ever hoping to reclaim a arbitrary amount
of space of metadata, this was from when everybody flushed at the same time.
Now we only have people flushing one at a time. So instead of trying to
reclaim a huge amount of space, just try to flush a decent chunk of space,
and stop looping as soon as we have enough free space to satisfy our
reservation. This makes xfstests 224 go much faster. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
$ mkfs.btrfs /dev/sdb7
$ btrfstune -S1 /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs
mount: block device /dev/sdb7 is write-protected, mounting read-only
$ btrfs dev add /dev/sdb8 /mnt/btrfs/
Now we get a btrfs in which mnt flags has readonly but sb flags does
not. So for those ioctls that only check sb flags with MS_RDONLY, it
is going to be a problem.
Setting subvolume flags is such an ioctl, we should use mnt_want_write_file()
to check RO flags.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
mnt_want_write() and mnt_want_write_file() will check sb->s_flags with
MS_RDONLY, and we don't need to do it ourselves.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Move check of write access to mount into upper functions so that we can
use mnt_want_write_file instead, which is faster than mnt_want_write.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
There is weird logic I had to put in place to make sure that when we were
adding csums that we'd used the delalloc block rsv instead of the global
block rsv. Part of this meant that we had to free up our transaction
reservation before we ran the delayed refs since csum deletion happens
during the delayed ref work. The problem with this is that when we release
a reservation we will add it to the global reserve if it is not full in
order to keep us going along longer before we have to force a transaction
commit. By releasing our reservation before we run delayed refs we don't
get the opportunity to drain down the global reserve for the work we did, so
we won't refill it as often. This isn't a problem per-se, it just results
in us possibly committing transactions more and more often, and in rare
cases could cause those WARN_ON()'s to pop in use_block_rsv because we ran
out of space in our block rsv.
This also helps us by holding onto space while the delayed refs run so we
don't end up with as many people trying to do things at the same time, which
again will help us not force commits or hit the use_block_rsv warnings.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We didn't check error of btrfs_update_inode(), but that error looks
easy to bubble back up.
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We dereferenced "node" in the error message after freeing it. Also
btrfs_panic() can return so we should return an error code instead of
continuing.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
There used to be a BUG_ON(ret) there before EH patch (79787eaa) went in.
Bail out with EINVAL.
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This will be used in conjunction with btrfs device ready <dev>. This is
needed for initrd's to have a nice and lightweight way to tell if all of the
devices needed for a file system are in the cache currently. This keeps
them from having to do mount+sleep loops waiting for devices to show up.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Those crazy gentoo guys have been complaining about ENOSPC errors on their
portage volumes. This is because doing things like untar tends to create
lots of new files which will soak up all the reservation space in the
delayed inodes. Usually this gets papered over by the fact that we will try
and commit the transaction, however if this happens in the wrong spot or we
choose not to commit the transaction you will be screwed. So add the
ability to expclitly flush delayed inodes to free up space. Please test
this out guys to make sure it works since as usual I cannot reproduce.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Commit c11d2c236c (Btrfs: add ioctl to get and reset the device
stats) introduced two ioctls doing almost the same thing distinguished
by just the ioctl number which encodes "do reset after read". I have
suggested
http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg16604.html
to implement it via the ioctl args. This hasn't happen, and I think we
should use a more clean way to pass flags and should not waste ioctl
numbers.
CC: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: David Sterba <dsterba@suse.cz>
Rebased on btrfs-next and retested.
Inform should_defrag_range if BTRFS_DEFRAG_RANGE_COMPRESS is set. If so, skip
checks for adjacent extents and extent size when deciding whether to defrag,
as these can prevent an uncompressed and unfragmented file from being
compressed as requested.
Signed-off-by: Andrew Mahone <andrew.mahone@gmail.com>
"root->fs_info" and "fs_info" are the same, but "fs_info" is prefered
because it is shorter and that's what is used in the rest of the
function.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Before the update_time inode operation was indroduced, it was
not possible to prevent updates of atime on RO subvolumes. VFS
was only able to check for RO on the mount, but did not know
anything about btrfs subvolumes.
btrfs_update_time does now check if the root is RO and skip
updating of times.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Btrfs allows to turn on compression on a mounted and used filesystem
by issuing mount -o remount,compress=lzo.
This patch allows to turn compression off again
while the filesystem is mounted. As suggested by David Sterba
if the compress-force option was set, it is implicitly cleared
if compression is turned off.
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
We do all of our inode updating when we change it, and now that we do
->update_time we don't need ->dirty_inode for atime updates anymore, so just
remove it. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The btrfs locks were unconditionally calling wake_up as the
locks were released. This lead to extra thrashing on the waitqueue,
especially for locks that were dominated by readers.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Waiting on spindles improves performance, but ssds want all the
IO as quickly as we can push it down.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called. They could also be passed to the
compare function.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument. And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When creating a subvolume or snapshot, it is necessary
to initialize the qgroup account with a copy of some
other (tracking) qgroup. This patch adds parameters
to the ioctls to pass the information from which qgroup
to inherit.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Like block reserves, reserve a small piece of space on each
transaction start and for delalloc. These are the hooks that
can actually return EDQUOT to the user.
The amount of space reserved is tracked in the transaction
handle.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Hooks into qgroup code to record refs and into transaction commit.
This is the main entry point for qgroup. Basically every change in
extent backrefs got accounted to the appropriate qgroups.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Init the quota tree along with the others on open_ctree
and close_ctree. Add the quota tree to the list of well
known trees in btrfs_read_fs_root_no_name.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Normally delayed refs get processed in ascending bytenr order. This
correlates in most cases to the order added. To expose dependencies
on this order, we start to process the tree in the middle instead of
the beginning.
This code is only effective when SCRAMBLE_DELAYED_REFS is defined.
Signed-off-by: Arne Jansen <sensille@gmx.net>
This patch only add a consistancy check to validate that the
same root is passed to start_transaction and end_transaction.
Subvolume quota depends on this.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Often no exact match is wanted but just the next lower or
higher item. There's a lot of duplicated code throughout
btrfs to deal with the corner cases. This patch adds a
helper function that can facilitate searching.
Signed-off-by: Arne Jansen <sensille@gmx.net>
We've got two mechanisms both required for reliable backref resolving (tree
mod log and holding back delayed refs). You cannot make use of one without
the other. So instead of requiring the user of this mechanism to setup both
correctly, we join them into a single interface.
Additionally, we stop inserting non-blockers into fs_info->tree_mod_seq_list
as we did before, which was of no value.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
When calling btrfs_next_old_leaf, we were leaking an extent buffer in the
rare case of using the deadlock avoidance code needed for the tree mod log.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Pull btrfs updates from Chris Mason:
"I held off on my rc5 pull because I hit an oops during log recovery
after a crash. I wanted to make sure it wasn't a regression because
we have some logging fixes in here.
It turns out that a commit during the merge window just made it much
more likely to trigger directory logging instead of full commits,
which exposed an old bug.
The new backref walking code got some additional fixes. This should
be the final set of them.
Josef fixed up a corner where our O_DIRECT writes and buffered reads
could expose old file contents (not stale, just not the most recent).
He and Liu Bo fixed crashes during tree log recover as well.
Ilya fixed errors while we resume disk balancing operations on
readonly mounts."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: run delayed directory updates during log replay
Btrfs: hold a ref on the inode during writepages
Btrfs: fix tree log remove space corner case
Btrfs: fix wrong check during log recovery
Btrfs: use _IOR for BTRFS_IOC_SUBVOL_GETFLAGS
Btrfs: resume balance on rw (re)mounts properly
Btrfs: restore restriper state on all mounts
Btrfs: fix dio write vs buffered read race
Btrfs: don't count I/O statistic read errors for missing devices
Btrfs: resolve tree mod log locking issue in btrfs_next_leaf
Btrfs: fix tree mod log rewind of ADD operations
Btrfs: leave critical region in btrfs_find_all_roots as soon as possible
Btrfs: always put insert_ptr modifications into the tree mod log
Btrfs: fix tree mod log for root replacements at leaf level
Btrfs: support root level changes in __resolve_indirect_ref
Btrfs: avoid waiting for delayed refs when we must not
While we are resolving directory modifications in the
tree log, we are triggering delayed metadata updates to
the filesystem btrees.
This commit forces the delayed updates to run so the
replay code can find any modifications done. It stops
us from crashing because the directory deleltion replay
expects items to be removed immediately from the tree.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
cc: stable@kernel.org
We can race with unlink and not actually be able to do our igrab in
btrfs_add_ordered_extent. This will result in all sorts of problems.
Instead of doing the complicated work to try and handle returning an error
properly from btrfs_add_ordered_extent, just hold a ref to the inode during
writepages. If we cannot grab a ref we know we're freeing this inode anyway
and can just drop the dirty pages on the floor, because screw them we're
going to invalidate them anyway. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The tree log stuff can have allocated space that we end up having split
across a bitmap and a real extent. The free space code does not deal with
this, it assumes that if it finds an extent or bitmap entry that the entire
range must fall within the entry it finds. This isn't necessarily the case,
so rework the remove function so it can handle this case properly. This
fixed two panics the user hit, first in the case where the space was
initially in a bitmap and then in an extent entry, and then the reverse
case. Thanks,
Reported-and-tested-by: Shaun Reich <sreich@kde.org>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we're evicting an inode during log recovery, we need to ensure that the inode
is not in orphan state any more, which means inode's run_time flags has _no_
BTRFS_INODE_HAS_ORPHAN_ITEM. Thus, the BUG_ON was triggered because of a wrong
check for the flags.
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We used the wrong ioctl macro for the getflags ioctl before.
As we don't have the set/getflags ioctls in the user space ioctl.h
at the moment, it's safe to fix it now.
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Alexander Block <ablock84@googlemail.com>
This introduces btrfs_resume_balance_async(), which, given that
restriper state was recovered earlier by btrfs_recover_balance(),
resumes balance in btrfs-balance kthread.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix a bug that triggered asserts in btrfs_balance() in both normal and
resume modes -- restriper state was not properly restored on read-only
mounts. This factors out resuming code from btrfs_restore_balance(),
which is now also called earlier in the mount sequence to avoid the
problem of some early writes getting the old profile.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Miao pointed out there's a problem with mixing dio writes and buffered
reads. If the read happens between us invalidating the page range and
actually locking the extent we can bring in pages into page cache. Then
once the write finishes if somebody tries to read again it will just find
uptodate pages and we'll read stale data. So we need to lock the extent and
check for uptodate bits in the range. If there are uptodate bits we need to
unlock and invalidate again. This will keep this race from happening since
we will hold the extent locked until we create the ordered extent, and then
teh read side always waits for ordered extents. There was also a race in
how we updated i_size, previously we were relying on the generic DIO stuff
to adjust the i_size after the DIO had completed, but this happens outside
of the extent lock which means reads could come in and not see the updated
i_size. So instead move this work into where we create the extents, and
then this way the update ordered i_size stuff works properly in the endio
handlers. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
It is normal behaviour of the low level btrfs function btrfs_map_bio()
to complete a bio with -EIO if the device is missing, instead of just
preventing the bio creation in an earlier step.
This used to cause I/O statistic read error increments and annoying
printk_ratelimited messages. This commit fixes the issue.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reported-by: Carey Underwood <cwillu@cwillu.com>
With the tree mod log, we may end up with two roots (the current root and a
rewinded version of it) both pointing to two leaves, l1 and l2, of which l2
had already been cow-ed in the current transaction. If we don't rewind any
tree blocks, we cannot have two roots both pointing to an already cowed tree
block.
Now there is btrfs_next_leaf, which has a leaf locked and wants a lock on
the next (right) leaf. And there is push_leaf_left, which has a (cowed!)
leaf locked and wants a lock on the previous (left) leaf.
In order to solve this dead lock situation, we use try_lock in
btrfs_next_leaf (only in case it's called with a tree mod log time_seq
paramter) and if we fail to get a lock on the next leaf, we give up our lock
on the current leaf and retry from the very beginning.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
When a MOD_LOG_KEY_ADD operation is rewinded, we remove the key from the
tree block. If its not the last key, removal involves a move operation.
This move operation was explicitly done before this commit.
However, at insertion time, there's a move operation before the actual
addition to make room for the new key, which is recorded in the tree mod
log as well. This means, we must drop the move operation when rewinding the
add operation, because the next operation we'll be rewinding will be the
corresponding MOD_LOG_MOVE_KEYS operation.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
When delayed refs exist, btrfs_find_all_roots used to hold the delayed ref
mutex way longer than actually required. We ought to drop it immediately
after we're done collecting all the delayed refs.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Several callers of insert_ptr set the tree_mod_log parameter to 0 to avoid
addition to the tree mod log. In fact, we need all of those operations. This
commit simply removes the additional parameter and makes addition to the
tree mod log unconditional.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
For the tree mod log, we don't log any operations at leaf level. If the root
is at the leaf level (i.e. the tree consists only of the root), then
__tree_mod_log_oldest_root will find a ROOT_REPLACE operation in the log
(because we always log that one no matter which level), but no other
operations.
With this patch __tree_mod_log_oldest_root exits cleanly instead of
BUGging in this situation. get_old_root checks if its really a root at leaf
level in case we don't have any operations and WARNs if this assumption
breaks.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
With the tree mod log, we can have a tree that's two levels high, but
btrfs_search_old_slot may still return a path with the tree root at level
one instead. __resolve_indirect_ref must care for this and accept parents in
a lower level than expected.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
We track two conditions to decide if we should sleep while waiting for more
delayed refs, the number of delayed refs (num_refs) and the first entry in
the list of blockers (first_seq).
When we suspect staleness, we save num_refs and do one more cycle. If
nothing changes, we then save first_seq for later comparison and do
wait_event. We ought to save first_seq the very same moment we're saving
num_refs. Otherwise we cannot be sure that nothing has changed and we might
start waiting when we shouldn't, which could lead to starvation.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Pull btrfs fixes from Chris Mason:
"This is a small pull with btrfs fixes. The biggest of the bunch is
another fix for the new backref walking code.
We're still hammering out one btrfs dio vs buffered reads problem, but
that one will have to wait for the next rc."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: delay iput with async extents
Btrfs: add a missing spin_lock
Btrfs: don't assume to be on the correct extent in add_all_parents
Btrfs: introduce btrfs_next_old_item
There is some concern that these iput()'s could be the final iputs and could
induce lockups on people waiting on writeback. This would happen in the
rare case that we don't create ordered extents because of an error, but it
is theoretically possible and we already have a mechanism to deal with this
so just make them delayed iputs to negate any worry.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When fixing up the locking in the delayed ref destruction work I accidently
broke the locking myself ;(. Add back a spin_lock that should be there and
we are now all set. Thanks,
Btrfs: add a missing spin_lock
When fixing up the locking in the delayed ref destruction work I accidently
broke the locking myself ;(. Add back a spin_lock that should be there and
we are now all set. Thanks,
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
add_all_parents did assume that path is already at a correct extent data
item, which may not be true in case of data extents that were partly
rewritten and splitted.
We need to check if we're on a matching extent for every item and only
for the ones after the first. The loop is changed to do this now.
This patch also fixes a bug introduced with commit 3b127fd8 "Btrfs:
remove obsolete btrfs_next_leaf call from __resolve_indirect_ref".
The removal of next_leaf did sometimes result in slot==nritems when
the above described case happens, and thus resulting in invalid values
(e.g. wanted_obejctid) in add_all_parents (leading to missed backrefs
or even crashes).
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We introduce btrfs_next_old_item that uses btrfs_next_old_leaf instead
of btrfs_next_leaf.
btrfs_next_item is also changed to simply call btrfs_next_old_item with
time_seq being 0.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs compile warning fixes from Chris Mason.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: cast devid to unsigned long long for printk %llu
Btrfs: init old_generation in get_old_root
gcc was giving an uninit variable warning here. Strictly
speaking we don't need to init it, but this will make things
much less error prone.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs update from Chris Mason:
"The dates look like I had to rebase this morning because there was a
compiler warning for a printk arg that I had missed earlier.
These are all fixes, including one to prevent using stale pointers for
device names, and lots of fixes around transaction abort cleanups
(Josef, Liu Bo).
Jan Schmidt also sent in a number of fixes for the new reference
number tracking code.
Liu Bo beat me to updating the MAINTAINERS file. Since he thought to
also fix the git url, I kept his commit."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
Btrfs: update MAINTAINERS info for BTRFS FILE SYSTEM
Btrfs: destroy the items of the delayed inodes in error handling routine
Btrfs: make sure that we've made everything in pinned tree clean
Btrfs: avoid memory leak of extent state in error handling routine
Btrfs: do not resize a seeding device
Btrfs: fix missing inherited flag in rename
Btrfs: fix incompat flags setting
Btrfs: fix defrag regression
Btrfs: call filemap_fdatawrite twice for compression
Btrfs: keep inode pinned when compressing writes
Btrfs: implement ->show_devname
Btrfs: use rcu to protect device->name
Btrfs: unlock everything properly in the error case for nocow
Btrfs: fix btrfs_destroy_marked_extents
Btrfs: abort the transaction if the commit fails
Btrfs: wake up transaction waiters when aborting a transaction
Btrfs: fix locking in btrfs_destroy_delayed_refs
Btrfs: pass locked_page into extent_clear_unlock_delalloc if theres an error
Btrfs: fix race in tree mod log addition
Btrfs: add btrfs_next_old_leaf
...
the items of the delayed inodes were forgotten to be freed, this patch
fixes it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Since we have two trees for recording pinned extents, we need to go through
both of them to make sure that we've done everything clean.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We've forgotten to clear extent states in pinned tree, which will results in
space counter mismatch and memory leak:
WARNING: at fs/btrfs/extent-tree.c:7537 btrfs_free_block_groups+0x1f3/0x2e0 [btrfs]()
...
space_info 2 has 8380416 free, is not full
space_info total=12582912, used=4096, pinned=4096, reserved=0, may_use=0, readonly=4194304
btrfs state leak: start 29364224 end 29376511 state 1 in tree ffff880075f20090 refs 1
...
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Seeding devices are not supposed to change any more.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we move a file into a directory with compression flag, we need to
inherite BTRFS_INODE_COMPRESS and clear BTRFS_INODE_NOCOMPRESS as well.
But if we move a file into a directory without compression flag, we need
to clear both of them.
It is the way how our setflags deals with compression flag, so keep
the same behaviour here.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If a file has 3 small extents:
| ext1 | ext2 | ext3 |
Running "btrfs fi defrag" will only defrag the last two extents, if those
extent mappings hasn't been read into memory from disk.
This bug was introduced by commit 17ce6ef8d7
("Btrfs: add a check to decide if we should defrag the range")
The cause is, that commit looked into previous and next extents using
lookup_extent_mapping() only.
While at it, remove the code that checks the previous extent, since
it's sufficient to check the next extent.
Signed-off-by: Li Zefan <lizefan@huawei.com>
I removed this in an earlier commit and I was wrong. Because compression
can return from filemap_fdatawrite() without having actually set any of it's
pages as writeback() it can make filemap_fdatawait() do essentially nothing,
and then we won't find any ordered extents because they may not have been
created yet. So not only does this make fsync() completely useless, but it
will also screw up if you truncate on a non-page aligned offset since we
zero out the end and then wait on ordered extents and then call drop caches.
We can drop the cache before the io completes and then we try to unpin the
extent we just wrote we won't find it and everything goes sideways. So fix
this by putting it back and put a giant comment there to keep me from trying
to remove it in the future. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
A user reported lots of problems using compression on the new code and it
turns out part of the problem was that igrab() was failing when we added a
new ordered extent. This is because when writing out an inode under
compression we immediately return without actually doing anything to the
pages, and then in another thread at some point down the line actually do
the ordered dance. The problem is between the point that we start writeback
and we actually add the ordered extent we could be trying to reclaim the
inode, which makes igrab() return NULL. So we need to do an igrab() when we
create the async extent and then drop it when we are done with it. This
makes sure we stay pinned in memory until the ordered extent can get a
reference on it and we are good to go. With this patch we no longer panic
in btrfs_finish_ordered_io(). Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Because btrfs can remove the device that was mounted we need to have a
->show_devname so that in this case we can print out some other device in
the file system to /proc/mount. So if there are multiple devices in a btrfs
file system we will just print the device with the lowest devid that we can
find. This will make everything consistent and deal with device removal
properly. The drawback is if you mount with a device that is higher than
the lowest devicd it won't show up as the mounted device in /proc/mounts,
but this is a small price to pay. This was inspired by Miao Xie's patch.
Thanks,
Reviewed-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Al pointed out that we can just toss out the old name on a device and add a
new one arbitrarily, so anybody who uses device->name in printk could
possibly use free'd memory. Instead of adding locking around all of this he
suggested doing it with RCU, so I've introduced a struct rcu_string that
does just that and have gone through and protected all accesses to
device->name that aren't under the uuid_mutex with rcu_read_lock(). This
protects us and I will use it for dealing with removing the device that we
used to mount the file system in a later patch. Thanks,
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
I was getting hung on umount when a transaction was aborted because a range
of one of the free space inodes was still locked. This is because the nocow
stuff doesn't unlock anything on error. This fixed the problem and I
verified that is what was happening. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
So we're forcing the eb's to have their ref count set to 1 so invalidatepage
works but this breaks lots of things, for example root nodes, and is just
plain wrong, we don't need to just evict all of this stuff. Also drop the
invalidatepage altogether and add a page_cache_release(). With this patch
we no longer hang when trying to access the root nodes after an aborted
transaction and we no longer leak memory. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If a transaction commit fails we don't abort it so we don't set an error on
the file system. This patch fixes that by actually calling the abort stuff
and then adding a check for a fs error in the transaction start stuff to
make sure it is caught properly. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I was getting lots of hung tasks and a NULL pointer dereference because we
are not cleaning up the transaction properly when it aborts. First we need
to reset the running_transaction to NULL so we don't get a bad dereference
for any start_transaction callers after this. Also we cannot rely on
waitqueue_active() since it's just a list_empty(), so just call wake_up()
directly since that will do the barrier for us and such. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The transaction abort stuff was throwing warnings from the list debugging
code because we do a list_del_init outside of the delayed_refs spin lock.
The delayed refs locking makes baby Jesus cry so it's not hard to get wrong,
but we need to take the ref head mutex to make sure it's not being processed
currently, and so if it is we need to drop the spin lock and then take and
drop the mutex and do the search again. If we can take the mutex then we
can safely remove the head from the list and carry on. Now when the
transaction aborts I don't get the list debugging warnings. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
While doing my enospc work I got a transaction abortion that resulted in a
panic when we tried to unlock_page() an already unlocked page. This is
because we aren't calling extent_clear_unlock_delalloc with the locked page
so it was unlocking all the pages in the range. This is wrong since
__extent_writepage expects to have the page locked still unless we return
*page_started as 1. This should keep us from panicing. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When adding to the tree modification log, we grab two locks at different
stages. We must not drop the outer lock until we're done with section
protected by the inner lock. This moves the unlock call for the outer lock
to the appropriate position.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
To make sense of the tree mod log, the backref walker not only needs
btrfs_search_old_slot, but it also called btrfs_next_leaf, which in turn was
calling btrfs_search_slot. This obviously didn't give the correct result.
This commit adds btrfs_next_old_leaf, a drop-in replacement for
btrfs_next_leaf with a time_seq parameter. If it is zero, it behaves exactly
like btrfs_next_leaf. If it is non-zero, it will use btrfs_search_old_slot
with this time_seq parameter.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
In __tree_mod_log_oldest_root() we must return the found operation even if
it's not a ROOT_REPLACE operation. Otherwise, the caller assumes that there
are no operations to be rewinded and returns immediately.
The code in the caller is modified to improve readability.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
get_old_root could race with root node updates because we weren't locking
the node early enough. Use btrfs_read_lock_root_node to grab the root locked
in the very beginning and release the lock as soon as possible (just like
btrfs_search_slot does).
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
When resolving indirect refs, we used to call btrfs_next_leaf in case we
didn't find an exact match. While we should find exact matches most of the
time, in case we don't, we must continue searching. Treating those matches
differently depending on the level we're searching doesn't make sense.
Even worse, we might end up searching for a key larger than the largest, in
which case there is no next_leaf and subsequent jobs would fail. This commit
drops the bogous lines.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
This is a leftover from cleanup patch 559af821. Before the cleanup,
btrfs_header_nritems was called inside an if condition. As it has no side
effects we need to preserve here, it should simply be dropped.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Pull vfs changes from Al Viro.
"A lot of misc stuff. The obvious groups:
* Miklos' atomic_open series; kills the damn abuse of
->d_revalidate() by NFS, which was the major stumbling block for
all work in that area.
* ripping security_file_mmap() and dealing with deadlocks in the
area; sanitizing the neighborhood of vm_mmap()/vm_munmap() in
general.
* ->encode_fh() switched to saner API; insane fake dentry in
mm/cleancache.c gone.
* assorted annotations in fs (endianness, __user)
* parts of Artem's ->s_dirty work (jff2 and reiserfs parts)
* ->update_time() work from Josef.
* other bits and pieces all over the place.
Normally it would've been in two or three pull requests, but
signal.git stuff had eaten a lot of time during this cycle ;-/"
Fix up trivial conflicts in Documentation/filesystems/vfs.txt (the
'truncate_range' inode method was removed by the VM changes, the VFS
update adds an 'update_time()' method), and in fs/btrfs/ulist.[ch] (due
to sparse fix added twice, with other changes nearby).
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (95 commits)
nfs: don't open in ->d_revalidate
vfs: retry last component if opening stale dentry
vfs: nameidata_to_filp(): don't throw away file on error
vfs: nameidata_to_filp(): inline __dentry_open()
vfs: do_dentry_open(): don't put filp
vfs: split __dentry_open()
vfs: do_last() common post lookup
vfs: do_last(): add audit_inode before open
vfs: do_last(): only return EISDIR for O_CREAT
vfs: do_last(): check LOOKUP_DIRECTORY
vfs: do_last(): make ENOENT exit RCU safe
vfs: make follow_link check RCU safe
vfs: do_last(): use inode variable
vfs: do_last(): inline walk_component()
vfs: do_last(): make exit RCU safe
vfs: split do_lookup()
Btrfs: move over to use ->update_time
fs: introduce inode operation ->update_time
reiserfs: get rid of resierfs_sync_super
reiserfs: mark the superblock as dirty a bit later
...
Btrfs had been doing it's own file_update_time so we could catch ENOSPC
properly, so just update our btrfs_update_time to work with the new stuff and
then we'll be fancy later. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Pull btrfs updates from Chris Mason:
"This includes a fairly large change from Josef around data writeback
completion. Before, the writeback wasn't completed until the metadata
insertions for the extent were done, and this made for fairly large
latency spikes on the last page of each ordered extent.
We already had a separate mechanism for tracking pending metadata
insertions, so Josef just needed to tweak things a little to end
writeback earlier on the page. Overall it makes us much friendly to
memory reclaim and lowers latencies quite a lot for synchronous IO.
Jan Schmidt has finished some background work required to track btree
blocks as they go through changes in ownership. It's the missing
piece he needed for both btrfs send/receive and subvolume quotas.
Neither of those are ready yet, but the new tracking code is included
here. Most of the time, the new code is off. It is only used by
scrub and other backref walkers.
Stefan Behrens has added io failure tracking. This includes counters
for which drives are causing the most trouble so the admin (or an
automated tool) can choose to kick them out. We're tracking IO
errors, crc errors, and generation checks we do on each metadata
block.
RAID5/6 did miss the cut this time because I'm having trouble with
corruptions. I'll nail it down next week and post as a beta testing
before 3.6"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (58 commits)
Btrfs: fix tree mod log rewinded level and rewinding of moved keys
Btrfs: fix tree mod log del_ptr
Btrfs: add tree_mod_dont_log helper
Btrfs: add missing spin_lock for insertion into tree mod log
Btrfs: add inodes before dropping the extent lock in find_all_leafs
Btrfs: use delayed ref sequence numbers for all fs-tree updates
Btrfs: fix false positive in check-integrity on unmount
Btrfs: fix runtime warning in check-integrity check data mode
Btrfs: set ioprio of scrub readahead to idle
Btrfs: fix return code in drop_objectid_items
Btrfs: check to see if the inode is in the log before fsyncing
Btrfs: return value of btrfs_read_buffer is checked correctly
Btrfs: read device stats on mount, write modified ones during commit
Btrfs: add ioctl to get and reset the device stats
Btrfs: add device counters for detected IO and checksum errors
btrfs: Drop unused function btrfs_abort_devices()
Btrfs: fix the same inode id problem when doing auto defragment
Btrfs: fall back to non-inline if we don't have enough space
Btrfs: fix how we deal with the orphan block rsv
Btrfs: convert the inode bit field to use the actual bit operations
...
When we rewind REMOVE_WHILE_FREEING operations, there's code that allocates
a fresh buffer instead of cloning the old one. Setting that buffer's level
correctly was missing in this case.
When rewinding a MOVE_KEYS operation, btrfs_node_key_ptr_offset(slot) was
missing for memmove_extent_buffer()'s arguments.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Logging for del_ptr when we're not deleting the last pointer was wrong. This
fixes both, duplicate log entries and log sequence.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
We must build up the inode list with the extent lock held after following
indirect refs.
This also requires an extension to ulists, which allows to modify the stored
aux value in case a key already exists in the list.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The sequence number for delayed refs is needed to postpone certain delayed
refs for a very short period while walking backrefs. Before the tree
modification log, we thought we'd only have to hold back those references
that don't have a counter operation.
While now we've the tree mod log, we're rewinding fs tree blocks to a
defined consistent state. We cannot know in advance for which tree block
we'll be doing rewind operations later. Therefore, we must postpone all the
delayed refs for fs-tree blocks, even those having a counter operation.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
During unmount, it could happen that the integrity checker printed a
warning message "attempt to free ... on umount which is not yet iodone"
which turned out to be a false positive.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
If a file_extent_item was located at the very end of a leaf and there was
not enough space to hold a full item, but there was enough space to hold
one of type BTRFS_FILE_EXTENT_INLINE or PREALLOC, and it was only such a
short item, a warning was printed anyway. This check is now fixed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reduce ioprio class of scrub readahead threads to idle priority.
This setting is fixed. This priority has shown the best performance
during all measurements.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
So dpkg fsync()'s the file and the directory containing the file whenever it
writes to a file which is really slow in btrfs. This is partly because
fsync()'ing a directory _always_ committed the transaction instead of just
going to the tree log. This is because drop_objectid_items() would return 1
since it does a btrfs_search_slot() which returns 1. In tree-log jargon
this means that we have to commit the transaction to be safe. So just check
if ret is greater than 0 and set it to 0 if it does. With this patch we now
use the tree-log instead of committing the entire transaction, which is
twice as fast on my box. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have this check down in the actual logging code, but this is after we
start a transaction and all that good stuff. So move the helper
inode_in_log() out so we can call it in fsync() and avoid starting a
transaction altogether and just exit if we've already fsync()'ed this file
recently. You would notice this issue if you fsync()'ed a file over and
over again until the transaction committed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
btrfs_read_buffer() has the possibility of returning the error.
Therefore, I add the code in which the return value of btrfs_read_buffer()
is checked.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
The device statistics are written into the device tree with each
transaction commit. Only modified statistics are written.
When a filesystem is mounted, the device statistics for each involved
device are read from the device tree and used to initialize the
counters.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
An ioctl interface is added to get the device statistic counters.
A second ioctl is added to atomically get and reset these counters.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
The goal is to detect when drives start to get an increased error rate,
when drives should be replaced soon. Therefore statistic counters are
added that count IO errors (read, write and flush). Additionally, the
software detected errors like checksum errors and corrupted blocks are
counted.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
1) This function is not used anywhere.
2) Using the blk_abort_queue() to abort the queue seems not correct.
blk_abort_queue() is used for timeout handling (block/blk-timeout.c).
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Asias He <asias@redhat.com>
Two files in the different subvolumes may have the same inode id, so
The rb-tree which is used to manage the defragment object must take it
into account. This patch fix this problem.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
If cow_file_range_inline fails with ENOSPC we abort the transaction which
isn't very nice. This really shouldn't be happening anyways but there's no
sense in making it a horrible error when we can easily just go allocate
normal data space for this stuff. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Ceph was hitting this race where we would remove an inode from the per-root
orphan list before we would release the space we had reserved for the inode.
We actually don't need a list or anything, we just need to make sure the
root doesn't try to free up the orphan reserve until after the inodes have
released their reservations. So use an atomic counter instead of a list on
the root and only decrement the counter after we've released our
reservation. I've tested this as well as several others and we no longer
see the warnings that you would see while running ceph. Thanks,
Btrfs: fix how we deal with the orphan block rsv
Ceph was hitting this race where we would remove an inode from the per-root
orphan list before we would release the space we had reserved for the inode.
We actually don't need a list or anything, we just need to make sure the
root doesn't try to free up the orphan reserve until after the inodes have
released their reservations. So use an atomic counter instead of a list on
the root and only decrement the counter after we've released our
reservation. I've tested this as well as several others and we no longer
see the warnings that you would see while running ceph. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Miao pointed this out while I was working on an orphan problem that messing
with a bitfield where different ranges are protected by different locks
doesn't work out right. Turns out we've been doing this forever where we
have different parts of the bit field protected by either no lock at all or
different locks which could cause all sorts of weird problems including the
issue I was hitting. So instead make a runtime_flags thing that we use the
normal bit operations on that are all atomic so we can keep having our
no/different locking for the different flags and then make force_compress
it's own thing so it can be treated normally. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When we write out the free space cache we will write out everything that is
in our in memory tree, and then we will just walk the pinned extents tree
and write anything we see there. The problem with this is that during
normal operations the pinned extents will be merged back into the free space
tree normally, and then we can allocate space from the merged areas and
commit them to the tree log. If we crash and replay the tree log we will
crash again because the tree log will try to free up space from what looks
like 2 seperate but contiguous entries, since one entry is from the original
free space cache and the other was a pinned extent that was merged back. To
fix this we just need to walk the free space tree after we load it and merge
contiguous entries back together. This will keep the tree log stuff from
breaking and it will make the allocator behave more nicely. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In normal cases, we would not be allowed to do balance in RO mode.
However, when we're using a seeding device and adding another device to sprout,
things will change:
$ mkfs.btrfs /dev/sdb7
$ btrfstune -S 1 /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs -o ro
$ btrfs fi bal /mnt/btrfs -----------------------> fail.
$ btrfs dev add /dev/sdb8 /mnt/btrfs
$ btrfs fi bal /mnt/btrfs -----------------------> works!
It should not be designed as an exception, and we'd better add another check for
mnt flags.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Fully utilize our extent state's new helper functions to use
fastpath as much as possible.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Reproduce:
$ mkfs.btrfs /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs -o ro
$ btrfs dev add /dev/sdb8 /mnt/btrfs
ERROR: error adding the device '/dev/sdb8' - Invalid argument
Since we mount with readonly options, and /dev/sdb7 is not a seeding one,
a readonly notification is preferred.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
We noticed that the ordered extent completion doesn't really rely on having
a page and that it could be done independantly of ending the writeback on a
page. This patch makes us not do the threaded endio stuff for normal
buffered writes and direct writes so we can end page writeback as soon as
possible (in irq context) and only start threads to do the ordered work when
it is actually done. Compression needs to be reworked some to take
advantage of this as well, but atm it has to do a find_get_page in its endio
handler so it must be done in its own thread. This makes direct writes
quite a bit faster. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We are checking delalloc to see if it is ok to update the i_size. There are
2 cases it stops us from updating
1) If there is delalloc between our current disk_i_size and this ordered
extent
2) If there is delalloc between our current ordered extent and the next
ordered extent
These tests are racy however since we can set delalloc for these ranges at
any time. Also for the first case if we notice there is delalloc between
disk_i_size and our ordered extent we will not update disk_i_size and assume
that when that delalloc bit gets written out it will update everything
properly. However if we crash before that we will have file extents outside
of our i_size, which is not good, so this test is dangerous as well as racy.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
There is an off-by-one error: allocating room for a maximal result
string but without room for a trailing NUL. That, can lead to
returning a transformed string that is not NUL-terminated, and
then to a caller reading beyond end of the malloc'd buffer.
Rewrite to s/kzalloc/kmalloc/, remove unwarranted use of strncpy
(the result is guaranteed to fit), remove dead strlen at end, and
change a few variable names and comments.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Jim Meyering <meyering@redhat.com>
A device with name of length BTRFS_DEVICE_PATH_NAME_MAX or longer
would not be NUL-terminated in the DEV_INFO ioctl result buffer.
Signed-off-by: Jim Meyering <meyering@redhat.com>
The buffer read-overrun would be triggered by a printk format
starting with <N>, where N is a single digit. NUL-terminate
after strncpy. Use memcpy, not strncpy, since we know the
string we're copying fits in the destination buffer and
contains no NUL byte.
Signed-off-by: Jim Meyering <meyering@redhat.com>
Changing 'mount -oremount,thread_pool=2 /' didn't make any effect:
maximum amount of worker threads is specified in 2 places:
- in 'strict btrfs_fs_info::thread_pool_size'
- in each worker struct: 'struct btrfs_workers::max_workers'
'mount -oremount' updated only 'btrfs_fs_info::thread_pool_size'.
Fix it by pushing new maximum value to all created worker structures
as well.
Cc: Josef Bacik <josef@redhat.com>
Cc: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
We already do the btrfs_wait_ordered_range which will do this for us, so
just remove this call so we don't call it twice. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In btrfs_wait_ordered_range we have been calling filemap_fdata_write() twice
because compression does strange things and then waiting. Then we look up
ordered extents and if we find any we will always schedule_timeout(); once
and then loop back around and do it all again. We will even check to see if
there is delalloc pages on this range and loop again. So this patch gets
rid of the multipe fdata_write() calls and just does
filemap_write_and_wait(). In the case of compression we will still find the
ordered extents and start those individually if we need to so that is ok,
but in the normal buffered case we avoid all this weird overhead.
Then in the case of the schedule_timeout(1), we don't need it. All callers
either 1) don't care, they just want to make sure what they just wrote maeks
it to disk or 2) are doing the lock()->lookup ordered->unlock->flush thing
in which case it will lock and check for ordered extents _anyway_ so get
back to them as quickly as possible. The delaloc check is simply not
needed, this only catches the case where we write to the file again since
doing the filemap_write_and_wait() and if the caller truly cares about that
it will take care of everything itself. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
These warnings are bogus since we will always have at least one page in an
eb, but to make the compiler happy just set ret = 0 in these two cases.
Thanks,
Btrfs: fix compile warnings in extent_io.c
These warnings are bogus since we will always have at least one page in an
eb, but to make the compiler happy just set ret = 0 in these two cases.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When running compilebench I noticed we were spending some time looking up
acls on new inodes, which shouldn't be happening since there were no acls.
This is because when we init acls on the inode after creating them we don't
cache the fact there are no acls if there aren't any. Doing this adds a
little bit of a bump to my compilebench runs. Thanks,
Btrfs: cache no acl on new inodes
Signed-off-by: Josef Bacik <josef@redhat.com>
We've been keeping around the inode sequence number in hopes that somebody
would use it, but nobody uses it and people actually use i_version which
serves the same purpose, so use i_version where we used the incore inode's
sequence number and that way the sequence is updated properly across the
board, and not just in file write. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When a fresh transaction begins, the tree mod log must be clean. Users of
the tree modification log must ensure they never span across transaction
boundaries.
We reset the sequence to 0 in this safe situation to make absolutely sure
overflow can't happen.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
This enables backref resolving on life trees while they are changing. This
is a prerequisite for quota groups and just nice to have for everything
else.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The tree modification log together with the current state of the tree gives
a consistent, old version of the tree. btrfs_search_old_slot is used to
search through this old version and return old (dummy!) extent buffers.
Naturally, this function cannot do any tree modifications.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Record all relevant modifications to block pointers in the tree mod log so
that we can rewind them later on for backref walking.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
When running functions that can make changes to the internal trees
(e.g. btrfs_search_slot), we check if somebody may be interested in the
block we're currently modifying. If so, we record our modification to be
able to rewind it later on.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The tree mod log will log modifications made fs-tree nodes. Most
modifications are done by autobalance of the tree. Such changes are recorded
as long as a block entry exists. When released, the log is cleaned.
With the tree modification log, it's possible to reconstruct a consistent
old state of the tree. This is required to do backref walking on a busy
file system.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
pass inode + parent's inode or NULL instead of dentry + bool saying
whether we want the parent or not.
NOTE: that needs ceph fix folded in.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPw2J/AAoJECvKgwp+S8Ja5jkP/3uMxkhf8XQpXCI3O1QVfaQr
uZFfM8sINqIPDVm1dtFjFj7f8Bw9mhE2KAnnJ1rKT8tQwqq9yAse1QPlhCG1ZqoP
+AnMDDXHtx7WmQZXhBvS9b+unpZ7Jr6r6pO5XrmTL2kRL3YJPUhZ2+xbTT5belTB
KoAu4WqORZRxfXoC76S7U8K+D4NcAGhAOxCClsIjmY+oocCiCag4FZOyzYIFViqc
ghUN/+rLQ3fqGGv2yO7Ylx1gUM7sxIwkZQ/h962jFAtxz9czImr2NmRoMliOaOkS
tvcnIf+E3u0n/zIjzFvzhxKgHJPP8PkcPMk60d3jKmFngBkqFTzNUeVTP8md7HrV
4DlXisWr+z7YVyWUCFaNcJLmjiWSwQ8DV/clRLobeBf9EJKan5F1PjFgl6PLJM5F
Qr1+LHMNaetdulBwMRTyveZTzYqw9RmDnD9dWMo4mX/kTpvtC4jTPVV7hkRD+Qlv
5vTRR+VXL3Q50yClLf0AQMSKTnH2gBuepM/b+7cShLGfsMln8DtUjmbigv+niL63
BibcCIbIlP2uWGnl37VhsC34AT+RKt3lggrBOpn/7XJMq/wKR7IRP/7V9TfYgaUN
NBa+wtnLDa1pZEn/X7izdcQP62PzDtmB+ObvYT0Yb40A4+2ud3qF/lB53c1A1ewF
/9c4zxxekjHZnn2oooEa
=oLXf
-----END PGP SIGNATURE-----
Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux
Pull writeback tree from Wu Fengguang:
"Mainly from Jan Kara to avoid iput() in the flusher threads."
* tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Avoid iput() from flusher thread
vfs: Rename end_writeback() to clear_inode()
vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
writeback: Refactor writeback_single_inode()
writeback: Remove wb->list_lock from writeback_single_inode()
writeback: Separate inode requeueing after writeback
writeback: Move I_DIRTY_PAGES handling
writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
writeback: Move clearing of I_SYNC into inode_sync_complete()
writeback: initialize global_dirty_limit
fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds
mm: page-writeback.c: local functions should not be exposed globally
The tree modification log needs two ways to create dummy extent buffers,
once by allocating a fresh one (to rebuild an old root) and once by
cloning an existing one (to make private rewind modifications) to it.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Three callers of btrfs_free_tree_block or btrfs_alloc_tree_block passed
parameter for_cow = 1. In fact, these two functions should never mark
their tree modification operations as for_cow, because they can change
the number of blocks referenced by a tree.
Hence, we remove the extra for_cow parameter from these functions and
make them pass a zero down.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Before this patch we called find_all_leafs for a data extent, then called
find_all_roots and then looked into the extent to grab the information
we were seeking. This was done without holding the leaves locked to avoid
deadlocks. However, this can obviouly race with concurrent tree
modifications.
Instead, we now look into the extent while we're holding the lock during
find_all_leafs and store this information together with the leaf list.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The key we store with a tree block backref is only a hint. It is set when
the ref is created and can remain correct for a long time. As the tree is
rebalanced, however, eventually the key no longer points to the correct
destination.
With this patch, we change find_parent_nodes to no longer add keys unless it
knows for sure they're correct (e.g. because they're for an extent data
backref). Then when we later encounter a backref ref with no parent and no
key set, we grab the block and take the first key from the block itself.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
That one has been around since the addition of backref.c. Due to the way we
calculate our slot numbers, after adding inline refs we're missing one keyed
ref unless it's located at the beginning of a new leaf.
Reported-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
ulist_next gets the pointer to the previously returned element to find the
next element from there. However, when we call ulist_add while iteration
with ulist_next is in progress (ulist explicitly supports this), we can
realloc the ulist internal memory, which makes the pointer to the previous
element useless.
Instead, we now use an iterator parameter that's independent from the
internal pointers.
Reported-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Pull trivial updates from Jiri Kosina:
"As usual, it's mostly typo fixes, redundant code elimination and some
documentation updates."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (57 commits)
edac, mips: don't change code that has been removed in edac/mips tree
xtensa: Change mail addresses of Hannes Weiner and Oskar Schirmer
lib: Change mail address of Oskar Schirmer
net: Change mail address of Oskar Schirmer
arm/m68k: Change mail address of Sebastian Hess
i2c: Change mail address of Oskar Schirmer
net: Fix tcp_build_and_update_options comment in struct tcp_sock
atomic64_32.h: fix parameter naming mismatch
Kconfig: replace "--- help ---" with "---help---"
c2port: fix bogus Kconfig "default no"
edac: Fix spelling errors.
qla1280: Remove redundant NULL check before release_firmware() call
remoteproc: remove redundant NULL check before release_firmware()
qla2xxx: Remove redundant NULL check before release_firmware() call.
aic94xx: Get rid of redundant NULL check before release_firmware() call
tehuti: delete redundant NULL check before release_firmware()
qlogic: get rid of a redundant test for NULL before call to release_firmware()
bna: remove redundant NULL test before release_firmware()
tg3: remove redundant NULL test before release_firmware() call
typhoon: get rid of redundant conditional before all to release_firmware()
...
It confuses Smatch that we use two names for the same lock. Plus the
shorter name is nicer. This doesn't change how the code works, it's
just a cleanup.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
The integrity checker used to be coded for nodesize == leafsize ==
sectorsize == PAGE_CACHE_SIZE.
This is now changed to support sizes for nodesize and leafsize which are
N * PAGE_CACHE_SIZE.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
In tree_insert, var *entry is used in the loop only, and is useless
out of the loop. Remove the useless assignment after the loop.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
The return value of find_first_extent_bit is 1 or 0, no < 0.
And if found something, return 0; if nothing was found, return 1.
Fix the comment.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
num_extent_pages returns the number of pages in the specific range, not
the index of the last page in the eb range.
btrfs_release_extent_buffer_page is called with start_idx set 0 in current
codes, so it's not a problem yet. But the logic is indeed wrong.
Fix it here.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Pull btrfs fixes from Chris Mason:
"The big ones here are a memory leak we introduced in rc1, and a
scheduling while atomic if the transid on disk doesn't match the
transid we expected. This happens for corrupt blocks, or out of date
disks.
It also fixes up the ioctl definition for our ioctl to resolve logical
inode numbers. The __u32 was a merging error and doesn't match what
we ship in the progs."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: avoid sleeping in verify_parent_transid while atomic
Btrfs: fix crash in scrub repair code when device is missing
btrfs: Fix mismatching struct members in ioctl.h
Btrfs: fix page leak when allocing extent buffers
Btrfs: Add properly locking around add_root_to_dirty_list
verify_parent_transid needs to lock the extent range to make
sure no IO is underway, and so it can safely clear the
uptodate bits if our checks fail.
But, a few callers are using it with spinlocks held. Most
of the time, the generation numbers are going to match, and
we don't want to switch to a blocking lock just for the error
case. This adds an atomic flag to verify_parent_transid,
and changes it to return EAGAIN if it needs to block to
properly verifiy things.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After we moved inode_sync_wait() from end_writeback() it doesn't make sense
to call the function end_writeback() anymore. Rename it to clear_inode()
which well says what the function really does - set I_CLEAR flag.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Fix that when scrub tries to repair an I/O or checksum error and one of
the devices containing the mirror is missing, it crashes in bio_add_page
because the bdev is a NULL pointer for missing devices.
Reported-by: Marco L. Crociani <marco.crociani@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix the size members of btrfs_ioctl_ino_path_args and
btrfs_ioctl_logical_ino_args. The user space btrfs-progs utilities used
__u64 and the kernel headers used __u32 before.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we happen to alloc a extent buffer and then alloc a page and notice that
page is already attached to an extent buffer, we will only unlock it and
free our existing eb. Any pages currently attached to that eb will be
properly freed, but we don't do the page_cache_release() on the page where
we noticed the other extent buffer which can cause us to leak pages and I
hope cause the weird issues we've been seeing in this area. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
add_root_to_dirty_list happens once at the very beginning of the
transaction, but it is still racey.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Pull btrfs fixes from Chris Mason:
"This has our collection of bug fixes. I missed the last rc because I
thought our patches were making NFS crash during my xfs test runs.
Turns out it was an NFS client bug fixed by someone else while I tried
to bisect it.
All of these fixes are small, but some are fairly high impact. The
biggest are fixes for our mount -o remount handling, a deadlock due to
GFP_KERNEL allocations in readdir, and a RAID10 error handling bug.
This was tested against both 3.3 and Linus' master as of this morning."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (26 commits)
Btrfs: reduce lock contention during extent insertion
Btrfs: avoid deadlocks from GFP_KERNEL allocations during btrfs_real_readdir
Btrfs: Fix space checking during fs resize
Btrfs: fix block_rsv and space_info lock ordering
Btrfs: Prevent root_list corruption
Btrfs: fix repair code for RAID10
Btrfs: do not start delalloc inodes during sync
Btrfs: fix that check_int_data mount option was ignored
Btrfs: don't count CRC or header errors twice while scrubbing
Btrfs: fix btrfs_ioctl_dev_info() crash on missing device
btrfs: don't return EINTR
Btrfs: double unlock bug in error handling
Btrfs: always store the mirror we read the eb from
fs/btrfs/volumes.c: add missing free_fs_devices
btrfs: fix early abort in 'remount'
Btrfs: fix max chunk size check in chunk allocator
Btrfs: add missing read locks in backref.c
Btrfs: don't call free_extent_buffer twice in iterate_irefs
Btrfs: Make free_ipath() deal gracefully with NULL pointers
Btrfs: avoid possible use-after-free in clear_extent_bit()
...
We're spending huge amounts of time on lock contention during
end_io processing because we unconditionally assume we are overwriting
an existing extent in the file for each IO.
This checks to see if we are outside i_size, and if so, it uses a
less expensive readonly search of the btree to look for existing
extents.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs has an optimization where it will preallocate dentries during
readdir to fill in enough information to open the inode without an extra
lookup.
But, we're calling d_alloc, which is doing GFP_KERNEL allocations, and
that leads to deadlocks because our readdir code has tree locks held.
For now, disable this optimization. We'll fix the gfp mask in the next
merge window.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix out-of-space checking, addressing a warning and potential resource
leak when resizing the filesystem down while allocating blocks.
Signed-off-by: Daniel J Blueman <daniel@quora.org>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
may_commit_transaction() calls
spin_lock(&space_info->lock);
spin_lock(&delayed_rsv->lock);
and update_global_block_rsv() calls
spin_lock(&block_rsv->lock);
spin_lock(&sinfo->lock);
Lockdep complains about this at run time.
Everywhere except in update_global_block_rsv(), the space_info lock is
the outer lock, therefore the locking order in update_global_block_rsv()
is changed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I was seeing root_list corruption on unmount during fs resize in 3.4-rc4; add
correct locking to address this.
Signed-off-by: Daniel J Blueman <daniel@quora.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_map_block sets mirror_num, so that the repair code knows eventually
which device gave us the read error. For RAID10, mirror_num must be 1 or 2.
Before this fix mirror_num was incorrectly related to our stripe index.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_start_delalloc_inodes will just walk the list of delalloc inodes and
start writing them out, but it doesn't splice the list or anything so as
long as somebody is doing work on the box you could end up in this section
_forever_. So just remove it, it's not needed anyway since sync will start
writeback on all inodes anyway, all we need to do is wait for ordered
extents and then we can commit the transaction. In my horrible torture test
sync goes from taking 4 minutes to about 1.5 minutes. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The bitfield member mount_opt was too small by one bit to hold the mount
option that enabled to include data extents in the integrity checker.
Since the same issue happened when the BTRFS_MOUNT_PANIC_ON_FATAL_ERROR
option was added (git rebase silently merges so that the increase of the
size of the bitfield member is lost), the bit limit was removed entirely.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
When a filesystem is mounted with the degraded option, it is
possible that some of the devices are not there.
btrfs_ioctl_dev_info() crashs in this case because the device
name is a NULL pointer. This ioctl was only used for scrub.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
It is basically a good thing if we are interruptible when waiting for
free space, but the generality in which it is implemented currently
leads to system calls being interruptible that are not documented this
way. For example git can't handle interrupted unlink(), leading to
corrupt repos under space pressure.
Instead we raise the bar to only be interruptible by SIGKILL.
Thanks to David Sterba for suggesting this.
Signed-off-by: Arne Jansen <sensille@gmx.net>
The caller expects this function to return with the lock held and
releases it immediately on error.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
A user reported a panic where we were trying to fix a bad mirror but the
mirror number we were giving was 0, which is invalid. This is because we
don't do the transid verification until after the read, so as far as the
read code is concerned the read was a success. So instead store the mirror
we read from so that if there is some failure post read we know which mirror
to try next and which mirror needs to be fixed if we find a good copy of the
block. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Josef Bacik <josef@redhat.com>
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Fix a bug, where in case we need to adjust stripe_size so that the
length of the resulting chunk is less than or equal to max_chunk_size,
DUP chunks turn out to be only half as big as they could be.
Cc: Arne Jansen <sensille@gmx.net>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
iref_to_path and iterate_irefs both increment the eb's refcount to use it
after releasing the path. Both depend on consistent data remaining in the
extent buffer and need a read lock to protect it.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Avoid calling free_extent_buffer more than once when the iterator function
returns non-zero. The only code that uses this is scrub repair for corrupted
nodatasum blocks.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Make free_ipath() behave like most other freeing functions in the
kernel and gracefully do nothing when passed a NULL pointer.
Besides this making the bahaviour consistent with functions such as
kfree(), vfree(), btrfs_free_path() etc etc, it also fixes a real NULL
deref issue in fs/btrfs/ioctl.c::btrfs_ioctl_ino_to_path(). In that
function we have this code:
...
ipath = init_ipath(size, root, path);
if (IS_ERR(ipath)) {
ret = PTR_ERR(ipath);
ipath = NULL;
goto out;
}
...
out:
btrfs_free_path(path);
free_ipath(ipath);
...
If we ever take the true branch of that 'if' statement we'll end up
passing a NULL pointer to free_ipath() which will subsequently
dereference it and we'll go "Boom" :-(
This patch will avoid that.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
clear_extent_bit()
{
next_node = rb_next(&state->rb_node);
...
clear_state_bit(state); <-- this may free next_node
if (next_node) {
state = rb_entry(next_node);
...
}
}
clear_state_bit() calls merge_state() which may free the next node
of the passing extent_state, so clear_extent_bit() may end up
referencing freed memory.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Currently it returns a set of bits that were cleared, but this return
value is not used at all.
Moreover it doesn't seem to be useful, because we may clear the bits
of a few extent_states, but only the cleared bits of last one is
returned.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Our code is not ready to cope with a sectorsize that's not equal to PAGE_SIZE.
It will lead to hanging-on while writing something.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Normally when there are 2 copies of a block, we add both to the
reada extent tree and prefetch only the one that is easier to reach.
This way we can better utilize multiple devices.
In case of DUP this makes no sense as both copies reside on the
same device.
Signed-off-by: Arne Jansen <sensille@gmx.net>
When inserting into the radix tree returns EEXIST, get the existing
entry without giving up the spinlock in between.
There was a race for both the zones trees and the extent tree.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Follow those instructions, and you'll trigger a warning in the
beginning of d_set_d_op():
# mkfs.btrfs /dev/loop3
# mount /dev/loop3 /mnt
# btrfs sub create /mnt/sub
# btrfs sub snap /mnt /mnt/snap
# touch /mnt/snap/sub
touch: cannot touch `tmp': Permission denied
__d_alloc() set d_op to sb->s_d_op (btrfs_dentry_operations), and
then simple_lookup() reset it to simple_dentry_operations, which
triggered the warning.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Pull vfs fixes from Al Viro:
"A bunch of endianness fixes and a couple of nfsd error value fixes.
Speaking of endianness stuff, I'm rather tempted to slap
ccflags-y += -D__CHECK_ENDIAN__
in fs/Makefile, if not making it default for the entire tree; nfsd
regressions I've caught make one hell of a pile and we'd obviously
benefit from having that kind of stuff caught earlier..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
lockd: fix the endianness bug
ocfs2: ->e_leaf_clusters endianness breakage
ocfs2: ->rl_count endianness breakage
ocfs: ->rl_used breakage on big-endian
ocfs2: ->l_next_free_req breakage on big-endian
btrfs: btrfs_root_readonly() broken on big-endian
ext4: fix endianness breakage in ext4_split_extent_at()
nfsd: fix compose_entry_fh() failure exits
nfsd: fix error value on allocation failure in nfsd4_decode_test_stateid()
nfsd: fix endianness breakage in TEST_STATEID handling
nfsd: fix error values returned by nfsd4_lockt() when nfsd_open() fails
nfsd: fix b0rken error value for setattr on read-only mount
Pull the minimal btrfs branch from Chris Mason:
"We have a use-after-free in there, along with errors when mount -o
discard is enabled, and a BUG_ON(we should compile with UP more
often)."
* 'for-linus-min' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: use commit root when loading free space cache
Btrfs: fix use-after-free in __btrfs_end_transaction
Btrfs: check return value of bio_alloc() properly
Btrfs: remove lock assert from get_restripe_target()
Btrfs: fix eof while discarding extents
Btrfs: fix uninit variable in repair_eb_io_failure
Revert "Btrfs: increase the global block reserve estimates"
->root_flags is __le64 and all accesses to it go through the helpers
that do proper conversions. Except for btrfs_root_readonly(), which
checks bit 0 as in host-endian...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A user reported that booting his box up with btrfs root on 3.4 was way
slower than on 3.3 because I removed the ideal caching code. It turns out
that we don't load the free space cache if we're in a commit for deadlock
reasons, but since we're reading the cache and it hasn't changed yet we are
safe reading the inode and free space item from the commit root, so do that
and remove all of the deadlock checks so we don't unnecessarily skip loading
the free space cache. The user reported this fixed the slowness. Thanks,
Tested-by: Calvin Walton <calvin.walton@kepstin.ca>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
49b25e0540 introduced a use-after-free bug
that caused spurious -EIO's to be returned.
Do the check before we free the transaction.
Cc: David Sterba <dsterba@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
bio_alloc() has the possibility of returning NULL.
So, it is necessary to check the return value.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This fixes a regression introduced by fc67c450. spin_is_locked() always
returns 0 on UP kernels, which caused assert in get_restripe_target() to
be fired on every call from btrfs_reduce_alloc_profile() on UP systems.
Remove it completely for now, it's not clear if it's going to be needed
in future.
Reported-by: Bobby Powers <bobbypowers@gmail.com>
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Tested-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We miscalculate the length of extents we're discarding, and it leads to
an eof of device.
Reported-by: Daniel Blueman <daniel@quora.org>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We'd have to be passing bogus extent buffers for this uninit variable to
actually be used, but set it to zero just in case.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This reverts commit 5500cdbe14.
We've had a number of complaints of early enospc that bisect down
to this patch. We'll hae to fix the reservations differently.
CC: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Merge with latest Linus' tree, as I have incoming patches
that fix code that is newer than current HEAD of for-next.
Conflicts:
drivers/net/ethernet/realtek/r8169.c
Pull btrfs fixes and features from Chris Mason:
"We've merged in the error handling patches from SuSE. These are
already shipping in the sles kernel, and they give btrfs the ability
to abort transactions and go readonly on errors. It involves a lot of
churn as they clarify BUG_ONs, and remove the ones we now properly
deal with.
Josef reworked the way our metadata interacts with the page cache.
page->private now points to the btrfs extent_buffer object, which
makes everything faster. He changed it so we write an whole extent
buffer at a time instead of allowing individual pages to go down,,
which will be important for the raid5/6 code (for the 3.5 merge
window ;)
Josef also made us more aggressive about dropping pages for metadata
blocks that were freed due to COW. Overall, our metadata caching is
much faster now.
We've integrated my patch for metadata bigger than the page size.
This allows metadata blocks up to 64KB in size. In practice 16K and
32K seem to work best. For workloads with lots of metadata, this cuts
down the size of the extent allocation tree dramatically and fragments
much less.
Scrub was updated to support the larger block sizes, which ended up
being a fairly large change (thanks Stefan Behrens).
We also have an assortment of fixes and updates, especially to the
balancing code (Ilya Dryomov), the back ref walker (Jan Schmidt) and
the defragging code (Liu Bo)."
Fixed up trivial conflicts in fs/btrfs/scrub.c that were just due to
removal of the second argument to k[un]map_atomic() in commit
7ac687d9e0.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (75 commits)
Btrfs: update the checks for mixed block groups with big metadata blocks
Btrfs: update to the right index of defragment
Btrfs: do not bother to defrag an extent if it is a big real extent
Btrfs: add a check to decide if we should defrag the range
Btrfs: fix recursive defragment with autodefrag option
Btrfs: fix the mismatch of page->mapping
Btrfs: fix race between direct io and autodefrag
Btrfs: fix deadlock during allocating chunks
Btrfs: show useful info in space reservation tracepoint
Btrfs: don't use crc items bigger than 4KB
Btrfs: flush out and clean up any block device pages during mount
btrfs: disallow unequal data/metadata blocksize for mixed block groups
Btrfs: enhance superblock sanity checks
Btrfs: change scrub to support big blocks
Btrfs: minor cleanup in scrub
Btrfs: introduce common define for max number of mirrors
Btrfs: fix infinite loop in btrfs_shrink_device()
Btrfs: fix memory leak in resolver code
Btrfs: allow dup for data chunks in mixed mode
Btrfs: validate target profiles only if we are going to use them
...
Dave Sterba had put in patches to look for mixed data/metadata groups
with metadata bigger than 4KB. But these ended up in the wrong place
and it wasn't testing the feature flag correctly.
This updates the tests to make sure our sizes are matching
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we use autodefrag, we forget to update the index which indicates
the last page we've dirty. And we'll set dirty flags on a same set of
pages again and again.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
$ mkfs.btrfs /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs/ -oautodefrag
$ dd if=/dev/zero of=/mnt/btrfs/foobar bs=4k count=10 oflag=direct 2>/dev/null
$ filefrag -v /mnt/btrfs/foobar
Filesystem type is: 9123683e
File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 3072 10 eof
/mnt/btrfs/foobar: 1 extent found
Now we have a big real extent [0, 40960), but autodefrag will still defrag it.
$ sync
$ filefrag -v /mnt/btrfs/foobar
Filesystem type is: 9123683e
File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 3082 10 eof
/mnt/btrfs/foobar: 1 extent found
So if we already find a big real extent, we're ok about that, just skip it.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If our file's layout is as follows:
| hole | data1 | hole | data2 |
we do not need to defrag this file, because this file has holes and
cannot be merged into one extent.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
$ mkfs.btrfs disk
$ mount disk /mnt -o autodefrag
$ dd if=/dev/zero of=/mnt/foobar bs=4k count=10 2>/dev/null && sync
$ for i in `seq 9 -2 0`; do dd if=/dev/zero of=/mnt/foobar bs=4k count=1 \
seek=$i conv=notrunc 2> /dev/null; done && sync
then we'll get to defrag "foobar" again and again.
So does option "-o autodefrag,compress".
Reasons:
When the cleaner kthread gets to fetch inodes from the defrag tree and defrag
them, it will dirty pages and submit them, this will comes to another DATA COW
where the processing inode will be inserted to the defrag tree again.
This patch sets a rule for COW code, i.e. insert an inode when we're really
going to make some defragments.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
commit 600a45e1d5
(Btrfs: fix deadlock on page lock when doing auto-defragment)
fixes the deadlock on page, but it also introduces another bug.
A page may have been truncated after unlock & lock.
So we need to find it again to get the right one.
And since we've held i_mutex lock, inode size remains unchanged and
we can drop isize overflow checks.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The bug is from running xfstests 209 with autodefrag.
The race is as follows:
t1 t2(autodefrag)
direct IO
invalidate pagecache
dio(old data) add_inode_defrag
invalidate pagecache
endio
direct IO
invalidate pagecache
run_defrag
readpage(old data)
set page dirty (old data)
dio(new data, rewrite)
invalidate pagecache (*)
endio
t2(autodefrag) will get old data into pagecache via readpage and set
pagecache dirty. Meanwhile, invalidate pagecache(*) will fail due to
dirty flags in pages. So the old data may be flushed into disk by
flush thread, which will lead to data loss.
And so does the case of user defragment progs.
The patch fixes this race by holding i_mutex when we readpage and set page dirty.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This deadlock comes from xfstests 251.
We'll hold the chunk_mutex throughout the whole of a chunk allocation.
But if we find that we've used up system chunk space, we need to allocate a
new system chunk, but this will lead to a recursion of chunk allocation and end
up with a deadlock on chunk_mutex.
So instead we need to allocate the system chunk first if we find we're in ENOSPC.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
o For space info, the type of space info is useful for debug.
o For transaction handle, its transid is useful.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With the big metadata blocks, we can have crc items
that are much bigger than a page. There are a few
places that we try to kmalloc memory to hold the
items during a split.
Items bigger than 4KB don't really have a huge benefit
in efficiency, but they do trigger larger order allocations.
This commits changes the csums to make sure they stay under
4KB. This is not a format change, just a #define to limit
huge items.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs puts the filesystem metadata into its own address space, and
somehow the block device address space isn't getting onto disk properly
before a mount. The end result is that a loop of mkfs and mounting the
filesystem will sometimes find stale or incorrect data.
This commit should fix it by sprinkling fdatawrites and invalidate_bdev
calls around. This is a short term measure to make sure it is fixed.
The block devices really should be flushed and cleaned up higher in the
stack.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With support for bigger metadata blocks, we must avoid mounting a
filesystem with different block size for mixed block groups, this causes
corruption (found by xfstests/083).
Signed-off-by: David Sterba <dsterba@suse.cz>
Scrub used to be coded for nodesize == leafsize == sectorsize == PAGE_SIZE.
This is now changed to support sizes for nodesize and leafsize which are
N * PAGE_SIZE.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Just a minor cleanup commit in preparation for the big block changes.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Readahead already has a define for the max number of mirrors. Scrub
needs such a define now, the rest of the code will need something
like this soon. Therefore the define was added to ctree.h and removed
from the readahead code.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If relocate of block group 0 fails with ENOSPC we end up infinitely
looping because key.offset -= 1 statement in that case brings us back to
where we started.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
init_ipath() allocates btrfs_data_container which is never freed. Free
it in free_ipath() and nuke the comment for init_data_container() - we
can safely free it with kfree().
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Generally we don't allow dup for data, but mixed chunks are special and
people seem to think this has its use cases.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Do not run sanity checks on all target profiles unless they all will be
used. This came up because alloc_profile_is_valid() is now more strict
than it used to be.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently if we don't have enough space allocated we go ahead and loop
though devices in the hopes of finding enough space for a chunk of the
*same* type as the one we are trying to relocate. The problem with that
is that if we are trying to restripe the chunk its target type can be
more relaxed than the current one (eg require less devices or less
space). So, when restriping, run checks against the target profile
instead of the current one.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add __get_block_group_index() helper to be able to derive block group
index from an arbitary set of flags. Implement get_block_group_index()
in terms of it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Header file is not a good place to define functions. This also moves a
call to alloc_profile_is_valid() down the stack and removes a redundant
check from __btrfs_alloc_chunk() - alloc_profile_is_valid() takes it
into account.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
"0" is a valid value for an on-disk chunk profile, but it is not a valid
extended profile. (We have a separate bit for single chunks in extended
case)
Also rename it to alloc_profile_is_valid() for clarity.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add functions to abstract the conversion between chunk and extended
allocation profile formats and switch everybody to use them.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This has been causing a lot of confusion for quite a while now and a lot
of users were surprised by this (some of them were even stuck in a
ENOSPC situation which they couldn't easily get out of). The addition
of restriper gives users a clear choice between raid0 and drive concat
setup so there's absolutely no excuse for us to keep doing this.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In commit 4692cf58 we introduced new backref walking code for btrfs. This
assumes we're searching live roots, which requires a transaction context.
While scrubbing, however, we must not join a transaction because this could
deadlock with the commit path. Additionally, what scrub really wants to do
is resolving a logical address in the commit root it's currently checking.
This patch adds support for logical to path resolving on commit roots and
makes scrub use that.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The two helper functions commit_cowonly_roots() and
create_pending_snapshot() failed to check the return value from
btrfs_cow_block(), which could at least in theory fail with -ENOSPC from
btrfs_alloc_free_block(). This commit adds the missing checks.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
btrfs_init_lockdep only makes our lockdep class names look prettier, thus
it did never hurt we forgot to actually call it. This turns our lockdep
identifier strings from lockdep auto-set #[id] into really pretty
"btrfs-fs-01" or "btrfs-csum-03".
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Since we need to read and write extent buffers in their entirety we can't use
the normal bio_readpage_error stuff since it only works on a per page basis. So
instead make it so that if we see an io error in endio we just mark the eb as
having an IO error and then in btree_read_extent_buffer_pages we will manually
try other mirrors and then overwrite the bad mirror if we find a good copy.
This works with larger than page size blocks. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The metadata write IO completion code is now simple enough that we
don't need the threaded helpers anymore.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_search_slot sometimes needs write locks on high levels of
the tree. It remembers the highest level that needs a write lock
and will use that for all future searches through the tree in a given
call.
But, very often we'll just cow the top level or the level below and we
won't really need write locks on the root again after that. This patch
changes things to adjust the write lock requirement as it unlocks
levels.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch simplifies how we track our extent buffers. Previously we could exit
writepages with only having written half of an extent buffer, which meant we had
to track the state of the pages and the state of the extent buffers differently.
Now we only read in entire extent buffers and write out entire extent buffers,
this allows us to simply set bits in our bflags to indicate the state of the eb
and we no longer have to do things like track uptodate with our iotree. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Because an eb can have multiple pages we need to make sure that all pages within
the eb are markes as accessed, since releasepage can be called against any page
in the eb. This will keep us from possibly evicting hot eb's when we're doing
larger than pagesize eb's. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Because btrfs cow's we can end up with extent buffers that are no longer
necessary just sitting around in memory. So instead of evicting these pages, we
could end up evicting things we actually care about. Thus we have
free_extent_buffer_stale for use when we are freeing tree blocks. This will
make it so that the ref for the eb being in the radix tree is dropped as soon as
possible and then is freed when the refcount hits 0 instead of waiting to be
released by releasepage. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We can run into a problem where we find an eb for our existing page already on
the radix tree but it has a ref count of 0. It hasn't yet been removed by RCU
yet so this can cause issues where we will use the EB after free. So do
atomic_inc_not_zero on the exists->refs and if it is zero just do
synchronize_rcu() and try again. We won't have to worry about new allocators
coming in since they will block on the page lock at this point. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We spend a lot of time looking up extent buffers from pages when we could just
store the pointer to the eb the page is associated with in page->private. This
patch does just that, and it makes things a little simpler and reduces a bit of
CPU overhead involved with doing metadata IO. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
A few years ago the btrfs code to support blocks lager than
the page size was disabled to fix a few corner cases in the
page cache handling. This fixes the code to properly support
large metadata blocks again.
Since current kernels will crash early and often with larger
metadata blocks, this adds an incompat bit so that older kernels
can't mount it.
This also does away with different blocksizes for nodes and leaves.
You get a single block size for all tree blocks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have been passing nothing but (u64)-1 to find_free_extent for search_end in
all of the callers, so it's completely useless, and we've always been passing 0
in as search_start, so just remove them as function arguments and move
search_start into find_free_extent. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This is a relic from before we had the disk space cache and it was to make
bootup times when you had btrfs as root not be so damned slow. Now that we have
the disk space cache this isn't a problem anymore and really having this code
casues uneeded fragmentation and complexity, so just remove it. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When a filesystem got aborted due do error, transaction_kthread() will
busyloop. Fix it by going to sleep in that case as well. Maybe we should
just stop transaction_kthread() when filesystem is aborted but that would be
more complex.
Signed-off-by: Jan Kara <jack@suse.cz>
btrfs currently handles most errors with BUG_ON. This patch is a work-in-
progress but aims to handle most errors other than internal logic
errors and ENOMEM more gracefully.
This iteration prevents most crashes but can run into lockups with
the page lock on occasion when the timing "works out."
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
btrfs currently handles most errors with BUG_ON. This patch is a work-in-
progress but aims to handle most errors other than internal logic
errors and ENOMEM more gracefully.
This iteration prevents most crashes but can run into lockups with
the page lock on occasion when the timing "works out."
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
btrfs_alloc_chunk() unconditionally BUGs on any error returned from
__finish_chunk_alloc() so there's no need for two BUG_ON lines. Remove the
one from __finish_chunk_alloc().
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
We BUG_ON() error from add_extent_mapping(), but that error looks pretty
easy to bubble back up - as far as I can tell there have not been any
permanent modifications to fs state at that point.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
The only caller of btrfs_alloc_dev_extent() is __btrfs_alloc_chunk() which
already bugs on any error returned. We can remove the BUG_ON's in
btrfs_alloc_dev_extent() then since __btrfs_alloc_chunk() will "catch" them
anyway.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
balace_level() seems to deal with missing tree nodes by BUG_ON(). Instead,
we can easily just set the file system readonly and bubble -EROFS back up
the stack.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
__btrfs_cow_block(), the only caller of update_ref_for_cow() will BUG_ON()
any error return. Instead, we can go read-only fs as update_ref_for_cow()
manipulates disk data in a way which doesn't look like it's easily rolled
back.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
update_ref_for_cow() will BUG_ON() after it's call to
btrfs_lookup_extent_info() if no existing references are found. Since refs
are computed directly from disk, this should be treated as a corruption
instead of a logic error.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
All callers of __finish_chunk_alloc() BUG_ON() return value, so it's trivial
for us to always bubble up any errors caught in __finish_chunk_alloc() to be
caught there.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Unfortunately it isn't enough to just exit here - the kzalloc() happens in a
loop and the allocated items are added to a linked list whose head is passed
in from the caller.
To fix the BUG_ON() and also provide the semantic that the list passed in is
only modified on success, I create function-local temporary list that we add
items too. If no error is met, that list is spliced to the callers at the
end of the function. Otherwise the list will be walked and all items freed
before the error value is returned.
I did a simple test on this patch by forcing an error at the kzalloc() point
and verifying that when this hits (git clone seemed to exercise this), the
function throws the proper error. Unfortunately but predictably, we later
hit a BUG_ON(ret) type line that still hasn't been fixed up ;)
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The only caller of update_ref_for_cow() is __btrfs_cow_block() which was
originally ignoring any return values. update_ref_for_cow() however doesn't
look like a candidate to become a void function - there are a few places
where errors can occur.
So instead I changed update_ref_for_cow() to bubble all errors up (instead
of BUG_ON). __btrfs_cow_block() was then updated to catch and BUG_ON() any
errors from update_ref_for_cow(). The end effect is that we have no change
in behavior, but about 8 different places where a BUG_ON(ret) was removed.
Obviously a future patch will have to address the BUG_ON() in
__btrfs_cow_block().
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
This is called from only one place - create_subvol() which passes errors
safely back out to it's caller, btrfs_mksubvol where they are handled.
Additionally, btrfs_create_subvol_root() itself bug's needlessly from error
return of btrfs_update_inode(). Since create_subvol() was fixed to catch
errors we can bubble this one up too.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Commit cb1b69f4 (Btrfs: forced readonly when btrfs_drop_snapshot() fails)
made btrfs_drop_snapshot return void because there were no callers checking
the return value. That is the wrong order to handle error propogation since
the caller will have no idea that an error has occured and continue on
as if nothing went wrong.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
set_extent_bit can do exclusive locking but only when called by lock_extent*,
Drop the exclusive bits argument except when called by lock_extent.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
lock_extent and unlock_extent are always called with GFP_NOFS, drop the
argument and use GFP_NOFS consistently.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
This patch pushes kmalloc errors up to the caller and BUGs in the caller.
The BUG_ON for duplicate reloc tree root insertion is replaced with a
panic explaining the issue.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
This pushes failures from the submit_bio_hook callbacks,
btrfs_submit_bio_hook and btree_submit_bio_hook into the callers, including
callers of submit_one_bio where it catches the failures with BUG_ON.
It also pushes up through the ->readpage_io_failed_hook to
end_bio_extent_writepage where the error is already caught with BUG_ON.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
In submit_extent_page, there's a visually noisy if statement that, in
the midst of other conditions, does the tree dependency for tree->ops
and tree->ops->merge_bio_hook before calling it, and then another
condition afterwards. If an error is returned from merge_bio_hook,
there's no way to catch it. It's considered a routine "1" return
value instead of a failure.
This patch factors out the dependency check into a new local merge_bio
routine and BUG's on an error. The if statement is less noisy as a side-
effect.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
btrfs_submit_bio_hook currently calls btrfs_bio_wq_end_io in either case
of an if statement that determines one of the arguments.
This patch moves the function call outside of the if statement and uses it
to only determine the different argument. This allows us to catch an
error in one place in a more visually obvious way.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
btrfs_update_root BUG's when it can't alloc a path, yet it can recover
from a search error. This patch returns -ENOMEM instead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
find_and_setup_root BUGs when it encounters an error from
btrfs_find_last_root, which can occur if a path can't be allocated.
This patch pushes it up to its callers where it is already handled.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
There is only one caller of clear_extent_bit that checks the return value
and it only checks if it's negative. Since there are no users of the
returned bits functionality of clear_extent_bit, stop returning it
and avoid complicating error handling.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
The only error condition in clean_tree_block is an accounting bug.
Returning without modifying dirty_metadata_bytes and as if the cleaning
as been performed may cause problems later so it should panic instead.
It should probably be a BUG_ON but we have btrfs_panic now.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Correctness fix: The kfree calls in the add_delayed_* functions free
the node that's passed into it, but the node is a member of another
structure. It works because it's always the first member of the
containing structure, but it should really be using the containing
structure itself.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
The ordered data and relocation trees have BUG_ONs to protect against
bad tree operations.
This patch replaces them with a panic that will report the problem.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
The *_state functions can only return 0 or -EEXIST. This patch addresses
the cases where those functions returning -EEXIST represent a locking
failure. It handles them by panicking with an appropriate error message.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
As part of the effort to eliminate BUG_ON as an error handling
technique, we need to determine which errors are actual logic errors,
which are on-disk corruption, and which are normal runtime errors
e.g. -ENOMEM.
Annotating these error cases is helpful to understand and report them.
This patch adds a btrfs_panic() routine that will either panic
or BUG depending on the new -ofatal_errors={panic,bug} mount option.
Since there are still so many BUG_ONs, it defaults to BUG for now but I
expect that to change once the error handling effort has made
significant progress.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Pull vfs pile 1 from Al Viro:
"This is _not_ all; in particular, Miklos' and Jan's stuff is not there
yet."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (64 commits)
ext4: initialization of ext4_li_mtx needs to be done earlier
debugfs-related mode_t whack-a-mole
hfsplus: add an ioctl to bless files
hfsplus: change finder_info to u32
hfsplus: initialise userflags
qnx4: new helper - try_extent()
qnx4: get rid of qnx4_bread/qnx4_getblk
take removal of PF_FORKNOEXEC to flush_old_exec()
trim includes in inode.c
um: uml_dup_mmap() relies on ->mmap_sem being held, but activate_mm() doesn't hold it
um: embed ->stub_pages[] into mmu_context
gadgetfs: list_for_each_safe() misuse
ocfs2: fix leaks on failure exits in module_init
ecryptfs: make register_filesystem() the last potential failure exit
ntfs: forgets to unregister sysctls on register_filesystem() failure
logfs: missing cleanup on register_filesystem() failure
jfs: mising cleanup on register_filesystem() failure
make configfs_pin_fs() return root dentry on success
configfs: configfs_create_dir() has parent dentry in dentry->d_parent
configfs: sanitize configfs_create()
...
Pull kmap_atomic cleanup from Cong Wang.
It's been in -next for a long time, and it gets rid of the (no longer
used) second argument to k[un]map_atomic().
Fix up a few trivial conflicts in various drivers, and do an "evil
merge" to catch some new uses that have come in since Cong's tree.
* 'kmap_atomic' of git://github.com/congwang/linux: (59 commits)
feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal
highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename]
drbd: remove the second argument of k[un]map_atomic()
zcache: remove the second argument of k[un]map_atomic()
gma500: remove the second argument of k[un]map_atomic()
dm: remove the second argument of k[un]map_atomic()
tomoyo: remove the second argument of k[un]map_atomic()
sunrpc: remove the second argument of k[un]map_atomic()
rds: remove the second argument of k[un]map_atomic()
net: remove the second argument of k[un]map_atomic()
mm: remove the second argument of k[un]map_atomic()
lib: remove the second argument of k[un]map_atomic()
power: remove the second argument of k[un]map_atomic()
kdb: remove the second argument of k[un]map_atomic()
udf: remove the second argument of k[un]map_atomic()
ubifs: remove the second argument of k[un]map_atomic()
squashfs: remove the second argument of k[un]map_atomic()
reiserfs: remove the second argument of k[un]map_atomic()
ocfs2: remove the second argument of k[un]map_atomic()
ntfs: remove the second argument of k[un]map_atomic()
...
Pull trivial tree from Jiri Kosina:
"It's indeed trivial -- mostly documentation updates and a bunch of
typo fixes from Masanari.
There are also several linux/version.h include removals from Jesper."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits)
kcore: fix spelling in read_kcore() comment
constify struct pci_dev * in obvious cases
Revert "char: Fix typo in viotape.c"
init: fix wording error in mm_init comment
usb: gadget: Kconfig: fix typo for 'different'
Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c"
writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header
writeback: fix typo in the writeback_control comment
Documentation: Fix multiple typo in Documentation
tpm_tis: fix tis_lock with respect to RCU
Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c"
Doc: Update numastat.txt
qla4xxx: Add missing spaces to error messages
compiler.h: Fix typo
security: struct security_operations kerneldoc fix
Documentation: broken URL in libata.tmpl
Documentation: broken URL in filesystems.tmpl
mtd: simplify return logic in do_map_probe()
mm: fix comment typo of truncate_inode_pages_range
power: bq27x00: Fix typos in comment
...
Pull btrfs updates from Chris Mason:
"I have two additional and btrfs fixes in my for-linus branch. One is
a casting error that leads to memory corruption on i386 during scrub,
and the other fixes a corner case in the backref walking code (also
triggered by scrub)."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix casting error in scrub reada code
btrfs: fix locking issues in find_parent_nodes()
The reada code from scrub was casting down a u64 to
an unsigned long so it could insert it into a radix tree.
What it really wanted to do was cast down the result of a shift, instead
of casting down the u64. The bug resulted in trying to insert our
reada struct into the wrong place, which caused soft lockups and other
problems.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
- We might unlock head->mutex while it was not locked
- We might leave the function without unlocking delayed_refs->lock
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Quoth Chris:
"This is later than I wanted because I got backed up running through
btrfs bugs from the Oracle QA teams. But they are all bug fixes that
we've queued and tested since rc1.
Nothing in particular stands out, this just reflects bug fixing and QA
done in parallel by all the btrfs developers. The most user visible
of these is:
Btrfs: clear the extent uptodate bits during parent transid failures
Because that helps deal with out of date drives (say an iscsi disk
that has gone away and come back). The old code wasn't always
properly retrying the other mirror for this type of failure."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits)
Btrfs: fix compiler warnings on 32 bit systems
Btrfs: increase the global block reserve estimates
Btrfs: clear the extent uptodate bits during parent transid failures
Btrfs: add extra sanity checks on the path names in btrfs_mksubvol
Btrfs: make sure we update latest_bdev
Btrfs: improve error handling for btrfs_insert_dir_item callers
Btrfs: be less strict on finding next node in clear_extent_bit
Btrfs: fix a bug on overcommit stuff
Btrfs: kick out redundant stuff in convert_extent_bit
Btrfs: skip states when they does not contain bits to clear
Btrfs: check return value of lookup_extent_mapping() correctly
Btrfs: fix deadlock on page lock when doing auto-defragment
Btrfs: fix return value check of extent_io_ops
btrfs: honor umask when creating subvol root
btrfs: silence warning in raid array setup
btrfs: fix structs where bitfields and spinlock/atomic share 8B word
btrfs: delalloc for page dirtied out-of-band in fixup worker
Btrfs: fix memory leak in load_free_space_cache()
btrfs: don't check DUP chunks twice
Btrfs: fix trim 0 bytes after a device delete
...
When doing IO with large amounts of data fragmentation, the global block
reserve calulations are too low. This increases them to avoid
ENOSPC crashes.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If btrfs reads a block and finds a parent transid mismatch, it clears
the uptodate flags on the extent buffer, and the pages inside it. But
we only clear the uptodate bits in the state tree if the block straddles
more than one page.
This is from an old optimization from to reduce contention on the extent
state tree. But it is buggy because the code that retries a read from
a different copy of the block is going to find the uptodate state bits
set and skip the IO.
The end result of the bug is that we'll never actually read the good
copy (if there is one).
The fix here is to always clear the uptodate state bits, which is safe
because this code is only called when the parent transid fails.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we are setting up the mount, we close all the
devices that were not actually part of the metadata we found.
But, we don't make sure that one of those devices wasn't
fs_devices->latest_bdev, which means we can do a use after free
on the one we closed.
This updates latest_bdev as it goes.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows us to gracefully continue if we aren't able to insert
directory items, both for normal files/dirs and snapshots.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Clearing a range's bits is different with setting them, since we don't
need to touch them when states do not contain bits we want.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
When I ran xfstests circularly on a auto-defragment btrfs, the deadlock
happened.
Steps to reproduce:
[tty0]
# export MOUNT_OPTIONS="-o autodefrag"
# export TEST_DEV=<partition1>
# export TEST_DIR=<mountpoint1>
# export SCRATCH_DEV=<partition2>
# export SCRATCH_MNT=<mountpoint2>
# while [ 1 ]
> do
> ./check 091 127 263
> sleep 1
> done
[tty1]
# while [ 1 ]
> do
> echo 3 > /proc/sys/vm/drop_caches
> done
Several hours later, the test processes will hang on, and the deadlock will
happen on page lock.
The reason is that:
Auto defrag task Flush thread Test task
btrfs_writepages()
add ordered extent
(including page 1, 2)
set page 1 writeback
set page 2 writeback
endio_fn()
end page 2 writeback
release page 2
lock page 1
alloc and lock page 2
page 2 is not uptodate
btrfs_readpage()
start ordered extent()
btrfs_writepages()
try to lock page 1
so deadlock happens.
Fix this bug by unlocking the page which is in writeback, and re-locking it
after the writeback end.
Signed-off-by: Miao Xie <miax@cn.fujitsu.com>
Raid array setup code creates an extent buffer in an usual way. When the
PAGE_CACHE_SIZE is > super block size, the extent pages are not marked
up-to-date, which triggers a WARN_ON in the following
write_extent_buffer call. Add an explicit up-to-date call to silence the
warning.
Signed-off-by: David Sterba <dsterba@suse.cz>
On ia64, powerpc64 and sparc64 the bitfield is modified through a RMW cycle and current
gcc rewrites the adjacent 4B word, which in case of a spinlock or atomic has
disaterous effect.
https://lkml.org/lkml/2012/2/1/220
Signed-off-by: David Sterba <dsterba@suse.cz>
We encountered an issue that was easily observable on s/390 systems but
could really happen anywhere. The timing just seemed to hit reliably
on s/390 with limited memory.
The gist is that when an unexpected set_page_dirty() happened, we'd
run into the BUG() in btrfs_writepage_fixup_worker since it wasn't
properly set up for delalloc.
This patch does the following:
- Performs the missing delalloc in the fixup worker
- Allow the start hook to return -EBUSY which informs __extent_writepage
that it should mark the page skipped and not to redirty it. This is
required since the fixup worker can fail with -ENOSPC and the page
will have already been redirtied. That causes an Oops in
drop_outstanding_extents later. Retrying the fixup worker could
lead to an infinite loop. Deferring the page redirty also saves us
some cycles since the page would be stuck in a resubmit-redirty loop
until the fixup worker completes. It's not harmful, just wasteful.
- If the fixup worker fails, we mark the page and mapping as errored,
and end the writeback, similar to what we would do had the page
actually been submitted to writeback.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Because scrub enumerates the dev extent tree to find the chunks to scrub,
it currently finds each DUP chunk twice and also scrubs it twice. This
patch makes sure that scrub_chunk only checks that part of the chunk the
dev extent has been found for. This only changes the behaviour for DUP
chunks.
Reported-and-tested-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Arne Jansen <sensille@gmx.net>
A user reported a bug of btrfs's trim, that is we will trim 0 bytes
after a device delete.
The reproducer:
$ mkfs.btrfs disk1
$ mkfs.btrfs disk2
$ mount disk1 /mnt
$ fstrim -v /mnt
$ btrfs device add disk2 /mnt
$ btrfs device del disk1 /mnt
$ fstrim -v /mnt
This is because after we delete the device, the block group may start from
a non-zero place, which will confuse trim to discard nothing.
Reported-by: Lutz Euler <lutz.euler@freenet.de>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Given that ENXIO only means "offset beyond EOF" for either SEEK_DATA or SEEK_HOLE inquiry
in a desired file range, so we should return the internal error unchanged if btrfs_get_extent_fiemap()
call failed, rather than ENXIO.
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
inode_ref_info() returns 1 when the element wasn't found and < 0 on error,
just like btrfs_search_slot(). In iref_to_path() it's an error when the
inode ref can't be found, thus we return ERR_PTR(ret) in that case. In order
to avoid ERR_PTR(1), we now set ret to -ENOENT in that case.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Gracefully fail when trying to mount a BTRFS file system that has a
sectorsize smaller than PAGE_SIZE.
On PPC it is possible to build a FS while using a 4k PAGE_SIZE kernel
then boot into a 64K PAGE_SIZE kernel. Presently open_ctree fails in an
endless loop and hangs the machine in this situation.
My debugging has show this Sector size < Page size to be a non trivial
situation and a graceful exit from the situation would be nice for the
time being.
Signed-off-by: Keith Mannthey <kmannth@us.ibm.com>
btrfs_fallocate tries to allocate space only if ranges in the file don't
already exist. But the enospc checks it does are not allowed with
extents locked.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix reservations in btrfs_page_mkwrite
Btrfs: advance window_start if we're using a bitmap
btrfs: mask out gfp flags in releasepage
Btrfs: fix enospc error caused by wrong checks of the chunk
Btrfs: do not defrag a file partially
Btrfs: fix warning for 32-bit build of fs/btrfs/check-integrity.c
Btrfs: use cluster->window_start when allocating from a cluster bitmap
Btrfs: Check for NULL page in extent_range_uptodate
btrfs: Fix busyloops in transaction waiting code
Btrfs: make sure a bitmap has enough bytes
Btrfs: fix uninit warning in backref.c
Josef fixed btrfs_page_mkwrite to properly release reserved
extents if there was an error. But if we fail to get a reservation
and we fail to dirty the inode (for ENOSPC reasons), we'll end up
trying to release a reservation we never had.
This makes sure we only release if we were able to reserve.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we span a long area in a bitmap we could end up taking a lot of time
searching to the next free area if we're searching from the original
window_start, so advance window_start in order to make sure we don't do any
superficial searching. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btree_releasepage is a callback and can be passed unknown gfp flags and then
they may end up in kmem_cache_alloc called from alloc_extent_state, slab
allocator will BUG_ON when there is HIGHMEM or DMA32 flag set.
This may happen when btrfs is mounted from a loop device, which masks out
__GFP_IO flag. The check in try_release_extent_state
3399 if ((mask & GFP_NOFS) == GFP_NOFS)
3400 mask = GFP_NOFS;
will not work and passes unfiltered flags further resulting in crash at
mm/slab.c:2963
[<000000000024ae4c>] cache_alloc_refill+0x3b4/0x5c8
[<000000000024c810>] kmem_cache_alloc+0x204/0x294
[<00000000001fd3c2>] mempool_alloc+0x52/0x170
[<000003c000ced0b0>] alloc_extent_state+0x40/0xd4 [btrfs]
[<000003c000cee5ae>] __clear_extent_bit+0x38a/0x4cc [btrfs]
[<000003c000cee78c>] try_release_extent_state+0x9c/0xd4 [btrfs]
[<000003c000cc4c66>] btree_releasepage+0x7e/0xd0 [btrfs]
[<0000000000210d84>] shrink_page_list+0x6a0/0x724
[<0000000000211394>] shrink_inactive_list+0x230/0x578
[<0000000000211bb8>] shrink_list+0x6c/0x120
[<0000000000211e4e>] shrink_zone+0x1e2/0x228
[<0000000000211f24>] shrink_zones+0x90/0x254
[<0000000000213410>] do_try_to_free_pages+0xac/0x420
[<0000000000213ae0>] try_to_free_pages+0x13c/0x1b0
[<0000000000204e6c>] __alloc_pages_nodemask+0x5b4/0x9a8
[<00000000001fb04a>] grab_cache_page_write_begin+0x7e/0xe8
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we did sysbench test for inline files, enospc error happened easily though
there was lots of free disk space which could be allocated for new chunks.
Reproduce steps:
# mkfs.btrfs -b $((2 * 1024 * 1024 * 1024)) <test partition>
# mount <test partition> /mnt
# ulimit -n 102400
# cd /mnt
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr prepare
# sysbench --num-threads=1 --test=fileio --file-num=81920 \
> --file-total-size=80M --file-block-size=1K --file-io-mode=sync \
> --file-test-mode=seqwr run
<soon later, BUG_ON() was triggered by enospc error>
The reason of this bug is:
Now, we can reserve space which is larger than the free space in the chunks if
we have enough free disk space which can be used for new chunks. By this way,
the space allocator should allocate a new chunk by force if there is no free
space in the free space cache. But there are two wrong checks which break this
operation.
One is
if (ret == -ENOSPC && num_bytes > min_alloc_size)
in btrfs_reserve_extent(), it is wrong, we should try to allocate a new chunk
even we fail to allocate free space by minimum allocable size.
The other is
if (space_info->force_alloc)
force = space_info->force_alloc;
in do_chunk_alloc(). It makes the allocator ignore CHUNK_ALLOC_FORCE If someone
sets ->force_alloc to CHUNK_ALLOC_LIMITED, and makes the enospc error happen.
Fix these two wrong checks. Especially the second one, we fix it by changing
the value of CHUNK_ALLOC_LIMITED and CHUNK_ALLOC_FORCE, and make
CHUNK_ALLOC_FORCE greater than CHUNK_ALLOC_LIMITED since CHUNK_ALLOC_FORCE has
higher priority. And if the value which is passed in by the caller is greater
than ->force_alloc, use the passed value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
xfstests 218 complains that btrfs defrags a file partially:
After: 1
Write backwards sync, but contiguous - should defrag to 1 extent
Before: 10
-After: 1
+After: 2
To fix this, we need to set max_to_defrag count properly.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There have been 4 warnings on 32-bit build, they are herewith fixed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We specifically set window_start in the cluster struct to indicate where the
cluster starts in a bitmap, but we've been using min_start to indicate where
we're searching from. This is usually the start of the blockgroup, so
essentially means we're constantly searching from the start of any bitmap we
find, which completely negates all the trouble we go to in order to setup a
cluster. So start using window_start to make sure we actually use the area we
found. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A user has encountered a NULL pointer kernel oops in btrfs when
encountering media errors. The problem has been identified
as an unhandled NULL pointer returned from find_get_page().
This modification simply checks for a NULL page, and returns
with an error if found (the extent_range_uptodate() function
returns 1 on errors).
After testing this patch, the user reported that the error with
the NULL pointer oops was solved. However, there is still a
remaining problem with a thread becoming stuck in
wait_on_page_locked(page) in the read_extent_buffer_pages(...)
function in extent_io.c
for (i = start_i; i < num_pages; i++) {
page = extent_buffer_page(eb, i);
wait_on_page_locked(page);
if (!PageUptodate(page))
ret = -EIO;
}
This patch leaves the issue with the locked page yet to be resolved.
Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
wait_log_commit() and wait_for_writer() were using slightly different
conditions for deciding whether they should call schedule() and whether they
should continue in the wait loop. Thus it could happen that we busylooped when
the first condition was not true while the second one was. That is burning CPU
cycles needlessly and is deadly on UP machines...
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have only been checking for min_bytes available in bitmap entries, but we
won't successfully setup a bitmap cluster unless it has at least bytes in the
bitmap, so in the common case min_bytes is 4k and we want something like 2MB, so
if there are a bunch of bitmap entries with less than 2mb's in them, we'll
search all them anyway, which is suboptimal. Fix this check. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Added initialization with the declaration of ret. It isn't set later on the
switch-default branch (which should never be taken).
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'btrfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
btrfs: take allocation of ->tree_root into open_ctree()
btrfs: let ->s_fs_info point to fs_info, not root...
btrfs: consolidate failure exits in btrfs_mount() a bit
btrfs: make free_fs_info() call ->kill_sb() unconditional
btrfs: merge free_fs_info() calls on fill_super failures
btrfs: kill pointless reassignment of ->s_fs_info in btrfs_fill_super()
btrfs: make open_ctree() return int
btrfs: sanitizing ->fs_info, part 5
btrfs: sanitizing ->fs_info, part 4
btrfs: sanitizing ->fs_info, part 3
btrfs: sanitizing ->fs_info, part 2
btrfs: sanitizing ->fs_info, part 1
btrfs: fix a deadlock in btrfs_scan_one_device()
btrfs: fix mount/umount race
btrfs: get ->kill_sb() of its own
btrfs: preparation to fixing mount/umount race
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (62 commits)
Btrfs: use larger system chunks
Btrfs: add a delalloc mutex to inodes for delalloc reservations
Btrfs: space leak tracepoints
Btrfs: protect orphan block rsv with spin_lock
Btrfs: add allocator tracepoints
Btrfs: don't call btrfs_throttle in file write
Btrfs: release space on error in page_mkwrite
Btrfs: fix btrfsck error 400 when truncating a compressed
Btrfs: do not use btrfs_end_transaction_throttle everywhere
Btrfs: add balance progress reporting
Btrfs: allow for resuming restriper after it was paused
Btrfs: allow for canceling restriper
Btrfs: allow for pausing restriper
Btrfs: add skip_balance mount option
Btrfs: recover balance on mount
Btrfs: save balance parameters to disk
Btrfs: soft profile changing mode (aka soft convert)
Btrfs: implement online profile changing
Btrfs: do not reduce profile in do_chunk_alloc()
Btrfs: virtual address space subset filter
...
Fix up trivial conflict in fs/btrfs/ioctl.c due to the use of the new
mnt_drop_write_file() helper.
system chunks by default are very small. This makes them slightly
larger and also fixes the conditional checks to make sure we don't
allocate a billion of them at once.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
that and theres no real way to get rid of those, so just stop using i_mutex to
protect delalloc metadata reservations and use a delalloc mutex instead. This
shouldn't be contended often at all, only if you are writing and mmap writing to
the file at the same time. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This in addition to a script in my btrfs-tracing tree will help track down space
leaks when we're getting space left over in block groups on umount. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We've been seeing warnings coming out of the orphan commit stuff forever from
ceph. Turns out it's because we're racing with checking if the orphan block
reserve is set, because we clear it outside of the spin_lock. So leave the
normal fastpath checks where they are, but take the spin_lock and _recheck_ to
make sure we haven't had an orphan block rsv added in the meantime. Then clear
the root's orphan block rsv and release the lock. With this patch a user said
the warnings went away and they usually showed up pretty soon after he started
ceph. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I used these tracepoints when figuring out what the cluster stuff was doing, so
add them to mainline in case we need to profile this stuff again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Btrfs_throttle will make us wait if there is a currently committing transaction
until we can open new transactions, which is ridiculous since we don't actually
start any transactions within the file write path anyway, so all this does is
introduce big latencies if we have a sync/fsync heavy workload going on while
somebody else is trying to do work. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If updating the inode gave us an ENOSPC we were just returning in page_mkwrite,
which is a problem since we make our reservation right before trying to update
the inode, so fix the out label so that we actually free our reservation.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reproduce steps:
# mkfs.btrfs /dev/sdb5
# mount /dev/sdb5 -o compress=lzo /mnt
# dd if=/dev/zero of=/mnt/tmpfile bs=128K count=1
# sync
# truncate -s 64K /mnt/tmpfile
root 5 inode 257 errors 400
This is because of the wrong if condition, which is used to check if we should
subtract the bytes of the dropped range from i_blocks/i_bytes of i-node or not.
When we truncate a compressed extent, btrfs substracts the bytes of the whole
extent, it's wrong. We should substract the real size that we truncate, no
matter it is a compressed extent or not. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A user reported a problem where things like open with O_CREAT would take up to
30 seconds when he had nfs activity on the same mount. This is because all of
our quick metadata operations, like create, symlink etc all do
btrfs_end_transaction_throttle, which if the transaction is blocked will wait
for the commit to complete before it returns. This adds a ridiculous amount of
latency and isn't really needed. The normal btrfs_end_transaction will mark the
transaction as blocked and wake the transaction kthread up if it thinks the
transaction needs to end (this being in the running out of global reserve space
scenario), and this is all that is really needed since we've already done
everything we're going to do, we just need to return. This should help people
with the latency they were seeing when using synchronous heavy workloads.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Recognize BTRFS_BALANCE_RESUME flag passed from userspace. We use the
same heuristics used when recovering balance after a crash to try to
start where we left off last time.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Implement an ioctl for canceling restriper. Currently we wait until
relocation of the current block group is finished, in future this can be
done by triggering a commit. Balance item is deleted and no memory
about the interrupted balance is kept.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Implement an ioctl for pausing restriper. This pauses the relocation,
but balance is still considered to be "in progress": balance item is
not deleted, other volume operations cannot be started, etc. If paused
in the middle of profile changing operation we will continue making
allocations with the target profile.
Add a hook to close_ctree() to pause restriper and free its data
structures on unmount. (It's safe to unmount when restriper is in
"paused" state, we will resume with the same parameters on the next
mount)
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Since restriper kthread starts involuntarily on mount and can suck cpu
and memory bandwidth add a mount option to forcefully skip it. The
restriper in that case hangs around in paused state and can be resumed
from userspace when it's convenient.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
On mount, if balance item is found, resume balance in a separate
kernel thread.
Try to be smart to continue roughly where previous balance (or convert)
was interrupted. For chunk types that were being converted to some
profile we turn on soft convert, in case of a simple balance we turn on
usage filter and relocate only less-than-90%-full chunks of that type.
These are just heuristics but they help quite a bit, and can be improved
in future.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Introduce a new btree objectid for storing balance item. The reason is
to be able to resume restriper after a crash with the same parameters.
Balance item has a very high objectid and goes into tree of tree roots.
The key for the new item is as follows:
[ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ]
Older kernels simply ignore it so it's safe to mount with an older
kernel and then go back to the newer one.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When doing convert from one profile to another if soft mode is on
restriper won't touch chunks that already have the profile we are
converting to. This is useful if e.g. half of the FS was converted
earlier.
The soft mode switch is (like every other filter) per-type. This means
that we can convert for example meta chunks the "hard" way while
converting data chunks selectively with soft switch.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Profile changing is done by launching a balance with
BTRFS_BALANCE_CONVERT bits set and target fields of respective
btrfs_balance_args structs initialized. Profile reducing code in this
case will pick restriper's target profile if it's available instead of
doing a blind reduce. If target profile is not yet available it goes
back to a plain reduce.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Every caller of do_chunk_alloc() feeds it the reduced allocation
profile, so stop trying to reduce it one more time. Instead check the
validity of the passed profile.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Select chunks which have at least one byte located inside a given
[vstart, vend) virtual address space range.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Select chunks which have at least one byte of at least one stripe
located on a device with devid X in a given [pstart,pend) physical
address range.
This filter only works when devid filter is turned on.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This allows to have a separate set of filters for each chunk type
(data,meta,sys). The code however is generic and switch on chunk type
is only done once.
This commit also adds a type filter: it allows to balance for example
meta and system chunks w/o touching data ones.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add basic restriper infrastructure: extended balancing ioctl and all
related ioctl data structures, add data structure for tracking
restriper's state to fs_info, etc. The semantics of the old balancing
ioctl are fully preserved.
Explicitly disallow any volume operations when balance is in progress.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently when new chunks are created respective avail_alloc_bits field
is updated to reflect profiles of all chunks present in the system.
However when chunks are removed profile bits are never cleared.
This patch clears profile bit of respective avail_alloc_bits field when
the last chunk with that profile is removed. Restriper needs this to
properly operate when "downgrading".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Right now on-disk BTRFS_BLOCK_GROUP_* profile bits are used for
avail_{data,metadata,system}_alloc_bits fields, which gather info about
available allocation profiles in the FS. When chunk is created or read
from disk, its profile is OR'ed with the corresponding avail_alloc_bits
field. Since SINGLE is denoted by 0 in the on-disk format, currently
there is no way to tell when such chunks become avaialble. Restriper
needs that information, so add a separate bit for SINGLE profile.
This bit is going to be in-memory only, it should never be written out
to disk, so it's not a disk format change. However to avoid remappings
in future, reserve corresponding on-disk bit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Chunk's type and profile are encoded in u64 flags field. Introduce
masks to easily access them. Also fix the type of BTRFS_BLOCK_GROUP_*
constants, it should be ULL.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.
This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.
[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;
if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;
This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The correct lock order is uuid_mutex -> volume_mutex -> chunk_mutex,
but when we mount a filesystem which has backing seed devices, we have
this lock chain:
open_ctree()
lock(chunk_mutex);
read_chunk_tree();
read_one_dev();
open_seed_devices();
lock(uuid_mutex);
and then we hit a lockdep splat.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
A bug was triggered while using seed device:
# mkfs.btrfs /dev/loop1
# btrfstune -S 1 /dev/loop1
# mount -o /dev/loop1 /mnt
# btrfs dev add /dev/loop2 /mnt
btrfs: block rsv returned -28
------------[ cut here ]------------
WARNING: at fs/btrfs/extent-tree.c:5969 btrfs_alloc_free_block+0x166/0x396 [btrfs]()
...
Call Trace:
...
[<f7b7c31c>] btrfs_cow_block+0x101/0x147 [btrfs]
[<f7b7eaa6>] btrfs_search_slot+0x1b8/0x55f [btrfs]
[<f7b7f844>] btrfs_insert_empty_items+0x42/0x7f [btrfs]
[<f7b7f8c1>] btrfs_insert_item+0x40/0x7e [btrfs]
[<f7b8ac02>] btrfs_make_block_group+0x243/0x2aa [btrfs]
[<f7bb3f53>] __btrfs_alloc_chunk+0x672/0x70e [btrfs]
[<f7bb41ff>] init_first_rw_device+0x77/0x13c [btrfs]
[<f7bb5a62>] btrfs_init_new_device+0x664/0x9fd [btrfs]
[<f7bbb65a>] btrfs_ioctl+0x694/0xdbe [btrfs]
[<c04f55f7>] do_vfs_ioctl+0x496/0x4cc
[<c04f5660>] sys_ioctl+0x33/0x4f
[<c07b9edf>] sysenter_do_call+0x12/0x38
---[ end trace 906adac595facc7d ]---
Since seed device is readonly, there's no usable space in the filesystem.
Afterwards we add a sprout device to it, and the kernel creates a METADATA
block group and a SYSTEM block group where comes free space we can reserve,
but we still get revervation failure because the global block_rsv hasn't
been updated accordingly.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
For btrfs raid, while discarding a range of space, we'll need to know
the start offset and length to discard for each device, and it's done
in btrfs_map_block().
However the calculation is a bit complex for raid0 and raid10, so I
reimplement it based on a fact that:
dev1 dev2 dev3 (raid0)
-----------------------------------
s0 s3 s6 s1 s4 s7 s2 s5
Each device has (total_stripes / nr_dev) stripes, or plus one.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
We pre-allocate a btrfs bio with fixed size, and then may re-allocate
memory if we find stripes are bigger than the fixed size. But this
pre-allocation is not necessary.
Also we don't have to calcuate the stripe number twice.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
If we run into some failure path in io_ctl_prepare_pages(),
io_ctl->pages[] array may have some NULL pointers.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
I got this while running xfstests:
[24256.836098] block group 317849600 has an wrong amount of free space
[24256.836100] btrfs: failed to load free space cache for block group 317849600
We should clamp the extent returned by find_first_extent_bit(),
so the start of the extent won't smaller than the start of the
block group.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Tell the page allocator that pages allocated for a buffered write are
expected to become dirty soon.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
the latter can be obtained from the former (by looking as ->tree_root)
just as cheaply as we currently are doing the other way round.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and don't bother with it after btrfs_fill_super() failure -
->kill_sb() (unlike ->put_super()) will be called even if we
have not got non-NULL ->s_root.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We do not (fortunately) modify ->s_fs_info of superblock on the fly in
btrfs_fill_super(); apparent assignment is a no-op.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It returns either ERR_PTR(-ve) or sb->s_fs_info. The latter can
be found by caller just as well, TYVM, no need to return it. Just
return -ve or 0...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
close_ctree() uses a weird mix of accesses to root->fs_info and
its value at the beginning of function stored in local variable.
Since ->fs_info *never* changes, let's just use the local variable
to avoid confusion.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A new helper: btrfs_alloc_root(fs_info); allocates btrfs_root
and sets ->fs_info. All places allocating the suckers converted
to it. At that point we *never* reassign ->fs_info of btrfs_root;
it's set before anyone sees the address of newly allocated
struct btrfs_root and never assigned anywhere else.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
move assignments to ->fs_info in open_ctree() up, to the place
just after the original allocations. Assignment for tree_root
becomes a no-op - we'd obtained fs_info from tree_root->fs_info
in the first place.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pathname resolution under a global mutex, taken on some paths in ->mount()
is a Bad Idea(tm) - think what happens if said pathname resolution triggers
automount of some btrfs instance and walks into attempt to grab the same
mutex. Deadlock - we are waiting for daemon to finish walking the path,
daemon is waiting for us to release the mutex...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Do *NOT* skip doomed superblocks in btrfs_test_super(); we want
sget() to wait for their shutdown to complete. Since we don't
mutilate ->s_fs_info in ->put_super() anymore (or free what it
used to point to until the superblock is past being findable
by sget()), we can just DTRT there and report a match.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and move free_fs_info() to that, out of ->put_super(). Do NOT
set ->s_fs_info to NULL in the latter; we need it for sget() to
be able to see and wait for fs in the middle of umount if we get a
mount/umount race.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We need fs_info and root to live until the moment when the victim
superblock leaves the list, so we need to postpone free_fs_info()
until after ->put_super(). The call is buried in close_ctree(),
though, so we need to lift it into the callers (including
btrfs_put_super()) first.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits)
Kconfig: acpi: Fix typo in comment.
misc latin1 to utf8 conversions
devres: Fix a typo in devm_kfree comment
btrfs: free-space-cache.c: remove extra semicolon.
fat: Spelling s/obsolate/obsolete/g
SCSI, pmcraid: Fix spelling error in a pmcraid_err() call
tools/power turbostat: update fields in manpage
mac80211: drop spelling fix
types.h: fix comment spelling for 'architectures'
typo fixes: aera -> area, exntension -> extension
devices.txt: Fix typo of 'VMware'.
sis900: Fix enum typo 'sis900_rx_bufer_status'
decompress_bunzip2: remove invalid vi modeline
treewide: Fix comment and string typo 'bufer'
hyper-v: Update MAINTAINERS
treewide: Fix typos in various parts of the kernel, and fix some comments.
clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR
gpio: Kconfig: drop unknown symbol 'CS5535_GPIO'
leds: Kconfig: Fix typo 'D2NET_V2'
sound: Kconfig: drop unknown symbol ARCH_CLPS7500
...
Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new
kconfig additions, close to removed commented-out old ones)
* 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
PM / Hibernate: Implement compat_ioctl for /dev/snapshot
PM / Freezer: fix return value of freezable_schedule_timeout_killable()
PM / shmobile: Allow the A4R domain to be turned off at run time
PM / input / touchscreen: Make st1232 use device PM QoS constraints
PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
PM / shmobile: Remove the stay_on flag from SH7372's PM domains
PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
PM: Drop generic_subsys_pm_ops
PM / Sleep: Remove forward-only callbacks from AMBA bus type
PM / Sleep: Remove forward-only callbacks from platform bus type
PM: Run the driver callback directly if the subsystem one is not there
PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
PM / Sleep: Merge internal functions in generic_ops.c
PM / Sleep: Simplify generic system suspend callbacks
PM / Hibernate: Remove deprecated hibernation snapshot ioctls
PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
ARM: S3C64XX: Implement basic power domain support
PM / shmobile: Use common always on power domain governor
...
Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
XBT_FORCE_SLEEP bit
Parameterize clusters on minimum total size, minimum chunk size and
minimum contiguous size for at least one chunk, without limits on
cluster, window or gap sizes. Don't tolerate any fragmentation for
SSD_SPREAD; accept it for metadata, but try to keep data dense.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We store the allocation start and length twice in ins, once right
after the other, but with intervening calls that may prevent the
duplicate from being optimized out by the compiler. Remove one of the
assignments.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since the clustered allocation may be taking extents from a different
block group, there's no point in spin-locking and testing the current
block group free space before attempting to allocate space from a
cluster, even more so when we might refrain from even trying the
cluster in the current block group because, after the cluster was set
up, not enough free space remained. Furthermore, cluster creation
attempts fail fast when the block group doesn't have enough free
space, so the test was completely superfluous.
I've move the free space test past the cluster allocation attempt,
where it is more useful, and arranged for a cluster in the current
block group to be released before trying an unclustered allocation,
when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in
the cluster stands a chance of being combined with additional free
space in the block group so as to succeed in the allocation attempt.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The chunk allocation code has tried to keep a pretty tight lid on creating new
metadata chunks. This is partially because in the past the reservation
code didn't give us an accurate idea of how much space was being used.
The new code is much more accurate, so we're able to get rid of some of these
checks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs tries to batch extent allocation tree changes to improve performance
and reduce metadata trashing. But it doesn't allocate new metadata chunks
while it is doing allocations for the extent allocation tree.
This commit changes the delayed refence code to do chunk allocations if we're
getting low on room. It prevents crashes and improves performance.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There's code in btrfs_get_extent that should never be used. This patch turns
a WARN_ON(1) into a BUG(), hoping we can remove the transaction code from
btrfs_get_extent soon.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The old backref iteration code could only safely be used on commit roots.
Besides this limitation, it had bugs in finding the roots for these
references. This commit replaces large parts of it by btrfs_find_all_roots()
which a) really finds all roots and the correct roots, b) works correctly
under heavy file system load, c) considers delayed refs.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
This function gets a byte number (a data extent), collects all the leafs
pointing to it and walks up the trees to find all fs roots pointing to those
leafs. It also returns the list of all leafs pointing to that extent.
It does proper locking for the involved trees, can be used on busy file
systems and honors delayed refs.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Now that we may be holding back delayed refs for a limited period, we
might end up having no runnable delayed refs. Without this commit, we'd
do busy waiting in that thread until another (runnable) ref arives.
Instead, we're detecting this situation and use a waitqueue, such that
we only try to run more refs after
a) another runnable ref was added or
b) delayed refs are no longer held back
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
When processing a delayed ref, first check if there are still old refs in
the process of being added. If so, put this ref back to the tree. To avoid
looping on this ref, choose a newer one in the next loop.
btrfs_find_ref_cluster has to take care of that.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Sequence numbers are needed to reconstruct the backrefs of a given extent to
a certain point in time. The total set of backrefs consist of the set of
backrefs recorded on disk plus the enqueued delayed refs for it that existed
at that moment.
This patch also adds a list that records all delayed refs which are
currently in the process of being added.
When walking all refs of an extent in btrfs_find_all_roots(), we freeze the
current state of delayed refs, honor anythinh up to this point and prevent
processing newer delayed refs to assert consistency.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
This patch adds the possibilty to read-lock an extent even if it is already
write-locked from the same thread. btrfs_find_all_roots() needs this
capability.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
vfs_create() ignores everything outside of 16bit subset of its
mode argument; switching it to umode_t is obviously equivalent
and it's the only caller of the method
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
vfs_mkdir() gets int, but immediately drops everything that might not
fit into umode_t and that's the only caller of ->mkdir()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's not needed anymore; we used to, back when we had to do
mount_subtree() by hand, complete with put_mnt_ns() in it.
No more... Apparmor didn't need it since the __d_path() fix.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pm-sleep: (51 commits)
PM: Drop generic_subsys_pm_ops
PM / Sleep: Remove forward-only callbacks from AMBA bus type
PM / Sleep: Remove forward-only callbacks from platform bus type
PM: Run the driver callback directly if the subsystem one is not there
PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
PM / Sleep: Merge internal functions in generic_ops.c
PM / Sleep: Simplify generic system suspend callbacks
PM / Hibernate: Remove deprecated hibernation snapshot ioctls
PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
PM / Sleep: Recommend [un]lock_system_sleep() over using pm_mutex directly
PM / Sleep: Replace mutex_[un]lock(&pm_mutex) with [un]lock_system_sleep()
PM / Sleep: Make [un]lock_system_sleep() generic
PM / Sleep: Use the freezer_count() functions in [un]lock_system_sleep() APIs
PM / Freezer: Remove the "userspace only" constraint from freezer[_do_not]_count()
PM / Hibernate: Replace unintuitive 'if' condition in kernel/power/user.c with 'else'
Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to block the freezer
PM / Sleep: Unify diagnostic messages from device suspend/resume
ACPI / PM: Do not save/restore NVS on Asus K54C/K54HR
PM / Hibernate: Remove deprecated hibernation test modes
PM / Hibernate: Thaw processes in SNAPSHOT_CREATE_IMAGE ioctl test path
...
Conflicts:
kernel/kmod.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: call d_instantiate after all ops are setup
Btrfs: fix worker lock misuse in find_worker
This closes races where btrfs is calling d_instantiate too soon during
inode creation. All of the callers of btrfs_add_nondir are updated to
instantiate after the inode is fully setup in memory.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Dan Carpenter noticed that we were doing a double unlock on the worker
lock, and sometimes picking a worker thread without the lock held.
This fixes both errors.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
For consistent backref walking and (later) qgroup calculation the
information to which root a delayed ref belongs is useful even for shared
refs.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
from every call site. The for_cow parameter will later on be used to
determine if a ref will change anything with respect to qgroups.
Delayed refs coming from relocation are always counted as for_cow, as they
don't change subvol quota.
Also pass in the fs_info for later use.
btrfs_find_all_roots() will use this as an optimization, as changes that are
for_cow will not change anything with respect to which root points to a
certain leaf. Thus, we don't need to add the current sequence number to
those delayed refs.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
btrfs_next_item() makes the btrfs path point to the next item, crossing leaf
boundaries if needed.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
ulist is a generic data structures to hold a collection of unique u64
values. The only operations it supports is adding to the list and
enumerating it.
It is possible to store an auxiliary value along with the key. The
implementation is preliminary and can probably be sped up significantly.
It is used by btrfs_find_all_roots() quota to translate recursions into
iterative loops.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
* master: (848 commits)
SELinux: Fix RCU deref check warning in sel_netport_insert()
binary_sysctl(): fix memory leak
mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
ipmi_watchdog: restore settings when BMC reset
oom: fix integer overflow of points in oom_badness
memcg: keep root group unchanged if creation fails
nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
nilfs2: unbreak compat ioctl
cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
evm: prevent racing during tfm allocation
evm: key must be set once during initialization
mmc: vub300: fix type of firmware_rom_wait_states module parameter
Revert "mmc: enable runtime PM by default"
mmc: sdhci: remove "state" argument from sdhci_suspend_host
x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
IB/qib: Correct sense on freectxts increment and decrement
RDMA/cma: Verify private data length
cgroups: fix a css_set not found bug in cgroup_attach_proc
oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
...
Conflicts:
kernel/cgroup_freezer.c
This is the last part of the patch series. It modifies the btrfs
code to use the integrity check module if configured to do so
with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set,
the only effective change is that code is added that handles the
mount option to activate the integrity check. If the mount option is
set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code
complains in the log and the mount fails with EINVAL.
Add the mount option to activate the usage of the integrity check
code.
Add invocation of btrfs integrity check code init and cleanup
function on mount and umount, respectively.
Add hook to call btrfs integrity check code version of
submit_bh/submit_bio.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
If the btrfs integrity check is enabled, the files required to
implement the checks are included in the build.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
The two files added in this patch contain all the code that is
required to implement the integrity checks.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
When doing 1KB sequential writes to the same page,
balance_dirty_pages_ratelimited_nr() should be called once instead of 4
times, the latter makes the dirtier tasks be throttled much too heavy.
Fix it with proper de-accounting on clear_page_dirty_for_io().
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: unplug every once and a while
Btrfs: deal with NULL srv_rsv in the delalloc inode reservation code
Btrfs: only set cache_generation if we setup the block group
Btrfs: don't panic if orphan item already exists
Btrfs: fix leaked space in truncate
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Btrfs: deal with enospc from dirtying inodes properly
Btrfs: fix num_workers_starting bug and other bugs in async thread
BTRFS: Establish i_ops before calling d_instantiate
Btrfs: add a cond_resched() into the worker loop
Btrfs: fix ctime update of on-disk inode
btrfs: keep orphans for subvolume deletion
Btrfs: fix inaccurate available space on raid0 profile
Btrfs: fix wrong disk space information of the files
Btrfs: fix wrong i_size when truncating a file to a larger size
Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror
* 'for-linus-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: lower the dirty balance poll interval
Tests show that the original large intervals can easily make the dirty
limit exceeded on 100 concurrent dd's. So adapt to as large as the
next check point selected by the dirty throttling algorithm.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs io submission threads can build up massive plug lists. This
keeps things more reasonable so we don't hand over huge dumps of IO at
once.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A user reported a problem booting into a new kernel with the old format inodes.
He was panicing in cow_file_range while writing out the inode cache. This is
because if the block group is not cached we'll just skip writing out the cache,
however if it gets dirtied again in the same transaction and it finished caching
we'd go ahead and write it out, but since we set cache_generation to the transid
we think we've already truncated it and will just carry on, running into
cow_file_range and blowing up. We need to make sure we only set
cache_generation if we've done the truncate. The user tested this patch and
verified that the panic no longer occured. Thanks,
Reported-and-Tested-by: Klaus Bitto <klaus.bitto@gmail.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
a loop. This is because we will add an orphan item, do the truncate, the
truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
left with an orphan item still in the fs. Then we come back later to do another
truncate and it blows up because we already have an orphan item. This is ok so
just fix the BUG_ON() to only BUG() if ret is not EEXIST. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We were occasionaly leaking space when running xfstest 269. This is because if
we failed to start the transaction in the truncate loop we'd just goto out, but
we need to break so that the inode is removed from the orphan list and the space
is properly freed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Now that we're properly keeping track of delayed inode space we've been getting
a lot of warnings out of btrfs_dirty_inode() when running xfstest 83. This is
because a bunch of people call mark_inode_dirty, which is void so we can't
return ENOSPC. This needs to be fixed in a few areas
1) file_update_time - this updates the mtime and such when writing to a file,
which will call mark_inode_dirty. So copy file_update_time into btrfs so we can
call btrfs_dirty_inode directly and return an error if we get one appropriately.
2) fix symlinks to use btrfs_setattr for ->setattr. For some reason we weren't
setting ->setattr for symlinks, even though we should have been. This catches
one of the cases where we were getting errors in mark_inode_dirty.
3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly
instead of mark_inode_dirty. This lets us return errors properly for truncate
and chown/anything related to setattr.
4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and
print an error if we have one. The only remaining user we can't control for
this is touch_atime(), but we don't really want to keep people from walking
down the tree if we don't have space to save the atime update, so just complain
but don't worry about it.
With this patch xfstests 83 complains a handful of times instead of hundreds of
times. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Al pointed out we have some random problems with the way we account for
num_workers_starting in the async thread stuff. First of all we need to make
sure to decrement num_workers_starting if we fail to start the worker, so make
__btrfs_start_workers do this. Also fix __btrfs_start_workers so that it
doesn't call btrfs_stop_workers(), there is no point in stopping everybody if we
failed to create a worker. Also check_pending_worker_creates needs to call
__btrfs_start_work in it's work function since it already increments
num_workers_starting.
People only start one worker at a time, so get rid of the num_workers argument
everywhere, and make btrfs_queue_worker a void since it will always succeed.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The Smack LSM hook for security_d_instantiate checks
the inode's i_op->getxattr value to determine if the
containing filesystem supports extended attributes.
The BTRFS filesystem sets the inode's i_op value only
after it has instantiated the inode. This results in
Smack incorrectly giving new BTRFS inodes attributes
from the filesystem defaults on the assumption that
values can't be stored on the filesystem. This patch
moves the assignment of inode operation vectors ahead
of the calls to d_instantiate, letting Smack know that
the filesystem supports extended attributes. There
should be no impact on the performance or behavior of
BTRFS.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we have a constant stream of end_io completions or crc work,
we can hit softlockup messages from the async helper threads. This
adds a cond_resched() into the loop to avoid them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since we have the free space caches, btrfs_orphan_cleanup also runs for
the tree_root. Unfortunately this also cleans up the orphans used to mark
subvol deletions in progress.
Currently if a subvol deletion gets interrupted twice by umount/mount, the
deletion will not be continued and the space permanently lost, though it
would be possible to write a tool to recover those lost subvol deletions.
This patch checks if the orphan belongs to a subvol (dead root) and skips
the deletion.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we use raid0 as the data profile, df command may show us a very
inaccurate value of the available space, which may be much less than the
real one. It may make the users puzzled. Fix it by changing the calculation
of the available space, and making it be more similar to a fake chunk
allocation.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfsck report errors after the 83th case of xfstests was run, The error
number is 400, it means the used disk space of the file is wrong.
The reason of this bug is that:
The file truncation may fail when the space of the file system is not enough,
and leave some file extents, whose offset are beyond the end of the files.
When we want to expand those files, we will drop those file extents, and
put in dummy file extents, and then we should update the i-node. But btrfs
forgets to do it.
This patch adds the forgotten i-node update.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfsck report error 100 after the 83th case of xfstests was run, it means
the i_size of the file is wrong.
The reason of this bug is that:
Btrfs increased i_size of the file at the beginning, but it failed to expand
the file, and failed to update the i_size to the old size because there is no
enough space in the file system, so we found a wrong i_size.
This patch fixes this bug by updating the i_size just when we pass the file
expanding and get enough space to update i-node.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The patch below removes an extra semicolon.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
btrfs_end_bio checks the number of errors on a bio against the max
number of errors allowed before sending any EIOs up to the higher
levels.
If we got enough copies of the bio done for a given raid level, it is
supposed to clear the bio error flag and return success.
We have pointers to the original bio sent down by the higher layers and
pointers to any cloned bios we made for raid purposes. If the original
bio happens to be the one that got an io error, but not the last one to
finish, it might not have the BIO_UPTODATE bit set.
Then, when the last bio does finish, we'll call bio_end_io on the
original bio. It won't have the uptodate bit set and we'll end up
sending EIO to the higher layers.
We already had a check for this, it just was conditional on getting the
IO error on the very last bio. Make the check unconditional so we eat
the EIOs properly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: drop spin lock when memory alloc fails
Btrfs: check if the to-be-added device is writable
Btrfs: try cluster but don't advance in search list
Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE
Drop spin lock in convert_extent_bit() when memory alloc fails,
otherwise, it will be a deadlock.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we call ioctl(BTRFS_IOC_ADD_DEV) directly, we'll succeed in adding
a readonly device to a btrfs filesystem, and btrfs will write to
that device, emitting kernel errors:
[ 3109.833692] lost page write due to I/O error on loop2
[ 3109.833720] lost page write due to I/O error on loop2
...
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we find an existing cluster, we switch to its block group as the
current block group, possibly skipping multiple blocks in the process.
Furthermore, under heavy contention, multiple threads may fail to
allocate from a cluster and then release just-created clusters just to
proceed to create new ones in a different block group.
This patch tries to allocate from an existing cluster regardless of its
block group, and doesn't switch to that group, instead proceeding to
try to allocate a cluster from the group it was iterating before the
attempt.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we reach LOOP_NO_EMPTY_SIZE, we won't even try to use a cluster that
others might have set up. Odds are that there won't be one, but if
someone else succeeded in setting it up, we might as well use it, even
if we don't try to set up a cluster again.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The below patch fixes some typos in various parts of the kernel, as well as fixes some comments.
Please let me know if I missed anything, and I will try to get it changed and resent.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix meta data raid-repair merge problem
Btrfs: skip allocation attempt from empty cluster
Btrfs: skip block groups without enough space for a cluster
Btrfs: start search for new cluster at the beginning
Btrfs: reset cluster's max_size when creating bitmap
Btrfs: initialize new bitmaps' list
Btrfs: fix oops when calling statfs on readonly device
Btrfs: Don't error on resizing FS to same size
Btrfs: fix deadlock on metadata reservation when evicting a inode
Fix URL of btrfs-progs git repository in docs
btrfs scrub: handle -ENOMEM from init_ipath()
Commit 4a54c8c16 introduced raid-repair, killing the individual
readpage_io_failed_hook entries from inode.c and disk-io.c. Commit
4bb31e92 introduced new readahead code, adding a readpage_io_failed_hook to
disk-io.c.
The raid-repair commit had logic to disable raid-repair, if
readpage_io_failed_hook is set. Thus, the readahead commit effectively
disabled raid-repair for meta data.
This commit changes the logic to always attempt raid-repair when needed and
call the readpage_io_failed_hook in case raid-repair fails. This is much
more straight forward and should have been like that from the beginning.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we don't have a cluster, don't bother trying to allocate from it,
jumping right away to the attempt to allocate a new cluster.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We test whether a block group has enough free space to hold the
requested block, but when we're doing clustered allocation, we can
save some cycles by testing whether it has enough room for the cluster
upfront, otherwise we end up attempting to set up a cluster and
failing. Only in the NO_EMPTY_SIZE loop do we attempt an unclustered
allocation, and by then we'll have zeroed the cluster size, so this
patch won't stop us from using the block group as a last resort.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Instead of starting at zero (offset is always zero), request a cluster
starting at search_start, that denotes the beginning of the current
block group.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The field that indicates the size of the largest contiguous chunk of
free space in the cluster is not initialized when setting up bitmaps,
it's only increased when we find a larger contiguous chunk. We end up
retaining a larger value than appropriate for highly-fragmented
clusters, which may cause pointless searches for large contiguous
groups, and even cause clusters that do not meet the density
requirements to be set up.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We're failing to create clusters with bitmaps because
setup_cluster_no_bitmap checks that the list is empty before inserting
the bitmap entry in the list for setup_cluster_bitmap, but the list
field is only initialized when it is restored from the on-disk free
space cache, or when it is written out to disk.
Besides a potential race condition due to the multiple use of the list
field, filesystem performance severely degrades over time: as we use
up all non-bitmap free extents, the try-to-set-up-cluster dance is
done at every metadata block allocation. For every block group, we
fail to set up a cluster, and after failing on them all up to twice,
we fall back to the much slower unclustered allocation.
To make matters worse, before the unclustered allocation, we try to
create new block groups until we reach the 1% threshold, which
introduces additional bitmaps and thus block groups that we'll iterate
over at each metadata block request.
To reproduce this bug:
# dd if=/dev/zero of=img bs=1M count=256
# mkfs.btrfs img
# losetup -r /dev/loop1 img
# mount /dev/loop1 /mnt
OOPS!!
It triggered BUG_ON(!nr_devices) in btrfs_calc_avail_data_space().
To fix this, instead of checking write-only devices, we check all open
deivces:
# df -h /dev/loop1
Filesystem Size Used Avail Use% Mounted on
/dev/loop1 250M 28K 238M 1% /mnt
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
It seems overly harsh to fail a resize of a btrfs file system to the
same size when a shrink or grow would succeed. User app GParted trips
over this error. Allow it by bypassing the shrink or grow operation.
Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com>
When I ran the xfstests, I found the test tasks was blocked on meta-data
reservation.
By debugging, I found the reason of this bug:
start transaction
|
v
reserve meta-data space
|
v
flush delay allocation -> iput inode -> evict inode
^ |
| v
wait for delay allocation flush <- reserve meta-data space
And besides that, the flush on evicting inode will block the thread, which
is reclaiming the memory, and make oom happen easily.
Fix this bug by skipping the flush step when evicting inode.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
* 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc: (24 commits)
freezer: fix wait_event_freezable/__thaw_task races
freezer: kill unused set_freezable_with_signal()
dmatest: don't use set_freezable_with_signal()
usb_storage: don't use set_freezable_with_signal()
freezer: remove unused @sig_only from freeze_task()
freezer: use lock_task_sighand() in fake_signal_wake_up()
freezer: restructure __refrigerator()
freezer: fix set_freezable[_with_signal]() race
freezer: remove should_send_signal() and update frozen()
freezer: remove now unused TIF_FREEZE
freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE
cgroup_freezer: prepare for removal of TIF_FREEZE
freezer: clean up freeze_processes() failure path
freezer: kill PF_FREEZING
freezer: test freezable conditions while holding freezer_lock
freezer: make freezing indicate freeze condition in effect
freezer: use dedicated lock instead of task_lock() + memory barrier
freezer: don't distinguish nosig tasks on thaw
freezer: remove racy clear_freeze_flag() and set PF_NOFREEZE on dead tasks
freezer: rename thaw_process() to __thaw_task() and simplify the implementation
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: remove free-space-cache.c WARN during log replay
Btrfs: sectorsize align offsets in fiemap
Btrfs: clear pages dirty for io and set them extent mapped
Btrfs: wait on caching if we're loading the free space cache
Btrfs: prefix resize related printks with btrfs:
btrfs: fix stat blocks accounting
Btrfs: avoid unnecessary bitmap search for cluster setup
Btrfs: fix to search one more bitmap for cluster setup
btrfs: mirror_num should be int, not u64
btrfs: Fix up 32/64-bit compatibility for new ioctls
Btrfs: fix barrier flushes
Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush
There is no reason to export two functions for entering the
refrigerator. Calling refrigerator() instead of try_to_freeze()
doesn't save anything noticeable or removes any race condition.
* Rename refrigerator() to __refrigerator() and make it return bool
indicating whether it scheduled out for freezing.
* Update try_to_freeze() to return bool and relay the return value of
__refrigerator() if freezing().
* Convert all refrigerator() users to try_to_freeze().
* Update documentation accordingly.
* While at it, add might_sleep() to try_to_freeze().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Christoph Hellwig <hch@infradead.org>
The log replay code only partially loads block groups, since
the block group caching code is able to detect and deal with
extents the logging code has pinned down.
While the logging code is pinning down block groups, there is
a bogus WARN_ON we're hitting if the code wasn't able to find
an extent in the cache. This commit removes the warning because
it can happen any time there isn't a valid free space cache
for that block group.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere
else that calls btrfs_get_extent while running xfstests 13 in a loop. This is
because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets,
which will end up adding mappings that are not sectorsize aligned, which will
cause problems in some cases for subsequent calls to btrfs_get_extent for
similar areas that are sectorsize aligned. With this patch I ran xfstests 13 in
a loop for a couple of hours and didn't hit the problem that I could previously
hit in at most 20 minutes. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When doing the io_ctl helpers to clean up the free space cache stuff I stopped
using our normal prepare_pages stuff, which means I of course forgot to do
things like set the pages extent mapped, which will cause us all sorts of
wonderful propblems. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We've been hitting panics when running xfstest 13 in a loop for long periods of
time. And actually this problem has always existed so we've been hitting these
things randomly for a while. Basically what happens is we get a thread coming
into the allocator and reading the space cache off of disk and adding the
entries to the free space cache as we go. Then we get another thread that comes
in and tries to allocate from that block group. Since block_group->cached !=
BTRFS_CACHE_NO it goes ahead and tries to do the allocation. We do this because
if we're doing the old slow way of caching we don't want to hold people up and
wait for everything to finish. The problem with this is we could end up
discarding the space cache at some arbitrary point in the future, which means we
could very well end up allocating space that is either bad, or when the real
caching happens it could end up thinking the space isn't in use when it really
is and cause all sorts of other problems.
The solution is to add a new flag to indicate we are loading the free space
cache from disk, and always try to cache the block group if cache->cached !=
BTRFS_CACHE_FINISHED. That way if we are loading the space cache anybody else
who tries to allocate from the block group will have to wait until it's finished
to make sure it completes successfully. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
For the user it is confusing to find something like:
[10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472
in kernel log, because it doesn't point directly to btrfs.
This patch prefixes those messages with "btrfs:" like other btrfs
related printks.
Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Round inode bytes and delalloc bytes up to real blocksize before
converting to sector size. Otherwise eg. files smaller than 512
are reported with zero blocks due to incorrect rounding.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
setup_cluster_no_bitmap() searches all the extents and bitmaps starting
from offset. Therefore if it returns -ENOSPC, all the bitmaps starting
from offset are in the bitmaps list, so it's sufficient to search from
this list in setup_cluser_bitmap().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Suppose there are two bitmaps [0, 256], [256, 512] and one extent
[100, 120] in the free space cache, and we want to setup a cluster
with offset=100, bytes=50.
In this case, there will be only one bitmap [256, 512] in the temporary
bitmaps list, and then setup_cluster_bitmap() won't search bitmap [0, 256].
The cause is, the list is constructed in setup_cluster_no_bitmap(),
and only bitmaps with bitmap_entry->offset >= offset will be added
into the list, and the very bitmap that convers offset has
bitmap_entry->offset <= offset.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
My previous patch introduced some u64 for failed_mirror variables, this one
makes it consistent again.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch casts to unsigned long before casting to a pointer and fixes
the following warnings:
fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs is writing the super blocks, it send barrier flushes to make
sure writeback caching drives get all the metadata on disk in the
right order.
But, we have two bugs in the way these are sent down. When doing
full commits (not via the tree log), we are sending the barrier down
before the last super when it should be going down before the first.
In multi-device setups, we should be waiting for the barriers to
complete on all devices before writing any of the supers.
Both of these bugs can cause corruptions on power failures. We fix it
with some new code to send down empty barriers to all devices before
writing the first super.
Alexandre Oliva found the multi-device bug. Arne Jansen did the async
barrier loop.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
takes vfsmount and relative path, does lookup within that vfsmount
(possibly triggering automounts) and returns the result as root
of subtree suitable for return by ->mount() (i.e. a reference to
dentry and an active reference to its superblock grabbed, superblock
locked exclusive).
btrfs and nfs switched to it instead of open-coding the sucker.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Life is much saner if create_mnt_ns(mnt) drops mnt in case of error...
Switch it to such calling conventions, switch callers, fix double mntput() in
fs/nfs/super.c one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The btrfs snapshotting code requires that once a root has been
snapshotted, we don't change it during a commit.
But there are two cases to lead to tree corruptions:
1) multi-thread snapshots can commit serveral snapshots in a transaction,
and this may change the src root when processing the following pending
snapshots, which lead to the former snapshots corruptions;
2) the free inode cache was changing the roots when it root the cache,
which lead to corruptions.
This fixes things by making sure we force COW the block after we create a
snapshot during commiting a transaction, then any changes to the roots
will result in COW, and we get all the fs roots and snapshot roots to be
consistent.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: rename the option to nospace_cache
Btrfs: handle bio_add_page failure gracefully in scrub
Btrfs: fix deadlock caused by the race between relocation
Btrfs: only map pages if we know we need them when reading the space cache
Btrfs: fix orphan backref nodes
Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush}
Btrfs: fix unreleased path in btrfs_orphan_cleanup()
Btrfs: fix no reserved space for writing out inode cache
Btrfs: fix nocow when deleting the item
Btrfs: tweak the delayed inode reservations again
Btrfs: rework error handling in btrfs_mount()
Btrfs: close devices on all error paths in open_ctree()
Btrfs: avoid null dereference and leaks when bailing from open_ctree()
Btrfs: fix subvol_name leak on error in btrfs_mount()
Btrfs: fix memory leak in btrfs_parse_early_options()
Btrfs: fix our reservations for updating an inode when completing io
Btrfs: fix oops on NULL trans handle in btrfs_truncate
btrfs: fix double-free 'tree_root' in 'btrfs_mount()'
Rename no_space_cache option to nospace_cache to be more consistent with
the rest, where the simple prefix 'no' is used to negate an option.
The option has been introduced during the -rc1 cycle and there are has not been
widely used, so it's safe.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently scrub fails with ENOMEM when bio_add_page fails. Unfortunately
dm based targets accept only one page per bio, thus making scrub always
fails. This patch just submits the current bio when an error is encountered
and starts a new one.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can not do flushable reservation for the relocation when we create snapshot,
because it may make the transaction commit task and the flush task wait for
each other and the deadlock happens.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
People have been running into a warning when loading space cache because the
page is already mapped when trying to read in a bitmap. The way we read in
entries and pages is kind of convoluted, so fix it so that io_ctl_read_entry
maps the entries if it needs to, and if it hits the end of the page it simply
unmaps the page. That way we can unconditionally unmap the io_ctl before
reading in the bitmap and we should stop hitting these warnings. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If the root node of a fs/file tree is in the block group that is
being relocated, but the others are not in the other block groups.
when we create a snapshot for this tree between the relocation tree
creation ends and ->create_reloc_tree is set to 0, Btrfs will create
some backref nodes that are the lowest nodes of the backrefs cache.
But we forget to add them into ->leaves list of the backref cache
and deal with them, and at last, they will triggered BUG_ON().
kernel BUG at fs/btrfs/relocation.c:239!
This patch fixes it by adding them into ->leaves list of backref cache.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_block_rsv_add{, _noflush}() have similar code, so abstract that code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we did stress test for the space relocation, the deadlock happened.
By debugging, We found it was caused by the carelessness that we forgot
to unlock the read lock of the extent buffers in btrfs_orphan_cleanup()
before we end the transaction handle, so the transaction commit task waited
the task, which called btrfs_orphan_cleanup(), to unlock the extent buffer,
but that task waited the commit task to end the transaction commit, and
the deadlock happened. Fix it.
Signed-ff-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I-node cache forgets to reserve the space when writing out it. And when
we do some stress test, such as synctest, it will trigger WARN_ON() in
use_block_rsv().
WARNING: at fs/btrfs/extent-tree.c:5718 btrfs_alloc_free_block+0xbf/0x281 [btrfs]()
...
Call Trace:
[<ffffffff8104df86>] warn_slowpath_common+0x80/0x98
[<ffffffff8104dfb3>] warn_slowpath_null+0x15/0x17
[<ffffffffa0369c60>] btrfs_alloc_free_block+0xbf/0x281 [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa035c040>] __btrfs_cow_block+0x118/0x3b5 [btrfs]
[<ffffffffa035c7ba>] btrfs_cow_block+0x103/0x14e [btrfs]
[<ffffffffa035e4c4>] btrfs_search_slot+0x249/0x6a4 [btrfs]
[<ffffffffa036d086>] btrfs_lookup_inode+0x2a/0x8a [btrfs]
[<ffffffffa03788b7>] btrfs_update_inode+0xaa/0x141 [btrfs]
[<ffffffffa036d7ec>] btrfs_save_ino_cache+0xea/0x202 [btrfs]
[<ffffffffa03a761e>] ? btrfs_update_reloc_root+0x17e/0x197 [btrfs]
[<ffffffffa0373867>] commit_fs_roots+0xaa/0x158 [btrfs]
[<ffffffffa03746a6>] btrfs_commit_transaction+0x405/0x731 [btrfs]
[<ffffffff810690df>] ? wake_up_bit+0x25/0x25
[<ffffffffa039d652>] ? btrfs_log_dentry_safe+0x43/0x51 [btrfs]
[<ffffffffa0381c5f>] btrfs_sync_file+0x16a/0x198 [btrfs]
[<ffffffff81122806>] ? mntput+0x21/0x23
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
Sometimes it causes BUG_ON() in the reservation code of the delayed inode
is triggered.
So we must reserve enough space for inode cache.
Note: If we can not reserve the enough space for inode cache, we will
give up writing out it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_previous_item() just search the b+ tree, do not COW the nodes or leaves,
if we modify the result of it, the meta-data will be broken. fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef sent along an incremental to the inode reservation
code to make sure we try and fall back to directly updating
the inode item if things go horribly wrong.
This reworks that patch slightly, adding a fallback function
that will always try to update the inode item directly without
going through the delayed_inode code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commits 6c41761f and 45ea6095 introduced the possibility of NULL pointer
dereference on error paths, also we would leave all devices busy and
leak fs_info with all sub-structures on error when trying to mount an
already mounted fs to a different directory.
Fix this by doing all allocations before trying to open any of the
devices, adjust error path for mount-already-mounted-fs case.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix a bug introduced by 7e662854 where we would leave devices busy on
certain error paths in open_ctree(). fs_info is guaranteed to be
non-NULL now so it's safe to dereference it on all error paths.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix bugs introduced by 6c41761f. Firstly, after failing to allocate any
of the tree roots (first 'goto fail' in open_ctree()) we would
dereference a NULL fs_info pointer in free_fs_info(). Secondly, after
failures from init_srcu_struct(), setup_bdi() and new_inode() we would
leak all earlier allocated roots: fs_info fields haven't been
initialized yet so free_fs_info() is rendered useless.
Fix this by initializing fs_info pointer and fs_info fields before any
allocations happen.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
btrfs_parse_early_options() can fail due to error while scanning devices
(-o device= option), but still strdup() subvol_name string:
mount -o subvol=SUBV,device=BAD_DEVICE <dev> <mnt>
So free subvol_name string on error.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Don't leak subvol_name string in case multiple subvol= options are
given. "The lastest option is effective" behavior (consistent with
subvolid= and subvolrootid= options) is preserved.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
People have been reporting ENOSPC crashes in finish_ordered_io. This is because
we try to steal from the delalloc block rsv to satisfy a reservation to update
the inode. The problem with this is we don't explicitly save space for updating
the inode when doing delalloc. This is kind of a problem and we've gotten away
with this because way back when we just stole from the delalloc reserve without
any questions, and this worked out fine because generally speaking the leaf had
been modified either by the mtime update when we did the original write or
because we just updated the leaf when we inserted the file extent item, only on
rare occasions had the leaf not actually been modified, and that was still ok
because we'd just use a block or two out of the over-reservation that is
delalloc.
Then came the delayed inode stuff. This is amazing, except it wants a full
reservation for updating the inode since it may do it at some point down the
road after we've written the blocks and we have to recow everything again. This
worked out because the delayed inode stuff just stole from the global reserve,
that is until recently when I changed that because it caused other problems.
So here we are, we're doing everything right and being screwed for it. So take
an extra reservation for the inode at delalloc reservation time and carry it
through the life of the delalloc reservation. If we need it we can steal it in
the delayed inode stuff. If we have already stolen it try and do a normal
metadata reservation. If that fails try to steal from the delalloc reservation.
If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
solve this and in the meantime we'll steal from the global reserve.
With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
any problems.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we fail to reserve space in the transaction during truncate, we can
error out with a NULL trans handle. The cleanup code needs an extra
check to make sure we aren't trying to use the bad handle.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On error path 'tree_root' is treed in 'free_fs_info()'.
No need to free it explicitely. Noticed by SLUB in debug mode:
Complete reproducer under usermode linux (discovered on real
machine):
bdev=/dev/ubda
btr_root=/btr
/mkfs.btrfs $bdev
mount $bdev $btr_root
mkdir $btr_root/subvols/
cd $btr_root/subvols/
/btrfs su cr foo
/btrfs su cr bar
mount $bdev -osubvol=subvols/foo $btr_root/subvols/bar
umount $btr_root/subvols/bar
which gives
device fsid 4d55aa28-45b1-474b-b4ec-da912322195e devid 1 transid 7 /dev/ubda
=============================================================================
BUG kmalloc-2048: Object already free
-----------------------------------------------------------------------------
INFO: Allocated in btrfs_mount+0x389/0x7f0 age=0 cpu=0 pid=277
INFO: Freed in btrfs_mount+0x51c/0x7f0 age=0 cpu=0 pid=277
INFO: Slab 0x0000000062886200 objects=15 used=9 fp=0x0000000070b4d2d0 flags=0x4081
INFO: Object 0x0000000070b4d2d0 @offset=21200 fp=0x0000000070b4a968
...
Call Trace:
70b31948: [<6008c522>] print_trailer+0xe2/0x130
70b31978: [<6008c5aa>] object_err+0x3a/0x50
70b319a8: [<6008e242>] free_debug_processing+0x142/0x2a0
70b319e0: [<600ebf6f>] btrfs_mount+0x55f/0x7f0
70b319f8: [<6008e5c1>] __slab_free+0x221/0x2d0
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Arne Jansen <sensille@gmx.net>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits)
Btrfs: check for a null fs root when writing to the backup root log
Btrfs: fix race during transaction joins
Btrfs: fix a potential btrfs_bio leak on scrub fixups
Btrfs: rename btrfs_bio multi -> bbio for consistency
Btrfs: stop leaking btrfs_bios on readahead
Btrfs: stop the readahead threads on failed mount
Btrfs: fix extent_buffer leak in the metadata IO error handling
Btrfs: fix the new inspection ioctls for 32 bit compat
Btrfs: fix delayed insertion reservation
Btrfs: ClearPageError during writepage and clean_tree_block
Btrfs: be smarter about committing the transaction in reserve_metadata_bytes
Btrfs: make a delayed_block_rsv for the delayed item insertion
Btrfs: add a log of past tree roots
btrfs: separate superblock items out of fs_info
Btrfs: use the global reserve when truncating the free space cache inode
Btrfs: release metadata from global reserve if we have to fallback for unlink
Btrfs: make sure to flush queued bios if write_cache_pages waits
Btrfs: fix extent pinning bugs in the tree log
Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN
Btrfs: don't wait as long for more batches during SSD log commit
...
* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Add a 'reason' to wb_writeback_work
writeback: send work item to queue_io, move_expired_inodes
writeback: trace event balance_dirty_pages
writeback: trace event bdi_dirty_ratelimit
writeback: fix ppc compile warnings on do_div(long long, unsigned long)
writeback: per-bdi background threshold
writeback: dirty position control - bdi reserve area
writeback: control dirty pause time
writeback: limit max dirty pause time
writeback: IO-less balance_dirty_pages()
writeback: per task dirty rate limit
writeback: stabilize bdi->dirty_ratelimit
writeback: dirty rate control
writeback: add bg_threshold parameter to __bdi_update_bandwidth()
writeback: dirty position control
writeback: account per-bdi accumulated dirtied pages
During log replay, can commit the transaction before the fs_root
pointers are setup, so we have to make sure they are not null before
trying to use them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While we're allocating ram for a new transaction, we drop our spinlock.
When we get the lock back, we do check to see if a transaction started
while we slept, but we don't check to make sure it isn't blocked
because a commit has already started.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In case we were able to map less than we wanted (length < PAGE_SIZE
clause is true) btrfs_bio is still allocated and we have to free it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The scrub readahead branch brought in a new error handling hook,
but it was leaking extent_buffer references.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The new ioctls to follow backrefs are not clean for 32/64 bit
compat. This reworks them for u64s everywhere. They are brand new, so
there are no problems with changing the interface now.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We all keep getting those stupid warnings from use_block_rsv when running
stress.sh, and it's because the delayed insertion stuff is being stupid. It's
not the delayed insertion stuffs fault, it's all just stupid. When marking an
inode dirty for oh say updating the time on it, we just do a
btrfs_join_transaction, which doesn't reserve any space. This is stupid because
we're going to have to have space reserve to make this change, but we do it
because it's fast because chances are we're going to call it over and over again
and it doesn't matter. Well thanks to the delayed insertion stuff this is
mostly the case, so we do actually need to make this reservation. So if
trans->bytes_reserved is 0 then try to do a normal reservation. If not return
ENOSPC which will make the btrfs_dirty_inode start a proper transaction which
will let it do the whole ENOSPC dance and reserve enough space for the delayed
insertion to steal the reservation from the transaction.
The other stupid thing we do is not reserve space for the inode when writing to
the thing. Usually this is ok since we have to update the time so we'd have
already done all this work before we get to the endio stuff, so it doesn't
matter. But this is stupid because we could write the data after the
transaction commits where we changed the mtime of the inode so we have to cow
all the way down to the inode anyway. This used to be masked by the delalloc
reservation stuff, but because we delay the update it doesn't get masked in this
case. So again the delayed insertion stuff bites us in the ass. So if our
trans->block_rsv is delalloc, just steal the reservation from the delalloc
reserve. Hopefully this won't bite us in the ass, but I've said that before.
With this patch stress.sh no longer spits out those stupid warnings (famous last
words). Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Failure testing was tripping up over stale PageError bits in
metadata pages. If we have an io error on a block, and later on
end up reusing it, nobody ever clears PageError on those pages.
During commit, we'll find PageError and think we had trouble writing
the block, which will lead to aborts and other problems.
This changes clean_tree_block and the btrfs writepage code to
clear the PageError bit. In both cases we're either completely
done with the page or the page has good stuff and the error bit
is no longer valid.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Because of the overcommit stuff I had to make it so that we committed the
transaction all the time in reserve_metadata_bytes in case we had overcommitted
because of delayed items. This was because previously we had no way of knowing
how much space was reserved for delayed items. Now that we have the
delayed_block_rsv we can check it to see if committing the transaction would get
us anywhere. This patch breaks out the committing logic into a helper function
that will check to see if committing the transaction would free enough space for
us to get anything done. With this patch xfstests 83 goes from taking 445
seconds to taking 28 seconds on my box. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I've been hitting warnings in use_block_rsv when running the delayed insertion
stuff. It's because we will readjust global block rsv based on what is in use,
which means we could end up discarding reservations that are for the delayed
insertion stuff. So instead create a seperate block rsv for the delayed
insertion stuff. This will also make it easier to debug problems with the
delayed insertion reservations since we will know that only the delayed
insertion code touches this block_rsv. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This takes some of the free space in the btrfs super block
to record information about most of the roots in the last four
commits.
It also adds a -o recovery to use the root history log when
we're not able to read the tree of tree roots, the extent
tree root, the device tree root or the csum root.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)
Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.
Signed-off-by: David Sterba <dsterba@suse.cz>
We no longer use the orphan block rsv for holding the reservation for truncating
the inode, so instead use the global block rsv and check to make sure it has
enough space for us to truncate the space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I fixed a problem where we weren't reserving space for an orphan item when we
had to fallback to using the global reserve for an unlink, but I introduced
another problem. I was migrating the bytes from the transaction reserve to the
global reserve and then releasing from the global reserve in
btrfs_end_transaction(). The problem with this is that a migrate will jack up
the size for the destination, but leave the size alone for the source, with the
idea that you can do a release normally on the source and it all washes out, and
then you can do a release again on the destination and it works out right. My
way was skipping the release on the trans_block_rsv which still had the jacked
up size from our original reservation. So instead release manually from the
global reserve if this transaction was using it, and then set the
trans->block_rsv back to the trans_block_rsv so that btrfs_end_transaction
cleans everything up properly. With this patch xfstest 83 doesn't emit warnings
about leaking space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
write_cache_pages tries to build up a large bio to stuff down the pipe.
But if it needs to wait for a page lock, it needs to make sure and send
down any pending writes so we don't deadlock with anyone who has the
page lock and is waiting for writeback of things inside the bio.
Dave Sterba triggered this as a deadlock between the autodefrag code and
the extent write_cache_pages
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The tree log had two important bugs that could cause corruptions after a
crash. Sometimes we were allowing tree log blocks to be reused after
the tree log was committed but before the transaction commit was done.
This allowed a future metadata write to overwrite the tree log data. It
is fixed by adding a new variant of freeing reserved extents that always
pins them. Credit goes to Stefan Behrens and Arne Jansen for many many
hours spent tracking this bug down.
During tree log replay, we do a pass through the tree log and pin all
the extents we find. This makes sure the replay code won't go in and
use any of those blocks for new allocations during replay. The problem
is the free space cache isn't honoring these pinned extents. So the
allocator can end up handing them out, leading to all kinds of problems
during replay.
The fix here is to force any free space cache to load while we pin the
extents, and then to make sure we remove the pinned extents from the
free space rbtree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
btrfs_remove_free_space needs to make sure to set ret back to a
valid return value after setting it to EAGAIN, otherwise we return
it to the callers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we're doing log commits, we try to wait for more writers to come in
and make the commit bigger. This helps improve performance on rotating
disks, but on SSDs it adds latencies.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity. A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.
The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.
And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits)
leases: fix write-open/read-lease race
nfs: drop unnecessary locking in llseek
ext4: replace cut'n'pasted llseek code with generic_file_llseek_size
vfs: add generic_file_llseek_size
vfs: do (nearly) lockless generic_file_llseek
direct-io: merge direct_io_walker into __blockdev_direct_IO
direct-io: inline the complete submission path
direct-io: separate map_bh from dio
direct-io: use a slab cache for struct dio
direct-io: rearrange fields in dio/dio_submit to avoid holes
direct-io: fix a wrong comment
direct-io: separate fields only used in the submission path from struct dio
vfs: fix spinning prevention in prune_icache_sb
vfs: add a comment to inode_permission()
vfs: pass all mask flags check_acl and posix_acl_permission
vfs: add hex format for MAY_* flag values
vfs: indicate that the permission functions take all the MAY_* flags
compat: sync compat_stats with statfs.
vfs: add "device" tag to /proc/self/mountstats
cleanup: vfs: small comment fix for block_invalidatepage
...
Fix up trivial conflict in fs/gfs2/file.c (llseek changes)
The i_mutex lock use of generic _file_llseek hurts. Independent processes
accessing the same file synchronize over a single lock, even though
they have no need for synchronization at all.
Under high utilization this can cause llseek to scale very poorly on larger
systems.
This patch does some rethinking of the llseek locking model:
First the 64bit f_pos is not necessarily atomic without locks
on 32bit systems. This can already cause races with read() today.
This was discussed on linux-kernel in the past and deemed acceptable.
The patch does not change that.
Let's look at the different seek variants:
SEEK_SET: Doesn't really need any locking.
If there's a race one writer wins, the other loses.
For 32bit the non atomic update races against read()
stay the same. Without a lock they can also happen
against write() now. The read() race was deemed
acceptable in past discussions, and I think if it's
ok for read it's ok for write too.
=> Don't need a lock.
SEEK_END: This behaves like SEEK_SET plus it reads
the maximum size too. Reading the maximum size would have the
32bit atomic problem. But luckily we already have a way to read
the maximum size without locking (i_size_read), so we
can just use that instead.
Without i_mutex there is no synchronization with write() anymore,
however since the write() update is atomic on 64bit it just behaves
like another racy SEEK_SET. On non atomic 32bit it's the same
as SEEK_SET.
=> Don't need a lock, but need to use i_size_read()
SEEK_CUR: This has a read-modify-write race window
on the same file. One could argue that any application
doing unsynchronized seeks on the same file is already broken.
But for the sake of not adding a regression here I'm
using the file->f_lock to synchronize this. Using this
lock is much better than the inode mutex because it doesn't
synchronize between processes.
=> So still need a lock, but can use a f_lock.
This patch implements this new scheme in generic_file_llseek.
I dropped generic_file_llseek_unlocked and changed all callers.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
* 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits)
TOMOYO: Fix incomplete read after seek.
Smack: allow to access /smack/access as normal user
TOMOYO: Fix unused kernel config option.
Smack: fix: invalid length set for the result of /smack/access
Smack: compilation fix
Smack: fix for /smack/access output, use string instead of byte
Smack: domain transition protections (v3)
Smack: Provide information for UDS getsockopt(SO_PEERCRED)
Smack: Clean up comments
Smack: Repair processing of fcntl
Smack: Rule list lookup performance
Smack: check permissions from user space (v2)
TOMOYO: Fix quota and garbage collector.
TOMOYO: Remove redundant tasklist_lock.
TOMOYO: Fix domain transition failure warning.
TOMOYO: Remove tomoyo_policy_memory_lock spinlock.
TOMOYO: Simplify garbage collector.
TOMOYO: Fix make namespacecheck warnings.
target: check hex2bin result
encrypted-keys: check hex2bin result
...
The WARN_ON under some circumstances heavily polute log and slow down
the machine. This is just a safety, as the warning should be fixed by
another patch, nevertheless, it still pops up during testing.
Signed-off-by: David Sterba <dsterba@suse.cz>
There's a missing test whether the path passed to subvol=path option
during mount is a real subvolume, allowing any directory located in
default subovlume to be passed and accepted for mount.
(current btrfs progs prevent this early)
$ btrfs subvol snapshot . p1-snap
ERROR: '.' is not a subvolume
(with "is subvolume?" test bypassed)
$ btrfs subvol snapshot . p1-snap
Create a snapshot of '.' in './p1-snap'
$ btrfs subvol list -p .
ID 258 parent 5 top level 5 path subvol
ID 259 parent 5 top level 5 path subvol1
ID 260 parent 5 top level 5 path default-subvol1
ID 262 parent 5 top level 5 path p1/p1-snapshot
ID 263 parent 259 top level 5 path subvol1/subvol1-snap
The problem I see is that this makes a false impression of snapshotting the
given subvolume but in fact snapshots the default one: a user expects outcome
like ID 263 but in fact gets ID 262 .
This patch makes mount fail with EINVAL with a message in syslog.
Signed-off-by: David Sterba <dsterba@suse.cz>
Fix a bug introduced by 20b45077. We have to return EINVAL on mount
failure, but doing that too early in the sequence leaves all of the
devices opened exclusively. This also fixes an issue where under some
scenarios only a second mount -o degraded <devices> command would
succeed.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Initialize fs_info->bdev_holder a bit earlier to be able to pass a
correct holder id to blkdev_get() when opening seed devices with O_EXCL.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If lookup_extent_backref fails, path->nodes[0] reasonably could be
null along with other callers of btrfs_print_leaf, so ensure we have a
valid extent buffer before dereferencing.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.
Task1 Caching Thread Task2
------------------------------------------------------------------------
find_free_extent
The space has not
be cached, and start
caching thread. And
wait for it.
cache space, if
the space is > 2MB
wake up Task1
find_free_extent
get all the space that
is cached.
try to allocate space,
but there is no space
now.
trigger BUG_ON()
The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
[<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
[<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
[<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
[<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
[<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
[<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
[<ffffffff811380fa>] ? fsnotify+0x236/0x266
[<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
[<ffffffff810cc215>] do_writepages+0x1c/0x25
[<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
[<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
[<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]
We fix this bug by trying to allocate the space again if there are block groups
in caching.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
In btrfs_get_acl(), when the second __btrfs_getxattr() call fails,
acl is not correctly set.
Therefore, a wrong value might return to the caller.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Free space items are located in tree of tree roots, not in the extent
tree. It didn't pop up because lookup_free_space_inode() grabs the
inode all the time instead of actually searching the tree.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
To reproduce the bug:
# mount -o nodatacow /dev/sda7 /mnt/
# dd if=/dev/zero of=/mnt/tmp bs=4K count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000136115 s, 30.1 MB/s
# dd if=/dev/zero of=/mnt/tmp bs=4K count=1 conv=notrunc oflag=direct
dd: writing `/mnt/tmp': Input/output error
1+0 records in
0+0 records out
btrfs_ordered_update_i_size() may return 1, but btrfs_endio_direct_write()
mistakenly takes it as an error.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
It's not a big deal if we fail to allocate the array, and instead of
panic we can just give up compressing.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
We should retirn EINVAL if the start is beyond the end of the file
system in the btrfs_ioctl_fitrim(). Fix that by adding the appropriate
check for it.
Also in the btrfs_trim_fs() it is possible that len+start might overflow
if big values are passed. Fix it by decrementing the len so that start+len
is equal to the file system size in the worst case.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
We won't defrag an extent, if it's bigger than the threshold we
specified and there's no small extent before it, but actually
the code doesn't work this way.
There are three bugs:
- When should_defrag_range() decides we should keep on defragmenting
an extent, last_len is not incremented. (old bug)
- The length that passes to should_defrag_range() is not the length
we're going to defrag. (new bug)
- We always defrag 256K bytes data, and a big extent can be part of
this range. (new bug)
For a file with 4 extents:
| 4K | 4K | 256K | 256K |
The result of defrag with (the default) 256K extent thresh should be:
| 264K | 256K |
but with those bugs, we'll get:
| 520K |
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
It's off-by-one, and thus we may skip the last page while defragmenting.
An example case:
# create /mnt/file with 2 4K file extents
# btrfs fi defrag /mnt/file
# sync
# filefrag /mnt/file
/mnt/file: 2 extents found
So it's not defragmented.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Don't use inode->i_size directly, since we're not holding i_mutex.
This also fixes another bug, that i_size can change after it's checked
against 0 and then (i_size - 1) can be negative.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Offset field in data extent backref can underflow if clone range ioctl
is used. We can reliably detect the underflow because max file size is
limited to 2^63 and max data extent size is limited by block group size.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
I noticed we had a little bit of latency when writing out the space cache
inodes. It's because we flush it before we write anything in case we have dirty
pages already there. This doesn't matter though since we're just going to
overwrite the space, and there really shouldn't be any dirty pages anyway. This
makes some of my tests run a little bit faster. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Mitch kept hitting a panic because he was getting ENOSPC. One of my previous
patches makes it so we are much better at not allocating new metadata chunks.
Unfortunately coupled with the overcommit patch this works us into a bit of a
problem if we are removing a bunch of space and end up chewing up all of our
space with pinned extents. We can allocate chunks fine and overflow is ok, but
the only way to reclaim this space is to commit the transaction. So if we go to
overcommit, first check and see how much pinned space we have. If we have more
than 80% of the free space chewed up with pinned extents, just commit the
transaction, this will free up enough space for our reservation and we won't
have this problem anymore. With this patch Mitch's test doesn't blow up
anymore. Thanks,
Reported-and-tested-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently btrfs_block_rsv_check does 2 things, it will either refill a block
reserve like in the truncate or refill case, or it will check to see if there is
enough space in the global reserve and possibly refill it. However because of
overcommit we could be well overcommitting ourselves just to try and refill the
global reserve, when really we should just be committing the transaction. So
breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check. Refill
will try to reserve more metadata if it can and btrfs_block_rsv_check will not,
it will only tell you if the factor of the total space is still reserved.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In __unlink_start_trans() if we don't have enough room for a reservation we will
check to see if the unlink will free up space. If it does that's great, but we
will still could add an orphan item, so we need to reserve enough space to add
the orphan item. Do this and migrate the space the global reserve so it all
works out right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We started setting trans->block_rsv = NULL to allow the delayed refs flushing
stuff to use the right block_rsv and then just made
btrfs_trans_release_metadata() unconditionally use the trans block rsv. The
problem with this is we need to reserve some space in the transaction and then
migrate it to the global block rsv, so we need to be able to free that out
properly. So instead just move btrfs_trans_release_metadata() before the
delayed ref flushing and use trans->block_rsv for the freeing. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently we only allow a maximum of 2 megabytes of pages to be flushed at a
time. This was ok before, but now we have overcommit which will screw us in a
heartbeat if we are quickly filling the disk. So instead pick either 2
megabytes or the number of pages we need to reclaim to be safe again, which ever
is larger. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The only way we actually reclaim delalloc space is waiting for the IO to
completely finish. Usually we kick off a bunch of IO and wait for a little bit
and hope we can make our reservation, and usually this works out pretty well.
With overcommit however we can get seriously underwater if we're filling up the
disk quickly, so we need to be able to force the delalloc shrinker to wait for
the ordered IO to finish to give us a better chance of actually reclaiming
enough space to get our reservation. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Before the only reason to commit the transaction to recover space in
reserve_metadata_bytes() was if there were enough pinned_bytes to satisfy our
reservation. But now we have the delayed inode stuff which will hold it's
reservations until we commit the transaction. So say we max out our reservation
by creating a bunch of files but don't have any pinned bytes we will ENOSPC out
early even though we could commit the transaction and get that space back. So
now just unconditionally commit the transaction since currently there is no way
to know how much metadata space is being reserved by delayed inode stuff.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Recently I changed the xattr stuff to unconditionally set the xattr first in
case the xattr didn't exist yet. This has introduced a regression when setting
an xattr that already exists with a large value. If we find the key we are
looking for split_leaf will assume that we're extending that item. The problem
is the size we pass down to btrfs_search_slot includes the size of the item
already, so if we have the largest xattr we can possibly have plus the size of
the xattr item plus the xattr item that btrfs_search_slot we'd overflow the
leaf. Thankfully this is not what we're doing, but split_leaf doesn't know this
so it just returns EOVERFLOW. So in the xattr code we need to check and see if
we got back EOVERFLOW and treat it like EEXIST since that's really what
happened. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Our unlink reservations were a bit much, we were reserving 10 and I only count 8
possible items we're touching, so comment what we're reserving for and fix the
count value. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed recently that my overcommit patch was causing one of my enospc tests
to fail 25% of the time with early ENOSPC. This is because my overcommit patch
was letting us go way over board, but it wasn't waiting long enough to let the
delalloc shrinker do it's job. The problem is we just start writeback and wait
a little bit hoping we flush enough, but we only free up delalloc space by
having the writes complete all the way. We do this by waiting for ordered
extents, which we do but only if we already free'd enough for the reservation,
which isn't right, we should flush ordered extents if we didn't reclaim enough
in case that will push us over the edge. With this patch I've not seen a
failure in this enospc test after running it in a loop for an hour. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Yeah yeah I know this is how we used to do it and then I changed it, but damnit
I'm changing it back. The fact is that writing out checksums will modify
metadata, which could cause us to dirty a block group we've already written out,
so we have to truncate it and all of it's checksums and re-write it which will
write new checksums which could dirty a blockg roup that has already been
written and you see where I'm going with this? This can cause unmount or really
anything that depends on a transaction to commit to take it's sweet damned time
to happen. So go back to the way it was, only this time we're specifically
setting NODATACOW because we can't go through the COW pathway anyway and we're
doing our own built-in cow'ing by truncating the free space cache. The other
new thing is once we truncate the old cache and preallocate the new space, we
don't need to do that song and dance at all for the rest of the transaction, we
can just overwrite the existing space with the new cache if the block group
changes for whatever reason, and the NODATACOW will let us do this fine. So
keep track of which transaction we last cleared our cache in and if we cleared
it in this transaction just say we're all setup and carry on. This survives
xfstests and stress.sh.
The inode cache will continue to use the normal csum infrastructure since it
only gets written once and there will be no more modifications to the fs tree in
a transaction commit.
Signed-off-by: Josef Bacik <josef@redhat.com>
My overcommit stuff can be a little racy when we're filling up the disk with
fs_mark and we overcommit into things that quickly get used up for data. So use
num_bytes to see if we have enough available space so we're less likely to
overcommit ourselves out of the ability to make reservations. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We need to check the return value of filemap_write_and_wait in the space cache
writeout code. Also don't set the inode's generation until we're sure nothing
else is going to fail. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In writing and reading the space cache we have one big loop that keeps track of
which page we are on and then a bunch of sizeable loops underneath this big loop
to try and read/write out properly. Especially in the write case this makes
things hugely complicated and hard to follow, and makes our error checking and
recovery equally as complex. So add a io_ctl struct with a bunch of helpers to
keep track of the pages we have, where we are, if we have enough space etc.
This unifies how we deal with the pages we're writing and keeps all the messy
tracking internal. This allows us to kill the big loops in both the read and
write case and makes reviewing and chaning the write and read paths much
simpler. I've run xfstests and stress.sh on this code and it survives. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed a slight bug where we will not bother writing out the block group
cache's space cache if it's space tree is empty. Since it could have a cluster
or pinned extents that need to be written out this is just not a valid test.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Some users have requested this and I've found I needed a way to disable cache
loading without actually clearing the cache, so introduce the no_space_cache
option. Before we check the super blocks cache generation field and if it was
populated we always turned space caching on. Now we check this and set the
space cache option on, and then parse the mount options so that if we want it
off it get's turned off. Then we check the mount option all the places we do
the caching work instead of checking the super's cache generation. This makes
things more consistent and lets us turn space caching off. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Xfstests 79 was failing because we were inheriting the S_APPEND flag when we
weren't supposed to. There isn't any specific documentation on this so I'm
taking the test as the standard of how things work, and having S_APPEND set on a
directory doesn't mean that S_APPEND gets inherited by its children according to
this test. So only inherit btrfs specific things. This will let us set
compress/nocompress on specific directories and everything in the directories
will inherit this flag, same with nodatacow. With this patch test 79 passes.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
One of the things that kills us is the fact that our ENOSPC reservations are
horribly over the top in most normal cases. There isn't too much that can be
done about this because when we are completely full we really need them to work
like this so we don't under reserve. However if there is plenty of unallocated
chunks on the disk we can use that to gauge how much we can overcommit. So this
patch adds chunk free space accounting so we always know how much unallocated
space we have. Then if we fail to make a reservation within our allocated
space, check to see if we can overcommit. In the normal flushing case (like
with delalloc metadata reservations) we'll take the free space and divide it by
2 if our metadata profile is setup for DUP or any of those, and then divide it
by 8 to make sure we don't overcommit too much. Then if we're in a non-flushing
case (we really need this reservation now!) we only limit ourselves to half of
the free space. This makes this fio test
[torrent]
filename=torrent-test
rw=randwrite
size=4g
ioengine=sync
directory=/mnt/btrfs-test
go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
file system. This doesn't seem to break my other enospc tests, but could really
use some more testing as this is a super scary change. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed while running xfstests 83 that if we didn't have enough space to
delete our inode the orphan cleanup would just loop. This is because it keeps
finding the same orphan item and keeps trying to kill it but can't because we
don't get an error back from iput for deleting the inode. So keep track of the
last guy we tried to kill, if it's the same as the one we're trying to kill
currently we know we are having problems and can just error out. I don't have a
way to test this so look hard and make sure it's right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Xfstests 83 really stresses our ENOSPC since it uses a 100mb fs which ends up
with the mixed block group stuff. Because of this we can run into a situation
where we don't have enough space to delete inodes, or even worse we can't free
the inodes when we next mount the fs which causes the orphan code to lose its
mind. So if we fail to make our reservation, steal from the global reserve.
The global reserve will end up taking up the entire rest of the free space on
the fs in this worst case so there really is no other option. With this patch
test 83 doesn't freak out. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
While looking for a performance regression a user was complaining about, I
noticed that we had a regression with the varmail test of filebench. This was
introduced by
0d10ee2e6d
which keeps us from calling writepages in writepage. This is a correct change,
however it happens to help the varmail test because we write out in larger
chunks. This is largly to do with how we write out dirty pages for each
transaction. If you run filebench with
load varmail
set $dir=/mnt/btrfs-test
run 60
prior to this patch you would get ~1420 ops/second, but with the patch you get
~1200 ops/second. This is a 16% decrease. So since we know the range of dirty
pages we want to write out, don't write out in one page chunks, write out in
ranges. So to do this we call filemap_fdatawrite_range() on the range of bytes.
Then we convert the DIRTY extents to NEED_WAIT extents. When we then call
btrfs_wait_marked_extents() we only have to filemap_fdatawait_range() on that
range and clear the NEED_WAIT extents. This doesn't get us back to our original
speeds, but I've been seeing ~1380 ops/second, which is a <5% regression as
opposed to a >15% regression. That is acceptable given that the original commit
greatly reduces our latency to begin with. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If I have a range where I know a certain bit is and I want to set it to another
bit the only option I have is to call set and then clear bit, which will result
in 2 tree searches. This is inefficient, so introduce convert_extent_bit which
will go through and set the bit I want and clear the old bit I don't want.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
There is a bug that may lead to early ENOSPC in our reservation code. We've
been checking against num_bytes which may be above and beyond what we want to
actually reserve, which could give us a false ENOSPC. Fix this by making sure
the unused space is above how much we want to reserve and not how much we're
trying to flush. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In fixing how we deal with bad inodes, we had a regression in the orphan cleanup
code, since it expects to get a bad inode back. So fix it to deal with getting
-ESTALE back by deleting the orphan item manually and moving on. Thanks,
Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Josef Bacik <josef@redhat.com>
Johannes pointed out we were allocating only kernel pages for doing writes,
which is kind of a big deal if you are on 32bit and have more than a gig of ram.
So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
don't re-enter. Thanks,
Reported-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
I kept getting warnings from evict because we were calling
btrfs_start_transaction() with a transaction already started when doing a
balance. This is because we remove a block group which requires a transaction,
and the put the last reference on the cache inode. Instead of doing this we
need to delay the iput so it is done not within a transaction having started.
This gets rid of our warnings. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Checksums are charged in 2 different ways. The first case is when we're writing
to the disk, we account for the new checksums with the delalloc block rsv. In
order for this to work we check if we're allocating a block for the csum root
and if trans->block_rsv == the delalloc block rsv. But when we're deleting the
csums because of cow, this is charged to the global block rsv, and is done when
we run the delayed refs. So we need to make sure that trans->block_rsv == NULL
when running the delayed refs. So set it to NULL and reset it in
should_end_transaction, and set it to NULL in commit_transaction. This got rid
of the ridiculous amount of warnings I was seeing when trying to do a balance.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The only thing that we need to have a trans handle for is in
reserve_metadata_bytes and thats to know how much flushing we can do. So
instead of passing it around, just check current->journal_info for a
trans_handle so we know if we can commit a transaction to try and free up space
or not. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Since the durable block rsv stuff has been killed there is no need to get the
block_rsv in btrfs_free_tree_block anymore.
Signed-off-by: Josef Bacik <josef@redhat.com>
The alloc warnings everybody has been seeing is because we have been reserving
space for csums, but we weren't actually using that space. So make
get_block_rsv() return the trans->block_rsv if we're modifying the csum root.
Also set the trans->block_rsv to NULL so that if we modify the csum root when
running delayed ref's that comes out of the global reserve like it's supposed
to. With this patch I'm not seeing those alloc warnings anymore. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Since free space inodes now use normal checksumming we need to make sure to
account for their metadata use. So reserve metadata space, and then if we fail
to write out the metadata we can just release it, otherwise it will be freed up
when the io completes. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In moving some enospc stuff around I noticed that when we unmount we are often
evicting the free space cache inodes before we do our last commit. This isn't
bad, but it makes us constantly have to re-read the inodes back. So instead
don't evict the cache until after we do our last commit, this will make things a
little less crappy and makes a future enospc change work properly. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
While debugging a different issue I noticed that we were always reserving space
when we tried to use our truncate block rsv's. This is because they didn't have
a ->size value, so use_block_rsv just assumes there is nothing reserved and it
does a reserve_metadata_bytes. This is because btrfs_check_block_rsv() doesn't
actually add to the size of the block rsv. That seems to be the right thing to
do so set ->size to the minimum truncate size we need, since we will always only
refill to that size anyway, and this way everything works out correctly.
Signed-off-by: Josef Bacik <josef@redhat.com>
If we have to emergency reserve space we need to not increase the block_rsv
size, otherwise we'll leak space. Take for instance delalloc, say we reserve
4k, and we use that 4k, and then we have to emergency allocate another 4k, we
bump the size up to 8k, however we've only accounted for 4k in reservations in
all of our supporting logic, so we'll go to free the 4k and end up having a size
of 4k, which will cause us to later not free as much space. I saw this doing
testing where I wasn't reserving enough space for something but was still
leaking space, very frustrating. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When changing back to using a spin_lock to protect the extent counters I decided
that since we would only be dropping our original extent, it was ok to just drop
the extent and return. However since somebody else could have come in and done
a reservation, we need to do the normal song and dance to clear the reservation
out properly. So calculate how much space we need to free, and then subtract
what we just attempted to reserve. If it's more then we know we need to drop
those bytes from the delalloc block rsv. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We are setting ins_len to 1 even tho we are just modifying an item that should
be there already. This may cause the search stuff to split nodes on the way
down needelessly. Set this to 0 since we aren't inserting anything. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If you run xfstest 224 it you will get lots of messages about not being able to
delete inodes and that they will be cleaned up next mount. This is because
btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
flush, so if there was not enough space, it simply failed. But in truncate and
evict case we could easily flush space to try and get enough space to do our
work, so make btrfs_block_rsv_check take a flush argument to pass down to
reserve_metadata_bytes. Now xfstests 224 runs fine without all those
complaints. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
With btrfs_truncate_inode_items we always return if we have to go to another
leaf, which makes us do our reservation again. This means we will only ever
modify one leaf at a time, so we only need 1 items worth of slack space. Also,
since we are deleting we will not be creating nodes as we go down, if anything
we'll be free'ing them as we merge them together, so make a different
calculation for truncate which will only have the worst case useage of COW'ing
the entire path down to the leaf. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Lukas found a problem where if he tries to fallocate over the same region twice
and the first fallocate took up all the space we would fail with ENOSPC. This
is because we reserve the total space we want to use for fallocate, regardless
of wether or not we will have to actually preallocate. So instead move the
check into the loop where we actually have to do the preallocate. Thanks,
Tested-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently we're starting and stopping a transaction for no real reason, so kill
that and just reserve enough space as if we can truncate all in one transaction.
Also use btrfs_block_rsv_check() for our reserve to minimize the amount of space
we may have to allocate for our slack space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We will try and reserve metadata bytes in btrfs_block_rsv_check and if we cannot
because we have a transaction open it will return EAGAIN, so we do not need to
try and commit the transaction again.
Signed-off-by: Josef Bacik <josef@redhat.com>
The priority and refill_used flags are not used anymore, and neither is the
usage counter, so just remove them from btrfs_block_rsv.
Signed-off-by: Josef Bacik <josef@redhat.com>
A user reported getting spammed when moving to 3.0 by this message. Since we
switched to the normal checksumming infrastructure all old free space caches
will be wrong and need to be regenerated so people are likely to see this
message a lot, so ratelimit it so it doesn't fill up their logs and freak them
out. Thanks,
Reported-by: Andrew Lutomirski <luto@mit.edu>
Signed-off-by: Josef Bacik <josef@redhat.com>
I converted btrfs_truncate to do sane reservations for truncate, but didn't
convert btrfs_evict_inode. Basically we need to save the orphan_rsv for
deleting the orphan item, and do normal reservations for our truncate. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This patch kills off the calculation for the amount of space needed for the
orphan operations during a snapshot. The thing is we only do snapshots on
commit, so any space that is in the block_rsv->freed[] isn't going to be in the
new snapshot anyway, so there isn't any reason to require that space to be
reserved for the snapshot to occur. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have not been reserving enough space for checksums. We were just reserving
bytes for the checksum items themselves, we were not taking into account having
to cow the tree and such. This patch adds a csum_bytes counter to the inode for
keeping track of the number of bytes outstanding we have for checksums. Then we
calculate how many leaves would be required for the checksums we are given and
use that to reserve space. This adds a significant amount of bytes to our
reservations, but we will handle this later. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We always look for delalloc bytes in our io_tree so we can fill in delalloc.
This is fine in most cases, but if we're writing out the btree_inode this is
just a superfluous tree search on the io_tree, and if we have a lot of metadata
dirty this could be an expensive check. So instead check to see if our io_tree
has a ->fill_delalloc op, and if not don't even bother doing the lookup.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have been using bytes_reserved for metadata reservations, which is wrong
since we use that to keep track of outstanding reservations from the allocator.
This resulted in us doing a lot of silly things to make sure we don't allocate a
bunch of metadata chunks since we never had a real view of how much space was
actually in use by metadata.
This passes Arne's enospc test and xfstests as well as my own enospc tests.
Hopefully this will get us moving in the right direction. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We've only been able to mount with subvol=<whatever> where whatever was a subvol
within whatever root we had as the default. This allows us to mount -o
subvol=path/to/subvol/you/want relative from the normal fs_tree root. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently what we do is just wrong. We either
1) Alloc a new "root" dentry with sb->s_root as it's parent which is just wrong
as we could walk into this subvol later on via another path and hilarity could
ensue. Also we don't check the return value of d_splice_alias which isn't good
either.
or
2) Do a d_find_alias() which we could have lost our dentry from cache at this
point and found nothing.
So use d_obtain_alias(). In the case that we already have the inode/dentry in
cache we will get the correct dentry. If not we will get a disconnected dentry
tree so if we walk into it later on everything will be connected up properly.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Moving things around to give us better packing in the btrfs_inode. This reduces
the size of our inode by 8 bytes. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The btrfs file defrag code will loop through the extents and
force COW on them. But there is a concurrent truncate in the middle of
the defrag, it might end up defragging the same range over and over
again.
The problem is that writepage won't go through and do anything on pages
past i_size, so the cow won't happen, so the file will appear to still
be fragmented. defrag will end up hitting the same extents again and
again.
In the worst case, the truncate can actually live lock with the defrag
because the defrag keeps creating new ordered extents which the truncate
code keeps waiting on.
The fix here is to make defrag check for i_size inside the main loop,
instead of just once before the looping starts.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Follow those steps:
# mount -o autodefrag /dev/sda7 /mnt
# dd if=/dev/urandom of=/mnt/tmp bs=200K count=1
# sync
# dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc
and then it'll go into a loop: writeback -> defrag -> writeback ...
It's because writeback writes [8K, 200K] and then writes [0, 8K].
I tried to make writeback know if the pages are dirtied by defrag,
but the patch was a bit intrusive. Here I simply set writeback_index
when we defrag a file.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Scrub uses a simple tree-enumeration to bring the relevant portions
of the extent- and csum-tree into the page cache before starting the
scrub-I/O. This is now replaced by using the new readahead-API.
During readahead the scrub is being accounted as paused, so it won't
hold off transaction commits.
This change raises the average disk bandwith utilisation on my test
volume from 70% to 90%. On another volume, the time for a test run
went down from 89s to 43s.
Changes v5:
- reada1/2 are now of type struct reada_control *
Signed-off-by: Arne Jansen <sensille@gmx.net>
This adds the hooks needed for readahead. In the readpage_end_io_hook,
the extent state is checked for the EXTENT_READAHEAD flag. Only in this
case the readahead hook is called, to keep the impact on non-ra as low
as possible.
Additionally, a hook for a failed IO is added, otherwise readahead would
wait indefinitely for the extent to finish.
Changes for v2:
- eliminate race condition
Signed-off-by: Arne Jansen <sensille@gmx.net>
This is the implementation for the generic read ahead framework.
To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.
Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
The readahead had a too simple check to determine if a branch from
a node should be checked or not. It now also records the upper bound
of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock
Changes v3:
- protect root->node by rcu
Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add
Signed-off-by: Arne Jansen <sensille@gmx.net>
Add state information for readahead to btrfs_fs_info and btrfs_device
Changes v2:
- don't wait in radix_trees
- add own set of workers for readahead
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Add a READAHEAD extent buffer flag.
Add a function to trigger a read with this flag set.
Changes v2:
- use extent buffer flags instead of extent state flags
Changes v5:
- adapt to changed read_extent_buffer_pages interface
- don't return eb from reada_tree_block_flagged if it has CORRUPT flag set
Signed-off-by: Arne Jansen <sensille@gmx.net>
read_extent_buffer_pages currently has two modes, either trigger a read
without waiting for anything, or wait for the I/O to finish. The former
also bails when it's unable to lock the page. This patch now adds an
additional parameter to allow it to block on page lock, but don't wait
for completion.
Changes v5:
- merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
WAIT_PAGE_LOCK
Change v6:
- fix bug introduced in v5
Signed-off-by: Arne Jansen <sensille@gmx.net>
A user reported a problem where ceph was getting into 100% cpu usage while doing
some writing. It turns out it's because we were doing a short write on a not
uptodate page, which means we'd fall back at one page at a time and fault the
page in. The problem is our position is on the page boundary, so our fault in
logic wasn't actually reading the page, so we'd just spin forever or until the
page got read in by somebody else. This will force a readpage if we end up
doing a short copy. Alexandre could reproduce this easily with ceph and reports
it fixes his problem. I also wrote a reproducer that no longer hangs my box
with this patch. Thanks,
Reported-and-tested-by: Alexandre Oliva <aoliva@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This ties nodatasum fixup in scrub together with raid repair patches. While
both series are working fine alone, scrub will report uncorrectable errors
if they occur in a nodatasum extent *and* the page is in the page cache.
Previously, we would have triggered readpage to find good data and do the
repair. However, readpage wouldn't read anything in the case where the page
is up to date in the cache. So, we simply take that good data we have and
call repair_io_failure directly (unless the page in the cache is dirty).
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The raid-retry code in inode.c can be generalized so that it works for
metadata as well. Thus, this patch moves it to extent_io.c and makes the
raid-retry code a raid-repair code.
Repair works that way: Whenever a read error occurs and we have more
mirrors to try, note the failed mirror, and retry another. If we find a
good one, check if we did note a failure earlier and if so, do not allow
the read to complete until after the bad sector was written with the good
data we just fetched. As we have the extent locked while reading, no one
can change the data in between.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The error correction code wants to make sure that only the bad mirror is
rewritten. Thus, we need to know which mirror is the bad one. I did not
find a more apropriate field than bi_bdev. But I think using this is fine,
because it is modified by the block layer, anyway, and should not be read
after the bio returned.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The block layer modifies bio->bi_bdev and bio->bi_sector while working on
the bio, they do _not_ come back unmodified in the completion callback.
To call add_page, we need at least some bi_bdev set, which is why the code
was working, previously. With this patch, we use the latest_bdev from
fsinfo instead of the leftover in the bio. This gives us the possibility to
use the bi_bdev field for another purpose.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
btrfs_bio is a bio abstraction able to split and not complete after the last
bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio
tracks the mirror_num used to read data which can be used for error
correction purposes.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
these ioctls make use of the new functions initially added for scrub. they
return all inodes belonging to a logical address (BTRFS_IOC_LOGICAL_INO) and
all paths belonging to an inode (BTRFS_IOC_INO_PATHS).
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
This removes a FIXME comment and introduces the first part of nodatasum
fixup: It gets the corresponding inode for a logical address and triggers a
regular readpage for the corrupted sector.
Once we have on-the-fly error correction our error will be automatically
corrected. The correction code is expected to clear the newly introduced
EXTENT_DAMAGED flag, making scrub report that error as "corrected" instead
of "uncorrectable" eventually.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Currently, extent_read_full_page always assumes we are trying to read mirror
0, which generally is the best we can do. To add flexibility, pass it as a
parameter. This will be needed by scrub fixup code.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Fix the mirror_num determination in scrub_stripe. The rest of the scrub code
did not use mirror_num for anything important and that error went unnoticed.
The nodatasum fixup patch of this set depends on a correct mirror_num.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
While scrubbing, we may encounter various errors. Previously, a logical
address was printed to the log only. Now, all paths belonging to that
address are resolved and printed separately. That should work for hardlinks
as well as reflinks.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
In normal operation, scrub is reading data sequentially in large portions.
In case of an i/o error, we try to find the corrupted area(s) by issuing
page sized read requests. With this commit we increment the
unverified_errors counter if all of the small size requests succeed.
Userland patches carrying such conspicous events to the administrator should
already be around.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
These helper functions iterate back references and call a function for each
backref. There is also a function to resolve an inode to a path in the
file system.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Fix a crash/BUG_ON in the clone ioctl due to insufficient reservation. We
need to reserve space for:
- adjusting the old extent (possibly splitting it)
- adding the new extent
- updating the inode
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can race with readdir and the RCU path walking stuff. This is because we
clear the need lookup flag before actually instantiating the inode. This will
lead the RCU path walk stuff to find a dentry it thinks is valid without a
d_inode attached. So instead unhash the dentry when we first start the lookup,
and then clear the flag after we've instantiated the dentry so we're garunteed
to either try the slow lookup, or have the d_inode set properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The recent reworking of btrfs' lseek lead to incorrect
values being returned. This adds checks for seeking
beyond EOF in SEEK_HOLE and makes sure the error
values come back correct.
Andi Kleen also sent in similar patches.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The dst file will have the same inode flags with dst file after
file clone, and I think it's unexpected.
For example, the dst file will suddenly become immutable after
getting some share of data with src file, if the src is immutable.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To reproduce the bug:
# mount /dev/sda7 /mnt
# dd if=/dev/zero of=/mnt/src bs=4K count=1
# umount /mnt
# mount -o nodatasum /dev/sda7 /mnt
# dd if=/dev/zero of=/mnt/dst bs=4K count=1
# clone_range -s 4K -l 4K /mnt/src /mnt/dst
# echo 3 > /proc/sys/vm/drop_caches
# cat /mnt/dst
# dmesg
...
btrfs no csum found for inode 258 start 0
btrfs csum failed ino 258 off 0 csum 2566472073 private 0
It's because part of the file is checksummed and the other part is not,
and then btrfs will complain checksum is not found when we read the file.
Disallow file clone if src and dst file have different checksum flag,
so we ensure a file is completely checksummed or unchecksummed.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It's a bug in commit f81c9cdc56
(Btrfs: truncate pages from clone ioctl target range)
We should pass the dest range to the truncate function, but not the
src range.
Also move the function before locking extent state.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since the d_off in the first dirent for "." (that originates from
the 4th argument "offset" of filldir() for the 2nd dirent for "..")
is wrongly assigned in btrfs_real_readdir(), telldir returns same
offset for different locations.
| # mkfs.btrfs /dev/sdb1
| # mount /dev/sdb1 fs0
| # cd fs0
| # touch file0 file1
| # ../test
| telldir: 0
| readdir: d_off = 2, d_name = "."
| telldir: 2
| readdir: d_off = 2, d_name = ".."
| telldir: 2
| readdir: d_off = 3, d_name = "file0"
| telldir: 3
| readdir: d_off = 2147483647, d_name = "file1"
| telldir: 2147483647
To fix this problem, pass filp->f_pos (which is loff_t) instead.
| # ../test
| telldir: 0
| readdir: d_off = 1, d_name = "."
| telldir: 1
| readdir: d_off = 2, d_name = ".."
| telldir: 2
| readdir: d_off = 3, d_name = "file0"
:
At the moment the "offset" for "." is unused because there is no
preceding dirent, however it is better to pass filp->f_pos to follow
grammatical usage.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://github.com/chrismason/linux:
Btrfs: add dummy extent if dst offset excceeds file end in
Btrfs: calc file extent num_bytes correctly in file clone
btrfs: xattr: fix attribute removal
Btrfs: fix wrong nbytes information of the inode
Btrfs: fix the file extent gap when doing direct IO
Btrfs: fix unclosed transaction handle in btrfs_cont_expand
Btrfs: fix misuse of trans block rsv
Btrfs: reset to appropriate block rsv after orphan operations
Btrfs: skip locking if searching the commit root in csum lookup
btrfs: fix warning in iput for bad-inode
Btrfs: fix an oops when deleting snapshots
You can see there's no file extent with range [0, 4096]. Check this by
btrfsck:
# btrfsck /dev/sda7
root 5 inode 258 errors 100
...
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
An attribute is not removed by 'setfattr -x attr file' and remains
visible in attr list. This makes xfstests/062 pass again.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we write some data into the data hole of the file(no preallocation for this
hole), Btrfs will allocate some disk space, and update nbytes of the inode, but
the other element--disk_i_size needn't be updated. At this condition, we must
update inode metadata though disk_i_size is not changed(btrfs_ordered_update_i_size()
return 1).
# mkfs.btrfs /dev/sdb1
# mount /dev/sdb1 /mnt
# touch /mnt/a
# truncate -s 856002 /mnt/a
# dd if=/dev/zero of=/mnt/a bs=4K count=1 conv=nocreat,notrunc
# umount /mnt
# btrfsck /dev/sdb1
root 5 inode 257 errors 400
found 32768 bytes used err is 1
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we write some data to the place that is beyond the end of the file
in direct I/O mode, a data hole will be created. And Btrfs should insert
a file extent item that point to this hole into the fs tree. But unfortunately
Btrfs forgets doing it.
The following is a simple way to reproduce it:
# mkfs.btrfs /dev/sdc2
# mount /dev/sdc2 /test4
# touch /test4/a
# dd if=/dev/zero of=/test4/a seek=8 count=1 bs=4K oflag=direct conv=nocreat,notrunc
# umount /test4
# btrfsck /dev/sdc2
root 5 inode 257 errors 100
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The function - btrfs_cont_expand() forgot to close the transaction handle before
it jump out the while loop. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
At the beginning of create_pending_snapshot, trans->block_rsv is set
to pending->block_rsv and is used for snapshot things, however, when
it is done, we do not recover it as will.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While truncating free space cache, we forget to change trans->block_rsv
back to the original one, but leave it with the orphan_block_rsv, and
then with option inode_cache enable, it leads to countless warnings of
btrfs_alloc_free_block and btrfs_orphan_commit_root:
WARNING: at fs/btrfs/extent-tree.c:5711 btrfs_alloc_free_block+0x180/0x350 [btrfs]()
...
WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It's not enough to just search the commit root, since we could be cow'ing the
very block we need to search through, which would mean that its locked and we'll
still deadlock. So use path->skip_locking as well. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
iput() shouldn't be called for inodes in I_NEW state.
We need to mark inode as constructed first.
WARNING: at fs/inode.c:1309 iput+0x20b/0x210()
Call Trace:
[<ffffffff8103e7ba>] warn_slowpath_common+0x7a/0xb0
[<ffffffff8103e805>] warn_slowpath_null+0x15/0x20
[<ffffffff810eaf0b>] iput+0x20b/0x210
[<ffffffff811b96fb>] btrfs_iget+0x1eb/0x4a0
[<ffffffff811c3ad6>] btrfs_run_defrag_inodes+0x136/0x210
[<ffffffff811ad55f>] cleaner_kthread+0x17f/0x1a0
[<ffffffff81035b7d>] ? sub_preempt_count+0x9d/0xd0
[<ffffffff811ad3e0>] ? transaction_kthread+0x280/0x280
[<ffffffff8105af86>] kthread+0x96/0xa0
[<ffffffff814336d4>] kernel_thread_helper+0x4/0x10
[<ffffffff8105aef0>] ? kthread_worker_fn+0x190/0x190
[<ffffffff814336d0>] ? gs_change+0xb/0xb
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
CC: Konstantin Khlebnikov <khlebnikov@openvz.org>
Tested-by: David Sterba <dsterba@suse.cz>
CC: Josef Bacik <josef@redhat.com>
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can reproduce this oops via the following steps:
$ mkfs.btrfs /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs
$ for ((i=0; i<3; i++)); do btrfs sub snap /mnt/btrfs /mnt/btrfs/s_$i; done
$ rm -fr /mnt/btrfs/*
$ rm -fr /mnt/btrfs/*
then we'll get
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:2264!
[...]
Call Trace:
[<ffffffffa05578c7>] btrfs_rmdir+0xf7/0x1b0 [btrfs]
[<ffffffff81150b95>] vfs_rmdir+0xa5/0xf0
[<ffffffff81153cc3>] do_rmdir+0x123/0x140
[<ffffffff81145ac7>] ? fput+0x197/0x260
[<ffffffff810aecff>] ? audit_syscall_entry+0x1bf/0x1f0
[<ffffffff81153d0d>] sys_unlinkat+0x2d/0x40
[<ffffffff8147896b>] system_call_fastpath+0x16/0x1b
RIP [<ffffffffa054f7b9>] btrfs_orphan_add+0x179/0x1a0 [btrfs]
When it comes to btrfs_lookup_dentry, we may set a snapshot's inode->i_ino
to BTRFS_EMPTY_SUBVOL_DIR_OBJECTID instead of BTRFS_FIRST_FREE_OBJECTID,
while the snapshot's location.objectid remains unchanged.
However, btrfs_ino() does not take this into account, and returns a wrong ino,
and causes the oops.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This fixes a regression introduced by commit cdcb725c05 ("Btrfs: check
if there is enough space for balancing smarter"). We can't do 64-bit
divides on 32-bit architectures.
In cases where we need to divide/multiply by 2 we should just left/right
shift respectively, and in cases where theres N number of devices use
do_div. Also make the counters u64 to match up with rw_devices.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Acked-and-tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xfstests exposed a problem with preallocate when it fallocates a range that
already has an extent. We don't set the new i_size properly because we see that
we already have an extent. This isn't right and we should update i_size if the
space already exists. With this patch we now pass xfstests 075. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There were some unlocks on error missing in a recent patch to
btrfs_file_llseek().
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch tightens the read-only access checks in btrfs_permission to
match the constraints in inode_permission. Currently, even though the
device node itself will be unmodified, read-write access to device nodes
is denied to when the device node resides on a read-only subvolume or a
is a file that has been marked read-only by the btrfs conversion utility.
With this patch applied, the check only affects regular files,
directories, and symlinks. It also restructures the code a bit so that
we don't duplicate the MAY_WRITE check for both tests.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We need to truncate page cache pages for the clone ioctl target range or
else we'll confuse ourselves to no end. If the old data was cached, we
used to still see it (until remount). If the page was partially updated
we used to get a mix of old and new data.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
sync_pending is uninitialized before it be used, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs subtracted the size of the allocated space twice when it allocated
the space from the bitmap in the cluster, it broke the free space information
and led to oops finally.
And this patch also fixes the bug that ctl->free_space was subtracted
without lock.
Reported-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The filesystem turns readonly instead of returning the error to the
caller when detected error in btrfs_drop_snapshot().
and, because the caller doesn't check the error, the function type is
changed to 'void'.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When checking if there is enough space for balancing a block group,
since we do not take raid types into consideration, we do not account
corrent amounts of space that we needed. This makes us do some extra
work before we get ENOSPC.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs recovers from a crash, it may hit the oops below:
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:4580!
[...]
RIP: 0010:[<ffffffffa03df251>] [<ffffffffa03df251>] btrfs_add_link+0x161/0x1c0 [btrfs]
[...]
Call Trace:
[<ffffffffa03e7b31>] ? btrfs_inode_ref_index+0x31/0x80 [btrfs]
[<ffffffffa04054e9>] add_inode_ref+0x319/0x3f0 [btrfs]
[<ffffffffa0407087>] replay_one_buffer+0x2c7/0x390 [btrfs]
[<ffffffffa040444a>] walk_down_log_tree+0x32a/0x480 [btrfs]
[<ffffffffa0404695>] walk_log_tree+0xf5/0x240 [btrfs]
[<ffffffffa0406cc0>] btrfs_recover_log_trees+0x250/0x350 [btrfs]
[<ffffffffa0406dc0>] ? btrfs_recover_log_trees+0x350/0x350 [btrfs]
[<ffffffffa03d18b2>] open_ctree+0x1442/0x17d0 [btrfs]
[...]
This comes from that while replaying an inode ref item, we forget to
check those old conflicting DIR_ITEM and DIR_INDEX items in fs/file tree,
then we will come to conflict corners which lead to BUG_ON().
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Tested-by: Andy Lutomirski <luto@mit.edu>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have a problem where if a user specifies discard but doesn't actually support
it we will return EOPNOTSUPP from btrfs_discard_extent. This is a problem
because this gets called (in a fashion) from the tree log recovery code, which
has a nice little BUG_ON(ret) after it, which causes us to fail the tree log
replay. So instead detect wether our devices support discard when we're adding
them and then don't issue discards if we know that the device doesn't support
it. And just for good measure set ret = 0 in btrfs_issue_discard just in case
we still get EOPNOTSUPP so we don't screw anybody up like this again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs does bio submissions from a worker thread, and each device
has a list of high priority bios and regular priority bios.
Synchronous writes go to the high priority thread while async writes
go to regular list. This commit brings back an explicit unplug
any time we switch from high to regular priority, which makes it
easier for the block layer to give us low latencies.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits)
Btrfs: don't call writepages from within write_full_page
Btrfs: Remove unused variable 'last_index' in file.c
Btrfs: clean up for find_first_extent_bit()
Btrfs: clean up for wait_extent_bit()
Btrfs: clean up for insert_state()
Btrfs: remove unused members from struct extent_state
Btrfs: clean up code for merging extent maps
Btrfs: clean up code for extent_map lookup
Btrfs: clean up search_extent_mapping()
Btrfs: remove redundant code for dir item lookup
Btrfs: make acl functions really no-op if acl is not enabled
Btrfs: remove remaining ref-cache code
Btrfs: remove a BUG_ON() in btrfs_commit_transaction()
Btrfs: use wait_event()
Btrfs: check the nodatasum flag when writing compressed files
Btrfs: copy string correctly in INO_LOOKUP ioctl
Btrfs: don't print the leaf if we had an error
btrfs: make btrfs_set_root_node void
Btrfs: fix oops while writing data to SSD partitions
Btrfs: Protect the readonly flag of block group
...
Fix up trivial conflicts (due to acl and writeback cleanups) in
- fs/btrfs/acl.c
- fs/btrfs/ctree.h
- fs/btrfs/extent_io.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set
VFS: Reorganise shrink_dcache_for_umount_subtree() after demise of dcache_lock
VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree()
VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree()
switch posix_acl_chmod() to umode_t
switch posix_acl_from_mode() to umode_t
switch posix_acl_equiv_mode() to umode_t *
switch posix_acl_create() to umode_t *
block: initialise bd_super in bdget()
vfs: avoid call to inode_lru_list_del() if possible
vfs: avoid taking inode_hash_lock on pipes and sockets
vfs: conditionally call inode_wb_list_del()
VFS: Fix automount for negative autofs dentries
Btrfs: load the key from the dir item in readdir into a fake dentry
devtmpfs: missing initialialization in never-hit case
hppfs: missing include
When doing a writepage we call writepages to try and write out any other dirty
pages in the area. This could cause problems where we commit a transaction and
then have somebody else dirtying metadata in the area as we could end up writing
out a lot more than we care about, which could cause latency on anybody who is
waiting for the transaction to completely finish committing. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The variable 'last_index' is calculated in the __btrfs_buffered_write
function and passed as a parameter to the prepare_pages function,
but is not used anywhere in the prepare_pages function.
Remove instances of 'last_index' in these functions.
Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_first_extent_bit() and find_first_extent_bit_state() share
most of the code, and we can just make the former call the latter.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can just use cond_resched_lock().
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These members are not used at all.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
unpin_extent_cache() and add_extent_mapping() shares the same code
that merges extent maps.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
lookup_extent_map() and search_extent_map() can share most of code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
rb_node returned by __tree_search() can be a valid pointer or NULL,
but won't be some errno.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we search a dir item with a specific hash code, we can
just return NULL without further checking if btrfs_search_slot()
returns 1.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since commit f2a97a9dbd
("btrfs: remove all unused functions"), there's no extern functions
at all in ref-cache.c, so just remove the remaining dead code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use wait_event() when possible to avoid code duplication.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If mounting with nodatasum option, we won't csum file data for
general write or direct-io write, and this rule should also be
applied when writing compressed files.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Memory areas [ptr, ptr+total_len] and [name, name+total_len]
may overlap, so it's wrong to use memcpy().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In __btrfs_free_extent we will print the leaf if we fail to find the extent we
wanted, but the problem is if we get an error we won't have a leaf so often this
leads to a NULL pointer dereference and we lose the error that actually
occurred. So only print the leaf if ret > 0, which means we didn't find the
item we were looking for but we didn't error either. This way the error is
preserved.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is fairly trivial - btrfs_set_root_node() - always returns zero so we
can just make it void. All callers ignore the return code now anyway. I
also made sure to check that none of the functions that
btrfs_set_root_node() calls returns an error that we might have needed to
catch and pass back.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Here I have a two SSD-partitions btrfs, and they are defaultly set to
"data=raid0, metadata=raid1", then I try to fill my btrfs partition
till "No space left on device", via "dd if=/dev/zero of=/mnt/btrfs/tmp".
I get an oops panic from kernel BUG at fs/btrfs/extent-tree.c:5199!, which
refers to find_free_extent's
BUG_ON(index != get_block_group_index(block_group));
In SSD mode, in order to find enough space to alloc, we may check the
block_group cache which has been checked sometime before, but the index is not
updated, where it hits the BUG_ON.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Acked-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The access for ro in btrfs_block_group_cache should be protected
because of the racy lock in relocation.
Signed-off-by: Wu Bo <wu.bo@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The set/clear bit and the extent split/merge hooks only ever return 0.
Changing them to return void simplifies the error handling cases later.
This patch changes the hook prototypes, the single implementation of each,
and the functions that call them to return void instead.
Since all four of these hooks execute under a spinlock, they're necessarily
simple.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We passed the wrong value to btrfs_force_ra(). Fix this by changing
the argument of btrfs_force_ra() from last_index to nr_page.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs_unlink_inode() and btrfs_orphan_add() in btrfs_unlink()
are error, the error code is returned to the caller instead of
BUG_ON().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Don't need to check the return value of __btrfs_add_inode_defrag(),
since it will always return 0.
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs we have 2 indexes for inodes. One is for readdir, it's in this nice
sequential order and works out brilliantly for readdir. However if you use ls,
it usually stat's each file it gets from readdir. This is where the second
index comes in, which is based on a hash of the name of the file. So then the
lookup has to lookup this index, and then lookup the inode. The index lookup is
going to be in random order (since its based on the name hash), which gives us
less than stellar performance. Since we know the inode location from the
readdir index, I create a dummy dentry and copy the location key into
dentry->d_fsdata. Then on lookup if we have d_fsdata we use that location to
lookup the inode, avoiding looking up the other directory index. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors
Btrfs: use the commit_root for reading free_space_inode crcs
Btrfs: reduce extent_state lock contention for metadata
Btrfs: remove lockdep magic from btrfs_next_leaf
Btrfs: make a lockdep class for each root
Btrfs: switch the btrfs tree locks to reader/writer
Btrfs: fix deadlock when throttling transactions
Btrfs: stop using highmem for extent_buffers
Btrfs: fix BUG_ON() caused by ENOSPC when relocating space
Btrfs: tag pages for writeback in sync
Btrfs: fix enospc problems with delalloc
Btrfs: don't flush delalloc arbitrarily
Btrfs: use find_or_create_page instead of grab_cache_page
Btrfs: use a worker thread to do caching
Btrfs: fix how we merge extent states and deal with cached states
Btrfs: use the normal checksumming infrastructure for free space cache
Btrfs: serialize flushers in reserve_metadata_bytes
Btrfs: do transaction space reservation before joining the transaction
Btrfs: try to only do one btrfs_search_slot in do_setxattr
The btrfs transaction code will return any errors that come from
reserve_metadata_bytes. We need to make sure we don't return funny
things like 1 or EAGAIN.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Now that we are using regular file crcs for the free space cache,
we can deadlock if we try to read the free_space_inode while we are
updating the crc tree.
This commit fixes things by using the commit_root to read the crcs. This is
safe because we the free space cache file would already be loaded if
that block group had been changed in the current transaction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
For metadata buffers that don't straddle pages (all of them), btrfs
can safely use the page uptodate bits and extent_buffer uptodate bit
instead of needing to use the extent_state tree.
This greatly reduces contention on the state tree lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before the reader/writer locks, btrfs_next_leaf needed to keep
the path blocking to avoid making lockdep upset.
Now that btrfs_next_leaf only takes read locks, this isn't required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch was originally from Tejun Heo. lockdep complains about the btrfs
locking because we sometimes take btree locks from two different trees at the
same time. The current classes are based only on level in the btree, which
isn't enough information for lockdep to figure out if the lock is safe.
This patch makes a class for each type of tree, and lumps all the FS trees that
actually have files and directories into the same class.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs metadata btree is the source of significant
lock contention, especially in the root node. This
commit changes our locking to use a reader/writer
lock.
The lock is built on top of rw spinlocks, and it
extends the lock tracking to remember if we have a
read lock or a write lock when we go to blocking. Atomics
count the number of blocking readers or writers at any
given time.
It removes all of the adaptive spinning from the old code
and uses only the spinning/blocking hints inside of btrfs
to decide when it should continue spinning.
In read heavy workloads this is dramatically faster. In write
heavy workloads we're still faster because of less contention
on the root node lock.
We suffer slightly in dbench because we schedule more often
during write locks, but all other benchmarks so far are improved.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Hit this nice little deadlock. What happens is this
__btrfs_end_transaction with throttle set, --use_count so it equals 0
btrfs_commit_transaction
<somebody else actually manages to start the commit>
btrfs_end_transaction --use_count so now its -1 <== BAD
we just return and wait on the transaction
This is bad because we just return after our use_count is -1 and don't let go
of our num_writer count on the transaction, so the guy committing the
transaction just sits there forever. Fix this by inc'ing our use_count if we're
going to call commit_transaction so that if we call btrfs_end_transaction it's
valid. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent_buffers have a very complex interface where
we use HIGHMEM for metadata and try to cache a kmap mapping
to access the memory.
The next commit adds reader/writer locks, and concurrent use
of this kmap cache would make it even more complex.
This commit drops the ability to use HIGHMEM with extent buffers,
and rips out all of the related code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we balanced the chunks across the devices, BUG_ON() in
__finish_chunk_alloc() was triggered.
------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:2568!
[SNIP]
Call Trace:
[<ffffffffa049525e>] btrfs_alloc_chunk+0x8e/0xa0 [btrfs]
[<ffffffffa04546b0>] do_chunk_alloc+0x330/0x3a0 [btrfs]
[<ffffffffa045c654>] btrfs_reserve_extent+0xb4/0x1f0 [btrfs]
[<ffffffffa045c86b>] btrfs_alloc_free_block+0xdb/0x350 [btrfs]
[<ffffffffa048a8d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
[<ffffffffa04476fd>] __btrfs_cow_block+0x14d/0x5e0 [btrfs]
[<ffffffffa044660d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
[<ffffffffa0447c9b>] btrfs_cow_block+0x10b/0x240 [btrfs]
[<ffffffffa044dd5e>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
[<ffffffffa044f07d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs]
[<ffffffffa045e973>] insert_with_overflow+0x43/0x110 [btrfs]
[<ffffffffa045eb0d>] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs]
[<ffffffffa0489bd0>] ? map_extent_buffer+0xb0/0xc0 [btrfs]
[<ffffffff812276ad>] ? rb_insert_color+0x9d/0x160
[<ffffffffa046cc40>] ? inode_tree_add+0xf0/0x150 [btrfs]
[<ffffffffa0474801>] btrfs_add_link+0xc1/0x1c0 [btrfs]
[<ffffffff811dacac>] ? security_inode_init_security+0x1c/0x30
[<ffffffffa04a28aa>] ? btrfs_init_acl+0x4a/0x180 [btrfs]
[<ffffffffa047492f>] btrfs_add_nondir+0x2f/0x70 [btrfs]
[<ffffffffa046af16>] ? btrfs_init_inode_security+0x46/0x60 [btrfs]
[<ffffffffa0474ac0>] btrfs_create+0x150/0x1d0 [btrfs]
[<ffffffff81159c63>] ? generic_permission+0x23/0xb0
[<ffffffff8115b415>] vfs_create+0xa5/0xc0
[<ffffffff8115ce6e>] do_last+0x5fe/0x880
[<ffffffff8115dc0d>] path_openat+0xcd/0x3d0
[<ffffffff8115e029>] do_filp_open+0x49/0xa0
[<ffffffff8116a965>] ? alloc_fd+0x95/0x160
[<ffffffff8114f0c7>] do_sys_open+0x107/0x1e0
[<ffffffff810bcc3f>] ? audit_syscall_entry+0x1bf/0x1f0
[<ffffffff8114f1e0>] sys_open+0x20/0x30
[<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa049444a>] __finish_chunk_alloc+0x20a/0x220 [btrfs]
The reason is:
Task1 Space balance task
do_chunk_alloc()
__finish_chunk_alloc()
update device info
in the chunk tree
alloc system metadata block
relocate system metadata block group
set system metadata block group
readonly, This block group is the
only one that can allocate space. So
there is no free space that can be
allocated now.
find no space and don't try
to alloc new chunk, and then
return ENOSPC
BUG_ON() in __finish_chunk_alloc()
was triggered.
Fix this bug by allocating a new system metadata chunk before relocating the
old one if we find there is no free space which can be allocated after setting
the old block group to be read-only.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Everybody else does this, we need to do it too. If we're syncing, we need to
tag the pages we're going to write for writeback so we don't end up writing the
same stuff over and over again if somebody is constantly redirtying our file.
This will keep us from having latencies with heavy sync workloads. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
So I had this brilliant idea to use atomic counters for outstanding and reserved
extents, but this turned out to be a bad idea. Consider this where we have 1
outstanding extent and 1 reserved extent
Reserver Releaser
atomic_dec(outstanding) now 0
atomic_read(outstanding)+1 get 1
atomic_read(reserved) get 1
don't actually reserve anything because
they are the same
atomic_cmpxchg(reserved, 1, 0)
atomic_inc(outstanding)
atomic_add(0, reserved)
free reserved space for 1 extent
Then the reserver now has no actual space reserved for it, and when it goes to
finish the ordered IO it won't have enough space to do it's allocation and you
get those lovely warnings.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Kill the check to see if we have 512mb of reserved space in delalloc and
shrink_delalloc if we do. This causes unexpected latencies and we have other
logic to see if we need to throttle. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
GFP_HIGHUSER_MOVABLE. So instead use find_or_create_page in all cases where we
need GFP_NOFS so we don't deadlock. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
A user reported a deadlock when copying a bunch of files. This is because they
were low on memory and kthreadd got hung up trying to migrate pages for an
allocation when starting the caching kthread. The page was locked by the person
starting the caching kthread. To fix this we just need to use the async thread
stuff so that the threads are already created and we don't have to worry about
deadlocks. Thanks,
Reported-by: Roman Mamedov <rm@romanrm.ru>
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
merge fchmod() and fchmodat() guts, kill ancient broken kludge
xfs: fix misspelled S_IS...()
xfs: get rid of open-coded S_ISREG(), etc.
vfs: document locking requirements for d_move, __d_move and d_materialise_unique
omfs: fix (mode & S_IFDIR) abuse
btrfs: S_ISREG(mode) is not mode & S_IFREG...
ima: fmode_t misspelled as mode_t...
pci-label.c: size_t misspelled as mode_t
jffs2: S_ISLNK(mode & S_IFMT) is pointless
snd_msnd ->mode is fmode_t, not mode_t
v9fs_iop_get_acl: get rid of unused variable
vfs: dont chain pipe/anon/socket on superblock s_inodes list
Documentation: Exporting: update description of d_splice_alias
fs: add missing unlock in default_llseek()
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In addition to properly handling allocation failure from btrfs_alloc_path, I
also fixed up the kzalloc error handling code immediately below it.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
I also removed the BUG_ON from error return of find_next_chunk in
init_first_rw_device(). It turns out that the only caller of
init_first_rw_device() also BUGS on any nonzero return so no actual behavior
change has occurred here.
do_chunk_alloc() also needed an update since it calls btrfs_alloc_chunk()
which can now return -ENOMEM. Instead of setting space_info->full on any
error from btrfs_alloc_chunk() I catch and return every error value _except_
-ENOSPC. Thanks goes to Tsutomu Itoh for pointing that issue out.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss. This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: posix_acl_create(&acl, gfp, mode_p). Replaces acl with
modified clone, on failure releases acl and replaces with NULL.
Returns 0 or -ve on error. All callers of posix_acl_create_masq()
switched.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: posix_acl_chmod(&acl, gfp, mode). Replaces acl with modified
clone or with NULL if that has failed; returns 0 or -ve on error. All
callers of posix_acl_chmod_masq() switched to that - they'd been doing
exactly the same thing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This moves logic for checking the cached ACL values from low-level
filesystems into generic code. The end result is a streamlined ACL
check that doesn't need to load the inode->i_op->check_acl pointer at
all for the common cached case.
The filesystems also don't need to check for a non-blocking RCU walk
case in their acl_check() functions, because that is all handled at a
VFS layer.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
both callers there have dentry->d_parent stabilized by the fact that
their caller had obtained dentry from lookup_one_len() and had not
dropped ->i_mutex on parent since then.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
Basically for the normal SEEK_*'s we will just defer to the generic helper, and
for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
hole or data. Currently this helper doesn't check for delalloc bytes for
prealloc space, so for now treat prealloc as data until that is fixed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch changes the security_inode_init_security API by adding a
filesystem specific callback to write security extended attributes.
This change is in preparation for supporting the initialization of
multiple LSM xattrs and the EVM xattr. Initially the callback function
walks an array of xattrs, writing each xattr separately, but could be
optimized to write multiple xattrs at once.
For existing security_inode_init_security() calls, which have not yet
been converted to use the new callback function, such as those in
reiserfs and ocfs2, this patch defines security_old_inode_init_security().
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Dealing with this seems trivial - the only caller of btrfs_balance() is
btrfs_ioctl() which passes the error code directly back to userspace. There
also isn't much state to unwind (if I'm wrong about this point, we can
always safely move the allocation to the top of btrfs_balance() anyway).
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
btrfs_iget() also needed an update so that errors from btrfs_locked_inode()
are caught and bubbled back up.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
I moved the path allocation up a few lines to the top of the function so
that we couldn't get into the state where we've dropped delayed items and
the extent cache but fail due to -ENOMEM.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The two ->process_func call sites in tree-log.c which were ignoring a return
code have also been updated to gracefully exit as well.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
First, we can sometimes free the state we're merging, which means anybody who
calls merge_state() may have the state it passed in free'ed. This is
problematic because we could end up caching the state, which makes caching
useless as the state will no longer be part of the tree. So instead of free'ing
the state we passed into merge_state(), set it's end to the other->end and free
the other state. This way we are sure to cache the correct state. Also because
we can merge states together, instead of only using the cache'd state if it's
start == the start we are looking for, go ahead and use it if the start we are
looking for is within the range of the cached state. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We used to store the checksums of the space cache directly in the space cache,
however that doesn't work out too well if we have more space than we can fit the
checksums into the first page. So instead use the normal checksumming
infrastructure. There were problems with doing this originally but those
problems don't exist now so this works out fine. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We keep having problems with early enospc, and that's because our method of
making space is inherently racy. The problem is we can have one guy trying to
make space for himself, and in the meantime people come in and steal his
reservation. In order to stop this we make a waitqueue and put anybody who
comes into reserve_metadata_bytes on that waitqueue if somebody is trying to
make more space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have to do weird things when handling enospc in the transaction joining code.
Because we've already joined the transaction we cannot commit the transaction
within the reservation code since it will deadlock, so we have to return EAGAIN
and then make sure we don't retry too many times. Instead of doing this, just
do the reservation the normal way before we join the transaction, that way we
can do whatever we want to try and reclaim space, and then if it fails we know
for sure we are out of space and we can return ENOSPC. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I've been watching how many btrfs_search_slot()'s we do and I noticed that when
we create a file with selinux enabled we were doing 2 each time we initialize
the security context. That's because we lookup the xattr first so we can delete
it if we're setting a new value to an existing xattr. But in the create case we
don't have any xattrs, so it is completely useless to have the extra lookup. So
re-arrange things so that we only lookup first if we specifically have
XATTR_REPLACE. That way in the basic case we only do 1 search, and in the more
complicated case we do the normal 2 lookups. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.
struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.
It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.
It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.
Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
btrfs: fix oops when doing space balance
Btrfs: don't panic if we get an error while balancing V2
btrfs: add missing options displayed in mount output
A user reported an error where if we try to balance an fs after a device has
been removed it will blow up. This is because we get an EIO back and this is
where BUG_ON(ret) bites us in the ass. To fix we just exit. Thanks,
Reported-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are three missed mount options settable by user which are not
currently displayed in mount output.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
btrfs: fix inconsonant inode information
Btrfs: make sure to update total_bitmaps when freeing cache V3
Btrfs: fix type mismatch in find_free_extent()
Btrfs: make sure to record the transid in new inodes
When iputting the inode, We may leave the delayed nodes if they have some
delayed items that have not been dealt with. So when the inode is read again,
we must look up the relative delayed node, and use the information in it to
initialize the inode. Or we will get inconsonant inode information, it may
cause that the same directory index number is allocated again, and hit the
following oops:
[ 5447.554187] err add delayed dir index item(name: pglog_0.965_0) into the
insertion tree of the delayed node(root id: 262, inode id: 258, errno: -17)
[ 5447.569766] ------------[ cut here ]------------
[ 5447.575361] kernel BUG at fs/btrfs/delayed-inode.c:1301!
[SNIP]
[ 5447.790721] Call Trace:
[ 5447.793191] [<ffffffffa0641c4e>] btrfs_insert_dir_item+0x189/0x1bb [btrfs]
[ 5447.800156] [<ffffffffa0651a45>] btrfs_add_link+0x12b/0x191 [btrfs]
[ 5447.806517] [<ffffffffa0651adc>] btrfs_add_nondir+0x31/0x58 [btrfs]
[ 5447.812876] [<ffffffffa0651d6a>] btrfs_create+0xf9/0x197 [btrfs]
[ 5447.818961] [<ffffffff8111f840>] vfs_create+0x72/0x92
[ 5447.824090] [<ffffffff8111fa8c>] do_last+0x22c/0x40b
[ 5447.829133] [<ffffffff8112076a>] path_openat+0xc0/0x2ef
[ 5447.834438] [<ffffffff810c58e2>] ? __perf_event_task_sched_out+0x24/0x44
[ 5447.841216] [<ffffffff8103ecdd>] ? perf_event_task_sched_out+0x59/0x67
[ 5447.847846] [<ffffffff81121a79>] do_filp_open+0x3d/0x87
[ 5447.853156] [<ffffffff811e126c>] ? strncpy_from_user+0x43/0x4d
[ 5447.859072] [<ffffffff8111f1f5>] ? getname_flags+0x2e/0x80
[ 5447.864636] [<ffffffff8111f179>] ? do_getname+0x14b/0x173
[ 5447.870112] [<ffffffff8111f1b7>] ? audit_getname+0x16/0x26
[ 5447.875682] [<ffffffff8112b1ab>] ? spin_lock+0xe/0x10
[ 5447.880882] [<ffffffff81112d39>] do_sys_open+0x69/0xae
[ 5447.886153] [<ffffffff81112db1>] sys_open+0x20/0x22
[ 5447.891114] [<ffffffff813b9aab>] system_call_fastpath+0x16/0x1b
Fix it by reusing the old delayed node.
Reported-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A user reported this bug again where we have more bitmaps than we are supposed
to. This is because we failed to load the free space cache, but don't update
the ctl->total_bitmaps counter when we remove entries from the tree. This patch
fixes this problem and we should be good to go again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
data parameter should be u64 because a full-sized chunk flags field is
passed instead of 0/1 for distinguishing data from metadata. All
underlying functions expect u64.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we create a new inode, we aren't filling in the
field that records the transaction that last changed this
inode.
If we then go to fsync that inode, it will be skipped because the field
isn't filled in.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It was pointed out by 'make versioncheck' that some includes of
linux/version.h were not needed in fs/ (fs/btrfs/ctree.h and
fs/omfs/file.c).
This patch removes them.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: avoid delayed metadata items during commits
btrfs: fix uninitialized return value
btrfs: fix wrong reservation when doing delayed inode operations
btrfs: Remove unused sysfs code
btrfs: fix dereference of ERR_PTR value
Btrfs: fix relocation races
Btrfs: set no_trans_join after trying to expand the transaction
Btrfs: protect the pending_snapshots list with trans_lock
Btrfs: fix path leakage on subvol deletion
Btrfs: drop the delalloc_bytes check in shrink_delalloc
Btrfs: check the return value from set_anon_super
Snapshot creation has two phases. One is the initial snapshot setup,
and the second is done during commit, while nobody is allowed to modify
the root we are snapshotting.
The delayed metadata insertion code can break that rule, it does a
delayed inode update on the inode of the parent of the snapshot,
and delayed directory item insertion.
This makes sure to run the pending delayed operations before we
record the snapshot root, which avoids corruptions.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When allocation fails in btrfs_read_fs_root_no_name, ret is not set
although it is returned, holding a garbage value.
Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Removes code no longer used. The sysfs file itself is kept, because the
btrfs developers expressed interest in putting new entries to sysfs.
Signed-off-by: Maarten Lankhorst <m.b.lankhorst@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The recent commit to get rid of our trans_mutex introduced
some races with block group relocation. The problem is that relocation
needs to do some record keeping about each root, and it was relying
on the transaction mutex to coordinate things in subtle ways.
This fix adds a mutex just for the relocation code and makes sure
it doesn't have a big impact on normal operations. The race is
really fixed in btrfs_record_root_in_trans, which is where we
step back and wait for the relocation code to finish accounting
setup.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can lockup if we try to allow new writers join the transaction and we have
flushoncommit set or have a pending snapshot. This is because we set
no_trans_join and then loop around and try to wait for ordered extents again.
The problem is the ordered endio stuff needs to join the transaction, which it
can't do because no_trans_join is set. So instead wait until after this loop to
set no_trans_join and then make sure to wait for num_writers == 1 in case
anybody got started in between us exiting the loop and setting no_trans_join.
This could easily be reproduced by mounting -o flushoncommit and running xfstest
13. It cannot be reproduced with this patch. Thanks,
Reported-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently there is nothing protecting the pending_snapshots list on the
transaction. We only hold the directory mutex that we are snapshotting and a
read lock on the subvol_sem, so we could race with somebody else creating a
snapshot in a different directory and end up with list corruption. So protect
this list with the trans_lock. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The delayed ref patch accidently removed the btrfs_free_path in
btrfs_unlink_subvol, this puts it back and means we don't leak a path. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: use join_transaction in btrfs_evict_inode()
Btrfs - use %pU to print fsid
Btrfs: fix extent state leak on failed nodatasum reads
btrfs: fix unlocked access of delalloc_inodes
Btrfs: avoid stack bloat in btrfs_ioctl_fs_info()
btrfs: remove 64bit alignment padding to allow extent_buffer to fit into one fewer cacheline
Btrfs: clear current->journal_info on async transaction commit
Btrfs: make sure to recheck for bitmaps in clusters
btrfs: remove unneeded includes from scrub.c
btrfs: reinitialize scrub workers
btrfs: scrub: errors in tree enumeration
Btrfs: don't map extent buffer if path->skip_locking is set
Btrfs: unlock the trans lock properly
Btrfs: don't map extent buffer if path->skip_locking is set
Btrfs: fix duplicate checking logic
Btrfs: fix the allocator loop logic
Btrfs: fix bitmap regression
Btrfs: don't commit the transaction if we dont have enough pinned bytes
Btrfs: noinline the cluster searching functions
Btrfs: cache bitmaps when searching for a cluster
The WARN_ON() in start_transaction() was triggered while balancing.
The cause is btrfs_relocate_chunk() started a transaction and
then called iput() on the inode that stores free space cache,
and iput() called btrfs_start_transaction() again.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Get rid of FIXME comment. Uuids from dmesg are now the same as uuids
given by btrfs-progs.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When encountering an EIO while reading from a nodatasum extent, we
insert an error record into the inode's failure tree.
btrfs_readpage_end_io_hook returns early for nodatasum inodes. We'd
better clear the failure tree in that case, otherwise the kernel
complains about
BUG extent_state: Objects remaining on kmem_cache_close()
on rmmod.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
list_splice_init will make delalloc_inodes empty, but without a spinlock
around, this may produce corrupted list head, accessed in many placess,
The race window is very tight and nobody seems to have hit it so far.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The size of struct btrfs_ioctl_fs_info_args is as big as 1KB, so
don't declare the variable on stack.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reorder extent_buffer to remove 8 bytes of alignment padding on 64 bit
builds. This shrinks its size to 128 bytes allowing it to fit into one
fewer cache lines and allows more objects per slab in its kmem_cache.
slabinfo extent_buffer reports :-
before:-
Sizes (bytes) Slabs
----------------------------------
Object : 136 Total : 123
SlabObj: 136 Full : 121
SlabSiz: 4096 Partial: 0
Loss : 0 CpuSlab: 2
Align : 8 Objects: 30
after :-
Object : 128 Total : 4
SlabObj: 128 Full : 2
SlabSiz: 4096 Partial: 0
Loss : 0 CpuSlab: 2
Align : 8 Objects: 32
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Normally current->jouranl_info is cleared by commit_transaction. For an
async snap or subvol creation, though, it runs in a work queue. Clear
it in btrfs_commit_transaction_async() to avoid leaking a non-NULL
journal_info when we return to userspace. When the actual commit runs in
the other thread it won't care that it's current->journal_info is already
NULL.
Signed-off-by: Sage Weil <sage@newdream.net>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef recently changed the free extent cache to look in
the block group cluster for any bitmaps before trying to
add a new bitmap for the same offset. This avoids BUG_ON()s due
covering duplicate ranges.
But it didn't go quite far enough. A given free range might span
between one or more bitmaps or free space entries. The code has
looping to cover this, but it doesn't check for clustered bitmaps
every time.
This shuffles our gotos to check for a bitmap in the cluster
for every new bitmap entry we try to add.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Scrub starts the workers each time a scrub starts and stops them after it
finished. This patch adds an initialization for the workers before each
start, otherwise the workers behave strangely.
Signed-off-by: Arne Jansen <sensille@gmx.net>
due to the semantics of btrfs_search_slot the path can point to an
invalid slot when ret > 0. This condition went unnoticed, which in
turn could have led to an incomplete scrubbing.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Arne's scrub stuff exposed a problem with mapping the extent buffer in
reada_for_search. He searches the commit root with multiple threads and with
skip_locking set, so we can race and overwrite node->map_token since node isn't
locked. So fix this so that we only map the extent buffer if we don't already
have a map_token and skip_locking isn't set. Without this patch scrub would
panic almost immediately, with the patch it doesn't panic anymore. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
In btrfs_wait_for_commit if we came upon a transaction that had committed we
just exited, but that's bad since we are holding the trans_lock. So break
instead so that the lock is dropped. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Arne's scrub stuff exposed a problem with mapping the extent buffer in
reada_for_search. He searches the commit root with multiple threads and with
skip_locking set, so we can race and overwrite node->map_token since node isn't
locked. So fix this so that we only map the extent buffer if we don't already
have a map_token and skip_locking isn't set. Without this patch scrub would
panic almost immediately, with the patch it doesn't panic anymore. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
When merging my code into the integration test the second check for duplicate
entries got screwed up. This patch fixes it by dropping ret2 and just using ret
for the return value, and checking if we got an error before adding the bitmap
to the local list. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I was testing with empty_cluster = 0 to try and reproduce a problem and kept
hitting early enospc panics. This was because our loop logic was a little
confused. So this is what I did
1) Make the loop variable the ultimate decider on wether we should loop again
isntead of checking to see if we had an uncached bg, empty size or empty
cluster.
2) Increment loop before checking to see what we are on to make the loop
definitions make more sense.
3) If we are on the chunk alloc loop don't set empty_size/empty_cluster to 0
unless we didn't actually allocate a chunk. If we did allocate a chunk we
should be able to easily setup a new cluster so clearing
empty_size/empty_cluster makes us less efficient.
This kept me from hitting panics while trying to reproduce the other problem.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In cleaning up the clustering code I accidently introduced a regression by
adding bitmap entries to the cluster rb tree. The problem is if we've maxed out
the number of bitmaps we can have for the block group we can only add free space
to the bitmaps, but since the bitmap is on the cluster we can't find it and we
try to create another one. This would result in a panic because the total
bitmaps was bigger than the max bitmaps that were allowed. This patch fixes
this by checking to see if we have a cluster, and then looking at the cluster rb
tree to see if it has a bitmap entry and if it does and that space belongs to
that bitmap, go ahead and add it to that bitmap.
I could hit this panic every time with an fs_mark test within a couple of
minutes. With this patch I no longer hit the panic and fs_mark goes to
completion. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed when running an enospc test that we would get stuck committing the
transaction in check_data_space even though we truly didn't have enough space.
So check to see if bytes_pinned is bigger than num_bytes, if it's not don't
commit the transaction. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When profiling the find cluster code it's hard to tell where we are spending our
time because the bitmap and non-bitmap functions get inlined by the compiler, so
make that not happen. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we are looking for a cluster in a particularly sparse or fragmented block
group, we will do a lot of looping through the free space tree looking for
various things, and if we need to look at bitmaps we will endup doing the whole
dance twice. So instead add the bitmap entries to a temporary list so if we
have to do the bitmap search we can just look through the list of entries we've
found quickly instead of having to loop through the entire tree again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
vfs: make unlink() and rmdir() return ENOENT in preference to EROFS
lmLogOpen() broken failure exit
usb: remove bad dput after dentry_unhash
more conservative S_NOSEC handling
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
btrfs: fix uninitialized variable warning
btrfs: add helper for fs_info->closing
Btrfs: add mount -o inode_cache
btrfs: scrub: add explicit plugging
btrfs: use btrfs_ino to access inode number
Btrfs: don't save the inode cache if we are deleting this root
btrfs: false BUG_ON when degraded
Btrfs: don't save the inode cache in non-FS roots
Btrfs: make sure we don't overflow the free space cache crc page
Btrfs: fix uninit variable in the delayed inode code
btrfs: scrub: don't reuse bios and pages
Btrfs: leave spinning on lookup and map the leaf
Btrfs: check for duplicate entries in the free space cache
Btrfs: don't try to allocate from a block group that doesn't have enough space
Btrfs: don't always do readahead
Btrfs: try not to sleep as much when doing slow caching
Btrfs: kill BTRFS_I(inode)->block_group
Btrfs: don't look at the extent buffer level 3 times in a row
Btrfs: map the node block when looking for readahead targets
Btrfs: set range_start to the right start in count_range_bits
...
With Linus' tree, today's linux-next build (powercp ppc64_defconfig)
produced this warning:
fs/btrfs/delayed-inode.c: In function 'btrfs_delayed_update_inode':
fs/btrfs/delayed-inode.c:1598:6: warning: 'ret' may be used
uninitialized in this function
Introduced by commit 16cdcec736 ("btrfs: implement delayed inode items
operation").
This fixes a bug in btrfs_update_inode(): if the returned value from
btrfs_delayed_update_inode is a nonzero garbage, inode stat data are not
updated and several call paths may hit a BUG_ON or fail with strange
code.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David Sterba <dsterba@suse.cz>
This makes the inode map cache default to off until we
fix the overflow problem when the free space crcs don't fit
inside a single page.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With the removal of the implicit plugging scrub ends up doing more and
smaller I/O than necessary. This patch adds explicit plugging per chunk.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
commit 4cb5300bc ("Btrfs: add mount -o auto_defrag") accesses inode
number directly while it should use the helper with the new inode
number allocator.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With xfstest 254 I can panic the box every time with the inode number caching
stuff on. This is because we clean the inodes out when we delete the subvolume,
but then we write out the inode cache which adds an inode to the subvolume inode
tree, and then when it gets evicted again the root gets added back on the dead
roots list and is deleted again, so we have a double free. To stop this from
happening just return 0 if refs is 0 (and we're not the tree root since tree
root always has refs of 0). With this fix 254 no longer panics. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In degraded mode the struct btrfs_device of missing devs don't have
device->name set. A kstrdup of NULL correctly returns NULL. Don't
BUG in this case.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds extra checks to make sure the inode map we are caching really
belongs to a FS root instead of a special relocation tree. It
prevents crashes during balancing operations.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The free space cache uses only one page for crcs right now,
which means we can't have a cache file bigger than the
crcs we can fit in the first page. This adds a check to
enforce that restriction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The current scrub implementation reuses bios and pages as often as possible,
allocating them only on start and releasing them when finished. This leads
to more problems with the block layer than it's worth. The elevator gets
confused when there are more pages added to the bio than bi_size suggests.
This patch completely rips out the reuse of bios and pages and allocates
them freshly for each submit.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Maosn <chris.mason@oracle.com>
Caching "we have already removed suid/caps" was overenthusiastic as merged.
On network filesystems we might have had suid/caps set on another client,
silently picked by this client on revalidate, all of that *without* clearing
the S_NOSEC flag.
AFAICS, the only reasonably sane way to deal with that is
* new superblock flag; unless set, S_NOSEC is not going to be set.
* local block filesystems set it in their ->mount() (more accurately,
mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
local block ones clear it)
* if any network filesystem (or a cluster one) wants to use S_NOSEC,
it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
inode attribute changes are picked from other clients.
It's not an earth-shattering hole (anybody that can set suid on another client
will almost certainly be able to write to the file before doing that anyway),
but it's a bug that needs fixing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (36 commits)
Cache xattr security drop check for write v2
fs: block_page_mkwrite should wait for writeback to finish
mm: Wait for writeback when grabbing pages to begin a write
configfs: remove unnecessary dentry_unhash on rmdir, dir rename
fat: remove unnecessary dentry_unhash on rmdir, dir rename
hpfs: remove unnecessary dentry_unhash on rmdir, dir rename
minix: remove unnecessary dentry_unhash on rmdir, dir rename
fuse: remove unnecessary dentry_unhash on rmdir, dir rename
coda: remove unnecessary dentry_unhash on rmdir, dir rename
afs: remove unnecessary dentry_unhash on rmdir, dir rename
affs: remove unnecessary dentry_unhash on rmdir, dir rename
9p: remove unnecessary dentry_unhash on rmdir, dir rename
ncpfs: fix rename over directory with dangling references
ncpfs: document dentry_unhash usage
ecryptfs: remove unnecessary dentry_unhash on rmdir, dir rename
hostfs: remove unnecessary dentry_unhash on rmdir, dir rename
hfsplus: remove unnecessary dentry_unhash on rmdir, dir rename
hfs: remove unnecessary dentry_unhash on rmdir, dir rename
omfs: remove unnecessary dentry_unhash on rmdir, dir rneame
udf: remove unnecessary dentry_unhash from rmdir, dir rename
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (58 commits)
Btrfs: use the device_list_mutex during write_dev_supers
Btrfs: setup free ino caching in a more asynchronous way
btrfs scrub: don't coalesce pages that are logically discontiguous
Btrfs: return -ENOMEM in clear_extent_bit
Btrfs: add mount -o auto_defrag
Btrfs: using rcu lock in the reader side of devices list
Btrfs: drop unnecessary device lock
Btrfs: fix the race between remove dev and alloc chunk
Btrfs: fix the race between reading and updating devices
Btrfs: fix bh leak on __btrfs_open_devices path
Btrfs: fix unsafe usage of merge_state
Btrfs: allocate extent state and check the result properly
fs/btrfs: Add missing btrfs_free_path
Btrfs: check return value of btrfs_inc_extent_ref()
Btrfs: return error to caller if read_one_inode() fails
Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
Btrfs: return error code to caller when btrfs_del_item fails
Btrfs: return error code to caller when btrfs_previous_item fails
btrfs: fix typo 'testeing' -> 'testing'
btrfs: typo: 'btrfS' -> 'btrfs'
...
write_dev_supers was changed to use RCU to protect the list of
devices, but it was then sleeping while it actually wrote the supers.
This fixes it to just use the mutex, since we really don't any
concurrency in write_dev_supers anyway.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
anything else, so that the filesystem can track internally if it
needs to push out a transaction for fdatasync or not.
This is just the prototype change with no user for it yet. I plan
to push large XFS changes for the next merge window, and getting
this trivial infrastructure in this window would help a lot to avoid
tree interdependencies.
Also remove incorrect comments that ->dirty_inode can't block. That
has been changed a long time ago, and many implementations rely on it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For a filesystem that has lots of files in it, the first time we mount
it with free ino caching support, it can take quite a long time to
setup the caching before we can create new files.
Here we fill the cache with [highest_ino, BTRFS_LAST_FREE_OBJECTID]
before we start the caching thread to search through the extent tree.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
scrub_page collects several pages into one bio as long as they are physically
contiguous. As we only save one logical address for the whole bio, don't
collect pages that are physically contiguous but logically discontiguous.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This will detect small random writes into files and
queue the up for an auto defrag process. It isn't well suited to
database workloads yet, but works for smaller files such as rpm, sqlite
or bdb databases.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
xen: cleancache shim to Xen Transcendent Memory
ocfs2: add cleancache support
ext4: add cleancache support
btrfs: add cleancache support
ext3: add cleancache support
mm/fs: add hooks to support cleancache
mm: cleancache core ops functions and config
fs: add field to superblock to support cleancache
mm/fs: cleancache documentation
Fix up trivial conflict in fs/btrfs/extent_io.c due to includes
This sixth patch of eight in this cleancache series "opts-in"
cleancache for btrfs. Filesystems must explicitly enable
cleancache by calling cleancache_init_fs anytime an instance
of the filesystem is mounted. Btrfs uses its own readpage
which must be hooked, but all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs hook
which must be called on unmount.
Details and a FAQ can be found in Documentation/vm/cleancache.txt
[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
fs_devices->devices is only updated on remove and add device paths, so we can
use rcu to protect it in the reader side
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Drop device_list_mutex for the reader side on clone_fs_devices and
btrfs_rm_device pathes since the fs_info->volume_mutex can ensure the device
list is not updated
btrfs_close_extra_devices is the initialized path, we can not add or remove
device at this time, so we can simply drop the mutex safely, like other
initialized function does(add_missing_dev, __find_device, __btrfs_open_devices
...).
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On remove device path, it updates device->dev_alloc_list but does not hold
chunk lock
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On btrfs_congested_fn and __unplug_io_fn paths, we should hold
device_list_mutex to avoid remove/add device path to
update fs_devices->devices
On __btrfs_close_devices and btrfs_prepare_sprout paths, the devices in
fs_devices->devices or fs_devices->devices is updated, so we should hold
the mutex to avoid the reader side to reach them
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
'bh' is forgot to release if no error is detected
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
merge_state can free the current state if it can be merged with the next node,
but in set_extent_bit(), after merge_state, we still use the current extent to
get the next node and cache it into cached_state
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It doesn't allocate extent_state and check the result properly:
- in set_extent_bit, it doesn't allocate extent_state if the path is not
allowed wait
- in clear_extent_bit, it doesn't check the result after atomic-ly allocate,
we trigger BUG_ON() if it's fail
- if allocate fail, we trigger BUG_ON instead of returning -ENOMEM since
the return value of clear_extent_bit() is ignored by many callers
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs_alloc_path should be matched with btrfs_free_path in error-handling code.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@r exists@
local idexpression struct btrfs_path * x;
expression ra,rb;
position p1,p2;
@@
x = btrfs_alloc_path@p1(...)
... when != btrfs_free_path(x,...)
when != if (...) { ... btrfs_free_path(x,...) ...}
when != x = ra
if(...) { ... when != x = rb
when forall
when != btrfs_free_path(x,...)
\(return <+...x...+>; \| return@p2...; \) }
@script:python@
p1 << r.p1;
p2 << r.p2;
@@
cocci.print_main("alloc",p1)
cocci.print_secs("return",p2)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If return value of btrfs_inc_extent_ref() is not 0, BUG() is called.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When read_one_inode() fails, error code is returned to caller instead
of BUG_ON().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently, btrfs_truncate_item and btrfs_extend_item returns only 0.
So, the check by BUG_ON in the caller is unnecessary.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The error code is returned instead of calling BUG_ON when
btrfs_del_item returns the error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The error code is returned instead of calling BUG_ON when
btrfs_previous_item returns the error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Observed as a large delay when --mixed filesystem is filled up.
Test example:
1. create tiny --mixed FS:
$ dd if=/dev/zero of=2G.img seek=$((2048 * 1024 * 1024 - 1)) count=1 bs=1
$ mkfs.btrfs --mixed 2G.img
$ mount -oloop 2G.img /mnt/ut/
2. Try to fill it up:
$ dd if=/dev/urandom of=10M.file bs=10240 count=1024
$ seq 1 256 | while read file_no; do echo $file_no; time cp 10M.file ${file_no}.copy; done
Up to '200.copy' it goes fast, but when disk fills-up each -ENOSPC
message takes 3 seconds to pop-up _every_ ENOSPC (and in usermode linux
it's even more: 30-60 seconds!). (Maybe, time depends on kernel's timer resolution).
No IO, no CPU load, just rescheduling. Some debugging revealed busy spinning
in shrink_delalloc.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In 2008, commit b4f6c45dfb dropped the use
of fs/btrfs/version.sh, but left the script behind. Kill it.
Commit by Jamey Sharp and Josh Triplett.
Signed-off-by: Jamey Sharp <jamey@minilop.net>
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs's tree search ioctl has a field to indicate that no more than a
given number of records should be returned. The ioctl doesn't honour
this, as the tested value is not incremented until the end of the
copy_to_sk function. This patch removes an unnecessary local variable,
and updates the num_found counter as each key is found in the tree.
Signed-off-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
240f62c875 replaced the node_lock with rcu_read_lock, but forgot
to remove the actual lock in the data structure. Remove it here.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On lookup we only want to read the inode item, so leave the path spinning. Also
we're just wholesale reading the leaf off, so map the leaf so we don't do a
bunch of kmap/kunmaps. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If there are duplicate entries in the free space cache, discard the entire cache
and load it the old fashioned way. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we have a very large filesystem, we can spend a lot of time in
find_free_extent just trying to allocate from empty block groups. So instead
check to see if the block group even has enough space for the allocation, and if
not go on to the next block group.
Signed-off-by: Josef Bacik <josef@redhat.com>
Our readahead is sort of sloppy, and really isn't always needed. For example if
ls is doing a stating ls (which is the default) it's going to stat in non-disk
order, so if say you have a directory with a stupid amount of files, readahead
is going to do nothing but waste time in the case of doing the stat. Taking the
unconditional readahead out made my test go from 57 minutes to 36 minutes. This
means that everywhere we do loop through the tree we want to make sure we do set
path->reada properly, so I went through and found all of the places where we
loop through the path and set reada to 1. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When the fs is super full and we unmount the fs, we could get stuck in this
thing where unmount is waiting for the caching kthread to make progress and the
caching kthread keeps scheduling because we're in the middle of a commit. So
instead just let the caching kthread keep going and only yeild if
need_resched(). This makes my horrible umount case go from taking up to 10
minutes to taking less than 20 seconds. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Originally this was going to be used as a way to give hints to the allocator,
but frankly we can get much better hints elsewhere and it's not even used at all
for anything usefull. In addition to be completely useless, when we initialize
an inode we try and find a freeish block group to set as the inodes block group,
and with a completely full 40gb fs this takes _forever_, so I imagine with say
1tb fs this is just unbearable. So just axe the thing altoghether, we don't
need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
inode lookup in my testcase. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have a bit of debugging in btrfs_search_slot to make sure the level of the
cow block is the same as the original block we were cow'ing. I don't think I've
ever seen this tripped, so kill it. This saves us 2 kmap's per level in our
search. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we have particularly full nodes, we could call btrfs_node_blockptr up to 32
times, which is 32 pairs of kmap/kunmap, which _sucks_. So go ahead and map the
extent buffer while we look for readahead targets. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In count_range_bits we are adjusting total_bytes based on the range we are
searching for, but we don't adjust the range start according to the range we are
searching for, which makes for weird results. For example, if the range
[0-8192]
is set DELALLOC, but I search for 4096-8192, I will get back 4096 for the number
of bytes found, but the range_start will be 0, which makes it look like the
range is [0-4096]. So instead set range_start = max(cur_start, state->start).
This makes everything come out right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The ceph guys keep running into problems where we have space reserved in our
orphan block rsv when freeing it up. This is because they tend to do snapshots
alot, so their truncates tend to use a bunch of space, so when we go to do
things like update the inode we have to steal reservation space in order to make
the reservation happen. This happens because truncate can use as much space as
it freaking feels like, but we still have to hold space for removing the orphan
item and updating the inode, which will definitely always happen. So in order
to fix this we need to split all of the reservation stuf up. So with this patch
we have
1) The orphan block reserve which only holds the space for deleting our orphan
item when everything is over.
2) The truncate block reserve which gets allocated and used specifically for the
space that the truncate will use on a per truncate basis.
3) The transaction will always have 1 item's worth of data reserved so we can
update the inode normally.
Hopefully this will make the ceph problem go away. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We use trans_mutex for lots of things, here's a basic list
1) To serialize trans_handles joining the currently running transaction
2) To make sure that no new trans handles are started while we are committing
3) To protect the dead_roots list and the transaction lists
Really the serializing trans_handles joining is not too hard, and can really get
bogged down in acquiring a reference to the transaction. So replace the
trans_mutex with a trans_lock spinlock and use it to do the following
1) Protect fs_info->running_transaction. All trans handles have to do is check
this, and then take a reference of the transaction and keep on going.
2) Protect the fs_info->trans_list. This doesn't get used too much, basically
it just holds the current transactions, which will usually just be the currently
committing transaction and the currently running transaction at most.
3) Protect the dead roots list. This is only ever processed by splicing the
list so this is relatively simple.
4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using
the trans_mutex before, so this is a pretty straightforward change.
5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over
the entirety of the commit we need to have a way to block new people from
creating a new transaction while we're doing our work. So we set no_trans_join
and in join_transaction we test to see if that is set, and if it is we do a
wait_on_commit.
6) Make the transaction use count atomic so we don't need to take locks to
modify it when we're dropping references.
7) Add a commit_lock to the transaction to make sure multiple people trying to
commit the same transaction don't race and commit at the same time.
8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
trans.
I have tested this with xfstests, but obviously it is a pretty hairy change so
lots of testing is greatly appreciated. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We currently track trans handles in current->journal_info, but we don't actually
use it. This patch fixes it. This will cover the case where we have multiple
people starting transactions down the call chain. This keeps us from having to
allocate a new handle and all of that, we just increase the use count of the
current handle, save the old block_rsv, and return. I tested this with xfstests
and it worked out fine. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I keep forgetting that btrfs_join_transaction() just ignores the num_items
argument, which leads me to sending pointless patches and looking stupid :). So
just kill the num_items argument from btrfs_join_transaction and
btrfs_start_ioctl_transaction, since neither of them use it. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In the prealloc filling code and compressed code we don't set trans->block_rsv
to the delalloc block reserve properly, which is going to make us use metadata
from the wrong pool, this patch fixes that. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
b43: fix comment typo reqest -> request
Haavard Skinnemoen has left Atmel
cris: typo in mach-fs Makefile
Kconfig: fix copy/paste-ism for dell-wmi-aio driver
doc: timers-howto: fix a typo ("unsgined")
perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c
md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').
treewide: fix a few typos in comments
regulator: change debug statement be consistent with the style of the rest
Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations"
audit: acquire creds selectively to reduce atomic op overhead
rtlwifi: don't touch with treewide double semicolon removal
treewide: cleanup continuations and remove logging message whitespace
ath9k_hw: don't touch with treewide double semicolon removal
include/linux/leds-regulator.h: fix syntax in example code
tty: fix typo in descripton of tty_termios_encode_baud_rate
xtensa: remove obsolete BKL kernel option from defconfig
m68k: fix comment typo 'occcured'
arch:Kconfig.locks Remove unused config option.
treewide: remove extra semicolons
...
The current code relogs the entire inode every time during fsync log,
and it is much better suited to small files rather than large ones.
During my performance test, the fsync performace of large files sucks,
and we can ascribe this to the tremendous amount of csum infos of the
large ones, cause we have to flush all of these csum infos into log trees
even when there are only _one_ change in the whole file data. Apparently,
to optimize fsync, we need to create a filter to skip the unnecessary csum
ones, that is, the corresponding file data remains unchanged before this fsync.
Here I have some test results to show, I use sysbench to do "random write + fsync".
===
sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags= [prepare, run]
===
Sysbench args:
- Number of threads: 1
- Extra file open flags: 0
- 2 files, 4Gb each
- Block size 4Kb
- Number of random requests for random IO: 10000
- Read/Write ratio for combined random IO test: 1.50
- Periodic FSYNC enabled, calling fsync() each 100 requests.
- Calling fsync() at the end of test, Enabled.
- Using synchronous I/O mode
- Doing random write test
Sysbench results:
===
Operations performed: 0 Read, 10000 Write, 200 Other = 10200 Total
Read 0b Written 39.062Mb Total transferred 39.062Mb
===
a) without patch: (*SPEED* : 451.01Kb/sec)
112.75 Requests/sec executed
b) with patch: (*SPEED* : 4.7533Mb/sec)
1216.84 Requests/sec executed
PS: I've made a _sub transid_ stuff patch, but it does not perform as effectively as this patch,
and I'm wanderring where the problem is and trying to improve it more.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commit e66eed651f ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.
So this fixes things up a bit, using
grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
to guide us in finding files that either need <linux/prefetch.h>
inclusion, or have it despite not needing it.
There are more of them around (mostly network drivers), but this gets
many core ones.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Steps to reproduce the bug:
- Call FS_IOC_SETLFAGS ioctl with flags=FS_COMPR_FL
- Call FS_IOC_SETFLAGS ioctl with flags=0
- Call FS_IOC_GETFLAGS ioctl, and you'll see FS_COMPR_FL is still set!
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file
COW in btrfs, but FS_NOCOW_FL is sufficient.
The fact is we don't have corresponding BTRFS_INODE_COW flag.
COW is default, and FS_NOCOW_FL can be used to switch off COW for
a single file.
If we mount btrfs with nodatacow, a newly created file will be set with
the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the
FS_NOCOW_FL flag.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a btrfs disk is created by mixed data & metadata option, it will have no
pure data or pure metadata space info.
In btrfs's for-linus branch, commit 78b1ea13838039cd88afdd62519b40b344d6c920
(Btrfs: fix OOPS of empty filesystem after balance) initializes space infos at
the very beginning. The problem is this initialization does not take the mixed
case into account, which will cause btrfs will easily get into ENOSPC in mixed
case.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If posix_acl_from_xattr() returns an error code, a negative address is
dereferenced causing an oops; fix by checking for error code first.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
currently alloc_start is disregarded if the requested
chunk size is bigger than (device size - alloc_start),
but smaller than the device size.
The only situation where I see this could have made sense
was when a chunk equal the size of the device has been
requested. This was possible as the allocator failed to
take alloc_start into account when calculating the request
chunk size. As this gets fixed by this patch, the workaround
is not necessary anymore.
btrfs scrub - make fixups sync, don't reuse fixup bios
Fixups are already sync for csum failures, this patch makes them sync
for EIO case as well.
Fixups are now sharing pages with the parent sbio - instead of
allocating a separate page to do a fixup we grab the page from the sbio
buffer.
Fixup bios are no longer reused.
struct fixup is no longer needed, instead pass [sbio pointer, index].
Originally this was added to look at the possibility of sharing the code
between drive swap and scrub, but it actually fixes a serious bug in
scrub code where errors that could be corrected were ignored and
reported as uncorrectable.
btrfs scrub - restore bios properly after media errors
The current code reallocates a bio after a media error. This is a
temporary measure introduced in v3 after a serious problem related to
bio reuse was found in v2 of scrub patchset.
Basically we did not reset bv_offset and bv_len fields of the bio_vec
structure. They are changed in case I/O error happens, for example, at
offset 512 or 1024 into the page. Also bi_flags field wasn't properly
setup before reusing the bio.
Signed-off-by: Arne Jansen <sensille@gmx.net>
adds ioctls necessary to start and cancel scrubs, to get current
progress and to get info about devices to be scrubbed.
Note that the scrub is done per-device and that the ioctl only
returns after the scrub for this devices is finished or has been
canceled.
Signed-off-by: Arne Jansen <sensille@gmx.net>
This adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.
This commit is based on the series originally posted to linux-btrfs
with some improvements that resulted from comments from David Sterba,
Ilya Dryomov and Jan Schmidt.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Remove code which has been #if0-ed out for a very long time and does not
seem to be related to current codebase anymore.
Signed-off-by: David Sterba <dsterba@suse.cz>
Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.
text data bss dec hex filename
402081 7464 200 409745 64091 btrfs.ko.base
398620 7144 200 405964 631cc btrfs.ko.remove-all
Signed-off-by: David Sterba <dsterba@suse.cz>
parameter tree root it's not used since commit
5f39d397df ("Btrfs: Create extent_buffer
interface for large blocksizes")
Signed-off-by: David Sterba <dsterba@suse.cz>
all callers pass GFP_NOFS, but the GFP mask argument is not used in the
function; GFP_ATOMIC is passed to radix tree initialization and it's the
only correct one, since we're using the preload/insert mechanism of
radix tree.
Let's drop the gfp mask from btrfs function, this will not change
behaviour.
Signed-off-by: David Sterba <dsterba@suse.cz>
The superblock's ->s_fs_info is properly set in btrfs_fill_super, after
a call to open_ctree, which derefereces it before check. Although
tree_root is set via btrfs_set_super, let's be defensive and leave the
check in place.
Signed-off-by: David Sterba <dsterba@suse.cz>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: cleanup error handling in inode.c
Btrfs: put the right bio if we have an error
Btrfs: free bitmaps properly when evicting the cache
Btrfs: Free free_space item properly in btrfs_trim_block_group()
btrfs: add missing spin_unlock to a rare exit path
Btrfs: check return value of kmalloc()
btrfs: fix wrong allocating flag when reading page
Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log()
The error processing of several places is changed like setting the
error number only at the error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs_submit_direct_hook if the first btrfs_map_block fails we need to put
the orig_bio, not bio.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If our space cache is wrong, we do the right thing and free up everything that
we loaded, however we don't reset the total_bitmaps counter or the thresholds or
anything. So in btrfs_remove_free_space_cache make sure to call free_bitmap()
if it's a bitmap, this will keep us from panicing when we check to make sure we
don't have too many bitmaps. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since commit dc89e98244, we've changed
to use a specific slab for alocation of free_space items.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The check on the return value of kmalloc() is added to some places.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
the space cache use extent_readpages() to read free space information,
so we can not use GFP_KERNEL flag to allocate memory, or it may lead
to deadlock.
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It is necessary to unlock mutex_lock before it return an error when
btrfs_alloc_path() fails.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is similar to block group caching.
We dedicate a special inode in fs tree to save free ino cache.
At the very first time we create/delete a file after mount, the free ino
cache will be loaded from disk into memory. When the fs tree is commited,
the cache will be written back to disk.
To keep compatibility, we check the root generation against the generation
of the special inode when loading the cache, so the loading will fail
if the btrfs filesystem was mounted in an older kernel before.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
There's a potential problem in 32bit system when we exhaust 32bit inode
numbers and start to allocate big inode numbers, because btrfs uses
inode->i_ino in many places.
So here we always use BTRFS_I(inode)->location.objectid, which is an
u64 variable.
There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
and inode->i_ino will be used in those cases.
Another reason to make this change is I'm going to use a special inode
to save free ino cache, and the inode number must be > (u64)-256.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Extract out block group specific code from lookup_free_space_inode(),
create_free_space_inode(), load_free_space_cache() and
btrfs_write_out_cache(), so the code can be used to read/write
free ino cache.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
So we can re-use the code to cache free inode numbers.
The change is quite straightforward. Two new structures are introduced.
- struct btrfs_free_space_ctl
We move those variables that are used for caching free space from
struct btrfs_block_group_cache to this new struct.
- struct btrfs_free_space_op
We do block group specific work (e.g. calculation of extents threshold)
through functions registered in this struct.
And then we can remove references to struct btrfs_block_group_cache.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
The Btrfs submit bio threads have a small number of
threads responsible for pushing down bios we've collected
for a large number of devices.
Since we do all the bios for a single device at once,
we want to make sure we unplug and send down the bios
for each device as we're done processing them.
The new plugging API removed the btrfs code to
unplug while processing bios, this adds it back with
the new API.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
Btrfs: fix free space cache leak
Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
Btrfs end_bio_extent_readpage should look for locked bits
Btrfs: don't force chunk allocation in find_free_extent
Btrfs: Check validity before setting an acl
Btrfs: Fix incorrect inode nlink in btrfs_link()
Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
Btrfs: make uncache_state unconditional
btrfs: using cached extent_state in set/unlock combinations
Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
Btrfs: fix subvolume mount by name problem when default mount subvolume is set
fix user annotation in ioctl.c
Btrfs: check for duplicate iov_base's when doing dio reads
btrfs: properly handle overlapping areas in memmove_extent_buffer
Btrfs: fix memory leaks in btrfs_new_inode()
Btrfs: check for duplicate iov_base's when doing dio reads
Btrfs: reuse the extent_map we found when calling btrfs_get_extent
Btrfs: do not use async submit for small DIO io's
Btrfs: don't split dio bios if we don't have to
...
The free space caching code was recently reworked to
cache all the pages it needed instead of using find_get_page everywhere.
One loop was missed though, so it ended up leaking pages. This fixes
it to use our page array instead of find_get_page.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Everytime we try to allocate disk space we try and see if we can pre-emptively
allocate a chunk, but in the common case we don't allocate anything, so there is
no sense in taking the chunk_mutex at all. So instead if we are allocating a
chunk, mark it in the space_info so we don't get two people trying to allocate
at the same time. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Reviewed-by: Liu Bo <liubo2009@cn.fujitsu.com>
A recent commit caches the extent state in end_bio_extent_readpage,
but the search it does should look for locked extents. This
fixes things to make it more effective.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_free_extent likes to allocate in contiguous clusters,
which makes writeback faster, especially on SSD storage. As
the FS fragments, these clusters become harder to find and we have
to decide between allocating a new chunk to make more clusters
or giving up on the cluster to allocate from the free space
we have.
Right now it creates too many chunks, and you can end up with
a whole FS that is mostly empty metadata chunks. This commit
changes the allocation code to be more strict and only
allocate new chunks when we've made good use of the chunks we
already have.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Call posix_acl_valid() to check if an acl is valid or not.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Link count of the inode is not decreased if btrfs_set_inode_index()
fails.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Singed-off-by: Li Zefan <lizf@cn.fujitsu.com>
btrfs_next_leaf() can return -errno, and we should propagate
it to userspace.
This also simplifies how we walk the btree path.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
btrfs_next_leaf() can return -errno, and we should propagate
it to userspace.
This also simplifies how we walk the btree path.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
The extent_io code can take cached pointers into the extent state trees,
and these can make lookups much faster in common operations. The
caching only happens when specific bits are set that prevent merging
and splitting of the extent state.
A help function was added to uncache the state, and it was testing
the same set of conditionals. This can leak in very strange corner
cases where the lock bit goes away unexpectedly.
The uncaching should be unconditional. Once we have a ref on the
extent we should always give it up.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In several places the sequence (set_extent_uptodate, unlock_extent) is used.
This leads to a duplicate lookup of the extent state. This patch lets
set_extent_uptodate return a cached extent_state which can be passed to
unlock_extent_cached.
The occurences of the above sequences are updated to use the cache. Only
end_bio_extent_readpage is updated that it first gets a cached state to
pass it to the readpage_end_io_hook as the prototype requested and is later
on being used for set/unlock.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I've been working on making our O_DIRECT latency not suck and I noticed we were
taking the trans_mutex in btrfs_end_transaction. So to do this we convert
num_writers and use_count to atomic_t's and just decrement them in
btrfs_end_transaction. Instead of deleting the transaction from the trans list
in put_transaction we do that in btrfs_commit_transaction() since that's the
only time it actually needs to be removed from the list. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We create two subvolumes (meego_root and meego_home) in
btrfs root directory. And set meego_root as default mount
subvolume. After we remount btrfs, meego_root is mounted
to top directory by default. Then when we try to mount
meego_home (subvol=meego_home) to a subdirectory, it failed.
The problem is when default mount subvolume is set to
meego_root, we search meego_home in meego_root but can not find
it. So the solution is to add a new mount option (subvolrootid)
to specify subvol id of root and search subvol name in it. For
our case, now we can use "-o subvolrootid=0,subvol=meego_home)
to mount meego_home.
Detail information can be found in meego bugzilla:
https://bugs.meego.com/show_bug.cgi?id=15055
Signed-off-by: Zhong, Xin <xin.zhong@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Apparently it is ok to submit a read to an IDE device with the same target page
for different offsets. This is what Windows does under qemu. The problem is
under DIO we expect them to be different buffers for checksumming reasons, and
so this sort of thing will result in checksum errors, when in reality the file
is fine. So when reading, check to make sure that all iov bases are different,
and if they aren't fall back to buffered mode, since that will work out right.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch fixes memory leaks in btrfs_new_inode().
Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Apparently it is ok to submit a read to an IDE device with the same target page
for different offsets. This is what Windows does under qemu. The problem is
under DIO we expect them to be different buffers for checksumming reasons, and
so this sort of thing will result in checksum errors, when in reality the file
is fine. So when reading, check to make sure that all iov bases are different,
and if they aren't fall back to buffered mode, since that will work out right.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In btrfs_get_block_direct we call btrfs_get_extent to lookup the extent for the
range that we are looking for. If we don't find an extent, btrfs_get_extent
will insert a extent_map for that area and mark it as a hole. So it does the
job of allocating a new extent map and inserting it into the io tree. But if
we're creating a new extent we free it up and redo all of that work. So instead
pass the em to btrfs_new_extent_direct(), and if it will work just allocate the
disk space and set it up properly and bypass the freeing/allocating of a new
extent map and the expensive operation of inserting the thing into the io_tree.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When looking at our DIO performance Chris said that for small IO's doing the
async submit stuff tends to be more overhead than it's worth. With this on top
of my other fixes I get about a 17-20% speedup doing a sequential dd with 4k
IO's. Basically if we don't have to split the bio for the map length it's small
enough to be directly submitted, otherwise go back to the async submit. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have been unconditionally allocating a new bio and re-adding all pages from
our original bio to the new bio. This is needed if our original bio is larger
than our stripe size, but if it is smaller than the stripe size then there is no
need to do this. So check the map length and if we are under that then go ahead
and submit the original bio. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In the DIO code we often don't update the i_disk_size because the i_size isn't
updated until after the DIO is completed, so basically we are allocating a path,
doing a search, and updating the inode item for no reason since nothing changed.
btrfs_ordered_update_i_size will return 1 if it didn't update i_disk_size, so
only run btrfs_update_inode if btrfs_ordered_update_i_size returns 0. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Instead of calling kmap_atomic for every thing we set in the inode item, map the
entire inode item at the start and unmap it at the end. This makes a sequential
dd of 400mb O_DIRECT something like 1% faster. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I saw a lockup where we kept getting into this start transaction->commit
transaction loop because of enospce. The fact is if we fail to make our
reservation, we've tried _everything_ several times, so we only need to try and
commit the transaction once, and if that doesn't work then we really are out of
space and need to just exit. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently we don't handle running out of space in the cache, so to fix this we
keep track of how far in the cache we are. Then we only dirty the pages if we
successfully modify all of them, otherwise if we have an error or run out of
space we can just drop them and not worry about the vm writing them out.
Thanks,
Tested-by Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: don't warn in btrfs_add_orphan
Btrfs: fix free space cache when there are pinned extents and clusters V2
Btrfs: Fix uninitialized root flags for subvolumes
btrfs: clear __GFP_FS flag in the space cache inode
Btrfs: fix memory leak in start_transaction()
Btrfs: fix memory leak in btrfs_ioctl_start_sync()
Btrfs: fix subvol_sem leak in btrfs_rename()
Btrfs: Fix oops for defrag with compression turned on
Btrfs: fix /proc/mounts info.
Btrfs: fix compiler warning in file.c
When I moved the orphan adding to btrfs_truncate I missed the fact that during
orphan cleanup we just add the orphan items to the orphan list without going
through btrfs_orphan_add, which results in lots of warnings on mount if you have
any orphan items that need to be truncated. Just remove this warning since it's
ok, this will allow all of the normal space accounting take place. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I noticed a huge problem with the free space cache that was presenting
as an early ENOSPC. Turns out when writing the free space cache out I
forgot to take into account pinned extents and more importantly
clusters. This would result in us leaking free space everytime we
unmounted the filesystem and remounted it.
I fix this by making sure to check and see if the current block group
has a cluster and writing out any entries that are in the cluster to the
cache, as well as writing any pinned extents we currently have to the
cache since those will be available for us to use the next time the fs
mounts.
This patch also adds a check to the end of load_free_space_cache to make
sure we got the right amount of free space cache, and if not make sure
to clear the cache and re-cache the old fashioned way.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
root_item->flags and root_item->byte_limit are not initialized when
a subvolume is created. This bug is not revealed until we added
readonly snapshot support - now you mount a btrfs filesystem and you
may find the subvolumes in it are readonly.
To work around this problem, we steal a bit from root_item->inode_item->flags,
and use it to indicate if those fields have been properly initialized.
When we read a tree root from disk, we check if the bit is set, and if
not we'll set the flag and initialize the two fields of the root item.
Reported-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Tested-by: Andreas Philipp <philipp.andreas@gmail.com>
cc: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
the object id of the space cache inode's key is allocated from the relative
root, just like the regular file. So we can't identify space cache inode by
checking the object id of the inode's key, and we have to clear __GFP_FS flag
at the time we look up the space cache inode.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Free btrfs_trans_handle when join_transaction() fails
in start_transaction()
Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_rename() does not release the subvol_sem if the transaction failed to start.
Signed-off-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we defrag a file, whose size can be fit into an inline extent,
with compression enabled, the compress type is set to be
fs_info->compress_type, which is 0 if the btrfs filesystem is mounted
without compress option. This leads to oops.
Reported-by: Daniel Blueman <daniel.blueman@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Some mount options are not displayed by /proc/mounts.
This patch displays the option such as compress_type by /proc/mounts.
Ex.
[before]
$ mount | grep sdc2
/dev/sdc2 on /test12 type btrfs (rw,space_cache,compress=lzo)
$ cat /proc/mounts | grep sdc2
/dev/sdc2 /test12 btrfs rw,relatime,compress 0 0
[after]
$ mount | grep sdc2
/dev/sdc2 on /test12 type btrfs (rw,space_cache,compress=lzo)
$ cat /proc/mounts | grep sdc2
/dev/sdc2 /test12 btrfs rw,relatime,compress=lzo,space_cache 0 0
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While compiling Btrfs, I got following messages:
CC [M] fs/btrfs/file.o
fs/btrfs/file.c: In function '__btrfs_buffered_write':
fs/btrfs/file.c:909: warning: 'ret' may be used uninitialized in this function
CC [M] fs/btrfs/tree-defrag.o
This patch fixes compiler warning.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (45 commits)
Btrfs: fix __btrfs_map_block on 32 bit machines
btrfs: fix possible deadlock by clearing __GFP_FS flag
btrfs: check link counter overflow in link(2)
btrfs: don't mess with i_nlink of unlocked inode in rename()
Btrfs: check return value of btrfs_alloc_path()
Btrfs: fix OOPS of empty filesystem after balance
Btrfs: fix memory leak of empty filesystem after balance
Btrfs: fix return value of setflags ioctl
Btrfs: fix uncheck memory allocations
btrfs: make inode ref log recovery faster
Btrfs: add btrfs_trim_fs() to handle FITRIM
Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes
Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP
Btrfs: make update_reserved_bytes() public
btrfs: return EXDEV when linking from different subvolumes
Btrfs: Per file/directory controls for COW and compression
Btrfs: add datacow flag in inode flag
btrfs: use GFP_NOFS instead of GFP_KERNEL
Btrfs: check return value of read_tree_block()
btrfs: properly access unaligned checksum buffer
...
Fix up trivial conflicts in fs/btrfs/volumes.c due to plug removal in
the block layer.
Recent changes for discard support didn't compile,
this fixes them not to try and % 64 bit numbers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Using the GFP_HIGHUSER_MOVABLE flag to allocate the metadata's page may cause
deadlock.
Task1
open()
...
btrfs_search_slot()
...
btrfs_cow_block()
...
alloc_page()
wait for reclaiming
shrink_slab()
...
shrink_icache_memory()
...
btrfs_evict_inode()
...
btrfs_search_slot()
If the path is locked by task1, the deadlock happens.
So the btree's page cache is different with the file's page cache, it can not
allocate pages by GFP_HIGHUSER_MOVABLE flag, we must clear __GFP_FS flag in
GFP_HIGHUSER_MOVABLE flag.
Reported-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
old_inode is not locked; it's not safe to play with its link
count. Instead of bumping it and calling btrfs_unlink_inode(),
add a variant of the latter that does not do btrfs_drop_nlink()/
btrfs_update_inode(), call it instead of btrfs_inc_nlink()/
btrfs_unlink_inode() and do btrfs_update_inode() ourselves.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Adding the check on the return value of btrfs_alloc_path() to several places.
And, some of callers are modified by this change.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs will remove unused block groups after balance.
When a empty filesystem is balanced, the block group with tag "DATA" may be
dropped, and after umount and mount again, it will not find "DATA" space_info
and lead to OOPS.
So we initial the necessary space_infos(DATA, SYSTEM, METADATA) to avoid OOPS.
Reported-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After Josef's patch(commit 3c14874acc),
btrfs will exclude super bytes when reading block groups(by marking a extent
state UPTODATE). However, these bytes do not get freed while balance remove
unused block groups, and we won't process those removed ones any more, when
we do umount and unload the btrfs module, btrfs hits a memory leak.
This patch add the missing free operation.
Reproduce steps:
$ mkfs.btrfs disk
$ mount disk /mnt/btrfs -o loop
$ btrfs filesystem balance /mnt/btrfs
$ umount /mnt/btrfs
$ rmmod btrfs
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
setflags ioctl should return error when any checks fail.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To make Btrfs code more robust, several return value checks where memory
allocation can fail are introduced. I use BUG_ON where I don't know how
to handle the error properly, which increases the number of using the
notorious BUG_ON, though.
Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we recover from crash via write-ahead log tree and process
the inode refs, for each btrfs_inode_ref item, we will
1) check if we already have a perfect match in fs/file tree, if
we have, then we're done.
2) search the corresponding back reference in fs/file tree, and
check all the names in this back reference to see if they are
also in the log to avoid conflict corners.
3) recover the logged inode refs to fs/file tree.
In current btrfs, however,
- for 2)'s check, once is enough, since the checked back reference
will remain unchanged after processing all the inode refs belonged
to the key.
- it has no need to do another 1) between 2) and 3).
I've made a small test to show how it improves,
$dd if=/dev/zero of=foobar bs=4K count=1
$sync
$make 100 hard links continuously, like ln foobar link_i
$fsync foobar
$echo b > /proc/sysrq-trigger
after reboot
$time mount DEV PATH
without patch:
real 0m0.285s
user 0m0.001s
sys 0m0.009s
with patch:
real 0m0.123s
user 0m0.000s
sys 0m0.010s
Changelog v1->v2:
- fix double free - pointed by David Sterba
Changelog v2->v3:
- adjust free order
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We take an free extent out from allocator, trim it, then put it back,
but before we trim the block group, we should make sure the block group is
cached, so plus a little change to make cache_block_group() run without a
transaction.
Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Callers of btrfs_discard_extent() should check if we are mounted with -o discard,
as we want to make fitrim to work even the fs is not mounted with -o discard.
Also we should use REQ_DISCARD to map the free extent to get a full mapping,
last we only return errors if
1. the error is not a EOPNOTSUPP
2. no device supports discard
Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_map_block() will only return a single stripe length, but we want the
full extent be mapped to each disk when we are trimming the extent,
so we add length to btrfs_bio_stripe and fill it if we are mapping for REQ_DISCARD.
Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Make the function public as we should update the reserved extents calculations
after taking out an extent for trimming.
Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_link returns EPERM if a cross-subvolume link is attempted.
However, in this case I believe EXDEV to be the more appropriate value.
>From the link(2) man page:
EXDEV oldpath and newpath are not on the same mounted file system. (Linux
permits a file system to be mounted at multiple points, but link()
does not work across different mount points, even if the same file
system is mounted on both.)
This matters because an application may have different behaviors based on
return codes.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In the filesystem context, we must allocate memory by GFP_NOFS,
or we may start another filesystem operation and make kswap thread hang up.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch is checking return value of read_tree_block(),
and if it is NULL, error processing.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On Fri, Mar 18, 2011 at 11:56:53AM -0400, Chris Mason wrote:
> Thanks for fielding this one. Does put_unaligned_le32 optimize away on
> platforms with efficient access? It would be great if we didn't need
> the #ifdef.
(quicktest: assembly output is same for put_unaligned_le32 and direct
assignment on my x86_64)
I was originally following examples in
Documentation/unaligned-memory-access.txt. From other code it seems to me that
the define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is intended for larger
portions of code. Macros/wrappers for {put,get}_unaligned* are chosen via
arch/<arch>/include/asm/unaligned.h accordingly, therefore it's safe to use
put_unaligned_le32 without the ifdef.
dave
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch changes some BUG_ON() to the error return.
(but, most callers still use BUG_ON())
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The pointer to the extent buffer for the root of each tree
is protected by a spinlock so that we can safely read the pointer
and take a reference on the extent buffer.
But now that the extent buffers are freed via RCU, we can safely
use rcu_read_lock instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I noticed that dio_end_io calls the appropriate endio function with an error,
but the endio functions don't actually do anything with that error, they assume
that if there was an error then the bio will not be uptodate. So if we had
checksum failures we would never pass back EIO. So if there is an error in our
endio functions make sure to clear the uptodate flag on the bio. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When doing direct writes we store the checksums in the ordered sum stuff in the
ordered extent for writing them when the write completes, so we don't even use
the dip->csums array. So if we're writing, don't bother allocating dip->csums
since we won't use it anyway. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This patch makes the free space cluster refilling code a little easier to
understand, and fixes some things with the bitmap part of it. Currently we
either want to refill a cluster with
1) All normal extent entries (those without bitmaps)
2) A bitmap entry with enough space
The current code has this ugly jump around logic that will first try and fill up
the cluster with extent entries and then if it can't do that it will try and
find a bitmap to use. So instead split this out into two functions, one that
tries to find only normal entries, and one that tries to find bitmaps.
This also fixes a suboptimal thing we would do with bitmaps. If we used a
bitmap we would just tell the cluster that we were pointing at a bitmap and it
would do the tree search in the block group for that entry every time we tried
to make an allocation. Instead of doing that now we just add it to the clusters
group.
I tested this with my ENOSPC tests and xfstests and it survived.
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
And give it a kernel-doc comment.
[akpm@linux-foundation.org: btrfs changed in linux-next]
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of always creating a huge (268K) deflate_workspace with the
maximum compression parameters (windowBits=15, memLevel=8), allow the
caller to obtain a smaller workspace by specifying smaller parameter
values.
For example, when capturing oops and panic reports to a medium with
limited capacity, such as NVRAM, compression may be the only way to
capture the whole report. In this case, a small workspace (24K works
fine) is a win, whether you allocate the workspace when you need it (i.e.,
during an oops or panic) or at boot time.
I've verified that this patch works with all accepted values of windowBits
(positive and negative), memLevel, and compression level.
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have been creating bitmaps for small extents unconditionally forever. This
was great when testing to make sure the bitmap stuff was working, but is
overkill normally. So instead of always adding small chunks of free space to
bitmaps, only start doing it if we go past half of our extent threshold. This
will keeps us from creating a bitmap for just one small free extent at the front
of the block group, and will make the allocator a little faster as a result.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We do all this fun stuff with min_bytes, but either don't use it in the case of
just normal extents, or use it completely wrong in the case of bitmaps. So fix
this for both cases
1) In the extent case, stop looking for space with window_free >= min_bytes
instead of bytes + empty_size.
2) In the bitmap case, we were looking for streches of free space that was at
least min_bytes in size, which was not right at all. So instead search for
stretches of free space that are at least bytes in size (this will make a
difference when we have > page size blocks) and then only search for min_bytes
amount of free space.
Thanks,
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
The free space cluster stuff is heavy duty, so there is no sense in going
through the entire song and dance if there isn't enough space in the block group
to begin with. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
Update cpuset info & webiste for cgroups
dcdbas: force SMI to happen when expected
arch/arm/Kconfig: remove one to many l's in the word.
asm-generic/user.h: Fix spelling in comment
drm: fix printk typo 'sracth'
Remove one to many n's in a word
Documentation/filesystems/romfs.txt: fixing link to genromfs
drivers:scsi Change printk typo initate -> initiate
serial, pch uart: Remove duplicate inclusion of linux/pci.h header
fs/eventpoll.c: fix spelling
mm: Fix out-of-date comments which refers non-existent functions
drm: Fix printk typo 'failled'
coh901318.c: Change initate to initiate.
mbox-db5500.c Change initate to initiate.
edac: correct i82975x error-info reported
edac: correct i82975x mci initialisation
edac: correct commented info
fs: update comments to point correct document
target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
...
Trivial conflict in fs/eventpoll.c (spelling vs addition)
We need to make sure the dir items we get are valid dir items. So any time we
try and read one check it with verify_dir_item, which will do various sanity
checks to make sure it looks sane. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Doing an audit of where we use btrfs_search_slot only showed one place where we
don't check the return value of btrfs_search_slot properly. Just fix
mark_extent_written to see if btrfs_search_slot failed and act accordingly.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently if we have corrupted items things will blow up in spectacular ways.
So as we read in blocks and they are leaves, check the entire leaf to make sure
all of the items are correct and point to valid parts in the leaf for the item
data the are responsible for. If the item is corrupt we will kick back EIO and
not read any of the copies since they are likely to not be correct either. This
will catch generic corruptions, it will be up to the individual callers of
btrfs_search_slot to make sure their items are right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently if we have corrupt metadata map_extent_buffer will complain about it,
but not return an error so the caller has no idea a problem was hit. Fix this.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Everytime I have to deal with btrfs_cont_expand I stare at it for 20 minutes
trying to remember what exactly it does and why the hell we need it. So add a
comment to save future-Josef some time. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Mark_inode_dirty will call btrfs_dirty_inode which will take care of updating
the inode. This makes setsize a little cleaner since we don't have to start a
transaction and update the inode in there, we can just call mark_inode_dirty.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We don't need an orphan item when expanding files, we just need them for
truncating them, so only add the orphan item in btrfs_truncate instead of in
btrfs_setsize. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This fixes a problem where if truncate fails the inode will still be on the in
memory orphan list. This is will make us complain when the inode gets destroyed
because it's still on the orphan list. So if we fail just remove us from the in
memory list and carry on.
Signed-off-by: Josef Bacik <josef@redhat.com>
If we cannot truncate an inode for some reason we will never delete the orphan
item associated with that inode, which means that we will loop forever in
btrfs_orphan_cleanup. Instead of doing this just return error so we fail to
mount. It sucks, but hey it's better than hanging. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Now that we can handle having errors in the truncate path lets make sure we
return errors instead of doing BUG_ON() and such. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
->truncate() is going away, instead all of the work needs to be done in
->setattr(). So this converts us over to do this. It's fairly straightforward,
just get rid of our .truncate inode operation and call btrfs_truncate() directly
from btrfs_setsize. This works out better for us since truncate can technically
return ENOSPC, and before we had no way of letting anybody know. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Since we alloc/free free space entries a whole lot, lets use a slab to keep
track of them. This makes some of my tests slightly faster. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We track delayed allocation per inodes via 2 counters, one is
outstanding_extents and reserved_extents. Outstanding_extents is already an
atomic_t, but reserved_extents is not and is protected by a spinlock. So
convert this to an atomic_t and instead of using a spinlock, use atomic_cmpxchg
when releasing delalloc bytes. This makes our inode 72 bytes smaller, and
reduces locking overhead (albiet it was minimal to begin with). Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Really we don't need to memset the pages array at all, since we know how many
pages we're going to use in the array and pass that around. So don't memset,
just trust we're not idiots and we pass num_pages around properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Our aio_write function is huge and kind of hard to follow at times. So this
patch fixes this by breaking out the buffered and direct write paths out into
seperate functions so it's a little clearer what's going on. I've also fixed
some wrong typing that we had and added the ability to handle getting an error
back from btrfs_set_extent_delalloc. Tested this with xfstests and everything
came out fine. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits)
AppArmor: kill unused macros in lsm.c
AppArmor: cleanup generated files correctly
KEYS: Add an iovec version of KEYCTL_INSTANTIATE
KEYS: Add a new keyctl op to reject a key with a specified error code
KEYS: Add a key type op to permit the key description to be vetted
KEYS: Add an RCU payload dereference macro
AppArmor: Cleanup make file to remove cruft and make it easier to read
SELinux: implement the new sb_remount LSM hook
LSM: Pass -o remount options to the LSM
SELinux: Compute SID for the newly created socket
SELinux: Socket retains creator role and MLS attribute
SELinux: Auto-generate security_is_socket_class
TOMOYO: Fix memory leak upon file open.
Revert "selinux: simplify ioctl checking"
selinux: drop unused packet flow permissions
selinux: Fix packet forwarding checks on postrouting
selinux: Fix wrong checks for selinux_policycap_netpeer
selinux: Fix check for xfrm selinux context algorithm
ima: remove unnecessary call to ima_must_measure
IMA: remove IMA imbalance checking
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (57 commits)
tidy the trailing symlinks traversal up
Turn resolution of trailing symlinks iterative everywhere
simplify link_path_walk() tail
Make trailing symlink resolution in path_lookupat() iterative
update nd->inode in __do_follow_link() instead of after do_follow_link()
pull handling of one pathname component into a helper
fs: allow AT_EMPTY_PATH in linkat(), limit that to CAP_DAC_READ_SEARCH
Allow passing O_PATH descriptors via SCM_RIGHTS datagrams
readlinkat(), fchownat() and fstatat() with empty relative pathnames
Allow O_PATH for symlinks
New kind of open files - "location only".
ext4: Copy fs UUID to superblock
ext3: Copy fs UUID to superblock.
vfs: Export file system uuid via /proc/<pid>/mountinfo
unistd.h: Add new syscalls numbers to asm-generic
x86: Add new syscalls for x86_64
x86: Add new syscalls for x86_32
fs: Remove i_nlink check from file system link callback
fs: Don't allow to create hardlink for deleted file
vfs: Add open by file handle support
...
Now that VFS check for inode->i_nlink == 0 and returns proper
error, remove similar check from file system
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The exportfs encode handle function should return the minimum required
handle size. This helps user to find out the handle size by passing 0
handle size in the first step and then redoing to the call again with
the returned handle size value.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: break out of shrink_delalloc earlier
btrfs: fix not enough reserved space
btrfs: fix dip leak
Btrfs: make sure not to return overlapping extents to fiemap
Btrfs: deal with short returns from copy_from_user
Btrfs: fix regressions in copy_from_user handling
Josef had changed shrink_delalloc to exit after three shrink
attempts, which wasn't quite enough because new writers could
race in and steal free space.
But it also fixed deadlocks and stalls as we tried to recover
delalloc reservations. The code was tweaked to loop 1024
times, and would reset the counter any time a small amount
of progress was made. This was too drastic, and with a
lot of writers we can end up stuck in shrink_delalloc forever.
The shrink_delalloc loop is fairly complex because the caller is looping
too, and the caller will go ahead and force a transaction commit to make
sure we reclaim space.
This reworks things to exit shrink_delalloc when we've forced some
writeback and the delalloc reservations have gone down. This means
the writeback has not just started but has also finished at
least some of the metadata changes required to reclaim delalloc
space.
If we've got this wrong, we're returning ENOSPC too early, which
is a big improvement over the current behavior of hanging the machine.
Test 224 in xfstests hammers on this nicely, and with 1000 writers
trying to fill a 1GB drive we get our first ENOSPC at 93% full. The
other writers are able to continue until we get 100%.
This is a worst case test for btrfs because the 1000 writers are doing
small IO, and the small FS size means we don't have a lot of room
for metadata chunks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_link() will insert 3 items(inode ref, dir name item and dir index item)
into the b+ tree and update 2 items(its inode, and parent's inode) in the b+
tree. So we should reserve space for these 5 items, not 3 items.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs DIO code leaks dip structs when dip->csums allocation
fails; bio->bi_end_io isn't set at the point where the free_ordered
branch is consequently taken, thus bio_endio doesn't call the function
which would free it in the normal case. Fix.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Acked-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The btrfs fiemap code was incorrectly returning duplicate or overlapping
extents in some cases. cp was blindly trusting this result and we would
end up with a destination file that was bigger than the original because
some bytes were copied twice.
The fix here adjusts our offsets to make sure we're always moving
forward in the fiemap results.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When copy_from_user is only able to copy some of the bytes we requested,
we may end up creating a partially up to date page. To avoid garbage in
the page, we need to treat a partial copy as a zero length copy.
This makes the rest of the file_write code drop the page and
retry the whole copy instead of marking the partially up to
date page as dirty.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
cc: stable@kernel.org
Commit 914ee295af fixed deadlocks in
btrfs_file_write where we would catch page faults on pages we had
locked.
But, there were a few problems:
1) The x86-32 iov_iter_copy_from_user_atomic code always fails to copy
data when the amount to copy is more than 4K and the offset to start
copying from is not page aligned. The result was btrfs_file_write
looping forever retrying the iov_iter_copy_from_user_atomic
We deal with this by changing btrfs_file_write to drop down to single
page copies when iov_iter_copy_from_user_atomic starts returning failure.
2) The btrfs_file_write code was leaking delalloc reservations when
iov_iter_copy_from_user_atomic returned zero. The looping above would
result in the entire filesystem running out of delalloc reservations and
constantly trying to flush things to disk.
3) btrfs_file_write will lock down page cache pages, make sure
any writeback is finished, do the copy_from_user and then release them.
Before the loop runs we check the first and last pages in the write to
see if they are only being partially modified. If the start or end of
the write isn't aligned, we make sure the corresponding pages are
up to date so that we don't introduce garbage into the file.
With the copy_from_user changes, we're allowing the VM to reclaim the
pages after a partial update from copy_from_user, but we're not
making sure the page cache page is up to date when we loop around to
resume the write.
We deal with this by pushing the up to date checks down into the page
prep code. This fits better with how the rest of file_write works.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
cc: stable@kernel.org
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix fiemap bugs with delalloc
Btrfs: set FMODE_EXCL in btrfs_device->mode
Btrfs: make btrfs_rm_device() fail gracefully
Btrfs: Avoid accessing unmapped kernel address
Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl
Btrfs: allow balance to explicitly allocate chunks as it relocates
Btrfs: put ENOSPC debugging under a mount option
The Btrfs fiemap code wasn't properly returning delalloc extents,
so applications that trust fiemap to decide if there are holes in the
file see holes instead of delalloc.
This reworks the btrfs fiemap code, adding a get_extent helper that
searches for delalloc ranges and also adding a helper for extent_fiemap
that skips past holes in the file.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This fixes a bug introduced in d4d77629, where the device added online
(and therefore initialized via btrfs_init_new_device()) would be left
with the positive bdev->bd_holders after unmount. Since d4d77629 we no
longer OR FMODE_EXCL explicitly on blkdev_put(), set it in
btrfs_device->mode.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If shrinking done as part of the online device removal fails add that
device back to the allocation list and increment the rw_devices counter.
This fixes two bugs:
1) we could have a perfectly good device out of alloc list for no good
reason;
2) in the btrfs consisting of two devices, failure in btrfs_rm_device()
could lead to a situation where it was impossible to remove any of the
devices because of the "unable to remove the only writeable device"
error.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When decompressing a chunk of data, we'll copy the data out to
a working buffer if the data is stored in more than one page,
otherwise we'll use the mapped page directly to avoid memory
copy.
In the latter case, we'll end up accessing the kernel address
after we've unmapped the page in a corner case.
Reported-by: Juan Francisco Cantero Hurtado <iam@juanfra.info>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
- Check user-specified flags correctly
- Check the inode owership
- Search root item in root tree but not fs tree
Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs device shrinking and balancing ends up reallocating all the blocks
in order to allow COW to move them to new destinations. It is somewhat
awkward in terms of ENOSPC because most of the enospc code is built
around the idea that some operation on a reference counted tree triggers
allocations in the non-reference counted trees.
This commit changes the balancing code to deal with enospc by trying to
allocate a new chunk. If that allocation succeeds, we go ahead and
retry whatever failed due to enospc.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
ENOSPC in btrfs is getting to the point where the extra debugging isn't
required. I've put it under mount -o enospc_debug just in case someone
is having difficult problems.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I add the check on the return value of alloc_extent_map() to several places.
In addition, alloc_extent_map() returns only the address or NULL.
Therefore, check by IS_ERR() is unnecessary. So, I remove IS_ERR() checking.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Memory allocated by calling kstrdup() should be freed.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commit bf5fc093c5 refactored
btrfs_ioctl_space_info() and introduced several security issues.
space_args.space_slots is an unsigned 64-bit type controlled by a
possibly unprivileged caller. The comparison as a signed int type
allows providing values that are treated as negative and cause the
subsequent allocation size calculation to wrap, or be truncated to 0.
By providing a size that's truncated to 0, kmalloc() will return
ZERO_SIZE_PTR. It's also possible to provide a value smaller than the
slot count. The subsequent loop ignores the allocation size when
copying data in, resulting in a heap overflow or write to ZERO_SIZE_PTR.
The fix changes the slot count type and comparison typecast to u64,
which prevents truncation or signedness errors, and also ensures that we
don't copy more data than we've allocated in the subsequent loop. Note
that zero-size allocations are no longer possible since there is already
an explicit check for space_args.space_slots being 0 and truncation of
this value is no longer an issue.
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Mark the cloned backref_node as checked in clone_backref_node()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs tracks uptodate state in an rbtree as well as in the
page bits. This is supposed to enable us to use block sizes other than
the page size, but there are a few parts still missing before that
completely works.
But, our readpage routine trusts this additional range based tracking
of uptodateness, much in the same way the buffer head up to date bits
are trusted for the other filesystems.
The problem is that sometimes we need to allocate memory in order to
split records in the rbtree, even when we are just clearing bits. This
can be difficult when our clearing function is called GFP_ATOMIC, which
can happen in the releasepage path.
So, what happens today looks like this:
releasepage called with GFP_ATOMIC
btrfs_releasepage calls clear_extent_bit
clear_extent_bit fails to allocate ram, leaving the up to date bit set
btrfs_releasepage returns success
The end result is the page being gone, but btrfs thinking the range is
up to date. Later on if someone tries to read that same page, the
btrfs readpage code will return immediately thinking the page is already
up to date.
This commit fixes things to fail the releasepage when we can't clear the
extent state bits. It covers both data pages and metadata tree blocks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a race where btrfs_releasepage can drop the
page->private contents just as alloc_extent_buffer is setting
up pages for metadata. Because of how the Btrfs page flags work,
this results in us skipping the crc on the page during IO.
This patch sovles the race by waiting until after the extent buffer
is inserted into the radix tree before it sets page private.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (33 commits)
Btrfs: Fix page count calculation
btrfs: Drop __exit attribute on btrfs_exit_compress
btrfs: cleanup error handling in btrfs_unlink_inode()
Btrfs: exclude super blocks when we read in block groups
Btrfs: make sure search_bitmap finds something in remove_from_bitmap
btrfs: fix return value check of btrfs_start_transaction()
btrfs: checking NULL or not in some functions
Btrfs: avoid uninit variable warnings in ordered-data.c
Btrfs: catch errors from btrfs_sync_log
Btrfs: make shrink_delalloc a little friendlier
Btrfs: handle no memory properly in prepare_pages
Btrfs: do error checking in btrfs_del_csums
Btrfs: use the global block reserve if we cannot reserve space
Btrfs: do not release more reserved bytes to the global_block_rsv than we need
Btrfs: fix check_path_shared so it returns the right value
btrfs: check return value of btrfs_start_ioctl_transaction() properly
btrfs: fix return value check of btrfs_join_transaction()
fs/btrfs/inode.c: Add missing IS_ERR test
btrfs: fix missing break in switch phrase
btrfs: fix several uncheck memory allocations
...
take offset of start position into account when calculating page count.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
As this function is called in some error paths while not
removing the module, the __exit attribute prevents the kernel
image from linking when btrfs is compiled in statically.
Signed-off-by: Alexey Charkov <alchark@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs_alloc_path() fails, btrfs_free_path() need not be called.
Therefore, it changes the branch ahead.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This has been resulting in a BUT_ON(ret) after btrfs_reserve_extent in
btrfs_cow_file_range. The reason is we don't actually calculate the bytes_super
for a block group until we go to cache it, which means that the space_info can
hand out reservations for space that it doesn't actually have, and we can run
out of data space. This is also a problem if you are using space caching since
we don't ever calculate bytes_super for the block groups. So instead everytime
we read a block group call exclude_super_stripes, which calculates the
bytes_super for the block group so it can be left out of the space_info. Then
whenever caching completes we just call free_excluded_extents so that the super
excluded extents are freed up. Also if we are unmounting and we hit any block
groups that haven't been cached we still need to call free_excluded_extents to
make sure things are cleaned up properly. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we're cleaning up the tree log we need to be able to remove free space from
the block group. The problem is if that free space spans bitmaps we would not
find the space since we're looking for too many bytes. So make sure the amount
of bytes we search for is limited to either the number of bytes we want, or the
number of bytes left in the bitmap. This was tested by a user who was hitting
the BUG() after search_bitmap. With this patch he can now mount his fs.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
SELinux would like to implement a new labeling behavior of newly created
inodes. We currently label new inodes based on the parent and the creating
process. This new behavior would also take into account the name of the
new object when deciding the new label. This is not the (supposed) full path,
just the last component of the path.
This is very useful because creating /etc/shadow is different than creating
/etc/passwd but the kernel hooks are unable to differentiate these
operations. We currently require that userspace realize it is doing some
difficult operation like that and than userspace jumps through SELinux hoops
to get things set up correctly. This patch does not implement new
behavior, that is obviously contained in a seperate SELinux patch, but it
does pass the needed name down to the correct LSM hook. If no such name
exists it is fine to pass NULL.
Signed-off-by: Eric Paris <eparis@redhat.com>
The error check of btrfs_start_transaction() is added, and the mistake
of the error check on several places is corrected.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Because NULL is returned when the memory allocation fails,
it is checked whether it is NULL.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This one isn't really an uninit variable, but for pretty
obscure reasons. Let's make it clearly correct.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_sync_log returns -EAGAIN when we need full transaction commits
instead of small log commits, but sometimes we were dropping the return
value.
In practice, we check for this a few different ways, but this is still a
bug that can leave off full log commits when we really need them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Xfstests 224 will just sit there and spin for ever until eventually we give up
flushing delalloc and exit. On my box this took several hours. I could not
interrupt this process either, even though we use INTERRUPTIBLE. So do 2 things
1) Keep us from looping over and over again without reclaiming anything
2) If we get interrupted exit the loop
I tested this and the test now exits in a reasonable amount of time, and can be
interrupted with ctrl+c. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Instead of doing a BUG_ON(1) in prepare_pages if grab_cache_page() fails, just
loop through the pages we've already grabbed and unlock and release them, then
return -ENOMEM like we should. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Got a report of a box panicing because we got a NULL eb in read_extent_buffer.
His fs was borked and btrfs_search_path returned EIO, but we don't check for
errors so the box paniced. Yes I know this will just make something higher up
the stack panic, but that's a problem for future Josef. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We call use_block_rsv right before we make an allocation in order to make sure
we have enough space. Now normally people have called btrfs_start_transaction()
with the appropriate amount of space that we need, so we just use some of that
pre-reserved space and move along happily. The problem is where people use
btrfs_join_transaction(), which doesn't actually reserve any space. So we try
and reserve space here, but we cannot flush delalloc, so this forces us to
return -ENOSPC when in reality we have plenty of space. The most common symptom
is seeing a bunch of "couldn't dirty inode" messages in syslog. With
xfstests 224 we end up falling back to start_transaction and then doing all the
flush delalloc stuff which causes to hang for a very long time.
So instead steal from the global reserve, which is what this is meant for
anyway. With this patch and the other 2 I have sent xfstests 224 now passes
successfully. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we do btrfs_block_rsv_release, if global_block_rsv is not full we will
release all the extra bytes to global_block_rsv, even if it's only a little
short of the amount of space that we need to reserve. This causes us to starve
ourselves of reservable space during the transaction which will force us to
shrink delalloc bytes and commit the transaction more often than we should. So
instead just add the amount of bytes we need to add to the global reserve so
reserved == size, and then add the rest back into the space_info for general
use. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When running xfstests 224 I kept getting ENOSPC when trying to remove the files,
and this is because we were returning ret from check_path_shared while it was
uninitalized, which isn't right. Fix this to return 0 properly, and now
xfstests 224 doesn't freak out when it tries to clean itself up. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_start_ioctl_transaction() returns ERR_PTR(), not NULL.
So, it is necessary to use IS_ERR() to check the return value.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The error check of btrfs_join_transaction()/btrfs_join_transaction_nolock()
is added, and the mistake of the error check in several places is
corrected.
For more stable Btrfs, I think that we should reduce BUG_ON().
But, I think that long time is necessary for this.
So, I propose this patch as a short-term solution.
With this patch:
- To more stable Btrfs, the part that should be corrected is clarified.
- The panic isn't done by the NULL pointer reference etc. (even if
BUG_ON() is increased temporarily)
- The error code is returned in the place where the error can be easily
returned.
As a long-term plan:
- BUG_ON() is reduced by using the forced-readonly framework, etc.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After the conditional that precedes the following code, inode may be an
ERR_PTR value. This can eg result from a memory allocation failure via the
call to btrfs_iget, and thus does not imply that root is different than
sub_root. Thus, an IS_ERR check is added to ensure that there is no
dereference of inode in this case.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@r@
identifier f;
@@
f(...) { ... return ERR_PTR(...); }
@@
identifier r.f, fld;
expression x;
statement S1,S2;
@@
x = f(...)
... when != IS_ERR(x)
(
if (IS_ERR(x) ||...) S1 else S2
|
*x->fld
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To make btrfs more stable, add several missing necessary memory allocation
checks, and when no memory, return proper errno.
We've checked that some of those -ENOMEM errors will be returned to
userspace, and some will be catched by BUG_ON() in the upper callers,
and none will be ignored silently.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_submit_compressed_read() is lack of memory allocation checks and
corresponding error route.
After this fix, if it comes to "no memory" case, errno will be returned
to userland step by step, and tell users this operation cannot go on.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Suppose:
- the source extent is: [0, 100]
- the src offset is 10
- the clone length is 90
- the dest offset is 0
This statement:
new_key.offset = key.offset + destoff - off
will produce such an extent for the dest file:
[ino, BTRFS_EXTENT_DATA_KEY, -10]
, which is obviously wrong.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
fixup, which is allocated when starting page write to fix up the
extent without ORDERED bit set, should be freed after this work
is done.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
We must save and free the original kstrdup()'ed pointer
because strsep() modifies its first argument.
Signed-off-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
We missed a memory deallocation in commit 450ba0ea.
If an existing super block is found at mount and there is no
error condition then the pre-allocated tree_root and fs_info
are no not used and are not freeded.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
After returing extents from a cluster to the block group, some
extents in the block group may be mergeable.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
When adding a new extent, we'll firstly see if we can merge
this extent to the left or/and right extent. Extract this as
a helper try_merge_free_space().
As a side effect, we fix a small bug that if the new extent
has non-bitmap left entry but is unmergeble, we'll directly
link the extent without trying to drop it into bitmap.
This also prepares for the next patch.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
When allocating extent entry from a cluster, we should update
the free_space and free_extents fields of the block group.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
If there's no more free space in a bitmap, we should free it.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Remove some duplicated code.
This prepares for the next patch.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
If a block group is smaller than 1GB, the extent entry threadhold
calculation will always set the threshold to 0.
So as free space gets fragmented, btrfs will switch to use bitmap
to manage free space, but then will never switch back to extents
due to this bug.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
Btrfs: forced readonly mounts on errors
btrfs: Require CAP_SYS_ADMIN for filesystem rebalance
Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()
btrfs: check NULL or not
btrfs: Don't pass NULL ptr to func that may deref it.
btrfs: mount failure return value fix
btrfs: Mem leak in btrfs_get_acl()
btrfs: fix wrong free space information of btrfs
btrfs: make the chunk allocator utilize the devices better
btrfs: restructure find_free_dev_extent()
btrfs: fix wrong calculation of stripe size
btrfs: try to reclaim some space when chunk allocation fails
btrfs: fix wrong data space statistics
fs/btrfs: Fix build of ctree
Btrfs: fix off by one while setting block groups readonly
Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls
Btrfs: Add readonly snapshots support
Btrfs: Refactor btrfs_ioctl_snap_create()
btrfs: Extract duplicate decompress code
...
This patch comes from "Forced readonly mounts on errors" ideas.
As we know, this is the first step in being more fault tolerant of disk
corruptions instead of just using BUG() statements.
The major content:
- add a framework for generating errors that should result in filesystems
going readonly.
- keep FS state in disk super block.
- make sure that all of resource will be freed and released at umount time.
- make sure that fter FS is forced readonly on error, there will be no more
disk change before FS is corrected. For this, we should stop write operation.
After this patch is applied, the conversion from BUG() to such a framework can
happen incrementally.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently all filesystems except XFS implement fallocate asynchronously,
while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC
I/O we really want our allocation on disk, especially for the !KEEP_SIZE
case where we actually grow the file with user-visible zeroes. On the
other hand always commiting the transaction is a bad idea for fast-path
uses of fallocate like for example in recent Samba versions. Given
that block allocation is a data plane operation anyway change it from
an inode operation to a file operation so that we have the file structure
available that lets us check for O_SYNC.
This also includes moving the code around for a few of the filesystems,
and remove the already unnedded S_ISDIR checks given that we only wire
up fallocate for regular files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of various home grown checks that might need updates for new
flags just check for any bit outside the mask of the features supported
by the filesystem. This makes the check future proof for any newly
added flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix the build failure in some configurations:
CC [M] fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field 'super_kobj' has incomplete type
fs/btrfs/ctree.h:1074:17: error: field 'root_kobj' has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2
caused by commit 57cc7215b7 ("headers: kobject.h redux")
We need to include kobject.h here.
Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Filesystem rebalancing (BTRFS_IOC_BALANCE) affects the entire
filesystem and may run uninterruptibly for a long time. This does not
seem to be something that an unprivileged user should be able to do.
Reported-by: Aron Xu <happyaron.xu@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we run low on space we could get a bunch of warnings out of
btrfs_block_rsv_check, but this is mostly just called via the transaction code
to see if we need to end the transaction, it expects to see failures, so let's
not WARN and freak everybody out for no reason. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs_read_fs_root_no_radix(), 'root' is not freed if
btrfs_search_slot() returns error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Hi,
In fs/btrfs/inode.c::fixup_tree_root_location() we have this code:
...
if (!path) {
err = -ENOMEM;
goto out;
}
...
out:
btrfs_free_path(path);
return err;
btrfs_free_path() passes its argument on to other functions and some of
them end up dereferencing the pointer.
In the code above that pointer is clearly NULL, so btrfs_free_path() will
eventually cause a NULL dereference.
There are many ways to cut this cake (fix the bug). The one I chose was to
make btrfs_free_path() deal gracefully with NULL pointers. If you
disagree, feel free to come up with an alternative patch.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I happened to pass swap partition as root partition in cmdline,
then kernel panic and tell me about "Cannot open root device".
It is not correct, in fact it is a fs type mismatch instead of 'no device'.
Eventually I found btrfs mounting failed with -EIO, it should be -EINVAL.
The logic in init/do_mounts.c:
for (p = fs_names; *p; p += strlen(p)+1) {
int err = do_mount_root(name, p, flags, root_mount_data);
switch (err) {
case 0:
goto out;
case -EACCES:
flags |= MS_RDONLY;
goto retry;
case -EINVAL:
continue;
}
print "Cannot open root device"
panic
}
SO fs type after btrfs will have no chance to mount
Here fix the return value as -EINVAL
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It seems to me that we leak the memory allocated to 'value' in
btrfs_get_acl() if the call to posix_acl_from_xattr() fails.
Here's a patch that attempts to correct that problem.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With this patch, we change the handling method when we can not get enough free
extents with default size.
Implementation:
1. Look up the suitable free extent on each device and keep the search result.
If not find a suitable free extent, keep the max free extent
2. If we get enough suitable free extents with default size, chunk allocation
succeeds.
3. If we can not get enough free extents, but the number of the extent with
default size is >= min_stripes, we just change the mapping information
(reduce the number of stripes in the extent map), and chunk allocation
succeeds.
4. If the number of the extent with default size is < min_stripes, sort the
devices by its max free extent's size descending
5. Use the size of the max free extent on the (num_stripes - 1)th device as the
stripe size to allocate the device space
By this way, the chunk allocator can allocate chunks as large as possible when
the devices' space is not enough and make full use of the devices.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
- make it return the start position and length of the max free space when it can
not find a suitable free space.
- make it more readability
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are two tiny problem:
- One is When we check the chunk size is greater than the max chunk size or not,
we should take mirrors into account, but the original code didn't.
- The other is btrfs shouldn't use the size of the residual free space as the
length of of a dup chunk when doing chunk allocation. It is because the device
space that a dup chunk needs is twice as large as the chunk size, if we use
the size of the residual free space as the length of a dup chunk, we can not
get enough free space. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We cannot write data into files when when there is tiny space in the filesystem.
Reproduce steps:
# mkfs.btrfs /dev/sda1
# mount /dev/sda1 /mnt
# dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
# dd if=/dev/zero of=/mnt/tmpfile1 bs=4K count=99999999999999
(fill the filesystem)
# umount /mnt
# mount /dev/sda1 /mnt
# rm -f /mnt/tmpfile0
# dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
(failed with nospec)
But if we do the last step again, we can write data successfully. The reason of
the problem is that btrfs didn't try to commit the current transaction and
reclaim some space when chunk allocation failed.
This patch fixes it by committing the current transaction to reclaim some
space when chunk allocation fails.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef has implemented mixed data/metadata chunks, we must add those chunks'
space just like data chunks.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
CC [M] fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field <91>super_kobj<92> has incomplete type
fs/btrfs/ctree.h:1074:17: error: field <91>root_kobj<92> has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2
We need to include kobject.h here.
Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
Btrfs doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate. This support can
be added later. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.
Patched with:
git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.
The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.
In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.
The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.
This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
When we read in block groups, we'll set non-redundant groups
readonly if we find a raid1, DUP or raid10 group. But the
ro code has an off by one bug in the math around testing to
make sure out accounting doesn't go wrong.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows us to set a snapshot or a subvolume readonly or writable
on the fly.
Usage:
Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and then
call ioctl(BTRFS_IOCTL_SUBVOL_SETFLAGS);
Changelog for v3:
- Change to pass __u64 as ioctl parameter.
Changelog for v2:
- Add _GETFLAGS ioctl.
- Check if the passed fd is the root of a subvolume.
- Change the name from _SNAP_SETFLAGS to _SUBVOL_SETFLAGS.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Usage:
Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).
Implementation:
- Set readonly bit of btrfs_root_item->flags.
- Add readonly checks in btrfs_permission (inode_permission),
btrfs_setattr, btrfs_set/remove_xattr and some ioctls.
Changelog for v3:
- Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
- Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Update defrag ioctl, so one can choose lzo or zlib when turning
on compression in defrag operation.
Changelog:
v1 -> v2
- Add incompability flag.
- Fix to check invalid compress type.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Lzo is a much faster compression algorithm than gzib, so would allow
more users to enable transparent compression, and some users can
choose from compression ratio and speed for different applications
Usage:
# mount -t btrfs -o compress[=<zlib,lzo>] dev /mnt
or
# mount -t btrfs -o compress-force[=<zlib,lzo>] dev /mnt
"-o compress" without argument is still allowed for compatability.
Compatibility:
If we mount a filesystem with lzo compression, it will not be able be
mounted in old kernels. One reason is, otherwise btrfs will directly
dump compressed data, which sits in inline extent, to user.
Performance:
The test copied a linux source tarball (~400M) from an ext4 partition
to the btrfs partition, and then extracted it.
(time in second)
lzo zlib nocompress
copy: 10.6 21.7 14.9
extract: 70.1 94.4 66.6
(data size in MB)
lzo zlib nocompress
copy: 185.87 108.69 394.49
extract: 193.80 132.36 381.21
Changelog:
v1 -> v2:
- Select LZO_COMPRESS and LZO_DECOMPRESS in btrfs Kconfig.
- Add incompability flag.
- Fix error handling in compress code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Make the code aware of compression type, instead of always assuming
zlib compression.
Also make the zlib workspace function as common code for all
compression types.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Return failure if alloc_page() fails to allocate memory,
and the upper code will just give up compression.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: prevent RAID level downgrades when space is low
Btrfs: account for missing devices in RAID allocation profiles
Btrfs: EIO when we fail to read tree roots
Btrfs: fix compiler warnings
Btrfs: Make async snapshot ioctl more generic
Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
Btrfs: Fix a crash when mounting a subvolume
Btrfs: fix sync subvol/snapshot creation
Btrfs: Fix page leak in compressed writeback path
Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
Btrfs: fixup return code for btrfs_del_orphan_item
Btrfs: do not do fast caching if we are allocating blocks for tree_root
Btrfs: deal with space cache errors better
Btrfs: fix use after free in O_DIRECT
The extent allocator has code that allows us to fill
allocations from any available block group, even if it doesn't
match the raid level we've requested.
This was put in because adding a new drive to a filesystem
made with the default mkfs options actually upgrades the metadata from
single spindle dup to full RAID1.
But, the code also allows us to allocate from a raid0 chunk when we
really want a raid1 or raid10 chunk. This can cause big trouble because
mkfs creates a small (4MB) raid0 chunk for data and metadata which then
goes unused for raid1/raid10 installs.
The allocator will happily wander in and allocate from that chunk when
things get tight, which is not correct.
The fix here is to make sure that we provide duplication when the
caller has asked for it. It does all the dups to be any raid level,
which preserves the dup->raid1 upgrade abilities.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we mount in RAID degraded mode without adding a new device to
replace the failed one, we can end up using the wrong RAID flags for
allocations.
This results in strange combinations of block groups (raid1 in a raid10
filesystem) and corruptions when we try to allocate blocks from single
spindle chunks on drives that are actually missing.
The first device has two small 4MB chunks in it that mkfs creates and
these are usually unused in a raid1 or raid10 setup. But, in -o degraded,
the allocator will fall back to these because the mask of desired raid groups
isn't correct.
The fix here is to count the missing devices as we build up the list
of devices in the system. This count is used when picking the
raid level to make sure we continue using the same levels that were
in place before we lost a drive.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we just get a plain IO error when we read tree roots, the code
wasn't properly sending that error up the chain. This allowed mounts to
continue when they should failed, and allowed operations
on partially setup root structs. The end result was usually oopsen
on spinlocks that hadn't been spun up correctly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
... regarding an unused function when !MIGRATION, and regarding a
printk() format string vs argument mismatch.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we had reserved some bytes in struct btrfs_ioctl_vol_args, we
wouldn't have to create a new structure for async snapshot creation.
Here we convert async snapshot ioctl to use a more generic ABI, as
we'll add more ioctls for snapshots/subvolumes in the future, readonly
snapshots for example.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This problem is found in meego testing:
http://bugs.meego.com/show_bug.cgi?id=6672
A file in btrfs is mmaped and the mmaped buffer is passed to pwrite to write to the same page
of the same file. In btrfs_file_aio_write(), the pages is locked by prepare_pages(). So when
btrfs_copy_from_user() is called, page fault happens and the same page needs to be locked again
in filemap_fault(). The fix is to move iov_iter_fault_in_readable() before prepage_pages() to make page
fault happen before pages are locked. And also disable page fault in critical region in
btrfs_copy_from_user().
Reviewed-by: Yan, Zheng<zheng.z.yan@intel.com>
Signed-off-by: Zhong, Xin <xin.zhong@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We should drop dentry before deactivating the superblock, otherwise
we can hit this bug:
BUG: Dentry f349a690{i=100,n=/} still in use (1) [unmount of btrfs loop1]
...
Steps to reproduce the bug:
# mount /dev/loop1 /mnt
# mkdir save
# btrfs subvolume snapshot /mnt save/snap1
# umount /mnt
# mount -o subvol=save/snap1 /dev/loop1 /mnt
(crash)
Reported-by: Michael Niederle <mniederle@gmx.at>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We were incorrectly taking the async path even for the sync ioctls by
passing in &transid unconditionally.
There's ample room for further cleanup here, but this keeps the fix simple.
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
"start + num_bytes >= actual_end" can happen when compressed page writeback races
with file truncation. In that case we need unlock and release pages past the end
of file.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Not being able to delete an orphan item isn't a horrible thing. The worst that
happens is the next time around we try and do the orphan cleanup and we can't
find the referenced object and just delete the item and move on.
Signed-off-by: Josef Bacik <josef@redhat.com>
If the orphan item doesn't exist, we return 1, which doesn't make any sense to
the callers. Instead return -ENOENT if we didn't find the item. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Since the fast caching uses normal tree locking, we can possibly deadlock if we
get to the caching via a btrfs_search_slot() on the tree_root. So just check to
see if the root we are on is the tree root, and just don't do the fast caching.
Reported-by: Sage Weil <sage@newdream.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently if the space cache inode generation number doesn't match the
generation number in the space cache header we will just fail to load the space
cache, but we won't mark the space cache as an error, so we'll keep getting that
error each time somebody tries to cache that block group until we actually clear
the thing. Fix this by marking the space cache as having an error so we only
get the message once. This patch also makes it so that we don't try and setup
space cache for a block group that isn't cached, since we won't be able to write
it out anyway. None of these problems are actual problems, they are just
annoying and sub-optimal. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This fixes a bug where we use dip after we have freed it. Instead just use the
file_offset that was passed to the function. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
Btrfs: don't use migrate page without CONFIG_MIGRATION
Btrfs: deal with DIO bios that span more than one ordered extent
Btrfs: setup blank root and fs_info for mount time
Btrfs: fix fiemap
Btrfs - fix race between btrfs_get_sb() and umount
Btrfs: update inode ctime when using links
Btrfs: make sure new inode size is ok in fallocate
Btrfs: fix typo in fallocate to make it honor actual size
Btrfs: avoid NULL pointer deref in try_release_extent_buffer
Btrfs: make btrfs_add_nondir take parent inode as an argument
Btrfs: hold i_mutex when calling btrfs_log_dentry_safe
Btrfs: use dget_parent where we can UPDATED
Btrfs: fix more ESTALE problems with NFS
Btrfs: handle NFS lookups properly
btrfs: make 1-bit signed fileds unsigned
btrfs: Show device attr correctly for symlinks
btrfs: Set file size correctly in file clone
btrfs: Check if dest_offset is block-size aligned before cloning file
Btrfs: handle the space_cache option properly
btrfs: Fix early enospc because 'unused' calculated with wrong sign.
...
The new DIO bio splitting code has problems when the bio
spans more than one ordered extent. This will happen as the
generic DIO code merges our get_blocks calls together into
a bigger single bio.
This fixes things by walking forward in the ordered extent
code finding all the overlapping ordered extents and completing them
all at once.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a problem with how we use sget, it searches through the list of supers
attached to the fs_type looking for a super with the same fs_devices as what
we're trying to mount. This depends on sb->s_fs_info being filled, but we don't
fill that in until we get to btrfs_fill_super, so we could hit supers on the
fs_type super list that have a null s_fs_info. In order to fix that we need to
go ahead and setup a blank root with a blank fs_info to hold fs_devices, that
way our test will work out right and then we can set s_fs_info in
btrfs_set_super, and then open_ctree will simply use our pre-allocated root and
fs_info when setting everything up. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are two big problems currently with FIEMAP
1) We return extents for holes. This isn't supposed to happen, we just don't
return extents for holes and then userspace interprets the lack of an extent as
a hole.
2) We sometimes don't set FIEMAP_EXTENT_LAST properly. This is because we wait
to see a EXTENT_FLAG_VACANCY flag on the em, but this won't happen if say we ask
fiemap to map up to the last extent in a file, and there is nothing but holes up
to the i_size. To fix this we need to lookup the last extent in this file and
save the logical offset, so if we happen to try and map that extent we can be
sure to set FIEMAP_EXTENT_LAST.
With this patch we now pass xfstest 225, which we never have before.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When mounting a btrfs file system btrfs_test_super() may attempt to
use sb->s_fs_info, the btrfs root, of a super block that is going away
and that has had the btrfs root set to NULL in its ->put_super(). But
if the super block is going away it cannot be an existing super block
so we can return false in this case.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently we fail xfstest 236 because we're not updating the inode ctime on
link. This is a simple fix, and makes it so we pass 236 now.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have been failing xfstest 228 forever, because we don't check to make sure
the new inode size is acceptable as far as RLIMIT is concerned. Just check to
make sure it's ok to create a inode with this new size and error out if not.
With this patch we now pass 228.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a typo in __btrfs_prealloc_file_range() where we set the i_size to
actual_len/cur_offset, and then just set it to cur_offset again, and do the same
with btrfs_ordered_update_i_size(). This fixes it back to keeping i_size in a
local variable and then updating i_size properly. Tested this with
xfs_io -F -f -c "falloc 0 1" -c "pwrite 0 1" foo
stat'ing foo gives us a size of 1 instead of 4096 like it was. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Everybody who calls btrfs_add_nondir just passes in the dentry of the new file
and then dereference dentry->d_parent->d_inode, but everybody who calls
btrfs_add_nondir() are already passed the parent's inode. So instead of
dereferencing dentry->d_parent, just make btrfs_add_nondir take the dir inode as
an argument and pass that along so we don't have to worry about d_parent.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since we walk up the path logging all of the parts of the inode's path, we need
to hold i_mutex to make sure that the inode is not renamed while we're logging
everything. btrfs_log_dentry_safe does dget_parent and all of that jazz, but we
may get unexpected results if the rename changes the inode's location while
we're higher up the path logging those dentries, so do this for safety reasons.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are lots of places where we do dentry->d_parent->d_inode without holding
the dentry->d_lock. This could cause problems with rename. So instead we need
to use dget_parent() and hold the reference to the parent as long as we are
going to use it's inode and then dput it at the end.
Signed-off-by: Josef Bacik <josef@redhat.com>
Cc: raven@themaw.net
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When creating new inodes we don't setup inode->i_generation. So if we generate
an fh with a newly created inode we save the generation of 0, but if we flush
the inode to disk and have to read it back when getting the inode on the server
we'll have the right i_generation, so gens wont match and we get ESTALE. This
patch properly sets inode->i_generation when we create the new inode and now I'm
no longer getting ESTALE. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
People kept reporting NFS issues, specifically getting ESTALE alot. I figured
out how to reproduce the problem
SERVER
mkfs.btrfs /dev/sda1
mount /dev/sda1 /mnt/btrfs-test
<add /mnt/btrfs-test to /etc/exports>
btrfs subvol create /mnt/btrfs-test/foo
service nfs start
CLIENT
mount server:/mnt/btrfs /mnt/test
cd /mnt/test/foo
ls
SERVER
echo 3 > /proc/sys/vm/drop_caches
CLIENT
ls <-- get an ESTALE here
This is because the standard way to lookup a name in nfsd is to use readdir, and
what it does is do a readdir on the parent directory looking for the inode of
the child. So in this case the parent being / and the child being foo. Well
subvols all have the same inode number, so doing a readdir of / looking for
inode 256 will return '.', which obviously doesn't match foo. So instead we
need to have our own .get_name so that we can find the right name.
Our .get_name will either lookup the inode backref or the root backref,
whichever we're looking for, and return the name we find. Running the above
reproducer with this patch results in everything acting the way its supposed to.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the
original filesize of the dest file 'file2' is 30K:
# ls -l /mnt/file2
-rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
Now clone file1 to file2, the dest file should be 40K, but it
still shows 30K:
# ls -l /mnt/file2
-rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We've done the check for src_offset and src_length, and We should
also check dest_offset, otherwise we'll corrupt the destination
file:
(After cloning file1 to file2 with unaligned dest_offset)
# cat /mnt/file2
cat: /mnt/file2: Input/output error
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When I added the clear_cache option I screwed up and took the break out of
the space_cache case statement, so whenever you mount with space_cache you also
get clear_cache, which does you no good if you say set space_cache in fstab so
it always gets set. This patch adds the break back in properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
'unused' calculated with wrong sign in reserve_metadata_bytes().
This might have lead to unwanted over-reservations.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
extent_bio_alloc() and compressed_bio_alloc() are similar, cleanup
similar source code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
bio_endio() will free dip and dip->csums, so dip and dip->csums twice will
be freed twice. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Migrate page will directly call the btrfs btree writepage function,
which isn't actually allowed.
Our writepage assumes that you have locked the extent_buffer and
flagged the block as written. Without doing these steps, we can
corrupt metadata blocks.
A later commit will remove the btree writepage function since
it is really only safely used internally by btrfs. We
use writepages for everything else.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.
All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.
* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().
* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.
* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.
The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.
* blkdev_get/put() are the primary open and close functions.
* bd_claim/release() deal with exclusive open.
* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.
* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.
* open_by_devnum() wraps bdget() + blkdev_get().
The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,
* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.
* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.
* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.
* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.
* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().
* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().
Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.
open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.
Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
In the failure path of __btrfs_open_devices(), close_bdev_exclusive()
is called with @flags which doesn't match the one used during
open_bdev_exclusive(). Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (39 commits)
Btrfs: deal with errors from updating the tree log
Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Btrfs: make SNAP_DESTROY async
Btrfs: add SNAP_CREATE_ASYNC ioctl
Btrfs: add START_SYNC, WAIT_SYNC ioctls
Btrfs: async transaction commit
Btrfs: fix deadlock in btrfs_commit_transaction
Btrfs: fix lockdep warning on clone ioctl
Btrfs: fix clone ioctl where range is adjacent to extent
Btrfs: fix delalloc checks in clone ioctl
Btrfs: drop unused variable in block_alloc_rsv
Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
Btrfs: Use ERR_CAST helpers
Btrfs: use memdup_user helpers
Btrfs: fix raid code for removing missing drives
Btrfs: Switch the extent buffer rbtree into a radix tree
Btrfs: restructure try_release_extent_buffer()
Btrfs: use the flusher threads for delalloc throttling
Btrfs: tune the chunk allocation to 5% of the FS as metadata
...
Fix up trivial conflicts in fs/btrfs/super.c and fs/fs-writeback.c, and
remove use of INIT_RCU_HEAD in fs/btrfs/extent_io.c (that init macro was
useless and removed in commit 5e8067adfd: "rcu head remove init")
During unlink we remove any references to the inode from
the tree log. It can return -ENOENT and other errors,
and this changes the unlink code to deal with it.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add a mount option user_subvol_rm_allowed that allows users to delete a
(potentially non-empty!) subvol when they would otherwise we allowed to do
an rmdir(2). We duplicate the may_delete() checks from the core VFS code
to implement identical security checks (minus the directory size check).
We additionally require that the user has write+exec permission on the
subvol root inode.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is no reason to force an immediate commit when deleting a snapshot.
Users have some expectation that space from a deleted snapshot be freed
immediately, but even if we do commit the reclaim is a background process.
If users _do_ want the deletion to be durable, they can call 'sync'.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Create a snap without waiting for it to commit to disk. The ioctl is
ordered such that subsequent operations will not be contained by the
created snapshot, and the commit is initiated, but the ioctl does not
wait for the snapshot to commit to disk.
We return the specific transid to userspace so that an application can wait
for this specific snapshot creation to commit via the WAIT_SYNC ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk. This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.
The wait_for_unblock flag controls whether we wait for the 'middle' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
num_writers > 1 or should_grow at the top of the loop. Then, much much
later, we wait for that timeout if either num_writers or should_grow is
true. However, it's possible for a racing process (calling
btrfs_end_transaction()) to decrement num_writers such that we wait
forever instead of for 1.
Fix this by deciding how long to wait when we wait. Include a smp_mb()
before checking if the waitqueue is active to ensure the num_writers
is visible.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I'm no lockdep expert, but this appears to make the lockdep warning go
away for the i_mutex locking in the clone ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We had an edge case issue where the requested range was just
following an existing extent. Instead of skipping to the next
extent, we used the previous one which lead to having zero
sized extents.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The lookup_first_ordered_extent() was done on the wrong inode, and the
->delalloc_bytes test was wrong, as the following
btrfs_wait_ordered_range() would only invoke a range write and wouldn't
write the entire file data range. Also, a bad parameter was passed to
btrfs_wait_ordered_range().
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.
Still needs more review.
Found by gcc 4.6's new warnings
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These are all the cases where a variable is set, but not
read which are really bugs.
- Couple of incorrect error handling fixed.
- One incorrect use of a allocation policy
- Some other things
Still needs more review.
Found by gcc 4.6's new warnings.
[akpm@linux-foundation.org: fix build. Might have been bitrot]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
T x;
identifier f;
@@
T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }
@@
expression x;
@@
- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use memdup_user when user data is immediately copied into the
allocated region.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@
- to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+ to = memdup_user(from,size);
if (
- to==NULL
+ IS_ERR(to)
|| ...) {
<+... when != goto l1;
- -ENOMEM
+ PTR_ERR(to)
...+>
}
- if (copy_from_user(to, from, size) != 0) {
- <+... when != goto l2;
- -EFAULT
- ...+>
- }
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs is mounted in degraded mode, it has some internal structures
to track the missing devices. This missing device is setup as readonly,
but the mapping code can get upset when we try to write to it.
This changes the mapping code to return -EIO instead of oops when we try
to write to the readonly device.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch reduces the CPU time spent in the extent buffer search by using the
radix tree instead of the rbtree and using the rcu lock instead of the spin
lock.
I did a quick test by the benchmark tool[1] and found the patch improve the
file creation/deletion performance problem that I have reported[2].
Before applying this patch:
Create files:
Total files: 50000
Total time: 0.971531
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.366761
Average time: 0.000027
After applying this patch:
Create files:
Total files: 50000
Total time: 0.927455
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.292280
Average time: 0.000026
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
[2] http://marc.info/?l=linux-btrfs&m=128212635122920&w=2
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
restructure try_release_extent_buffer() and write a function to release the
extent buffer. It will be used later.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have a fairly complex set of loops around walking our list of
delalloc inodes when we find metadata delalloc space running low.
It doesn't work very well, can use large amounts of CPU and doesn't
do very efficient writeback.
This switches us to kick the bdi flusher threads instead. All dirty
data in btrfs is accounted as delalloc data, so this is very similar
in terms of what it writes, but we're able to just kick off the IO
and wait for progress.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
An earlier commit tried to keep us from allocating too many
empty metadata chunks. It was somewhat too restrictive and could
lead to ENOSPC errors on empty filesystems.
This increases the limits to about 5% of the FS size, allowing more
metadata chunks to be preallocated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs discovers the generation number in a btree block is
incorrect, it can loop forever without forcing the RAID
code to try a valid mirror, and without returning EIO.
This changes things to properly kick out the EIO.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If you mount -o space_cache, the option will be persistent across mounts, but to
make sure the user knows that they did this, emit a message telling them if they
didn't mount with -o space_cache but the feature is still used.
Signed-off-by: Josef Bacik <josef@redhat.com>
If something goes wrong with the free space cache we need a way to make sure
it's not loaded on mount and that it's cleared for everybody. When you pass the
clear_cache option it will make it so all block groups are setup to be cleared,
which keeps them from being loaded and then they will be truncated when the
transaction is committed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
There are just a few things that need to be fixed in the kernel to support mixed
data+metadata block groups. Mostly we just need to make sure that if we are
using mixed block groups that we continue to allocate mixed block groups as we
need them. Also we need to make sure __find_space_info will find our space info
if we search for DATA or METADATA only. Tested this with xfstests and it works
nicely. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
With the free space disk caching we can mark the block group as started with the
caching, but we don't have a caching ctl. This can race with anybody else who
tries to get the caching ctl before we cache (this is very hard to do btw). So
instead check to see if cache->caching_ctl is set, and if not return NULL.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This patch actually loads the free space cache if it exists. The only thing
that really changes here is that we need to cache the block group if we're going
to remove an extent from it. Previously we did not do this since the caching
kthread would pick it up. With the on disk cache we don't have this luxury so
we need to make sure we read the on disk cache in first, and then remove the
extent, that way when the extent is unpinned the free space is added to the
block group. This has been tested with all sorts of things.
Signed-off-by: Josef Bacik <josef@redhat.com>
This is a simple bit, just dump the free space cache out to our preallocated
inode when we're writing out dirty block groups. There are a bunch of changes
in inode.c in order to account for special cases. Mostly when we're doing the
writeout we're holding trans_mutex, so we need to use the nolock transacation
functions. Also we can't do asynchronous completions since the async thread
could be blocked on already completed IO waiting for the transaction lock. This
has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
tests. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In order to save free space cache, we need an inode to hold the data, and we
need a special item to point at the right inode for the right block group. So
first, create a special item that will point to the right inode, and the number
of extent entries we will have and the number of bitmaps we will have. We
truncate and pre-allocate space everytime to make sure it's uptodate.
This feature will be turned on as soon as you mount with -o space_cache, however
it is safe to boot into old kernels, they will just generate the cache the old
fashion way. When you boot back into a newer kernel we will notice that we
modified and not the cache and automatically discard the cache.
Signed-off-by: Josef Bacik <josef@redhat.com>
Because btrfs_dirty_inode does a btrfs_join_transaction, it doesn't actually
reserve space. It does this so we can try and dirty the inode quickly without
having to deal with the ENOSPC problems. But if it does get back ENOSPC it
handles it properly. The problem is use_block_rsv does a WARN_ON whenever this
case happens, even tho btrfs_dirty_inode takes it into account and actually
expects to get -ENOSPC if things are particularly tight. So instead just remove
the warning. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
btrfs_commit_transaction will free our trans, but because we pass trans to
shrink_delalloc we could possibly have a use after free situation. So instead
if we commit the transaction, set trans to null and set committed to true so we
don't keep trying to commit a transaction. This fixes a panic I could reproduce
at will. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
xen-blkfront: disable barrier/flush write support
Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
block: remove BLKDEV_IFL_WAIT
aic7xxx_old: removed unused 'req' variable
block: remove the BH_Eopnotsupp flag
block: remove the BLKDEV_IFL_BARRIER flag
block: remove the WRITE_BARRIER flag
swap: do not send discards as barriers
fat: do not send discards as barriers
ext4: do not send discards as barriers
jbd2: replace barriers with explicit flush / FUA usage
jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
jbd: replace barriers with explicit flush / FUA usage
nilfs2: replace barriers with explicit flush / FUA usage
reiserfs: replace barriers with explicit flush / FUA usage
gfs2: replace barriers with explicit flush / FUA usage
btrfs: replace barriers with explicit flush / FUA usage
xfs: replace barriers with explicit flush / FUA usage
block: pass gfp_mask and flags to sb_issue_discard
dm: convey that all flushes are processed as empty
...
If we failed to find the root subvol id, or the subvol=<name>, we would
deactivate the locked super and close the devices. The problem is at this point
we have gotten the SB all setup, which includes setting super_operations, so
when we'd deactiveate the super, we'd do a close_ctree() which closes the
devices, so we'd end up closing the devices twice. So if you do something like
this
mount /dev/sda1 /mnt/test1
mount /dev/sda1 /mnt/test2 -o subvol=xxx
umount /mnt/test1
it would blow up (if subvol xxx doesn't exist). This patch fixes that problem.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
With multi-threaded writes we were getting ENOSPC early because somebody would
come in, start flushing delalloc because they couldn't make their reservation,
and in the meantime other threads would come in and use the space that was
getting freed up, so when the original thread went to check to see if they had
space they didn't and they'd return ENOSPC. So instead if we have some free
space but not enough for our reservation, take the reservation and then start
doing the flushing. The only time we don't take reservations is when we've
already overcommitted our space, that way we don't have people who come late to
the party way overcommitting ourselves. This also moves all of the retrying and
flushing code into reserve_metdata_bytes so it's all uniform. This keeps my
fs_mark test from returning -ENOSPC as soon as it starts and actually lets me
fill up the disk. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Because the ENOSPC code over reserves super aggressively we end up allocating
chunks way more often than we should. For example with my fs_mark tests on a
2gb fs I can end up reserved 1gb just for metadata, when only 34mb of that is
being used. So instead check to see if the amount of space actually used is
less than 30% of the total space, and if so don't allocate a chunk, but only if
we have at least 256mb of free space to make sure we don't put too much pressure
on free space.
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently we try and flush delalloc, but we only do that in a sort of weak way,
which works fine in most cases but if we're under heavy pressure we need to be
able to wait for flushing to happen. Also instead of checking the bytes
reserved in the block_rsv, check the space info since it is more accurate. The
sync option will be used in a future patch.
Signed-off-by: Josef Bacik <josef@redhat.com>
The global reservation stuff tries to add together DATA and METADATA used in
order to figure out how much to reserve for everything, but this doesn't work
right for mixed block groups. Instead if we have mixed block groups just set
data used to 0. Also with mixed block groups we will use bytes_may_use for
keeping track of delalloc bytes, so we need to take that into account in our
reservation calculations.
Signed-off-by: Josef Bacik <josef@redhat.com>
The new ENOSPC stuff breaks out the raid types which breaks the way we were
reporting df to the system. This fixes it back so that Available is the total
space available to data and used is the actual bytes used by the filesystem.
This means that Available is Total - data used - all of the metadata space.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The new ENOSPC stuff broke the df ioctl since we no longer create seperate space
info's for each RAID type. So instead, loop through each space info's raid
lists so we can get the right RAID information which will allow the df ioctl to
tell us RAID types again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In very severe ENOSPC cases we can run out of inodes to do delalloc on, which
means we'll just keep looping trying to shrink delalloc. Instead, if we fail to
shrink delalloc 3 times in a row break out since we're not likely to make any
progress. Tested this with a 100mb fs an xfstests test 13. Before the patch it
would hang the box, with the patch we get -ENOSPC like we should. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
All the blkdev_issue_* helpers can only sanely be used for synchronous
caller. To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout. For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
xen-blkfront: fix missing out label
blkdev: fix blkdev_issue_zeroout return value
block: update request stacking methods to support discards
block: fix missing export of blk_types.h
writeback: fix bad _bh spinlock nesting
drbd: revert "delay probes", feature is being re-implemented differently
drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
drbd: Disable delay probes for the upcomming release
writeback: cleanup bdi_register
writeback: add new tracepoints
writeback: remove unnecessary init_timer call
writeback: optimize periodic bdi thread wakeups
writeback: prevent unnecessary bdi threads wakeups
writeback: move bdi threads exiting logic to the forker thread
writeback: restructure bdi forker loop a little
writeback: move last_active to bdi
writeback: do not remove bdi from bdi_list
writeback: simplify bdi code a little
writeback: do not lose wake-ups in bdi threads
...
Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.
BTRFS does not define a '->write_super()' method, so it should
not mark its superblock as dirty. This looks like some left-over.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
NB: do we want btrfs_wait_ordered_range() on eviction of
inodes with positive i_nlink on subvolume with zero root_refs?
If not, btrfs_evict_inode() can be simplified by unconditionally
bailing out in case of i_nlink > 0 in the very beginning...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
add I_CLEAR instead of replacing I_FREEING with it. I_CLEAR is
equivalent to I_FREEING for almost all code looking at either;
it's there to keep track of having called clear_inode() exactly
once per inode lifetime, at some point after having set I_FREEING.
I_CLEAR and I_FREEING never get set at the same time with the
current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
instead of I_CLEAR without loss of information. As the result of
such change, checks become simpler and the amount of code that needs
to know about I_CLEAR shrinks a lot.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Replace inode_setattr with opencoded variants of it in all callers. This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.
In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:
spufs: explicitly checks for ATTR_SIZE earlier
btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
In addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.
Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
1. The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check
whether the donor file is append-only before writing to it.
2. The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer
overflow that allows a user to specify an out-of-bounds range to copy
from the source file (if off + len wraps around). I haven't been able
to successfully exploit this, but I'd imagine that a clever attacker
could use this to read things he shouldn't. Even if it's not
exploitable, it couldn't hurt to be safe.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
cc: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The CLONE and CLONE_RANGE ioctls round up the range of extents being
cloned to the block size when the range to clone extends to the end of file
(this is always the case with CLONE). It was then using that offset when
extending the destination file's i_size. Fix this by not setting i_size
beyond the originally requested ending offset.
This bug was introduced by a22285a6 (2.6.35-rc1).
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
split_leaf was not properly balancing leaves when it was forced to
split a leaf twice. This commit adds an extra push left and right
before forcing the double split in hopes of getting the slot where
we want to insert at either the start or end of the leaf.
If the extra pushes do work, then we are able to avoid splitting twice
and we keep the tree properly balanced.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This was just an odd wrapper around writeback_inodes_wb. Removing this
also allows to get rid of the bdi member of struct writeback_control
which was rather out of place there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: The file argument for fsync() is never null
Btrfs: handle ERR_PTR from posix_acl_from_xattr()
Btrfs: avoid BUG when dropping root and reference in same transaction
Btrfs: prohibit a operation of changing acl's mask when noacl mount option used
Btrfs: should add a permission check for setfacl
Btrfs: btrfs_lookup_dir_item() can return ERR_PTR
Btrfs: btrfs_read_fs_root_no_name() returns ERR_PTRs
Btrfs: unwind after btrfs_start_transaction() errors
Btrfs: btrfs_iget() returns ERR_PTR
Btrfs: handle kzalloc() failure in open_ctree()
Btrfs: handle error returns from btrfs_lookup_dir_item()
Btrfs: Fix BUG_ON for fs converted from extN
Btrfs: Fix null dereference in relocation.c
Btrfs: fix remap_file_pages error
Btrfs: uninitialized data is check_path_shared()
Btrfs: fix fallocate regression
Btrfs: fix loop device on top of btrfs
The "file" argument for fsync is never null so we can remove this check.
What drew my attention here is that 7ea8085910: "drop unused dentry
argument to ->fsync" introduced an unconditional dereference at the
start of the function and that generated a smatch warning.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
posix_acl_from_xattr() returns both ERR_PTRs and null, but it's OK to
pass null values to set_cached_acl()
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If btrfs_ioctl_snap_destroy() deletes a snapshot but finishes
with end_transaction(), the cleaner kthread may come in and
drop the root in the same transaction. If that's the case, the
root's refs still == 1 in the tree when btrfs_del_root() deletes
the item, because commit_fs_roots() hasn't updated it yet (that
happens during the commit).
This wasn't a problem before only because
btrfs_ioctl_snap_destroy() would commit the transaction before dropping
the dentry reference, so the dead root wouldn't get queued up until
after the fs root item was updated in the btree.
Since it is not an error to drop the root reference and the root in the
same transaction, just drop the BUG_ON() in btrfs_del_root().
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
when used Posix File System Test Suite(pjd-fstest) to test btrfs,
some cases about setfacl failed when noacl mount option used.
I simplified used commands in pjd-fstest, and the following steps
can reproduce it.
------------------------
# cd btrfs-part/
# mkdir aaa
# setfacl -m m::rw aaa <- successed, but not expected by pjd-fstest.
------------------------
I checked ext3, a warning message occured, like as:
setfacl: aaa/: Operation not supported
Certainly, it's expected by pjd-fstest.
So, i compared acl.c of btrfs and ext3. Based on that, a patch created.
Fortunately, it works.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On btrfs, do the following
------------------
# su user1
# cd btrfs-part/
# touch aaa
# getfacl aaa
# file: aaa
# owner: user1
# group: user1
user::rw-
group::rw-
other::r--
# su user2
# cd btrfs-part/
# setfacl -m u::rwx aaa
# getfacl aaa
# file: aaa
# owner: user1
# group: user1
user::rwx <- successed to setfacl
group::rw-
other::r--
------------------
but we should prohibit it that user2 changing user1's acl.
In fact, on ext3 and other fs, a message occurs:
setfacl: aaa: Operation not permitted
This patch fixed it.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_lookup_dir_item() can return either ERR_PTRs or null.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_read_fs_root_no_name() returns ERR_PTRs on error so I added a
check for that. It's not clear to me if it can also return NULL
pointers or not so I left the original NULL pointer check as is.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This was added by a22285a6a3: "Btrfs: Integrate metadata reservation
with start_transaction". If we goto out here then we skip all the
unwinding and there are locks still held etc.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_iget() returns an ERR_PTR() on failure and not null.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Unwind and return -ENOMEM if the allocation fails here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If btrfs_lookup_dir_item() fails, we should can just let the mount fail
with an error.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tree blocks can live in data block groups in FS converted from extN.
So it's easy to trigger the BUG_ON.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix a potential null dereference in relocation.c
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Acked-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
when we use remap_file_pages() to remap a file, remap_file_pages always return
error. It is because btrfs didn't set VM_CAN_NONLINEAR for vma.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
refs can be used with uninitialized data if btrfs_lookup_extent_info()
fails on the first pass through the loop. In the original code if that
happens then check_path_shared() probably returns 1, this patch
changes it to return 1 for safety.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Seems that when btrfs_fallocate was converted to use the new ENOSPC stuff we
dropped passing the mode to the function that actually does the preallocation.
This breaks anybody who wants to use FALLOC_FL_KEEP_SIZE. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We cannot use the loop device which has been connected to a file in the btrf
The reproduce steps is following:
# dd if=/dev/zero of=vdev0 bs=1M count=1024
# losetup /dev/loop0 vdev0
# mkfs.btrfs /dev/loop0
...
failed to zero device start -5
The reason is that the btrfs don't implement either ->write_begin or ->write
the VFS API, so we fix it by setting ->write to do_sync_write().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
Btrfs: add more error checking to btrfs_dirty_inode
Btrfs: allow unaligned DIO
Btrfs: drop verbose enospc printk
Btrfs: Fix block generation verification race
Btrfs: fix preallocation and nodatacow checks in O_DIRECT
Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
Btrfs: rework O_DIRECT enospc handling
Btrfs: use async helpers for DIO write checksumming
Btrfs: don't walk around with task->state != TASK_RUNNING
Btrfs: do aio_write instead of write
Btrfs: add basic DIO read/write support
direct-io: do not merge logically non-contiguous requests
direct-io: add a hook for the fs to provide its own submit_bio function
fs: allow short direct-io reads to be completed via buffered IO
Btrfs: Metadata ENOSPC handling for balance
Btrfs: Pre-allocate space for data relocation
Btrfs: Metadata ENOSPC handling for tree log
Btrfs: Metadata reservation for orphan inodes
Btrfs: Introduce global metadata reservation
...
The ENOSPC code will now return ENOSPC to btrfs_start_transaction.
btrfs_dirty_inode needs to check for this and error out appropriately.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In order to support DIO that isn't aligned to the filesystem blocksize,
we fall back to buffered for any unaligned DIOs.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After the path is released, the generation number got from block
pointer is no long valid. The race may cause disk corruption, because
verify_parent_transid() calls clear_extent_buffer_uptodate() when
generation numbers mismatch.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The O_DIRECT code wasn't checking for multiple references
on preallocated or nodatacow extents. This means it
wasn't honoring snapshots properly.
The fix here is to add an explicit check for multiple references
This also fixes the math for selecting the correct disk block,
making sure not to go past the end of the extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_dirty_inode tries to sneak in without much waiting or
space reservation, mostly for performance reasons. This
usually works well but can cause problems when there are
many many writers.
When btrfs_update_inode fails with ENOSPC, we fallback
to a slower btrfs_start_transaction call that will reserve
some space.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This moves the delalloc space reservation done for O_DIRECT
into btrfs_direct_IO. This way we don't leak reserved space
if the generic O_DIRECT write code errors out before it
calls into btrfs_direct_IO.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes O_DIRECT write code to mark extents as delalloc
while it is processing them. Yan Zheng has reworked the
enospc accounting based on tracking delalloc extents and
this makes it much easier to track enospc in the O_DIRECT code.
There are a few space cases with the O_DIRECT code though,
it only sets the EXTENT_DELALLOC bits, instead of doing
EXTENT_DELALLOC | EXTENT_DIRTY | EXTENT_UPTODATE, because
we don't want to mess with clearing the dirty and uptodate
bits when things go wrong. This is important because there
are no pages in the page cache, so any extent state structs
that we put in the tree won't get freed by releasepage. We have
to clear them ourselves as the DIO ends.
With this commit, we reserve space at in btrfs_file_aio_write,
and then as each btrfs_direct_IO call progresses it sets
EXTENT_DELALLOC on the range.
btrfs_get_blocks_direct is responsible for clearing the delalloc
at the same time it drops the extent lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds:
alias: devname:<name>
to some common kernel modules, which will allow the on-demand loading
of the kernel module when the device node is accessed.
Ideally all these modules would be compiled-in, but distros seems too
much in love with their modularization that we need to cover the common
cases with this new facility. It will allow us to remove a bunch of pretty
useless init scripts and modprobes from init scripts.
The static device node aliases will be carried in the module itself. The
program depmod will extract this information to a file in the module directory:
$ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
# Device nodes to trigger on-demand module loading.
microcode cpu/microcode c10:184
fuse fuse c10:229
ppp_generic ppp c108:0
tun net/tun c10:200
dm_mod mapper/control c10:235
Udev will pick up the depmod created file on startup and create all the
static device nodes which the kernel modules specify, so that these modules
get automatically loaded when the device node is accessed:
$ /sbin/udevd --debug
...
static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
static_dev_create_from_modules: mknod '/dev/fuse' c10:229
static_dev_create_from_modules: mknod '/dev/ppp' c108:0
static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666
A few device nodes are switched to statically allocated numbers, to allow
the static nodes to work. This might also useful for systems which still run
a plain static /dev, which is completely unsafe to use with any dynamic minor
numbers.
Note:
The devname aliases must be limited to the *common* and *single*instance*
device nodes, like the misc devices, and never be used for conceptually limited
systems like the loop devices, which should rather get fixed properly and get a
control node for losetup to talk to, instead of creating a random number of
device nodes in advance, regardless if they are ever used.
This facility is to hide the mess distros are creating with too modualized
kernels, and just to hide that these modules are not compiled-in, and not to
paper-over broken concepts. Thanks! :)
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The async helper threads offload crc work onto all the
CPUs, and make streaming writes much faster. This
changes the O_DIRECT write code to use them. The only
small complication was that we need to pass in the
logical offset in the file for each bio, because we can't
find it in the bio's pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed two places we were doing a lot of work
without task->state set to TASK_RUNNING. This sets the state
properly after we get ready to sleep but decide not to.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In order for AIO to work, we need to implement aio_write. This patch converts
our btrfs_file_write to btrfs_aio_write. I've tested this with xfstests and
nothing broke, and the AIO stuff magically started working. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This provides basic DIO support for reading and writing. It does not do the
work to recover from mismatching checksums, that will come later. A few design
changes have been made from Jim's code (sorry Jim!)
1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO
code in order to account for all of BTRFS's oddities, but thanks to that work it
seems like the best bet is to just ignore compression and such and just opt to
fallback on buffered IO.
2) Fallback on buffered IO for compressed or inline extents. Jim's code did
it's own buffering to make dio with compressed extents work. Now we just
fallback onto normal buffered IO.
3) Use ordered extents for the writes so that all of the
lock_extent()
lookup_ordered()
type checks continue to work.
4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
DIO writes.
I've tested this with fsx and everything works great. This patch depends on my
dio and filemap.c patches to work. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds metadata ENOSPC handling for the balance code.
It is consisted by following major changes:
1. Avoid COW tree leave in the phrase of merging tree.
2. Handle interaction with snapshot creation.
3. make the backref cache can live across transactions.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Pre-allocate space for data relocation. This can detect ENOPSC
condition caused by fragmentation of free space.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Previous patches make the allocater return -ENOSPC if there is no
unreserved free metadata space. This patch updates tree log code
and various other places to propagate/handle the ENOSPC error.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reserve metadata space for extent tree, checksum tree and root tree
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Introduce metadata reservation context for delayed allocation
and update various related functions.
This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Besides simplify the code, this change makes sure all metadata
reservation for normal metadata operations are released after
committing transaction.
Changes since V1:
Add code that check if unlink and rmdir will free space.
Add ENOSPC handling for clone ioctl.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Introducing metadata reseravtion contexts has two major advantages.
First, it makes metadata reseravtion more traceable. Second, it can
reclaim freed space and re-add them to the itself after transaction
committed.
Besides add btrfs_block_rsv structure and related helper functions,
This patch contains following changes:
Move code that decides if freed tree block should be pinned into
btrfs_free_tree_block().
Make space accounting more accurate, mainly for handling read only
block groups.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
All code in init_btrfs_i can be moved into btrfs_alloc_inode()
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Shrink delayed allocation space in a synchronized manner is more
controllable than flushing all delay allocated space in an async
thread.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We already have fs_info->chunk_mutex to avoid concurrent
chunk creation.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The size of reserved space is stored in space_info. If block groups
of different raid types are linked to separate space_info, changing
allocation profile will corrupt reserved space accounting.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
get rid of home-grown mutex in cris eeprom.c
switch ecryptfs_write() to struct inode *, kill on-stack fake files
switch ecryptfs_get_locked_page() to struct inode *
simplify access to ecryptfs inodes in ->readpage() and friends
AFS: Don't put struct file on the stack
Ban ecryptfs over ecryptfs
logfs: replace inode uid,gid,mode initialization with helper function
ufs: replace inode uid,gid,mode initialization with helper function
udf: replace inode uid,gid,mode init with helper
ubifs: replace inode uid,gid,mode initialization with helper function
sysv: replace inode uid,gid,mode initialization with helper function
reiserfs: replace inode uid,gid,mode initialization with helper function
ramfs: replace inode uid,gid,mode initialization with helper function
omfs: replace inode uid,gid,mode initialization with helper function
bfs: replace inode uid,gid,mode initialization with helper function
ocfs2: replace inode uid,gid,mode initialization with helper function
nilfs2: replace inode uid,gid,mode initialization with helper function
minix: replace inode uid,gid,mode init with helper
ext4: replace inode uid,gid,mode init with helper
...
Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
The patch just convert all blkdev_issue_xxx function to common
set of flags. Wait/allocation semantics preserved.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: make sure the chunk allocator doesn't create zero length chunks
Btrfs: fix data enospc check overflow
A recent commit allowed for smaller chunks to be created, but didn't
make sure they were always bigger than a stripe. After some divides,
this led to zero length stripes.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: add check for changed leaves in setup_leaf_for_split
Btrfs: create snapshot references in same commit as snapshot
Btrfs: fix small race with delalloc flushing waitqueue's
Btrfs: use add_to_page_cache_lru, use __page_cache_alloc
Btrfs: fix chunk allocate size calculation
Btrfs: kill max_extent mount option
Btrfs: fail to mount if we have problems reading the block groups
Btrfs: check btrfs_get_extent return for IS_ERR()
Btrfs: handle kmalloc() failure in inode lookup ioctl
Btrfs: dereferencing freed memory
Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk()
Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree()
Btrfs: Remove unnecessary finish_wait() in wait_current_trans()
Btrfs: add NULL check for do_walk_down()
Btrfs: remove duplicate include in ioctl.c
Fix trivial conflict in fs/btrfs/compression.c due to slab.h include
cleanups.
Because we account for reserved space we get from the allocator before we
actually account for allocating delalloc space, we can have a small window where
the amount of "used" space in a space_info is more than the total amount of
space in the space_info. This will cause a overflow in our check, so it will
seem like we have _tons_ of free space, and we'll allow reservations to occur
that will end up larger than the amount of space we have. I've seen users
report ENOSPC panic's in cow_file_range a few times recently, so I tried to
reproduce this problem and found I could reproduce it if I ran one of my tests
in a loop for like 20 minutes. With this patch my test ran all night without
issues. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
setup_leaf_for_split needs to drop the path and search again, and has
checks to see if the item we want to split changed size. But, it misses
the case where the leaf changed and now has enough room for the item
we want to insert.
This adds an extra check to make sure the leaf really needs splitting
before we call btrfs_split_leaf(), which keeps us from trying to split
a leaf with a single item.
btrfs_split_leaf() will blindly split the single item leaf, leaving us
with one good leaf and one empty leaf and then a crash.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This creates the reference to a new snapshot in the same commit as the
snapshot itself. This avoids the need for a second commit in order for a
snapshot to be persistent, and also avoids the problem of "leaking" a
new snapshot tree root if the host crashes before the second commit takes
place.
It is not at all clear to me why it wasn't always done this way. If there
is still a reason for the two-stage {create,finish}_pending_snapshots()
approach I'm missing something! :)
I've been running this for a couple weeks under pretty heavy usage (a few
snapshots per minute) without obvious problems.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Everytime we start a new flushing thread, we init the waitqueue if there isn't a
flushing thread running. The problem with this is we check
space_info->flushing, which we clear right before doing a wake_up on the
flushing waitqueue, which causes problems if we init the waitqueue in the middle
of clearing the flushing flagh and calling wake_up. This is hard to hit, but
the code is wrong anyway, so init the flushing/allocating waitqueue when
creating the space info and let it be. I haven't seen the panic since I've been
using this patch. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Pagecache pages should be allocated with __page_cache_alloc, so they
obey pagecache memory policies.
add_to_page_cache_lru is exported, so it should be used. Benefits over
using a private pagevec: neater code, 128 bytes fewer stack used, percpu
lru ordering is preserved, and finally don't need to flush pagevec
before returning so batching may be shared with other LRU insertions.
Signed-off-by: Nick Piggin <npiggin@suse.de>:
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If the amount of free space left in a device is less than what we think should
be the minimum size, just ignore the minimum size and use the amount we have. I
ran into this running tests on a 600mb volume, the chunk allocator wouldn't let
me allocate the last 52mb of the disk for data because we want to have at least
64mb chunks for data. This patch fixes that problem. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
As Yan pointed out, theres not much reason for all this complicated math to
account for file extents being split up into max_extent chunks, since they are
likely to all end up in the same leaf anyway. Since there isn't much reason to
use max_extent, just remove the option altogether so we have one less thing we
need to test.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We don't actually check the return value of btrfs_read_block_groups, so we can
possibly succeed to mount, but then fail to say read the superblock xattr for
selinux which will cause the vfs code to deactivate the super.
This is a problem because in find_free_extent we just assume that we
will find the right space_info for the allocation we want. But if we
failed to read the block groups, we won't have setup any space_info's,
and we'll hit a NULL pointer deref in find_free_extent.
This patch fixes that problem by checking the return value of
btrfs_read_block_groups, and failing out properly. I've also added a
check in find_free_extent so if for some reason we don't find an
appropriate space_info, we just return -ENOSPC.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_get_extent() never returns NULL, only a valid pointer or ERR_PTR()
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The original code dereferenced range on the next line.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can use this simple method to make source more readable.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We need to check return value of btrfs_search_slot() in
btrfs_read_chunk_tree() and do corresponding error handing.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We only need to call finish_wait() after wait loop.
By the way, this patch makes code of waiting loop similar to
example in wait.h(no functional change)
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_find_create_tree_block() may return NULL, so we must check the returned
value, or we will access a NULL pointer.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
fs/btrfs/ioctl.c: ctree.h is included more than once.
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (30 commits)
Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree
Btrfs: allow treeid==0 in the inode lookup ioctl
Btrfs: return keys for large items to the search ioctl
Btrfs: fix key checks and advance in the search ioctl
Btrfs: buffer results in the space_info ioctl
Btrfs: use __u64 types in ioctl.h
Btrfs: fix search_ioctl key advance
Btrfs: fix gfp flags masking in the compression code
Btrfs: don't look at bio flags after submit_bio
btrfs: using btrfs_stack_device_id() get devid
btrfs: use memparse
Btrfs: add a "df" ioctl for btrfs
Btrfs: cache the extent state everywhere we possibly can V2
Btrfs: cache ordered extent when completing io
Btrfs: cache extent state in find_delalloc_range
Btrfs: change the ordered tree to use a spinlock instead of a mutex
Btrfs: finish read pages in the order they are submitted
btrfs: fix btrfs_mkdir goto for no free objectids
Btrfs: flush data on snapshot creation
Btrfs: make df be a little bit more understandable
...
This is used by the inode lookup ioctl to follow all the backrefs up
to the subvol root. But the search being done would sometimes land one
past the last item in the leaf instead of finding the backref.
This changes the search to look for the highest possible backref and hop
back one item. It also fixes a leaked path on failure to find the root.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a root id of 0 is sent to the inode lookup ioctl, it will
use the root of the file we're ioctling and pass the root id
back to userland along with the results.
This allows userland to do searches based on that root later on.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The search ioctl was skipping large items entirely (ones that are too
big for the results buffer). This changes things to at least copy
the item header so that we can send information about the item back to
userland.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The search ioctl was working well for finding tree roots, but using it for
generic searches requires a few changes to how the keys are advanced.
This treats the search control min fields for objectid, type and offset
more like a key, where we drop the offset to zero once we bump the type,
etc.
The downside of this is that we are changing the min_type and min_offset
fields during the search, and so the ioctl caller needs extra checks to make sure
the keys in the result are the ones it wanted.
This also changes key_in_sk to use btrfs_comp_cpu_keys, just to make
things more readable.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The space_info ioctl was using copy_to_user inside rcu_read_lock. This
commit changes things to copy into a buffer first and then dump the
result down to userland.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
key->type is u8, not u64.
fs/btrfs/ioctl.c: In function 'copy_to_sk':
fs/btrfs/ioctl.c:1024: warning: comparison is always true due to limited range of data type
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After callling submit_bio, the bio can be freed at any time. The
btrfs submission thread helper was checking the bio flags too late,
which might not give the correct answer.
When CONFIG_DEBUG_PAGE_ALLOC is turned on, it can lead to oopsen.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can use btrfs_stack_device_id() to get dev_item->devid
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use memparse() instead of its own private implementation.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
df is a very loaded question in btrfs. This gives us a way to get the per-space
usage information so we can tell exactly what is in use where. This will help
us figure out ENOSPC problems, and help users better understand where their disk
space is going.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch just goes through and fixes everybody that does
lock_extent()
blah
unlock_extent()
to use
lock_extent_bits()
blah
unlock_extent_cached()
and pass around a extent_state so we only have to do the searches once per
function. This gives me about a 3 mb/s boots on my random write test. I have
not converted some things, like the relocation and ioctl's, since they aren't
heavily used and the relocation stuff is in the middle of being re-written. I
also changed the clear_extent_bit() to only unset the cached state if we are
clearing EXTENT_LOCKED and related stuff, so we can do things like this
lock_extent_bits()
clear delalloc bits
unlock_extent_cached()
without losing our cached state. I tested this thoroughly and turned on
LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
fine.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When finishing io we run btrfs_dec_test_ordered_pending, and then immediately
run btrfs_lookup_ordered_extent, but btrfs_dec_test_ordered_pending does that
already, so we're searching twice when we don't have to. This patch lets us
pass a btrfs_ordered_extent in to btrfs_dec_test_ordered_pending so if we do
complete io on that ordered extent we can just use the one we found then instead
of having to do another btrfs_lookup_ordered_extent. This made my fio job with
the other patch go from 24 mb/s to 29 mb/s.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch makes us cache the extent state we find in find_delalloc_range since
we'll have to lock the extent later on in the function. This will keep us from
re-searching for the rang when we try to lock the extent.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The ordered tree used to need a mutex, but currently all we use it for is to
protect the rb_tree, and a spin_lock is just fine for that. Using a spin_lock
instead makes dbench run a little faster, 58 mb/s instead of 51 mb/s, and have
less latency, 3445.138 ms instead of 3820.633 ms.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The endio is done at reverse order of bio vectors.
That means for a sequential read, the page first submitted will finish
last in a bio. Considering we will do checksum (making cache hot) for
every page, this does introduce delay (and chance to squeeze cache used
soon) for pages submitted at the begining.
I don't observe obvious performance difference with below patch at my
simple test, but seems more natural to finish read in the order they are
submitted.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_mkdir() must jump to the place of ending transaction after
btrfs_find_free_objectid() failed. Or this transaction can't end.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Flush any delalloc extents when we create a snapshot, so that recently
written file data is always included in the snapshot.
A later commit will add the ability to snapshot without the flush, but
most people expect flushing.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The way we report df usage is way confusing for everybody, including some other
utilities (bacula for one). So this patch makes df a little bit more
understandable. First we make used actually count the total amount of used
space in all space info's. This will give us a real view of how much disk space
is in use. Second, for blocks available, only count data space. This makes
things like bacula work because it says 0 when you can no longer write anymore
data to the disk. I think this is a nice compromise, since you will end up with
something like the following
[root@alpha ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
148G 30G 111G 21% /
/dev/sda1 194M 116M 68M 64% /boot
tmpfs 985M 12K 985M 1% /dev/shm
/dev/mapper/VolGroup-LogVol02
145G 140G 0 100% /mnt/btrfs-test
Compare this with btrfsctl -i output
[root@alpha btrfs-progs-unstable]# ./btrfsctl -i /mnt/btrfs-test/
Metadata, DUP: total=4.62GB, used=2.46GB
System, DUP: total=8.00MB, used=24.00KB
Data: total=134.80GB, used=134.80GB
Metadata: total=8.00MB, used=0.00
System: total=4.00MB, used=0.00
operation complete
This way we show that there is no more data space to be used, but we have
another 5GB of space left for metadata. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we scan devices in a multi-device filesystem, we memorize the original
name. If the device gets a new name, later scans don't update the
in-kernel structures related to it, and we're not able to mount the
filesystem.
This patch updates device name during scaning.
Signed-off-by: TARUISI Hiroaki <taruishi.hiroak@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs defrag ioctl was limited to doing the entire file. This
commit adds a new interface that can defrag a specific range inside
the file.
It can also force compression on the file, allowing you to selectively
compress individual files after they were created, even when mount -o
compress isn't turned on.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs defrag ioctl had some bugs around delalloc accounting, and it
wasn't properly skipping pages that were not in the mapping.
It wasn't properly clearing the page checked flag, which could make the
writeback code ignore the page forever while pinning it as dirty.
This commit fixes those problems and makes defrag a little smarter. It
skips holes and it doesn't waste time defragging large extents. If a
tiny extent comes before a very large extent, it will defrag both of
them to make sure the tiny extent ends up next to something big.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The submit_bio helper thread can decide to loop back around to
service more bios. This commit forces it to unplug first, which helps
reduce the latency seen by submitters.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since theres not a good way to make sure the user sees the original default root
tree id, and not to mention it's 5 so is way different than any other volume,
just make subvol=0 mount the original default root. This makes it a bit easier
for users to handle in the long run. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch needs to go along with my previous patch. This lets us set the
default dir item's location to whatever root we want to use as our default
mounting subvol. With this we don't have to use mount -o subvol=<tree id>
anymore to mount a different subvol, we can just set the new one and it will
just magically work. I've done some moderate testing with this, mostly just
switching the default mount around, mounting subvols and the default mount at
the same time and such, everything seems to work. Thanks,
Older kernels would generally be able to still mount the filesystem with the
default subvolume set, but it would result in a different volume being mounted,
which could be an even more unpleasant suprise for users. So if you set your
default subvolume, you can't go back to older kernels. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This work is in preperation for being able to set a different root as the
default mounting root.
There is currently a problem with how we mount subvolumes. We cannot currently
mount a subvolume of a subvolume, you can only mount subvolumes/snapshots of the
default subvolume. So say you take a snapshot of the default subvolume and call
it snap1, and then take a snapshot of snap1 and call it snap2, so now you have
/
/snap1
/snap1/snap2
as your available volumes. Currently you can only mount / and /snap1,
you cannot mount /snap1/snap2. To fix this problem instead of passing
subvolid=<name> you must pass in subvolid=<treeid>, where <treeid> is
the tree id that gets spit out via the subvolume listing you get from
the subvolume listing patches (btrfs filesystem list). This allows us
to mount /, /snap1 and /snap1/snap2 as the root volume.
In addition to the above, we also now read the default dir item in the
tree root to get the root key that it points to. For now this just
points at what has always been the default subvolme, but later on I plan
to change it to point at whatever root you want to be the new default
root, so you can just set the default mount and not have to mount with
-o subvolid=<treeid>. I tested this out with the above scenario and it
worked perfectly. Thanks,
mount -o subvol operates inside the selected subvolid. For example:
mount -o subvol=snap1,subvolid=256 /dev/xxx /mnt
/mnt will have the snap1 directory for the subvolume with id
256.
mount -o subvol=snap /dev/xxx /mnt
/mnt will be the snap directory of whatever the default subvolume
is.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Our set/get functions for compat_ro_flags actually look at compat_flags. This
will mess any attempt to use compat flags up. The fix is obvious. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The search ioctl is a generic tool for doing btree searches from
userland applications. The first user of the search ioctl is a
subvolume listing feature, but we'll also use it to find new
files in a subvolume.
The search ioctl allows you to specify min and max keys to search for,
along with min and max transid. It returns the items along with a
header that includes the item key.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This will be used by the inode lookup ioctl.
Signed-off-by: TARUISI Hiroaki <taruishi.hiroak@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: kfree correct pointer during mount option parsing
Btrfs: use RB_ROOT to intialize rb_trees instead of setting rb_node to NULL
We kstrdup the options string, but then strsep screws with the pointer,
so when we kfree() it, we're not giving it the right pointer.
Tested-by: Andy Lutomirski <luto@mit.edu>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs inialize rb trees in quite a number of places by settin rb_node =
NULL; The problem with this is that 17d9ddc72f in the
linux-next tree adds a new field to that struct which needs to be NULL for
the new rbtree library code to work properly. This patch uses RB_ROOT as
the intializer so all of the relevant fields will be NULL'd. Without the
patch I get a panic.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Constify struct sysfs_ops.
This is part of the ops structure constification
effort started by Arjan van de Ven et al.
Benefits of this constification:
* prevents modification of data that is shared
(referenced) by many other structure instances
at runtime
* detects/prevents accidental (but not intentional)
modification attempts on archs that enforce
read-only kernel data at runtime
* potentially better optimized code as the compiler
can assume that the const data cannot be changed
* the compiler/linker move const data into .rodata
and therefore exclude them from false sharing
Signed-off-by: Emese Revfy <re.emese@gmail.com>
Acked-by: David Teigland <teigland@redhat.com>
Acked-by: Matt Domsch <Matt_Domsch@dell.com>
Acked-by: Maciej Sosnowski <maciej.sosnowski@intel.com>
Acked-by: Hans J. Koch <hjk@linutronix.de>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This gives the filesystem more information about the writeback that
is happening. Trond requested this for the NFS unstable write handling,
and other filesystems might benefit from this too by beeing able to
distinguish between the different callers in more detail.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
My test do: fallocate a big file and do write. The file is 512M, but
after file write is done btrfs-debug-tree shows:
item 6 key (257 EXTENT_DATA 0) itemoff 3516 itemsize 53
extent data disk byte 1103101952 nr 536870912
extent data offset 0 nr 399634432 ram 536870912
extent compression 0
Looks like a regression introducted by
6c7d54ac87, where we set wrong slot.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Acked-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: apply updated fallocate i_size fix
Btrfs: do not try and lookup the file extent when finishing ordered io
Btrfs: Fix oopsen when dropping empty tree.
Btrfs: remove BUG_ON() due to mounting bad filesystem
Btrfs: make error return negative in btrfs_sync_file()
Btrfs: fix race between allocate and release extent buffer.
This version of the i_size fix for fallocate makes sure we only update
the i_size when the current fallocate is really operating outside of
i_size.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When running the following fio job
[torrent]
filename=torrent-test
rw=randwrite
size=4g
filesize=4g
bs=4k
ioengine=sync
you would see long stalls where no work was being done. That is because we were
doing all this extra work to read in the file extent outside of the transaction,
however in the random io case this ends up hurting us because the file extents
are not there to begin with. So axe this logic, since we end up reading in the
file extent when we go to update it anyway. This took the fio job from 11 mb/s
with several ~10 second stalls to 24 mb/s to a couple of 1-2 second stalls.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When dropping a empty tree, walk_down_tree() skips checking
extent information for the tree root. This will triggers a
BUG_ON in walk_up_proc().
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Mounting a bad filesystem caused a BUG_ON(). The following is steps to
reproduce it.
# mkfs.btrfs /dev/sda2
# mount /dev/sda2 /mnt
# mkfs.btrfs /dev/sda1 /dev/sda2
(the program says that /dev/sda2 was mounted, and then exits. )
# umount /mnt
# mount /dev/sda1 /mnt
At the third step, mkfs.btrfs exited in the way of make filesystem. So the
initialization of the filesystem didn't finish. So the filesystem was bad, and
it caused BUG_ON() when mounting it. But BUG_ON() should be called by the wrong
code, not user's operation, so I think it is a bug of btrfs.
This patch fixes it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Increase extent buffer's reference count while holding the lock.
Otherwise it can race with try_release_extent_buffer.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: check total number of devices when removing missing
Btrfs: check return value of open_bdev_exclusive properly
Btrfs: do not mark the chunk as readonly if in degraded mode
Btrfs: run orphan cleanup on default fs root
Btrfs: fix a memory leak in btrfs_init_acl
Btrfs: Use correct values when updating inode i_size on fallocate
Btrfs: remove tree_search() in extent_map.c
Btrfs: Add mount -o compress-force
If you have a disk failure in RAID1 and then add a new disk to the
array, and then try to remove the missing volume, it will fail. The
reason is the sanity check only looks at the total number of rw devices,
which is just 2 because we have 2 good disks and 1 bad one. Instead
check the total number of devices in the array to make sure we can
actually remove the device. Tested this with a failed disk setup and
with this test we can now run
btrfs-vol -r missing /mount/point
and it works fine.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Hit this problem while testing RAID1 failure stuff. open_bdev_exclusive
returns ERR_PTR(), not NULL. So change the return value properly. This
is important if you accidently specify a device that doesn't exist when
trying to add a new device to an array, you will panic the box
dereferencing bdev.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If a RAID setup has chunks that span multiple disks, and one of those
disks has failed, btrfs_chunk_readonly will return 1 since one of the
disks in that chunk's stripes is dead and therefore not writeable. So
instead if we are in degraded mode, return 0 so we can go ahead and
allocate stuff. Without this patch all of the block groups in a RAID1
setup will end up read-only, which will mean we can't add new disks to
the array since we won't be able to make allocations.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch revert's commit
6c090a11e1
Since it introduces this problem where we can run orphan cleanup on a
volume that can have orphan entries re-added. Instead of my original
fix, Yan Zheng pointed out that we can just revert my original fix and
then run the orphan cleanup in open_ctree after we look up the fs_root.
I have tested this with all the tests that gave me problems and this
patch fixes both problems. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs_init_acl() cloned acl is not released
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
commit f2bc9dd07e3424c4ec5f3949961fe053d47bc825
Author: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Date: Wed Jan 20 12:57:53 2010 +0530
Btrfs: Use correct values when updating inode i_size on fallocate
Even though we allocate more, we should be updating inode i_size
as per the arguments passed
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch removes tree_search() in extent_map.c because it is not called by
anything.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The default btrfs mount -o compress mode will quickly back off
compressing a file if it notices that compression does not reduce the
size of the data being written. This can save considerable CPU because
all future writes to the file go through uncompressed.
But some files are both very large and have mixed data stored in
them. In that case, we want to add the ability to always try
compressing data before writing it.
This commit adds mount -o compress-force. A later commit will add
a new inode flag that does the same thing.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix possible panic on unmount
Btrfs: deal with NULL acl sent to btrfs_set_acl
Btrfs: fix regression in orphan cleanup
Btrfs: Fix race in btrfs_mark_extent_written
Btrfs, fix memory leaks in error paths
Btrfs: align offsets for btrfs_ordered_update_i_size
btrfs: fix missing last-entry in readdir(3)
We can race with the unmount of an fs and the stopping of a kthread where we
will free the block group before we're done using it. The reason for this is
because we do not hold a reference on the block group while its caching, since
the allocator drops its reference once it exits or moves on to the next block
group. This patch fixes the problem by taking a reference to the block group
before we start caching and dropping it when we're done to make sure all
accesses to the block group are safe. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It is legal for btrfs_set_acl to be sent a NULL acl. This
makes sure we don't dereference it. A similar patch was sent by
Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently orphan cleanup only ever gets triggered if we cross subvolumes during
a lookup, which means that if we just mount a plain jane fs that has orphans in
it, they will never get cleaned up. This results in panic's like these
http://www.kerneloops.org/oops.php?number=1109085
where adding an orphan entry results in -EEXIST being returned and we panic. In
order to fix this, we check to see on lookup if our root has had the orphan
cleanup done, and if not go ahead and do it. This is easily reproduceable by
running this testcase
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>
#include <unistd.h>
#include <stdio.h>
int main(int argc, char **argv)
{
char data[4096];
char newdata[4096];
int fd1, fd2;
memset(data, 'a', 4096);
memset(newdata, 'b', 4096);
while (1) {
int i;
fd1 = creat("file1", 0666);
if (fd1 < 0)
break;
for (i = 0; i < 512; i++)
write(fd1, data, 4096);
fsync(fd1);
close(fd1);
fd2 = creat("file2", 0666);
if (fd2 < 0)
break;
ftruncate(fd2, 4096 * 512);
for (i = 0; i < 512; i++)
write(fd2, newdata, 4096);
close(fd2);
i = rename("file2", "file1");
unlink("file1");
}
return 0;
}
and then pulling the power on the box, and then trying to run that test again
when the box comes back up. I've tested this locally and it fixes the problem.
Thanks to Tomas Carnecky for helping me track this down initially.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix bug reported by Johannes Hirte. The reason of that bug
is btrfs_del_items is called after btrfs_duplicate_item and
btrfs_del_items triggers tree balance. The fix is check that
case and call btrfs_search_slot when needed.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Stanse found 2 memory leaks in relocate_block_group and
__btrfs_map_block. cluster and multi are not freed/assigned on all
paths. Fix that.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Some callers of btrfs_ordered_update_i_size can now pass in
a NULL for the ordered extent to update against. This makes
sure we properly align the offset they pass in when deciding
how much to bump the on disk i_size.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
parent 49313cdac7b34c9f7ecbb1780cfc648b1c082cd7 (v2.6.32-1-g49313cd)
commit ff48c08e1c05c67e8348ab6f8a24de8034e0e34d
Author: Jan Engelhardt <jengelh@medozas.de>
Date: Wed Dec 9 22:57:36 2009 +0100
Btrfs: fix missing last-entry in readdir(3)
When one does a 32-bit readdir(3), the last entry of a directory is
missing. This is however not due to passing a large value to filldir,
but it seems to have to do with glibc doing telldir or something
quirky. In any case, this patch fixes it in practice.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: make sure fallocate properly starts a transaction
Btrfs: make metadata chunks smaller
Btrfs: Show discard option in /proc/mounts
Btrfs: deny sys_link across subvolumes.
Btrfs: fail mount on bad mount options
Btrfs: don't add extent 0 to the free space cache v2
Btrfs: Fix per root used space accounting
Btrfs: Fix btrfs_drop_extent_cache for skip pinned case
Btrfs: Add delayed iput
Btrfs: Pass transaction handle to security and ACL initialization functions
Btrfs: Make truncate(2) more ENOSPC friendly
Btrfs: Make fallocate(2) more ENOSPC friendly
Btrfs: Avoid orphan inodes cleanup during committing transaction
Btrfs: Avoid orphan inodes cleanup while replaying log
Btrfs: Fix disk_i_size update corner case
Btrfs: Rewrite btrfs_drop_extents
Btrfs: Add btrfs_duplicate_item
Btrfs: Avoid superfluous tree-log writeout
This reverts commit e4c570c4cb, as
requested by Alexey:
"I think I gave a good enough arguments to not merge it.
To iterate:
* patch makes impossible to start using ext3 on EXT3_FS=n kernels
without reboot.
* this is done only for one pointer on task_struct"
None of config options which define task_struct are tristate directly
or effectively."
Requested-by: Alexey Dobriyan <adobriyan@gmail.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The recent patch to make fallocate enospc friendly would send
down a NULL trans handle to the allocator. This moves the
transaction start to properly fix things.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch makes us a bit less zealous about making sure we have enough free
metadata space by pearing down the size of new metadata chunks to 256mb instead
of 1gb. Also, we used to try an allocate metadata chunks when allocating data,
but that sort of thing is done elsewhere now so we can just remove it. With my
-ENOSPC test I used to have 3gb reserved for metadata out of 75gb, now I have
1.7gb. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Christoph's patch e244a0aeb6 doesn't display
the discard option in /proc/mounts, leading to some confusion for me.
Here's the missing bit.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I rebased Christian Parpart's patch to deny hard link across
subvolumes. Original patch modifies also btrfs_rename, but
I excluded it because we can move across subvolumes now and
it make no problem.
-----------------
Hard link across subvolumes should not allowed in Btrfs.
btrfs_link checks root of 'to' directory is same as root
of 'from' file. If not same, btrfs_link returns -EPERM.
Signed-off-by: TARUISI Hiroaki <taruishi.hiroak@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If block group 0 is completely free, btrfs_read_block_groups will
add extent [0, BTRFS_SUPER_INFO_OFFSET) to the free space cache.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The bytes_used field in root item was originally planned to
trace the amount of used data and tree blocks. But it never
worked right since we can't trace freeing of data accurately.
This patch changes it to only trace the amount of tree blocks.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The check for skip pinned case is wrong, it may breaks the
while loop too soon.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
iput() can trigger new transactions if we are dropping the
final reference, so calling it in btrfs_commit_transaction
may end up deadlock. This patch adds delayed iput to avoid
the issue.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Pass transaction handle down to security and ACL initialization
functions, so we can avoid starting nested transactions
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
truncating and deleting regular files are unbound operations,
so it's not good to do them in a single transaction. This
patch makes btrfs_truncate and btrfs_delete_inode start a
new transaction after all items in a tree leaf are deleted.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
fallocate(2) may allocate large number of file extents, so it's not
good to do it in a single transaction. This patch make fallocate(2)
start a new transaction for each file extents it allocates.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_lookup_dentry may trigger orphan cleanup, so it's not good
to call it while committing a transaction.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We do log replay in a single transaction, so it's not good to do unbound
operations. This patch cleans up orphan inodes cleanup after replaying
the log. It also avoids doing other unbound operations such as truncating
a file during replaying log. These unbound operations are postponed to
the orphan inode cleanup stage.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are some cases file extents are inserted without involving
ordered struct. In these cases, we update disk_i_size directly,
without checking pending ordered extent and DELALLOC bit. This
patch extends btrfs_ordered_update_i_size() to handle these cases.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add a flags argument to struct xattr_handler and pass it to all xattr
handler methods. This allows using the same methods for multiple
handlers, e.g. for the ACL methods which perform exactly the same action
for the access and default ACLs, just using a different underlying
attribute. With a little more groundwork it'll also allow sharing the
methods for the regular user/trusted/secure handlers in extN, ocfs2 and
jffs2 like it's already done for xfs in this patch.
Also change the inode argument to the handlers to a dentry to allow
using the handlers mechnism for filesystems that require it later,
e.g. cifs.
[with GFS2 bits updated by Steven Whitehouse <swhiteho@redhat.com>]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rewrite btrfs_drop_extents by using btrfs_duplicate_item, so we can
avoid calling lock_extent within transaction.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_duplicate_item duplicates item with new key, guaranteeing
the source item and the new items are in the same tree leaf and
contiguous. It allows us to split file extent in place, without
using lock_extent to prevent bookend extent race.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We allow two log transactions at a time, but use same flag
to mark dirty tree-log btree blocks. So we may flush dirty
blocks belonging to newer log transaction when committing a
log transaction. This patch fixes the issue by using two
flags to mark dirty tree-log btree blocks.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
journal_info in task_struct is used in journaling file system only. So
introduce CONFIG_FS_JOURNAL_INFO and make it conditional.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While Linux provided an O_SYNC flag basically since day 1, it took until
Linux 2.4.0-test12pre2 to actually get it implemented for filesystems,
since that day we had generic_osync_around with only minor changes and the
great "For now, when the user asks for O_SYNC, we'll actually give
O_DSYNC" comment. This patch intends to actually give us real O_SYNC
semantics in addition to the O_DSYNC semantics. After Jan's O_SYNC
patches which are required before this patch it's actually surprisingly
simple, we just need to figure out when to set the datasync flag to
vfs_fsync_range and when not.
This patch renames the existing O_SYNC flag to O_DSYNC while keeping it's
numerical value to keep binary compatibility, and adds a new real O_SYNC
flag. To guarantee backwards compatiblity it is defined as expanding to
both the O_DSYNC and the new additional binary flag (__O_SYNC) to make
sure we are backwards-compatible when compiled against the new headers.
This also means that all places that don't care about the differences can
just check O_DSYNC and get the right behaviour for O_SYNC, too - only
places that actuall care need to check __O_SYNC in addition. Drivers and
network filesystems have been updated in a fail safe way to always do the
full sync magic if O_DSYNC is set. The few places setting O_SYNC for
lower layers are kept that way for now to stay failsafe.
We enforce that O_DSYNC is set when __O_SYNC is set early in the open path
to make sure we always get these sane options.
Note that parisc really screwed up their headers as they already define a
O_DSYNC that has always been a no-op. We try to repair it by using it for
the new O_DSYNC and redefinining O_SYNC to send both the traditional
O_SYNC numerical value _and_ the O_DSYNC one.
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Grant Grundler <grundler@parisc-linux.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger@sun.com>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Ulrich Drepper <drepper@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jan Kara <jack@suse.cz>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix panic when trying to destroy a newly allocated
Btrfs: allow more metadata chunk preallocation
Btrfs: fallback on uncompressed io if compressed io fails
Btrfs: find ideal block group for caching
Btrfs: avoid null deref in unpin_extent_cache()
Btrfs: skip btrfs_release_path in btrfs_update_root and btrfs_del_root
Btrfs: fix some metadata enospc issues
Btrfs: fix how we set max_size for free space clusters
Btrfs: cleanup transaction starting and fix journal_info usage
Btrfs: fix data allocation hint start
There is a problem where iget5_locked will look for an inode, not find it, and
then subsequently try to allocate it. Another CPU will have raced in and
allocated the inode instead, so when iget5_locked gets the inode spin lock again
and does a search, it finds the new inode. So it goes ahead and calls
destroy_inode on the inode it just allocated. The problem is we don't set
BTRFS_I(inode)->root until the new inode is completely initialized. This patch
makes us set root to NULL when alloc'ing a new inode, so when we get to
btrfs_destroy_inode and we see that root is NULL we can just free up the memory
and continue on. This fixes the panic
http://www.kerneloops.org/submitresult.php?number=812690
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On an FS where all of the space has not been allocated into chunks yet,
the enospc can return enospc just because the existing metadata chunks
are full.
We get around this by allowing more metadata chunks to be allocated up
to a certain limit, and finding the right limit is a little fuzzy. The
problem is the reservations for delalloc would preallocate way too much
of the FS as metadata. We need to start saying no and just force some
IO to happen.
But we also need to let a reasonable amount of the FS become metadata.
This bumps the hard limit up, later releases will have a better system.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently compressed IO does not deal with not having its entire extent able to
be allocated. So if we have enough free space to allocate for the extent, but
its not contiguous, it will fail spectacularly. This patch fixes this by
falling back on uncompressed IO which lets us spread the delalloc extent across
multiple extents. I tested this by making us randomly think the reservation had
failed to make it fallback on the uncompressed io way and it seemed to work
fine. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.
Problem:
My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So
Solution:
1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.
2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.
There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.
With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I re-orderred the checks to avoid dereferencing "em" if it was null.
Found by smatch static checker.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We don't need to call btrfs_release_path because btrfs_free_path will do
that for us.
Signed-off-by: Li Dongyang <Jerry87905@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We weren't reserving metadata space for rename, rmdir and unlink, which could
cause problems.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch fixes a problem where max_size can be set to 0 even though we
filled the cluster properly. We set max_size to 0 if we restart the cluster
window, but if the new start entry is big enough to be our new cluster then we
could return with a max_size set to 0, which will mean the next time we try to
allocate from this cluster it will fail. So set max_extent to the entry's
size. Tested this on my box and now we actually allocate from the cluster
after we fill it. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We use journal_info to tell if we're in a nested transaction to make sure we
don't commit the transaction within a nested transaction. We use another
method to see if there are any outstanding ioctl trans handles, so if we're
starting one do not set current->journal_info, since it will screw with other
filesystems. This patch also cleans up the starting stuff so there aren't any
magic numbers.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Sometimes our start allocation hint when we cow a file can be either
EXTENT_HOLE or some other such place holder, which is not optimal. So if we
find that our em->block_start is one of these special values, check to see
where the first block of the inode is stored, and use that as a hint. If that
block is also a special value, just fallback on a hint of 0 and let the
allocator figure out a good place to put the data.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: always pin metadata in discard mode
Btrfs: enable discard support
Btrfs: add -o discard option
Btrfs: properly wait log writers during log sync
Btrfs: fix possible ENOSPC problems with truncate
Btrfs: fix btrfs acl #ifdef checks
Btrfs: streamline tree-log btree block writeout
Btrfs: avoid tree log commit when there are no changes
Btrfs: only write one super copy during fsync
We have an optimization in btrfs to allow blocks to be
immediately freed if they were allocated in this transaction and never
written. Otherwise they are pinned and freed when the transaction
commits.
This isn't optimal for discard mode because immediately freeing
them means immediately discarding them. It is better to give the
block to the pinning code and letting the (slow) discard happen later.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The discard support code in btrfs currently is guarded by ifdefs for
BIO_RW_DISCARD, which is never defines as it's the name of an enum
memeber. Just remove the useless ifdefs to actually enable the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Enable discard by default is not a good idea given the the trim speed
of SSD prototypes we've seen, and the carecteristics for many high-end
arrays. Turn of discards by default and require the -o discard option
to enable them on.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A recently fsync optimization make btrfs_sync_log skip calling
wait_for_writer in the single log writer case. This is incorrect
since the writer count can also be increased by btrfs_pin_log.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There's a problem where we don't do any space reservation for truncates, which
can cause you to OOPs because you will be allowed to go off in the weeds a bit
since we don't account for the delalloc bytes that are created as a result of
the truncate.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs acl code was #ifdefing for a define
that didn't exist. This correctly matches it
to the values used by the Kconfig file.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Syncing the tree log is a 3 phase operation.
1) write and wait for all the tree log blocks for a given root.
2) write and wait for all the tree log blocks for the
tree of tree log roots.
3) write and wait for the super blocks (barriers here)
This isn't as efficient as it could be because there is
no requirement to wait for the blocks from step one to hit the disk
before we start writing the blocks from step two. This commit
changes the sequence so that we don't start waiting until
all the tree blocks from both steps one and two have been sent
to disk.
We do this by breaking up btrfs_write_wait_marked_extents into
two functions, which is trivial because it was already broken
up into two parts.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
rpm has a habit of running fdatasync when the file hasn't
changed. We already detect if a file hasn't been changed
in the current transaction but it might have been sent to
the tree-log in this transaction and not changed since
the last call to fsync.
In this case, we want to avoid a tree log sync, which includes
a number of synchronous writes and barriers. This commit
extends the existing tracking of the last transaction to change
a file to also track the last sub-transaction.
The end result is that rpm -ivh and -Uvh are roughly twice as fast,
and on par with ext3.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
During a tree-log commit for fsync, we've been writing at least
two copies of the super block and forcing them to disk.
The other filesystems write only one, and this change brings us on
par with them. A full transaction commit will write all the super
copies, so we still have redundant info written on a regular
basis.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix file clone ioctl for bookend extents
Btrfs: fix uninit compiler warning in cow_file_range_nocow
Btrfs: constify dentry_operations
Btrfs: optimize back reference update during btrfs_drop_snapshot
Btrfs: remove negative dentry when deleting subvolumne
Btrfs: optimize fsync for the single writer case
Btrfs: async delalloc flushing under space pressure
Btrfs: release delalloc reservations on extent item insertion
Btrfs: delay clearing EXTENT_DELALLOC for compressed extents
Btrfs: cleanup extent_clear_unlock_delalloc flags
Btrfs: fix possible softlockup in the allocator
Btrfs: fix deadlock on async thread startup
The file clone ioctl was incorrectly taking the offset into the
extent on disk into account when calculating the length of the
cloned extent.
The length never changes based on the offset into the physical extent.
Test case:
fallocate -l 1g image
mke2fs image
bcp image image2
e2fsck -f image2
(errors on image2)
The math bug ends up wrapping the length of the extent, and things
go wrong from there.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent_type variable was exposed uninit via a goto. It should be
impossible to trigger because it is protected by a check on another
variable, but this makes sure.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch reading level 0 tree blocks that already use full backrefs.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The use of btrfs_dentry_delete is removing dentries from the
dcache when deleting subvolumne. btrfs_dentry_delete ignores
negative dentries. This is incorrect since if we don't remove
the negative dentry, its parent dentry can't be removed.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch optimizes the tree logging stuff so it doesn't always wait 1 jiffie
for new people to join the logging transaction if there is only ever 1 writer.
This helps a little bit with latency where we have something like RPM where it
will fdatasync every file it writes, and so waiting the 1 jiffie for every
fdatasync really starts to add up.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch moves the delalloc flushing that occurs when we are under space
pressure off to a async thread pool. This helps since we only free up
metadata space when we actually insert the extent item, which means it takes
quite a while for space to be free'ed up if we wait on all ordered extents.
However, if space is freed up due to inline extents being inserted, we can
wake people who are waiting up early, and they can finish their work.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch fixes an issue with the delalloc metadata space reservation
code. The problem is we used to free the reservation as soon as we
allocated the delalloc region. The problem with this is if we are not
inserting an inline extent, we don't actually insert the extent item until
after the ordered extent is written out. This patch does 3 things,
1) It moves the reservation clearing stuff into the ordered code, so when
we remove the ordered extent we remove the reservation.
2) It adds a EXTENT_DO_ACCOUNTING flag that gets passed when we clear
delalloc bits in the cases where we want to clear the metadata reservation
when we clear the delalloc extent, in the case that we do an inline extent
or we invalidate the page.
3) It adds another waitqueue to the space info so that when we start a fs
wide delalloc flush, anybody else who also hits that area will simply wait
for the flush to finish and then try to make their allocation.
This has been tested thoroughly to make sure we did not regress on
performance.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When compression is on, the cow_file_range code is farmed off to
worker threads. This allows us to do significant CPU work in parallel
on SMP machines.
But it is a delicate balance around when we clear flags and how. In
the past we cleared the delalloc flag immediately, which was safe
because the pages stayed locked.
But this is causing problems with the newest ENOSPC code, and with the
recent extent state cleanups we can now clear the delalloc bit at the
same time the uncompressed code does.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
extent_clear_unlock_delalloc has a growing set of ugly parameters
that is very difficult to read and maintain.
This switches to a flag field and well named flag defines.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Like the cluster allocating stuff, we can lockup the box with the normal
allocation path. This happens when we
1) Start to cache a block group that is severely fragmented, but has a decent
amount of free space.
2) Start to commit a transaction
3) Have the commit try and empty out some of the delalloc inodes with extents
that are relatively large.
The inodes will not be able to make the allocations because they will ask for
allocations larger than a contiguous area in the free space cache. So we will
wait for more progress to be made on the block group, but since we're in a
commit the caching kthread won't make any more progress and it already has
enough free space that wait_block_group_cache_progress will just return. So,
if we wait and fail to make the allocation the next time around, just loop and
go to the next block group. This keeps us from getting stuck in a softlockup.
Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs async worker threads are used for a wide variety of things,
including processing bio end_io functions. This means that when
the endio threads aren't running, the rest of the FS isn't
able to do the final processing required to clear PageWriteback.
The endio threads also try to exit as they become idle and
start more as the work piles up. The problem is that starting more
threads means kthreadd may need to allocate ram, and that allocation
may wait until the global number of writeback pages on the system is
below a certain limit.
The result of that throttling is that end IO threads wait on
kthreadd, who is waiting on IO to end, which will never happen.
This commit fixes the deadlock by handing off thread startup to a
dedicated thread. It also fixes a bug where the on-demand thread
creation was creating far too many threads because it didn't take into
account threads being started by other procs.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix data space leak fix
Btrfs: remove duplicates of filemap_ helpers
Btrfs: take i_mutex before generic_write_checks
Btrfs: fix arguments to btrfs_wait_on_page_writeback_range
Btrfs: fix deadlock with free space handling and user transactions
Btrfs: fix error cases for ioctl transactions
Btrfs: Use CONFIG_BTRFS_POSIX_ACL to enable ACL code
Btrfs: introduce missing kfree
Btrfs: Fix setting umask when POSIX ACLs are not enabled
Btrfs: proper -ENOSPC handling
There is a problem where page_mkwrite can be called on a dirtied page that
already has a delalloc range associated with it. The fix is to clear any
delalloc bits for the range we are dirtying so the space accounting gets
handled properly. This is the same thing we do in the normal write case, so we
are consistent across the board. With this patch we no longer leak reserved
space.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use filemap_fdatawrite_range and filemap_fdatawait_range instead of
local copies of the functions. For filemap_fdatawait_range that
also means replacing the awkward old wait_on_page_writeback_range
calling convention with the regular filemap byte offsets.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_file_write was incorrectly calling generic_write_checks without
taking i_mutex. This lead to problems with racing around i_size when
doing O_APPEND writes.
The fix here is to move i_mutex higher.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
wait_on_page_writeback_range/btrfs_wait_on_page_writeback_range takes
a pagecache offset, not a byte offset into the file. Shift the arguments
around to wait for the correct range
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If an ioctl-initiated transaction is open, we can't force a commit during
the free space checks in order to free up pinned extents or else we
deadlock. Just ENOSPC instead.
A more satisfying solution that reserves space for the entire user
transaction up front is forthcoming...
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix leak of vfsmount write reference and open_ioctl_trans reference on
ENOMEM. Clean up the error paths while we're at it.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We've already defined CONFIG_BTRFS_POSIX_ACL in Kconfig, but we're
currently not using it and are testing CONFIG_FS_POSIX_ACL instead.
CONFIG_FS_POSIX_ACL states "Never use this symbol for ifdefs".
Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We currently set sb->s_flags |= MS_POSIXACL unconditionally, which is
incorrect -- it tells the VFS that it shouldn't set umask because we
will, yet we don't set it ourselves if we aren't using POSIX ACLs, so
the umask ends up ignored.
Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
At the start of a transaction we do a btrfs_reserve_metadata_space() and
specify how many items we plan on modifying. Then once we've done our
modifications and such, just call btrfs_unreserve_metadata_space() for
the same number of items we reserved.
For keeping track of metadata needed for data I've had to add an extent_io op
for when we merge extents. This lets us track space properly when we are doing
sequential writes, so we don't end up reserving way more metadata space than
what we need.
The only place where the metadata space accounting is not done is in the
relocation code. This is because Yan is going to be reworking that code in the
near future, so running btrfs-vol -b could still possibly result in a ENOSPC
related panic. This patch also turns off the metadata_ratio stuff in order to
allow users to more efficiently use their disk space.
This patch makes it so we track how much metadata we need for an inode's
delayed allocation extents by tracking how many extents are currently
waiting for allocation. It introduces two new callbacks for the
extent_io tree's, merge_extent_hook and split_extent_hook. These help
us keep track of when we merge delalloc extents together and split them
up. Reservations are handled prior to any actually dirty'ing occurs,
and then we unreserve after we dirty.
btrfs_unreserve_metadata_for_delalloc() will make the appropriate
unreservations as needed based on the number of reservations we
currently have and the number of extents we currently have. Doing the
reservation outside of doing any of the actual dirty'ing lets us do
things like filemap_flush() the inode to try and force delalloc to
happen, or as a last resort actually start allocation on all delalloc
inodes in the fs. This has survived dbench, fs_mark and an fsx torture
test.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code
But leave TTM code alone, something is fishy there with global vm_ops
being used.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (42 commits)
Btrfs: hash the btree inode during fill_super
Btrfs: relocate file extents in clusters
Btrfs: don't rename file into dummy directory
Btrfs: check size of inode backref before adding hardlink
Btrfs: fix releasepage to avoid unlocking extents we haven't locked
Btrfs: Fix test_range_bit for whole file extents
Btrfs: fix errors handling cached state in set/clear_extent_bit
Btrfs: fix early enospc during balancing
Btrfs: deal with NULL space info
Btrfs: account for space used by the super mirrors
Btrfs: fix extent entry threshold calculation
Btrfs: remove dead code
Btrfs: fix bitmap size tracking
Btrfs: don't keep retrying a block group if we fail to allocate a cluster
Btrfs: make balance code choose more wisely when relocating
Btrfs: fix arithmetic error in clone ioctl
Btrfs: add snapshot/subvolume destroy ioctl
Btrfs: change how subvolumes are organized
Btrfs: do not reuse objectid of deleted snapshot/subvol
Btrfs: speed up snapshot dropping
...
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...
The extent relocation code copy file extents one by one when
relocating data block group. This is inefficient if file
extents are small. This patch makes the relocation code copy
file extents in clusters. So we can can make better use of
read-ahead.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A recent change enforces only one access point to each subvolume. The first
directory entry (the one added when the subvolume/snapshot was created) is
treated as valid access point, all other subvolume links are linked to dummy
empty directories. The dummy directories are temporary inodes that only in
memory, so we can not rename file into them.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
For every hardlink in btrfs, there is a corresponding inode back
reference. All inode back references for hardlinks in a given
directory are stored in single b-tree item. The size of b-tree item
is limited by the size of b-tree leaf, so we can only create limited
number of hardlinks to a given file in a directory.
The original code lacks of the check, it oops if the number of
hardlinks goes over the limit. This patch fixes the issue by adding
check to btrfs_link and btrfs_rename.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
During releasepage, we try to drop any extent_state structs for the
bye offsets of the page we're releaseing. But the code was incorrectly
telling clear_extent_bit to delete the state struct unconditionallly.
Normally this would be fine because we have the page locked, but other
parts of btrfs will lock down an entire extent, the most common place
being IO completion.
releasepage was deleting the extent state without first locking the extent,
which may result in removing a state struct that another process had
locked down. The fix here is to leave the NODATASUM and EXTENT_LOCKED
bits alone in releasepage.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If test_range_bit finds an extent that goes all the way to (u64)-1, it
can incorrectly wrap the u64 instead of treaing it like the end of
the address space.
This just adds a check for the highest possible offset so we don't wrap.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Both set and clear_extent_bit allow passing a cached
state struct to reduce rbtree search times. clear_extent_bit
was improperly bypassing some of the checks around making sure
the extent state fields were correct for a given operation.
The fix used here (from Yan Zheng) is to use the hit_next
goto target instead of jumping all the way down to start clearing
bits without making sure the cached state was exactly correct
for the operation we were doing.
This also fixes up the setting of the start variable for both
ops in the case where we find an overlapping extent that
begins before the range we want to change. In both cases
we were incorrectly going backwards from the original
requested change.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We now do extra checks before a balance to make sure
there is room for the balance to take place. One of
the checks was testing to see if we were trying to
balance away the last block group of a given type.
If there is no space available for new chunks, we
should not try and balance away the last block group
of a give type. But, the code wasn't checking for
available chunk space, and so it was exiting too soon.
The fix here is to combine some of the checks and make
sure we try to allocate new chunks when we're balancing
the last block group.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After a balance it is briefly possible for the space info
field in the inode to be NULL. This adds some checks
to make sure things properly deal with the NULL value.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
trivial: fix typo in aic7xxx comment
trivial: fix comment typo in drivers/ata/pata_hpt37x.c
trivial: typo in kernel-parameters.txt
trivial: fix typo in tracing documentation
trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
trivial: remove unnecessary semicolons
trivial: Fix duplicated word "options" in comment
trivial: kbuild: remove extraneous blank line after declaration of usage()
trivial: improve help text for mm debug config options
trivial: doc: hpfall: accept disk device to unload as argument
trivial: doc: hpfall: reduce risk that hpfall can do harm
trivial: SubmittingPatches: Fix reference to renumbered step
trivial: fix typos "man[ae]g?ment" -> "management"
trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
trivial: fix missing printk space in amd_k7_smp_check
trivial: fix typo s/ketymap/keymap/ in comment
trivial: fix typo "to to" in multiple files
trivial: fix typos in comments s/DGBU/DBGU/
...
As we get closer to proper -ENOSPC handling in btrfs, we need more accurate
space accounting for the space info's. Currently we exclude the free space for
the super mirrors, but the space they take up isn't accounted for in any of the
counters. This patch introduces bytes_super, which keeps track of the amount
of bytes used for a super mirror in the block group cache and space info. This
makes sure that our free space caclucations will be completely accurate.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a slight problem with the extent entry threshold calculation for the
free space cache. We only adjust the threshold down as we add bitmaps, but
never actually adjust the threshold up as we add bitmaps. This means we could
fragment the free space so badly that we end up using all bitmaps to describe
the free space, use all the free space which would result in the bitmaps being
freed, but then go to add free space again as we delete things and immediately
add bitmaps since the extent threshold would still be 0. Now as we free
bitmaps the extent threshold will be ratcheted up to allow more extent entries
to be added.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch removes a bunch of dead code from the snapshot removal stuff. It
was confusing me when doing the metadata ENOSPC stuff so I killed it.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we first go to add free space, we allocate a new info and set the offset
and bytes to the space we are adding. This is fine, except we actually set the
size of a bitmap as we set the bits in it, so if we add space to a bitmap, we'd
end up counting the same space twice. This isn't a huge deal, it just makes
the allocator behave weirdly since it will think that a bitmap entry has more
space than it ends up actually having. I used a BUG_ON() to catch when this
problem happened, and with this patch I no longer get the BUG_ON().
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The box can get locked up in the allocator if we happen upon a block group
under these conditions:
1) During a commit, so caching threads cannot make progress
2) Our block group currently is in the middle of being cached
3) Our block group currently has plenty of free space in it
4) Our block group is so fragmented that it ends up having no free space chunks
larger than min_bytes calculated by btrfs_find_space_cluster.
What happens is we try and do btrfs_find_space_cluster, which fails because it
is unable to find enough free space chunks that are large than min_bytes and
are close enough together. Since the block group is not cached we do a
wait_block_group_cache_progress, which waits for the number of bytes we need,
except the block group already has _plenty_ of free space, its just severely
fragmented, so we loop and try again, ad infinitum. This patch keeps us from
waiting on the block group to finish caching if we failed to find a free space
cluster before. It also makes sure that we don't even try to find a free space
cluster if we are on our last loop in the allocator, since we will have tried
everything at this point at it is futile.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix an arithmetic error that was breaking extents cloned via the clone
ioctl starting in the second half of a file.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being
used and doesn't contains links to other subvolumes can be destroyed.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs allows subvolumes and snapshots anywhere in the directory tree.
If we snapshot a subvolume that contains a link to other subvolume
called subvolA, subvolA can be accessed through both the original
subvolume and the snapshot. This is similar to creating hard link to
directory, and has the very similar problems.
The aim of this patch is enforcing there is only one access point to
each subvolume. Only the first directory entry (the one added when
the subvolume/snapshot was created) is treated as valid access point.
The first directory entry is distinguished by checking root forward
reference. If the corresponding root forward reference is missing,
we know the entry is not the first one.
This patch also adds snapshot/subvolume rename support, the code
allows rename subvolume link across subvolumes.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The new back reference format does not allow reusing objectid of
deleted snapshot/subvol. So we use ++highest_objectid to allocate
objectid for new snapshot/subvol.
Now we use ++highest_objectid to allocate objectid for both new inode
and new snapshot/subvolume, so this patch removes 'find hole' code in
btrfs_find_free_objectid.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch contains two changes to avoid unnecessary tree block reads during
snapshot dropping.
First, check tree block's reference count and flags before reading the tree
block. if reference count > 1 and there is no need to update backrefs, we can
avoid reading the tree block.
Second, save when snapshot was created in root_key.offset. we can compare block
pointer's generation with snapshot's creation generation during updating
backrefs. If a given block was created before snapshot was created, the
snapshot can't be the tree block's owner. So we can avoid reading the block.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The allocator has some nice knobs for sending hints about where
to try and allocate new blocks, but when we're doing file allocations
we're not sending any hint at all.
This commit adds a simple extent map search to see if we can
quickly and easily find a hint for the allocator.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs fills a delayed allocation, it tries to increase
the wbc nr_to_write to cover a big part of allocation. The
theory is that we're doing contiguous IO and writing a few
more blocks will save seeks overall at a very low cost.
The problem is that extent_write_cache_pages could ignore
the new higher nr_to_write if nr_to_write had already gone
down to zero. We fix that by rechecking the nr_to_write
for every page that is processed in the pagevec.
This updates the math around bumping the nr_to_write value
to make sure we don't leave a tiny amount of IO hanging
around for the very end of a new extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch gets rid of two limitations of async block group caching.
The old code delays handling pinned extents when block group is in
caching. To allocate logged file extents, the old code need wait
until block group is fully cached. To get rid of the limitations,
This patch introduces a data structure to track the progress of
caching. Base on the caching progress, we know which extents should
be added to the free space cache when handling the pinned extents.
The logged file extents are also handled in a similar way.
This patch also changes how pinned extents are tracked. The old
code uses one tree to track pinned extents, and copy the pinned
extents tree at transaction commit time. This patch makes it use
two trees to track pinned extents. One tree for extents that are
pinned in the running transaction, one tree for extents that can
be unpinned. At transaction commit time, we swap the two trees.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We do this automatically in get_sb_bdev() from the set_bdev_super()
callback. Filesystems that have their own private backing_dev_info
must assign that in ->fill_super().
Note that ->s_bdi assignment is required for proper writeback!
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It was possible for an async worker thread to be selected to
receive a new work item, but exit before the work item was
actually placed into that thread's work list.
This commit fixes the race by incrementing the num_pending
counter earlier, and making sure to check the number of pending
work items before a thread exits.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The exit-on-idle code for async worker threads was incorrectly
calling spin_lock_irq with interrupts already off.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After a new worker thread starts, it is placed into the
list of idle threads. But, this may race with a
check for idle done by the worker thread itself, resulting
in a double list_add operation.
This fix adds a check to make sure the idle thread addition
is done properly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits)
block: use blkdev_issue_discard in blk_ioctl_discard
Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
block: don't assume device has a request list backing in nr_requests store
block: Optimal I/O limit wrapper
cfq: choose a new next_req when a request is dispatched
Seperate read and write statistics of in_flight requests
aoe: end barrier bios with EOPNOTSUPP
block: trace bio queueing trial only when it occurs
block: enable rq CPU completion affinity by default
cfq: fix the log message after dispatched a request
block: use printk_once
cciss: memory leak in cciss_init_one()
splice: update mtime and atime on files
block: make blk_iopoll_prep_sched() follow normal 0/1 return convention
cfq-iosched: get rid of must_alloc flag
block: use interrupts disabled version of raise_softirq_irqoff()
block: fix comment in blk-iopoll.c
block: adjust default budget for blk-iopoll
block: fix long lines in block/blk-iopoll.c
block: add blk-iopoll, a NAPI like approach for block devices
...
blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
the only difference between the two is that blkdev_issue_discard needs to
send a barrier discard request and blk_ioctl_discard a non-barrier one,
and blk_ioctl_discard needs to wait on the request. To facilitates this
add a flags argument to blkdev_issue_discard to control both aspects of the
behaviour. This will be very useful later on for using the waiting
funcitonality for other callers.
Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When btrfs_get_extent is reading inline file items for readpage,
it needs to copy the inline extent into the page. If the
inline extent doesn't cover all of the page, that means there
is a hole in the file, or that our file is smaller than one
page.
readpage does zeroing for the case where the file is smaller than one
page, but nobody is currently zeroing for the case where there is
a hole after the inline item.
This commit changes btrfs_get_extent to zero fill the page past
the end of the inline item.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This closes a whole where the page may be written before
the page_mkwrite caller has a chance to dirty it
(thanks to Nick Piggin)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Data COW means that whenever we write to a file, we replace any old
extent pointers with new ones. There was a window where a readpage
might find the old extent pointers on disk and cache them in the
extent_map tree in ram in the middle of a given write replacing them.
Even though both the readpage and the write had their respective bytes
in the file locked, the extent readpage inserts may cover more bytes than
it had locked down.
This commit closes the race by keeping the new extent pinned in the extent
map tree until after the on-disk btree is properly setup with the new
extent pointers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs writes go through delalloc to the data=ordered code. This
makes sure that all of the data is on disk before the metadata
that references it. The tracking means that we have to make sure
each page in an extent is fully written before we add that extent into
the on-disk btree.
This was done in the past by setting the EXTENT_ORDERED bit for the
range of an extent when it was added to the data=ordered code, and then
clearing the EXTENT_ORDERED bit in the extent state tree as each page
finished IO.
One of the reasons we had to do this was because sometimes pages are
magically dirtied without page_mkwrite being called. The EXTENT_ORDERED
bit is checked at writepage time, and if it isn't there, our page become
dirty without going through the proper path.
These bit operations make for a number of rbtree searches for each page,
and can cause considerable lock contention.
This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
As pages go into the ordered code, PagePrivate2 is set on each one.
This is a cheap operation because we already have all the pages locked
and ready to go.
As IO finishes, the PagePrivate2 bit is cleared and the ordered
accoutning is updated for each page.
At writepage time, if the PagePrivate2 bit is missing, we go into the
writepage fixup code to handle improperly dirtied pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes the btrfs code to find delalloc ranges in the extent state
tree to use the new state caching code from set/test bit. It reduces
one of the biggest causes of rbtree searches in the writeback path.
test_range_bit is also modified to take the cached state as a starting
point while searching.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
At writepage time, we have the page locked and we have the
extent_map entry for this extent pinned in the extent_map tree.
So, the page can't go away and its mapping can't change.
There is no need for the extra extent_state lock bits during writepage.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Many of the btrfs extent state tree users follow the same pattern.
They lock an extent range in the tree, do some operation and then
unlock.
This translates to at least 2 rbtree searches, and maybe more if they
are doing operations on the extent state tree. A locked extent
in the tree isn't going to be merged or changed, and so we can
safely return the extent state structure as a cached handle.
This changes set_extent_bit to give back a cached handle, and also
changes both set_extent_bit and clear_extent_bit to use the cached
handle if it is available.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs is currently mirroring some of the page state bits into
its extent state tree. The goal behind this was to use it in supporting
blocksizes other than the page size.
But, we don't currently support that, and we're using quite a lot of CPU
on the rb tree and its spin lock. This commit starts a series of
cleanups to reduce the amount of work done in the extent state tree as
part of each IO.
This commit:
* Adds the ability to lock an extent in the state tree and also set
other bits. The idea is to do locking and delalloc in one call
* Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits. Btrfs is using
a combination of the page bits and the ordered write code for this
instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
As the extent state tree is manipulated, there are call backs
that are used to take extra actions when different state bits are set
or cleared. One example of this is a counter for the total number
of delayed allocation bytes in a single inode and in the whole FS.
When new states are inserted, this callback is being done before we
properly setup the new state. This hasn't caused problems before
because the lock bit was always done first, and the existing call backs
don't care about the lock bit.
This patch makes sure the state is properly setup before using the
callback, which is important for later optimizations that do more work
without using the lock bit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are two main users of the extent_map tree. The
first is regular file inodes, where it is evenly spread
between readers and writers.
The second is the chunk allocation tree, which maps blocks from
logical addresses to phyiscal ones, and it is 99.99% reads.
The mapping tree is a point of lock contention during heavy IO
workloads, so this commit switches things to a rw lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs io submission thread tries to back off congested devices in
favor of rotating off to another disk.
But, it tries to make sure it submits at least some IO before rotating
on (the others may be congested too), and so it has a magic number of
requests it tries to write before it hops.
This makes the magic number smaller. Testing shows that we're spending
too much time on congested devices and leaving the other devices idle.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs fills a large delayed allocation extent, it is a good idea
to try and convince the write_cache_pages caller to go ahead and
write a good chunk of that extent. The extra IO is basically free
because we know it is contiguous.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes the btrfs worker threads to batch work items
into a local list. It allows us to pull work items in
large chunks and significantly reduces the number of times we
need to take the worker thread spinlock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs worker thread spinlock was being used both for the
queueing of IO and for the processing of ordered events.
The ordered events never happen from end_io handlers, and so they
don't need to use the _irq version of spinlocks. This adds a
dedicated lock to the ordered lists so they don't have to run
with irqs off.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The Btrfs set_extent_bit call currently searches the rbtree
every time it needs to find more extent_state objects to fill
the requested operation.
This adds a simple test with rb_next to see if the next object
in the tree was adjacent to the one we just found. If so,
we skip the search and just use the next object.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The Btrfs worker threads don't currently die off after they have
been idle for a while, leading to a lot of threads sitting around
doing nothing for each mount.
Also, they are unable to start atomically (from end_io hanlders).
This commit reworks the worker threads so they can be started
from end_io handlers (just setting a flag that asks for a thread
to be added at a later date) and so they can exit if they
have been idle for a long time.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This enables us to track who does what and print info. Its main use
is catching dirty inodes on the default_backing_dev_info, so we can
fix that up.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Node may not be inserted over existing node. This causes inode tree
corruption and I was seeing crashes in inode_tree_del which I can not
reproduce after this patch.
The other way to fix this would be to tie inode lifetime in the rbtree
with inode while not in freeing state. I had a look at this but it is
not so trivial at this point. At least this patch gets things working again.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Acked-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
invalidate_inode_pages2_range may return -EBUSY occasionally
which results Oops. This patch fixes the issue by moving
invalidate_inode_pages2_range into a loop and keeping calling
it until the return value is not -EBUSY.
The EBUSY return is temporary, and can happen when the btrfs release page
function is unable to release a page because the EXTENT_LOCK
bit is set.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_zlib_workspace returns an ERR_PTR value in an error case instead of NULL.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@match exists@
expression x, E;
statement S1, S2;
@@
x = find_zlib_workspace(...)
... when != x = E
(
* if (x == NULL || ...) S1 else S2
|
* if (x == NULL && ...) S1 else S2
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This takes care of the following entry from Dan's list:
fs/btrfs/inode.c +4788 btrfs_rename(36) warning: variable derefenced before check 'old_inode'
Reported-by: Dan Carpenter <error27@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Eugene Teo <eteo@redhat.com>
Cc: Julia Lawall <julia@diku.dk>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The async caching thread can end up looping forever if a given
search puts it at the last key in a leaf. It will end up calling
btrfs_next_leaf and then checking if it needs to politely drop
the read semaphore.
Most of the time this looping isn't noticed because it is able to
make progress the next time around. But, during log replay,
we wait on the async caching thread to finish, and the async thread
is waiting on the commit, and no progress is really made.
The fix used here is to copy the key out of the next leaf,
that way our search lands there properly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng hit a problem where we tried to remove some free space but failed
because we couldn't find the free space entry. This is because the free space
was held within a bitmap that had a starting offset well before the actual
offset of the free space, and there were free space extents that were in the
same range as that offset, so tree_search_offset returned with NULL because we
couldn't find a free space extent that had that offset. This is fixed by
making sure that if we fail to find the entry, we re-search again with
bitmap_only set to 1 and do an offset_to_bitmap so we can get the appropriate
bitmap. A similar problem happens in btrfs_alloc_from_bitmap for the
clustering code, but that is not as bad since we will just go and redo our
cluster allocation.
Also this adds some debugging checks to make sure that the free space we are
trying to remove from the bitmap is in fact there. This can probably go away
after a while, but since this code is only used by the tree-logging stuff it
would be nice to run with it for a while to make sure there are no problems.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: be more polite in the async caching threads
Btrfs: preserve commit_root for async caching
The semaphore used by the async caching threads can prevent a
transaction commit, which can make the FS appear to stall. This
releases the semaphore more often when a transaction commit is
in progress.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The async block group caching code uses the commit_root pointer
to get a stable version of the extent allocation tree for scanning.
This copy of the tree root isn't going to change and it significantly
reduces the complexity of the scanning code.
During a commit, we have a loop where we update the extent allocation
tree root. We need to loop because updating the root pointer in
the tree of tree roots may allocate blocks which may change the
extent allocation tree.
Right now the commit_root pointer is changed inside this loop. It
is more correct to change the commit_root pointer only after all the
looping is done.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (22 commits)
Btrfs: Fix async caching interaction with unmount
Btrfs: change how we unpin extents
Btrfs: Correct redundant test in add_inode_ref
Btrfs: find smallest available device extent during chunk allocation
Btrfs: clear all space_info->full after removing a block group
Btrfs: make flushoncommit mount option correctly wait on ordered_extents
Btrfs: Avoid delayed reference update looping
Btrfs: Fix ordering of key field checks in btrfs_previous_item
Btrfs: find_free_dev_extent doesn't handle holes at the start of the device
Btrfs: Remove code duplication in comp_keys
Btrfs: async block group caching
Btrfs: use hybrid extents+bitmap rb tree for free space
Btrfs: Fix crash on read failures at mount
Btrfs: remove of redundant btrfs_header_level
Btrfs: adjust NULL test
Btrfs: Remove broken sanity check from btrfs_rmap_block()
Btrfs: convert nested spin_lock_irqsave to spin_lock
Btrfs: make sure all dirty blocks are written at commit time
Btrfs: fix locking issue in btrfs_find_next_key
Btrfs: fix double increment of path->slots[0] in btrfs_next_leaf
...
- don't stop the caching thread until btrfs_commit_super return.
- if caching is interrupted by umount, set last to (u64)-1.
otherwise the un-scanned range of block group will be considered
as free extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We are racy with async block caching and unpinning extents. This patch makes
things much less complicated by only unpinning the extent if the block group is
cached. We check the block_group->cached var under the block_group->lock spin
lock. If it is set to BTRFS_CACHE_FINISHED then we update the pinned counters,
and unpin the extent and add the free space back. If it is not set to this, we
start the caching of the block group so the next time we unpin extents we can
unpin the extent. This keeps us from racing with the async caching threads,
lets us kill the fs wide async thread counter, and keeps us from having to set
DELALLOC bits for every extent we hit if there are caching kthreads going.
One thing that needed to be changed was btrfs_free_super_mirror_extents. Now
instead of just looking for LOCKED extents, we also look for DIRTY extents,
since we could have left some extents pinned in the previous transaction that
will never get freed now that we are unmounting, which would cause us to leak
memory. So btrfs_free_super_mirror_extents has been changed to
btrfs_free_pinned_extents, and it will clear the extents locked for the super
mirror, and any remaining pinned extents that may be present. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
dir has already been tested. It seems that this test should be on the
recently returned value inode.
A simplified version of the semantic match that finds this problem is as
follows: (http://www.emn.fr/x-info/coccinelle/)
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Allocating new block group is easy when the disk has plenty of space.
But things get difficult as the disk fills up, especially if
the FS has been run through btrfs-vol -b. The balance operation
is likely to make the total bytes available on the device greater
than the largest extent we'll actually be able to allocate.
But the device extent allocation code incorrectly assumes that a device
with 5G free will be able to allocate a 5G extent. It isn't normally a
problem because device extents don't get freed unless btrfs-vol -b
is run.
This fixes the device extent allocator to remember the largest free
extent it can find, and then uses that value as a fallback.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs allocates individual extents from block groups, and each
block group has a specific type. It may hold metadata, data
mirrored or striped etc.
When we balance space (btrfs-vol -b) or remove a drive (btrfs-vol -r)
we free block groups. Once a block group is freed, the space it was
using on the device may be available for use by new block groups.
btrfs_remove_block_group was clearing the flag that said
'our devices are full, don't even try to allocate new block groups',
but it was only clearing that flag for a specific type of block group.
This commit clears the full flag for all of the types of block groups,
making it much more likely that we'll be able to balance space when
the drive is close to full.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The commit_transaction call to wait_ordered_extents when snap_pending
passes nocow_only=1 to process only NOCOW or PREALLOC extents. This isn't
correct for the 'flushoncommit' mode, as it skips extents we just started
IO on in start_delalloc_inodes.
So, in the flushoncommit case, wait on all ordered extents. Otherwise,
only pass the nocow_only flag to wait_ordered_extents if snap_pending.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_split_leaf and btrfs_del_items can end up in a loop
where one is constantly spliting a given leaf and the other
is constantly merging it back with the adjacent nodes.
There is a better fix for this, but in the interest of something
small, this patch just changes btrfs_del_items back to balancing less
often.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Check objectid of item before checking the item type, otherwise we may return
zero for a key that is actually too low.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_free_dev_extent does not properly handle the case where
the device is not complete free, and there is a free extent
at the beginning of the device.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
comp_keys is duplicating what is done in btrfs_comp_cpu_keys, so just
call it.
Signed-off-by: Diego Calleja <diegocg@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch moves the caching of the block group off to a kthread in order to
allow people to allocate sooner. Instead of blocking up behind the caching
mutex, we instead kick of the caching kthread, and then attempt to make an
allocation. If we cannot, we wait on the block groups caching waitqueue, which
the caching kthread will wake the waiting threads up everytime it finds 2 meg
worth of space, and then again when its finished caching. This is how I tested
the speedup from this
mkfs the disk
mount the disk
fill the disk up with fs_mark
unmount the disk
mount the disk
time touch /mnt/foo
Without my changes this took 11 seconds on my box, with these changes it now
takes 1 second.
Another change thats been put in place is we lock the super mirror's in the
pinned extent map in order to keep us from adding that stuff as free space when
caching the block group. This doesn't really change anything else as far as the
pinned extent map is concerned, since for actual pinned extents we use
EXTENT_DIRTY, but it does mean that when we unmount we have to go in and unlock
those extents to keep from leaking memory.
I've also added a check where when we are reading block groups from disk, if the
amount of space used == the size of the block group, we go ahead and mark the
block group as cached. This drastically reduces the amount of time it takes to
cache the block groups. Using the same test as above, except doing a dd to a
file and then unmounting, it used to take 33 seconds to umount, now it takes 3
seconds.
This version uses the commit_root in the caching kthread, and then keeps track
of how many async caching threads are running at any given time so if one of the
async threads is still running as we cross transactions we can wait until its
finished before handling the pinned extents. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently btrfs has a problem where it can use a ridiculous amount of RAM simply
tracking free space. As free space gets fragmented, we end up with thousands of
entries on an rb-tree per block group, which usually spans 1 gig of area. Since
we currently don't ever flush free space cache back to disk this gets to be a
bit unweildly on large fs's with lots of fragmentation.
This patch solves this problem by using PAGE_SIZE bitmaps for parts of the free
space cache. Initially we calculate a threshold of extent entries we can
handle, which is however many extent entries we can cram into 16k of ram. The
maximum amount of RAM that should ever be used to track 1 gigabyte of diskspace
will be 32k of RAM, which scales much better than we did before.
Once we pass the extent threshold, we start adding bitmaps and using those
instead for tracking the free space. This patch also makes it so that any free
space thats less than 4 * sectorsize we go ahead and put into a bitmap. This is
nice since we try and allocate out of the front of a block group, so if the
front of a block group is heavily fragmented and then has a huge chunk of free
space at the end, we go ahead and add the fragmented areas to bitmaps and use a
normal extent entry to track the big chunk at the back of the block group.
I've also taken the opportunity to revamp how we search for free space.
Previously we indexed free space via an offset indexed rb tree and a bytes
indexed rb tree. I've dropped the bytes indexed rb tree and use only the offset
indexed rb tree. This cuts the number of tree operations we were doing
previously down by half, and gives us a little bit of a better allocation
pattern since we will always start from a specific offset and search forward
from there, instead of searching for the size we need and try and get it as
close as possible to the offset we want.
I've given this a healthy amount of testing pre-new format stuff, as well as
post-new format stuff. I've booted up my fedora box which is installed on btrfs
with this patch and ran with it for a few days without issues. I've not seen
any performance regressions in any of my tests.
Since the last patch Yan Zheng fixed a problem where we could have overlapping
entries, so updating their offset inline would cause problems. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If the tree roots hit read errors during mount, btrfs is not properly
erroring out. We need to check the uptodate bits after
reading in the tree root node.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This removes the continues call's of btrfs_header_level. One call of
btrfs_header_level(c) its enough.
Signed-off-by Daniel Cadete <danielncadete10@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Move the call to BUG_ON to before the dereference of the tested value.
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It was never actually doing anything anyway (see the loop condition),
and it would be difficult to make it work for RAID[56].
Even if it was actually working, it's checking for the wrong thing
anyway. Instead of checking whether we list a block which _doesn't_ land
at the relevant physical location, it should be checking that we _have_
listed all the logical blocks which refer to the required physical
location on all devices.
This function is only called from remove_sb_from_cache() to ensure that
we reserve the logical blocks which would reside at the same physical
location as the superblock copies. So listing more blocks than we need
is actually OK.
With RAID[56] we're going to throw away an entire stripe for each block
we have to ignore, so we _are_ going to list blocks other than the
ones which actually contain the superblock.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If spin_lock_irqsave is called twice in a row with the same second
argument, the interrupt state at the point of the second call overwrites
the value saved by the first call. Indeed, the second call does not need
to save the interrupt state, so it is changed to a simple spin_lock.
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Write dirty block groups may allocate new block, and so may add new delayed
back ref. btrfs_run_delayed_refs may make some block groups dirty.
commit_cowonly_roots does not handle the recursion properly, and some dirty
blocks can be left unwritten at commit time. This patch moves
btrfs_run_delayed_refs into the loop that writes dirty block groups, and makes
the code not break out of the loop until there are no dirty block groups or
delayed back refs.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When walking up the tree, btrfs_find_next_key assumes the upper level tree
block is properly locked. This isn't always true even path->keep_locks is 1.
This is because btrfs_find_next_key may advance path->slots[] several times
instead of only once.
When 'path->slots[level] >= btrfs_header_nritems(path->nodes[level])' is found,
we can't guarantee the original value of 'path->slots[level]' is
'btrfs_header_nritems(path->nodes[level]) - 1'. If it's not, the tree block at
'level + 1' isn't locked.
This patch fixes the issue by explicitly checking the locking state,
re-searching the tree if it's not locked.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
if 1 is returned by btrfs_search_slot, the path already points to the
first item with 'key > searching key'. So increasing path->slots[0] by
one is superfluous in that case.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Change 'goto done' to 'break' for the case of all device extents have
been freed, so that the code updates space information will be execute.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
use __le64 instead of u64 in on-disk structure definition.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT
This will make hardirq.h inclusion cheaper for every PREEMPT=n config
(which includes allmodconfig/allyesconfig, BTW)
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix error message formatting
Btrfs: fix use after free in btrfs_start_workers fail path
Btrfs: honor nodatacow/sum mount options for new files
Btrfs: update backrefs while dropping snapshot
Btrfs: account for space we may use in fallocate
Btrfs: fix the file clone ioctl for preallocated extents
Btrfs: don't log the inode in file_write while growing the file
Make an error msg look nicer by inserting a space between number and word.
Signed-off-by: Hu Tao <hu.taoo@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
worker memory is already freed on one fail path in btrfs_start_workers,
but is still dereferenced. Switch the dereference and kfree.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs attr patches unconditionally inherited the inode flags field
without honoring nodatacow and nodatasum. This fix makes sure
we properly record the nodatacow/sum mount options in new inodes.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The new backref format has restriction on type of backref item. If a tree
block isn't referenced by its owner tree, full backrefs must be used for the
pointers in it. When a tree block loses its owner tree's reference, backrefs
for the pointers in it should be updated to full backrefs. Current
btrfs_drop_snapshot misses the code that updates backrefs, so it's unsafe for
general use.
This patch adds backrefs update code to btrfs_drop_snapshot. It isn't a
problem in the restricted form btrfs_drop_snapshot is used today, but for
general snapshot deletion this update is required.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Using Eric Sandeen's xfstest for fallocate, you can easily trigger a ENOSPC
panic on btrfs. This is because we do not account for data we may use when
doing the fallocate. This patch fixes the problem by properly reserving space,
and then just freeing it when we are done. The reservation stuff was made with
delalloc in mind, so its a little crude for this case, but it keeps the box
from panicing.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
helpers: get_cached_acl(inode, type), set_cached_acl(inode, type, acl),
forget_cached_acl(inode, type).
ubifs/xattr.c needed includes reordered, the rest is a plain switchover.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: remove some includings of blktrace_api.h
mg_disk: seperate mg_disk.h again
block: Introduce helper to reset queue limits to default values
cfq: remove extraneous '\n' in blktrace output
ubifs: register backing_dev_info
btrfs: properly register fs backing device
block: don't overwrite bdi->state after bdi_init() has been run
cfq: cleanup for last_end_request in cfq_data
btrfs assigns this bdi to all inodes on that file system, so make
sure it's registered. This isn't really important now, but will be
when we put dirty inodes there. Even now, we miss the stats when the
bdi isn't visible.
Also fixes failure to check bdi_init() return value, and bad inherit of
->capabilities flags from the default bdi.
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
commit_fs_roots skips updating root items for fs trees that aren't modified.
This is unsafe now that relocation code modifies root item's last_snapshot
field without modifying corresponding fs tree.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Make sure a superblock really is writeable by checking MS_RDONLY
under s_umount. sync_filesystems needed some re-arragement for
that, but all but one sync_filesystem caller had the correct locking
already so that we could add that check there. cachefiles grew
s_umount locking.
I've also added a WARN_ON to sync_filesystem to assert this for
future callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
During tree log replay, we read in the tree log roots,
process them and then free them. A recent change
takes an extra reference on the root node of the tree
when the root is read in, and stores that reference
in root->commit_root.
This reference was not being freed, leaving us with
one buffer pinned in ram for each subvol with
a tree log root after a crash.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
lookup_inline_extent_backref only checks for duplicate backref for data
extents. It assumes backrefs for tree block never conflict.
This patch makes lookup_inline_extent_backref check for duplicate backrefs
for both data and tree block, so that we can detect potential bug earlier.
This is a safety check, strictly speaking it is not required.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch fixes a bug which may result race condition
between btrfs_start_workers() and worker_loop().
btrfs_start_workers() executed in a parent thread writes
on workers->worker and worker_loop() in a child thread
reads workers->worker. However, there is no synchronization
enforcing the order of two operations.
This patch makes btrfs_start_workers() fill workers->worker
before it starts a child thread with worker_loop()
Signed-off-by: Chris Mason <chris.mason@oracle.com>
write_dev_supers is called in sequence. First is it called with wait == 0,
which starts IO on all of the super blocks for a given device. Then it is
called with wait == 1 to make sure they all reach the disk.
It doesn't currently pin the buffers between the two calls, and it also
assumes the buffers won't go away between the two calls, leading to
an oops if the VM manages to free the buffers in the middle of the sync.
This fixes that assumption and updates the code to return an error if things
are not up to date when the wait == 1 run is done.
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On multi-device filesystems, btrfs writes supers to all of the devices
before considering a sync complete. There wasn't any additional
locking between super writeout and the device list management code
because device management was done inside a transaction and
super writeout only happened with no transation writers running.
With the btrfs fsync log and other async transaction updates, this
has been racey for some time. This adds a mutex to protect
the device list. The existing volume mutex could not be reused due to
transaction lock ordering requirements.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
... otherwise generic_permission() will allow *anything* for all
files you don't own and that have some group permissions.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs, fdatasync and fsync are identical, but
fdatasync should skip committing transaction when
inode->i_state is set just I_DIRTY_SYNC and this indicates
only atime or/and mtime updates.
Following patch improves fdatasync throughput.
--file-block-size=4K --file-total-size=16G --file-test-mode=rndwr
--file-fsync-mode=fdatasync run
Results:
-2.6.30-rc8
Test execution summary:
total time: 1980.6540s
total number of events: 10001
total time taken by event execution: 1192.9804
per-request statistics:
min: 0.0000s
avg: 0.1193s
max: 15.3720s
approx. 95 percentile: 0.7257s
Threads fairness:
events (avg/stddev): 625.0625/151.32
execution time (avg/stddev): 74.5613/9.46
-2.6.30-rc8-patched
Test execution summary:
total time: 1695.9118s
total number of events: 10000
total time taken by event execution: 871.3214
per-request statistics:
min: 0.0000s
avg: 0.0871s
max: 10.4644s
approx. 95 percentile: 0.4787s
Threads fairness:
events (avg/stddev): 625.0000/131.86
execution time (avg/stddev): 54.4576/8.98
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There's no need to preserve this abstraction; it used to let us use
hardware crc32c support directly, but libcrc32c is already doing that for us
through the crypto API -- so we're already using the Intel crc32c
acceleration where appropriate.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add support for the standard attributes set via chattr and read via
lsattr. Currently we store the attributes in the flags value in
the btrfs inode, but I wonder whether we should split it into two so
that we don't have to keep converting between the two formats.
Remove the btrfs_clear_flag/btrfs_set_flag/btrfs_test_flag macros
as they were confusing the existing code and got in the way of the
new additions.
Also add the FS_IOC_GETVERSION ioctl for getting i_generation as it's
trivial.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
During mount, btrfs will check the queue nonrot flag
for all the devices found in the FS. If they are all
non-rotating, SSD mode is enabled by default.
If the FS was mounted with -o nossd, the non-rotating
flag is ignored.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Some SSDs perform best when reusing block numbers often, while
others perform much better when clustering strictly allocates
big chunks of unused space.
The default mount -o ssd will find rough groupings of blocks
where there are a bunch of free blocks that might have some
allocated blocks mixed in.
mount -o ssd_spread will make sure there are no allocated blocks
mixed in. It should perform better on lower end SSDs.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In SSD mode for data, and all the time for metadata the allocator
will try to find a cluster of nearby blocks for allocations. This
commit adds extra checks to make sure that each free block in the
cluster is close to the last one.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs IO submission threads try to service a bunch of devices with a small
number of threads. They do a congestion check to try and avoid waiting
on requests for a busy device.
The checks make sure we've sent a few requests down to a given device just so
that we aren't bouncing between busy devices without actually sending down
any IO. The counter used to decide if we can switch to the next device
is somewhat overloaded. It is also being used to decide if we've done
a good batch of requests between the WRITE_SYNC or regular priority lists.
It may get reset to zero often, leaving us hammering on a busy device
instead of moving on to another disk.
This commit adds a new counter for the number of bios sent while
servicing a device. It doesn't get reset or fiddled with. On
multi-device filesystems, this fixes IO stalls in streaming
write workloads.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs uses dedicated threads to submit bios when checksumming is on,
which allows us to make sure the threads dedicated to checksumming don't get
stuck waiting for requests. For each btrfs device, there are
two lists of bios. One list is for WRITE_SYNC bios and the other
is for regular priority bios.
The IO submission threads used to process all of the WRITE_SYNC bios first and
then switch to the regular bios. This commit makes sure we don't completely
starve the regular bios by rotating between the two lists.
WRITE_SYNC bios are still favored 2:1 over the regular bios, and this tries
to run in batches to avoid seeking. Benchmarking shows this eliminates
stalls during streaming buffered writes on both multi-device and
single device filesystems.
If the regular bios starve, the system can end up with a large amount of ram
pinned down in writeback pages. If we are a little more fair between the two
classes, we're able to keep throughput up and make progress on the bulk of
our dirty ram.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Once a metadata block has been written, it must be recowed, so the
btrfs dirty balancing call has a check to make sure a fair amount of metadata
was actually dirty before it started writing it back to disk.
A previous commit had changed the dirty tracking for metadata without
updating the btrfs dirty balancing checks. This commit switches it
to use the correct counter.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The block allocator in SSD mode will try to find groups of free blocks
that are close together. This commit makes it loop less on a given
group size before bumping it.
The end result is that we are less likely to fill small holes in the
available free space, but we don't waste as much CPU building the
large cluster used by ssd mode.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With the new back reference code, the cost of a balance has gone down
in terms of the number of back reference updates done. This commit
makes us more aggressively balance leaves and nodes as they become
less full.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When the delayed reference code was added, some checks were added
to avoid extra balancing while the delayed references were being flushed.
This made for less efficient btrees, but it reduced the chances of
loops where no forward progress was made because the balances made
more delayed ref updates.
With the new dead root removal code and the mixed back references,
the extent allocation tree is no longer using precise back refs, and
the delayed reference updates don't carry the risk of looping forever
anymore. So, the balance avoidance is no longer required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are some 'start = state->end + 1;' like code in set_extent_bit
and clear_extent_bit. They overflow when end == (u64)-1.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: Fix oops and use after free during space balancing
Btrfs: set device->total_disk_bytes when adding new device
The btrfs allocator uses list_for_each to walk the available block
groups when searching for free blocks. It starts off with a hint
to help find the best block group for a given allocation.
The hint is resolved into a block group, but we don't properly check
to make sure the block group we find isn't in the middle of being
freed due to filesystem shrinking or balancing. If it is being
freed, the list pointers in it are bogus and can't be trusted. But,
the code happily goes along and uses them in the list_for_each loop,
leading to all kinds of fun.
The fix used here is to check to make sure the block group we find really
is on the list before we use it. list_del_init is used when removing
it from the list, so we can do a proper check.
The allocation clustering code has a similar bug where it will trust
the block group in the current free space cluster. If our allocation
flags have changed (going from single spindle dup to raid1 for example)
because the drives in the FS have changed, we're not allowed to use
the old block group any more.
The fix used here is to check the current cluster against the
current allocation flags.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: Spelling fix in btrfs_lookup_first_block_group comments
Btrfs: make show_options result match actual option names
Btrfs: remove outdated comment in btrfs_ioctl_resize()
Btrfs: remove some WARN_ONs in the IO failure path
Btrfs: Don't loop forever on metadata IO failures
Btrfs: init inode ordered_data_close flag properly
The notreelog and flushoncommit mount options were being printed slightly
differently.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In Li Zefan's commit dae7b665cf,
a combination call of kmalloc() and copy_from_user() is replaced by
memdup_user(). So btrfs_ioctl_resize() doesn't use GFP_NOFS any more.
Signed-off-by: Li Hong <lihong.hi@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These debugging WARN_ONs make too much console noise during regular
IO failures. An IO failure will still generate a number of messages
as we verify checksums etc, but these two are not needed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a btrfs metadata read fails, the first thing we try to do is find
a good copy on another mirror of the block. If this fails, read_tree_block()
ends up returning a buffer that isn't up to date.
The btrfs btree reading code was reworked to drop locks and repeat
the search when IO was done, but the changes didn't add a check for failed
reads. The end result was looping forever on buffers that were never
going to become up to date.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This flag is used to decide when we need to send a given file through
the ordered code to make sure it is fully written before a transaction
commits. It was not being properly set to zero when the inode was
being setup.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: look for acls during btrfs_read_locked_inode
Btrfs: fix acl caching
Btrfs: Fix a bunch of printk() warnings.
Btrfs: Fix a trivial warning using max() of u64 vs ULL.
Btrfs: remove unused btrfs_bit_radix slab
Btrfs: ratelimit IO error printks
Btrfs: remove #if 0 code
Btrfs: When shrinking, only update disk size on success
Btrfs: fix deadlocks and stalls on dead root removal
Btrfs: fix fallocate deadlock on inode extent lock
Btrfs: kill btrfs_cache_create
Btrfs: don't export symbols
Btrfs: simplify makefile
Btrfs: try to keep a healthy ratio of metadata vs data block groups
This changes btrfs_read_locked_inode() to peek ahead in the btree for acl items.
If it is certain a given inode has no acls, it will set the in memory acl
fields to null to avoid acl lookups completely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Linus noticed the btrfs code to cache acls wasn't properly caching
a NULL acl when the inode didn't have any acls. This meant the common
case of no acls resulted in expensive btree searches every time the
kernel checked permissions (which is quite often).
This is a modified version of Linus' original patch:
Properly set initial acl fields to BTRFS_ACL_NOT_CACHED in the inode.
This forces an acl lookup when permission checks are done.
Fix btrfs_get_acl to avoid lookups and locking when the inode acls fields
are set to null.
Fix btrfs_get_acl to use the right return value from __btrfs_getxattr
when deciding to cache a NULL acl. It was storing a NULL acl when
__btrfs_getxattr return -ENOENT, but __btrfs_getxattr was actually returning
-ENODATA for this case.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Just happened to notice a bunch of %llu vs u64 warnings. Here's a patch
to cast them all.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A small warning popped up on ia64 because inode-map.c was comparing a
u64 object id with the ULL FIRST_FREE_OBJECTID. My first thought was
that all the OBJECTID constants should contain the u64 cast because
btrfs code deals entirely in u64s. But then I saw how large that was,
and figured I'd just fix the max() call.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs has printks for various IO errors, including bad checksums and
mismatches between what we expect the block headers to contain and what
we actually find on the disk.
Longer term we need a real reporting mechanism for this, but for now
printk is going to have to do.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Previously, we updated a device's size prior to attempting a shrink
operation. This patch moves the device resizing logic to only happen if
the shrink completes successfully. In the process, it introduces a new
field to btrfs_device -- disk_total_bytes -- to track the on-disk size.
Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After a transaction commit, the old root of the subvol btrees are sent through
snapshot removal. This is what actually frees up any blocks replaced by
COW, and anything the old blocks pointed to.
Snapshot deletion will pause when a transaction commit has started, which
helps to avoid a huge amount of delayed reference count updates piling up
as the transaction is trying to close.
But, this pause happens after the snapshot deletion process has asked other
procs on the system to throttle back a bit so that it can make progress.
We don't want to throttle everyone while we're waiting for the transaction
commit, it leads to deadlocks in the user transaction ioctls used by Ceph
and makes things slower in general.
This patch changes things to avoid the throttling while we sleep.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs fallocate call takes an extent lock on the entire range
being fallocated, and then runs through insert_reserved_extent on each
extent as they are allocated.
The problem with this is that btrfs_drop_extents may decide to try
and take the same extent lock fallocate was already holding. The solution
used here is to push down knowledge of the range that is already locked
going into btrfs_drop_extents.
It turns out that at least one other caller had the same bug.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently the extent_map code is only for btrfs so don't export it's
symbols.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Get rid of the hacks for building out of tree, and always use += for
assigning to the object lists.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch makes the chunk allocator keep a good ratio of metadata vs data
block groups. By default for every 8 data block groups, we'll allocate 1
metadata chunk, or about 12% of the disk will be allocated for metadata. This
can be changed by specifying the metadata_ratio mount option.
This is simply the number of data block groups that have to be allocated to
force a metadata chunk allocation. By making sure we allocate metadata chunks
more often, we are less likely to get into situations where the whole disk
has been allocated as data block groups.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix btrfs fallocate oops and deadlock
Btrfs: use the right node in reada_for_balance
Btrfs: fix oops on page->mapping->host during writepage
Btrfs: add a priority queue to the async thread helpers
Btrfs: use WRITE_SYNC for synchronous writes
Btrfs fallocate was incorrectly starting a transaction with a lock held
on the extent_io tree for the file, which could deadlock. Strictly
speaking it was using join_transaction which would be safe, but it is better
to move the transaction outside of the lock.
When preallocated extents are overwritten, btrfs_mark_buffer_dirty was
being called on an unlocked buffer. This was triggering an assertion and
oops because the lock is supposed to be held.
The bug was calling btrfs_mark_buffer_dirty on a leaf after btrfs_del_item had
been run. btrfs_del_item takes care of dirtying things, so the solution is a
to skip the btrfs_mark_buffer_dirty call in this case.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Remove open-coded memdup_user().
Note this changes some GFP_NOFS to GFP_KERNEL, since copy_from_user() may
cause pagefault, it's pointless to pass GFP_NOFS to kmalloc().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
reada_for_balance was using the wrong index into the path node array,
so it wasn't reading the right blocks. We never directly used the
results of the read done by this function because the btree search is
started over at the end.
This fixes reada_for_balance to reada in the correct node and to
avoid searching past the last slot in the node. It also makes sure to
hold the parent lock while we are finding the nodes to read.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent_io writepage call updates the writepage index in the inode
as it makes progress. But, it was doing the update after unlocking the page,
which isn't legal because page->mapping can't be trusted once the page
is unlocked.
This lead to an oops, especially common with compression turned on. The
fix here is to update the writeback index before unlocking the page.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs is using WRITE_SYNC_PLUG to send down synchronous IOs with a
higher priority. But, the checksumming helper threads prevent it
from being fully effective.
There are two problems. First, a big queue of pending checksumming
will delay the synchronous IO behind other lower priority writes. Second,
the checksumming uses an ordered async work queue. The ordering makes sure
that IOs are sent to the block layer in the same order they are sent
to the checksumming threads. Usually this gives us less seeky IO.
But, when we start mixing IO priorities, the lower priority IO can delay
the higher priority IO.
This patch solves both problems by adding a high priority list to the async
helper threads, and a new btrfs_set_work_high_prio(), which is used
to make put a new async work item onto the higher priority list.
The ordering is still done on high priority IO, but all of the high
priority bios are ordered separately from the low priority bios. This
ordering is purely an IO optimization, it is not involved in data
or metadata integrity.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Part of reducing fsync/O_SYNC/O_DIRECT latencies is using WRITE_SYNC for
writes we plan on waiting on in the near future. This patch
mirrors recent changes in other filesystems and the generic code to
use WRITE_SYNC when WB_SYNC_ALL is passed and to use WRITE_SYNC for
other latency critical writes.
Btrfs uses async worker threads for checksumming before the write is done,
and then again to actually submit the bios. The bio submission code just
runs a per-device list of bios that need to be sent down the pipe.
This list is split into low priority and high priority lists so the
WRITE_SYNC IO happens first.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: BUG to BUG_ON changes
Btrfs: remove dead code
Btrfs: remove dead code
Btrfs: fix typos in comments
Btrfs: remove unused ftrace include
Btrfs: fix __ucmpdi2 compile bug on 32 bit builds
Btrfs: free inode struct when btrfs_new_inode fails
Btrfs: fix race in worker_loop
Btrfs: add flushoncommit mount option
Btrfs: notreelog mount option
Btrfs: introduce btrfs_show_options
Btrfs: rework allocation clustering
Btrfs: Optimize locking in btrfs_next_leaf()
Btrfs: break up btrfs_search_slot into smaller pieces
Btrfs: kill the pinned_mutex
Btrfs: kill the block group alloc mutex
Btrfs: clean up find_free_extent
Btrfs: free space cache cleanups
Btrfs: unplug in the async bio submission threads
Btrfs: keep processing bios for a given bdev if our proc is batching
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Remove two unneeded exports and make two symbols static in fs/mpage.c
Cleanup after commit 585d3bc06f
Trim includes of fdtable.h
Don't crap into descriptor table in binfmt_som
Trim includes in binfmt_elf
Don't mess with descriptor table in load_elf_binary()
Get rid of indirect include of fs_struct.h
New helper - current_umask()
check_unsafe_exec() doesn't care about signal handlers sharing
New locking/refcounting for fs_struct
Take fs_struct handling to new file (fs/fs_struct.c)
Get rid of bumping fs_struct refcount in pivot_root(2)
Kill unsharing fs_struct in __set_personality()
Remove an unneeded return statement and conditional
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We get this on 32 builds:
fs/built-in.o: In function `extent_fiemap':
(.text+0x1019f2): undefined reference to `__ucmpdi2'
Happens because of a switch statement with a 64 bit argument.
Convert this to an if statement to fix this.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_new_inode doesn't call iput to free the inode
when it fails.
Signed-off-by: Shen Feng <shen@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Need to check kthread_should_stop after schedule_timeout() before calling
schedule(). This causes threads to sleep with potentially no one to wake them
up causing mount(2) to hang in btrfs_stop_workers waiting for threads to stop.
Signed-off-by: Amit Gud <gud@ksu.edu>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The 'flushoncommit' mount option forces any data dirtied by a write in a
prior transaction to commit as part of the current commit. This makes
the committed state a fully consistent view of the file system from the
application's perspective (i.e., it includes all completed file system
operations). This was previously the behavior only when a snapshot is
created.
This is used by Ceph to ensure that completed writes make it to the
platter along with the metadata operations they are bound to (by
BTRFS_IOC_TRANS_{START,END}).
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add a 'notreelog' mount option to disable the tree log (used by fsync,
O_SYNC writes). This is much slower, but the tree logging produces
inconsistent views into the FS for ceph.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs options can change at times other than mount, yet /proc/mounts shows the
options string used when the fs was mounted (an example would be when btrfs
determines that barriers aren't useful and turns them off.) This patch
instead outputs the actual options in use by btrfs.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Because btrfs is copy-on-write, we end up picking new locations for
blocks very often. This makes it fairly difficult to maintain perfect
read patterns over time, but we can at least do some optimizations
for writes.
This is done today by remembering the last place we allocated and
trying to find a free space hole big enough to hold more than just one
allocation. The end result is that we tend to write sequentially to
the drive.
This happens all the time for metadata and it happens for data
when mounted -o ssd. But, the way we record it is fairly racey
and it tends to fragment the free space over time because we are trying
to allocate fairly large areas at once.
This commit gets rid of the races by adding a free space cluster object
with dedicated locking to make sure that only one process at a time
is out replacing the cluster.
The free space fragmentation is somewhat solved by allowing a cluster
to be comprised of smaller free space extents. This part definitely
adds some CPU time to the cluster allocations, but it allows the allocator
to consume the small holes left behind by cow.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_next_leaf was using blocking locks when it could have been using
faster spinning ones instead. This adds a few extra checks around
the pieces that block and switches over to spinning locks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch removes the pinned_mutex. The extent io map has an internal tree
lock that protects the tree itself, and since we only copy the extent io map
when we are committing the transaction we don't need it there. We also don't
need it when caching the block group since searching through the tree is also
protected by the internal map spin lock.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This patch removes the block group alloc mutex used to protect the free space
tree for allocations and replaces it with a spin lock which is used only to
protect the free space rb tree. This means we only take the lock when we are
directly manipulating the tree, which makes us a touch faster with
multi-threaded workloads.
This patch also gets rid of btrfs_find_free_space and replaces it with
btrfs_find_space_for_alloc, which takes the number of bytes you want to
allocate, and empty_size, which is used to indicate how much free space should
be at the end of the allocation.
It will return an offset for the allocator to use. If we don't end up using it
we _must_ call btrfs_add_free_space to put it back. This is the tradeoff to
kill the alloc_mutex, since we need to make sure nobody else comes along and
takes our space.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
I've replaced the strange looping constructs with a list_for_each_entry on
space_info->block_groups. If we have a hint we just jump into the loop with
the block group and start looking for space. If we don't find anything we
start at the beginning and start looking. We never come out of the loop with a
ref on the block_group _unless_ we found space to use, then we drop it after we
set the trans block_group.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This patch cleans up the free space cache code a bit. It better documents the
idiosyncrasies of tree_search_offset and makes the code make a bit more sense.
I took out the info allocation at the start of __btrfs_add_free_space and put it
where it makes more sense. This was left over cruft from when alloc_mutex
existed. Also all of the re-searches we do to make sure we inserted properly.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Btrfs pages being written get set to writeback, and then may go through
a number of steps before they hit the block layer. This includes compression,
checksumming and async bio submission.
The end result is that someone who writes a page and then does
wait_on_page_writeback is likely to unplug the queue before the bio they
cared about got there.
We could fix this by marking bios sync, or by doing more frequent unplugs,
but this commit just changes the async bio submission code to unplug
after it has processed all the bios for a device. The async bio submission
does a fair job of collection bios, so this shouldn't be a huge problem
for reducing merging at the elevator.
For streaming O_DIRECT writes on a 5 drive array, it boosts performance
from 386MB/s to 460MB/s.
Thanks to Hisashi Hifumi for helping with this work.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs uses async helper threads to submit write bios so the checksumming
helper threads don't block on the disk.
The submit bio threads may process bios for more than one block device,
so when they find one device congested they try to move on to other
devices instead of blocking in get_request_wait for one device.
This does a pretty good job of keeping multiple devices busy, but the
congested flag has a number of problems. A congested device may still
give you a request, and other procs that aren't backing off the congested
device may starve you out.
This commit uses the io_context stored in current to decide if our process
has been made a batching process by the block layer. If so, it keeps
sending IO down for at least one batch. This helps make sure we do
a good amount of work each time we visit a bdev, and avoids large IO
stalls in multi-device workloads.
It's also very ugly. A better solution is in the works with Jens Axboe.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: try to free metadata pages when we free btree blocks
Btrfs: add extra flushing for renames and truncates
Btrfs: make sure btrfs_update_delayed_ref doesn't increase ref_mod
Btrfs: optimize fsyncs on old files
Btrfs: tree logging unlink/rename fixes
Btrfs: Make sure i_nlink doesn't hit zero too soon during log replay
Btrfs: limit balancing work while flushing delayed refs
Btrfs: readahead checksums during btrfs_finish_ordered_io
Btrfs: leave btree locks spinning more often
Btrfs: Only let very young transactions grow during commit
Btrfs: Check for a blocking lock before taking the spin
Btrfs: reduce stack in cow_file_range
Btrfs: reduce stalls during transaction commit
Btrfs: process the delayed reference queue in clusters
Btrfs: try to cleanup delayed refs while freeing extents
Btrfs: reduce stack usage in some crucial tree balancing functions
Btrfs: do extent allocation and reference count updates in the background
Btrfs: don't preallocate metadata blocks during btrfs_search_slot
page_mkwrite is called with neither the page lock nor the ptl held. This
means a page can be concurrently truncated or invalidated out from
underneath it. Callers are supposed to prevent truncate races themselves,
however previously the only thing they can do in case they hit one is to
raise a SIGBUS. A sigbus is wrong for the case that the page has been
invalidated or truncated within i_size (eg. hole punched). Callers may
also have to perform memory allocations in this path, where again, SIGBUS
would be wrong.
The previous patch ("mm: page_mkwrite change prototype to match fault")
made it possible to properly specify errors. Convert the generic buffer.c
code and btrfs to return sane error values (in the case of page removed
from pagecache, VM_FAULT_NOPAGE will cause the fault handler to exit
without doing anything, and the fault will be retried properly).
This fixes core code, and converts btrfs as a template/example. All other
filesystems defining their own page_mkwrite should be fixed in a similar
manner.
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags. There should be no functional change.
This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg. virtual_address to the
driver, which might be important in some special cases).
This is required for a subsequent fix. And will also make it easier to
merge page_mkwrite() with fault() in future.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Artem Bityutskiy <dedekind@infradead.org>
Cc: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
COW means we cycle though blocks fairly quickly, and once we
free an extent on disk, it doesn't make much sense to keep the pages around.
This commit tries to immediately free the page when we free the extent,
which lowers our memory footprint significantly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Renames and truncates are both common ways to replace old data with new
data. The filesystem can make an effort to make sure the new data is
on disk before actually replacing the old data.
This is especially important for rename, which many application use as
though it were atomic for both the data and the metadata involved. The
current btrfs code will happily replace a file that is fully on disk
with one that was just created and still has pending IO.
If we crash after transaction commit but before the IO is done, we'll end
up replacing a good file with a zero length file. The solution used
here is to create a list of inodes that need special ordering and force
them to disk before the commit is done. This is similar to the
ext3 style data=ordering, except it is only done on selected files.
Btrfs is able to get away with this because it does not wait on commits
very often, even for fsync (which use a sub-commit).
For renames, we order the file when it wasn't already
on disk and when it is replacing an existing file. Larger files
are sent to filemap_flush right away (before the transaction handle is
opened).
For truncates, we order if the file goes from non-zero size down to
zero size. This is a little different, because at the time of the
truncate the file has no dirty bytes to order. But, we flag the inode
so that it is added to the ordered list on close (via release method). We
also immediately add it to the ordered list of the current transaction
so that we can try to flush down any writes the application sneaks in
before commit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_update_delayed_ref is optimized to add and remove different
references in one pass through the delayed ref tree. It is a zero
sum on the total number of refs on a given extent.
But, the code was recording an extra ref in the head node. This
never made it down to the disk but was used when deciding if it was
safe to free the extent while dropping snapshots.
The fix used here is to make sure the ref_mod count is unchanged
on the head ref when btrfs_update_delayed_ref is called.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The fsync log has code to make sure all of the parents of a file are in the
log along with the file. It uses a minimal log of the parent directory
inodes, just enough to get the parent directory on disk.
If the transaction that originally created a file is fully on disk,
and the file hasn't been renamed or linked into other directories, we
can safely skip the parent directory walk. We know the file is on disk
somewhere and we can go ahead and just log that single file.
This is more important now because unrelated unlinks in the parent directory
might make us force a commit if we try to log the parent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The tree logging code allows individual files or directories to be logged
without including operations on other files and directories in the FS.
It tries to commit the minimal set of changes to disk in order to
fsync the single file or directory that was sent to fsync or O_SYNC.
The tree logging code was allowing files and directories to be unlinked
if they were part of a rename operation where only one directory
in the rename was in the fsync log. This patch adds a few new rules
to the tree logging.
1) on rename or unlink, if the inode being unlinked isn't in the fsync
log, we must force a full commit before doing an fsync of the directory
where the unlink was done. The commit isn't done during the unlink,
but it is forced the next time we try to log the parent directory.
Solution: record transid of last unlink/rename per directory when the
directory wasn't already logged. For renames this is only done when
renaming to a different directory.
mkdir foo/some_dir
normal commit
rename foo/some_dir foo2/some_dir
mkdir foo/some_dir
fsync foo/some_dir/some_file
The fsync above will unlink the original some_dir without recording
it in its new location (foo2). After a crash, some_dir will be gone
unless the fsync of some_file forces a full commit
2) we must log any new names for any file or dir that is in the fsync
log. This way we make sure not to lose files that are unlinked during
the same transaction.
2a) we must log any new names for any file or dir during rename
when the directory they are being removed from was logged.
2a is actually the more important variant. Without the extra logging
a crash might unlink the old name without recreating the new one
3) after a crash, we must go through any directories with a link count
of zero and redo the rm -rf
mkdir f1/foo
normal commit
rm -rf f1/foo
fsync(f1)
The directory f1 was fully removed from the FS, but fsync was never
called on f1, only its parent dir. After a crash the rm -rf must
be replayed. This must be able to recurse down the entire
directory tree. The inode link count fixup code takes care of the
ugly details.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
During log replay, inodes are copied from the log to the main filesystem
btrees. Sometimes they have a zero link count in the log but they actually
gain links during the replay or have some in the main btree.
This patch updates the link count to be at least one after copying the
inode out of the log. This makes sure the inode is deleted during an
iput while the rest of the replay code is still working on it.
The log replay has fixup code to make sure that link counts are correct
at the end of the replay, so we could use any non-zero number here and
it would work fine.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The delayed reference mechanism is responsible for all updates to the
extent allocation trees, including those updates created while processing
the delayed references.
This commit tries to limit the amount of work that gets created during
the final run of delayed refs before a commit. It avoids cowing new blocks
unless it is required to finish the commit, and so it avoids new allocations
that were not really required.
The goal is to avoid infinite loops where we are always making more work
on the final run of delayed refs. Over the long term we'll make a
special log for the last delayed ref updates as well.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This reads in blocks in the checksum btree before starting the
transaction in btrfs_finish_ordered_io. It makes it much more likely
we'll be able to do operations inside the transaction without
needing any btree reads, which limits transaction latencies overall.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_mark_buffer dirty would set dirty bits in the extent_io tree
for the buffers it was dirtying. This may require a kmalloc and it
was not atomic. So, anyone who called btrfs_mark_buffer_dirty had to
set any btree locks they were holding to blocking first.
This commit changes dirty tracking for extent buffers to just use a flag
in the extent buffer. Now that we have one and only one extent buffer
per page, this can be safely done without losing dirty bits along the way.
This also introduces a path->leave_spinning flag that callers of
btrfs_search_slot can use to indicate they will properly deal with a
path returned where all the locks are spinning instead of blocking.
Many of the btree search callers now expect spinning paths,
resulting in better btree concurrency overall.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commits are fairly expensive, and so btrfs has code to sit around for a while
during the commit and let new writers come in.
But, while we're sitting there, new delayed refs might be added, and those
can be expensive to process as well. Unless the transaction is very very
young, it makes sense to go ahead and let the commit finish without hanging
around.
The commit grow loop isn't as important as it used to be, the fsync logging
code handles most performance critical syncs now.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This reduces contention on the extent buffer spin locks by testing for a
blocking lock before trying to take the spinlock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The fs/btrfs/inode.c code to run delayed allocation during writout
needed some stack usage optimization. This is the first pass, it does
the check for compression earlier on, which allows us to do the common
(no compression) case higher up in the call chain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To avoid deadlocks and reduce latencies during some critical operations, some
transaction writers are allowed to jump into the running transaction and make
it run a little longer, while others sit around and wait for the commit to
finish.
This is a bit unfair, especially when the callers that jump in do a bunch
of IO that makes all the others procs on the box wait. This commit
reduces the stalls this produces by pre-reading file extent pointers
during btrfs_finish_ordered_io before the transaction is joined.
It also tunes the drop_snapshot code to politely wait for transactions
that have started writing out their delayed refs to finish. This avoids
new delayed refs being flooded into the queue while we're trying to
close off the transaction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The delayed reference queue maintains pending operations that need to
be done to the extent allocation tree. These are processed by
finding records in the tree that are not currently being processed one at
a time.
This is slow because it uses lots of time searching through the rbtree
and because it creates lock contention on the extent allocation tree
when lots of different procs are running delayed refs at the same time.
This commit changes things to grab a cluster of refs for processing,
using a cursor into the rbtree as the starting point of the next search.
This way we walk smoothly through the rbtree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When extents are freed, it is likely that we've removed the last
delayed reference update for the extent. This checks the delayed
ref tree when things are freed, and if no ref updates area left it
immediately processes the delayed ref.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Many of the tree balancing functions follow the same pattern.
1) cow a block
2) do something to the result
This commit breaks them up into two functions so the variables and
code required for part two don't suck down stack during part one.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent allocation tree maintains a reference count and full
back reference information for every extent allocated in the
filesystem. For subvolume and snapshot trees, every time
a block goes through COW, the new copy of the block adds a reference
on every block it points to.
If a btree node points to 150 leaves, then the COW code needs to go
and add backrefs on 150 different extents, which might be spread all
over the extent allocation tree.
These updates currently happen during btrfs_cow_block, and most COWs
happen during btrfs_search_slot. btrfs_search_slot has locks held
on both the parent and the node we are COWing, and so we really want
to avoid IO during the COW if we can.
This commit adds an rbtree of pending reference count updates and extent
allocations. The tree is ordered by byte number of the extent and byte number
of the parent for the back reference. The tree allows us to:
1) Modify back references in something close to disk order, reducing seeks
2) Significantly reduce the number of modifications made as block pointers
are balanced around
3) Do all of the extent insertion and back reference modifications outside
of the performance critical btrfs_search_slot code.
#3 has the added benefit of greatly reducing the btrfs stack footprint.
The extent allocation tree modifications are done without the deep
(and somewhat recursive) call chains used in the past.
These delayed back reference updates must be done before the transaction
commits, and so the rbtree is tied to the transaction. Throttling is
implemented to help keep the queue of backrefs at a reasonable size.
Since there was a similar mechanism in place for the extent tree
extents, that is removed and replaced by the delayed reference tree.
Yan Zheng <yan.zheng@oracle.com> helped review and fixup this code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In order to avoid doing expensive extent management with tree locks held,
btrfs_search_slot will preallocate tree blocks for use by COW without
any tree locks held.
A later commit moves all of the extent allocation work for COW into
a delayed update mechanism, and this preallocation will no longer be
required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The full flag on the space info structs tells the allocator not to try
and allocate more chunks because the devices in the FS are fully allocated.
When more devices are added, we need to clear the full flag so the allocator
knows it has more space available.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Storage allocated to different raid levels in btrfs is tracked by
a btrfs_space_info structure, and all of the current space_infos are
collected into a list_head.
Most filesystems have 3 or 4 of these structs total, and the list is
only changed when new raid levels are added or at unmount time.
This commit adds rcu locking on the list head, and properly frees
things at unmount time. It also clears the space_info->full flag
whenever new space is added to the FS.
The locking for the space info list goes like this:
reads: protected by rcu_read_lock()
writes: protected by the chunk_mutex
At unmount time we don't need special locking because all the readers
are gone.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_tree_locked was being used to make sure a given extent_buffer was
properly locked in a few places. But, it wasn't correct for UP compiled
kernels.
This switches it to using assert_spin_locked instead, and renames it to
btrfs_assert_tree_locked to better reflect how it was really being used.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This fixes a problem where we could return -ENOSPC when we may actually have
plenty of space, the space is just pinned. Instead of returning -ENOSPC
immediately, commit the transaction first and then try and do the allocation
again.
This patch also does chunk allocation for metadata if we pass the 80%
threshold for metadata space. This will help with stack usage since the chunk
allocation will happen early on, instead of when the allocation is happening.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This is a step in the direction of better -ENOSPC handling. Instead of
checking the global bytes counter we check the space_info bytes counters to
make sure we have enough space.
If we don't we go ahead and try to allocate a new chunk, and then if that fails
we return -ENOSPC. This patch adds two counters to btrfs_space_info,
bytes_delalloc and bytes_may_use.
bytes_delalloc account for extents we've actually setup for delalloc and will
be allocated at some point down the line.
bytes_may_use is to keep track of how many bytes we may use for delalloc at
some point. When we actually set the extent_bit for the delalloc bytes we
subtract the reserved bytes from the bytes_may_use counter. This keeps us from
not actually being able to allocate space for any delalloc bytes.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
btrfs_record_root_in_trans needs the trans_mutex held to make sure two
callers don't race to setup the root in a given transaction. This adds
it to all the places that were missing it.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Btrfs is currently using spin_lock_nested with a nested value based
on the tree depth of the block. But, this doesn't quite work because
the max tree depth is bigger than what spin_lock_nested can deal with,
and because locks are sometimes taken before the level field is filled in.
The solution here is to use lockdep_set_class_and_name instead, and to
set the class before unlocking the pages when the block is read from the
disk and just after init of a freshly allocated tree block.
btrfs_clear_path_blocking is also changed to take the locks in the proper
order, and it also makes sure all the locks currently held are properly
set to blocking before it tries to retake the spinlocks. Otherwise, lockdep
gets upset about bad lock orderin.
The lockdep magic cam from Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The call to kzalloc is followed by a kmalloc whose result is stored in the
same variable.
The semantic match that finds the problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r exists@
local idexpression x;
statement S;
expression E;
identifier f,l;
position p1,p2;
expression *ptr != NULL;
@@
(
if ((x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...)) == NULL) S
|
x@p1 = \(kmalloc\|kzalloc\|kcalloc\)(...);
...
if (x == NULL) S
)
<... when != x
when != if (...) { <+...x...+> }
x->f = E
...>
(
return \(0\|<+...x...+>\|ptr\);
|
return@p2 ...;
)
@script:python@
p1 << r.p1;
p2 << r.p2;
@@
print "* file: %s kmalloc %s return %s" % (p1[0].file,p1[0].line,p2[0].line)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_init_path was initially used when the path objects were on the
stack. Now all the work is done by btrfs_alloc_path and btrfs_init_path
isn't required.
This patch removes it, and just uses kmem_cache_zalloc to zero out the object.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_releasepage may call kmem_cache_alloc indirectly,
and provide same GFP flags it gets to kmem_cache_alloc.
So it's possible to use __GFP_HIGHMEM with the slab
allocator.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Cleaning old snapshots can make sync(1) somewhat slow, and some users
and applications still use it in a global fsync kind of workload.
This patch changes btrfs not to clean old snapshots during sync, which is
safe from a FS consistency point of view. The major downside is that it
makes it difficult to tell when old snapshots have been reaped and
the space they were using has been reclaimed. A new ioctl will be added
for this purpose instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Larger metadata clusters can significantly improve writeback performance
on ssd drives with large erasure blocks. The larger clusters make it
more likely a given IO will completely overwrite the ssd block, so it
doesn't have to do an internal rwm cycle.
On spinning media, lager metadata clusters end up spreading out the
metadata more over time, which makes fsck slower, so we don't want this
to be the default.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs wasn't parsing any new mount options during remount, making it
difficult to set mount options on a root drive.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Theres a slight problem with finish_current_insert, if we set all to 1 and then
go through and don't actually skip any of the extents on the pending list, we
could exit right after we've added new extents.
This is a problem because by inserting the new extents we could have gotten new
COW's to happen and such, so we may have some pending updates to do or even
more inserts to do after that.
So this patch will only exit if we have never skipped any of the extents in the
pending list, and we have no extents to insert, this will make sure that all of
the pending work is truly done before we return. I've been running with this
patch for a few days with all of my other testing and have not seen issues.
Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Btrfs was using spin_is_contended to see if it should drop locks before
doing extent allocations during btrfs_search_slot. The idea was to avoid
expensive searches in the tree unless the lock was actually contended.
But, spin_is_contended is specific to the ticket spinlocks on x86, so this
is causing compile errors everywhere else.
In practice, the contention could easily appear some time after we started
doing the extent allocation, and it makes more sense to always drop the lock
instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The S_ISGID check in btrfs_new_inode caused an oops during subvol creation
because sometimes the dir is null.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On fast devices that go from congested to uncongested very quickly, pdflush
is waiting too often in congestion_wait, and the FS is backing off to
easily in write_cache_pages.
For now, fix this on the btrfs side by only checking congestion after
some bios have already gone down. Longer term a real fix is needed
for pdflush, but that is a larger project.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Whenever an item deletion is done, we need to balance all the nodes
in the tree to make sure we don't end up with an empty node if a pointer
is deleted. This balance prep happens from the root of the tree down
so we can drop our locks as we go.
reada_for_balance was triggering read-ahead on neighboring nodes even
when no balancing was required. This adds an extra check to avoid
calling balance_level() and avoid reada_for_balance() when a balance
won't be required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_unlock_up_safe would break out at the first NULL node entry or
unlocked node it found in the path.
Some of the callers have missing nodes at the lower levels of the path, so this
commit fixes things to check all the nodes in the path before returning.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_del_leaf does two things. First it removes the pointer in the
parent, and then it frees the block that has the leaf. It has the
parent node locked for both operations.
But, it only needs the parent locked while it is deleting the pointer.
After that it can safely free the block without the parent locked.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_truncate_inode_items is setup to stop doing btree searches when
it has finished removing the items for the inode. It used to detect the
end of the inode by looking for an objectid that didn't match the
one we were searching for.
But, this would result in an extra search through the btree, which
adds extra balancing and cow costs to the operation.
This commit adds a check to see if we found the inode item, which means
we can stop searching early.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The compression code had some checks to make sure we were only
compressing bytes inside of i_size, but it wasn't catching every
case. To make things worse, some incorrect math about the number
of bytes remaining would make it try to compress more pages than the
file really had.
The fix used here is to fall back to the non-compression code in this
case, which does all the proper cleanup of delalloc and other accounting.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With selinux on we end up calling __btrfs_setxattr when we create an inode,
which calls btrfs_start_transaction(). The problem is we've already called
that in btrfs_new_inode, and in btrfs_start_transaction we end up doing a
wait_current_trans(). If btrfs-transaction has started committing it will wait
for all handles to finish, while the other process is waiting for the
transaction to commit. This is fixed by using btrfs_join_transaction, which
won't wait for the transaction to commit. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Before this patch, new files/dirs would ignore the SGID bit on their
parent directory and always be owned by the creating user's uid/gid.
Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Every transaction in btrfs creates a new snapshot, and then schedules the
snapshot from the last transaction for deletion. Snapshot deletion
works by walking down the btree and dropping the reference counts
on each btree block during the walk.
If if a given leaf or node has a reference count greater than one,
the reference count is decremented and the subtree pointed to by that
node is ignored.
If the reference count is one, walking continues down into that node
or leaf, and the references of everything it points to are decremented.
The old code would try to work in small pieces, walking down the tree
until it found the lowest leaf or node to free and then returning. This
was very friendly to the rest of the FS because it didn't have a huge
impact on other operations.
But it wouldn't always keep up with the rate that new commits added new
snapshots for deletion, and it wasn't very optimal for the extent
allocation tree because it wasn't finding leaves that were close together
on disk and processing them at the same time.
This changes things to walk down to a level 1 node and then process it
in bulk. All the leaf pointers are sorted and the leaves are dropped
in order based on their extent number.
The extent allocation tree and commit code are now fast enough for
this kind of bulk processing to work without slowing the rest of the FS
down. Overall it does less IO and is better able to keep up with
snapshot deletions under high load.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Most of the btrfs metadata operations can be protected by a spinlock,
but some operations still need to schedule.
So far, btrfs has been using a mutex along with a trylock loop,
most of the time it is able to avoid going for the full mutex, so
the trylock loop is a big performance gain.
This commit is step one for getting rid of the blocking locks entirely.
btrfs_tree_lock takes a spinlock, and the code explicitly switches
to a blocking lock when it starts an operation that can schedule.
We'll be able get rid of the blocking locks in smaller pieces over time.
Tracing allows us to find the most common cause of blocking, so we
can start with the hot spots first.
The basic idea is:
btrfs_tree_lock() returns with the spin lock held
btrfs_set_lock_blocking() sets the EXTENT_BUFFER_BLOCKING bit in
the extent buffer flags, and then drops the spin lock. The buffer is
still considered locked by all of the btrfs code.
If btrfs_tree_lock gets the spinlock but finds the blocking bit set, it drops
the spin lock and waits on a wait queue for the blocking bit to go away.
Much of the code that needs to set the blocking bit finishes without actually
blocking a good percentage of the time. So, an adaptive spin is still
used against the blocking bit to avoid very high context switch rates.
btrfs_clear_lock_blocking() clears the blocking bit and returns
with the spinlock held again.
btrfs_tree_unlock() can be called on either blocking or spinning locks,
it does the right thing based on the blocking bit.
ctree.c has a helper function to set/clear all the locked buffers in a
path as blocking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before metadata is written to disk, it is updated to reflect that writeout
has begun. Once this update is done, the block must be cow'd before it
can be modified again.
This update was originally synchronized by using a per-fs spinlock. Today
the buffers for the metadata blocks are locked before writeout begins,
and everyone that tests the flag has the buffer locked as well.
So, the per-fs spinlock (called hash_lock for no good reason) is no
longer required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
extent_io.c has debugging code to report and free leaked extent_state
and extent_buffer objects at rmmod time. This helps track down
leaks and it saves you from rebooting just to properly remove the
kmem_cache object.
But, the code runs under a fairly expensive spinlock and the checks to
see if it is currently enabled are not entirely consistent. Some use
#ifdef and some #if.
This changes everything to #if and disables the leak checking.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a block goes through cow, we update the reference counts of
everything that block points to. The internal pointers of the block
can be in just about any order, and it is likely to have clusters of
things that are close together and clusters of things that are not.
To help reduce the seeks that come with updating all of these reference
counts, sort them by byte number before actual updates are done.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tracing shows the delay between when an async thread goes to sleep
and when more work is added is often very short. This commit adds
a little bit of delay and extra checking to the code right before
we schedule out.
It allows more work to be added to the worker
without requiring notifications from other procs.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add call to LSM security initialization and save
resulting security xattr for new inodes.
Add xattr support to symlink inode ops.
Set inode->i_op for existing special files.
Signed-off-by: jim owens <jowens@hp.com>
This patch adds a menu entry to kconfig to enable acls for btrfs.
This allows you to enable FS_POSIX_ACL at kernel compile time.
(updated by Jeff Mahoney to make the changes in fs/btrfs/Kconfig instead)
Signed-off-by: Christian Hesse <mail@earthworm.de>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
The async bio submission thread was missing some bios that were
added after it had decided there was no work left to do.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After btrfs_readdir has gone through all the directory items, it
sets the directory f_pos to the largest possible int. This way
applications that mix readdir with creating new files don't
end up in an endless loop finding the new directory items as they go.
It was a workaround for a bug in git, but the assumption was that if git
could make this looping mistake than it would be a common problem.
The largest possible int chosen was INT_LIMIT(typeof(file->f_pos),
and it is possible for that to be a larger number than 32 bit glibc
expects to come out of readdir.
This patches switches that to INT_LIMIT(off_t), which should keep
applications happy on 32 and 64 bit machines.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Just before reading a leaf, btrfs scans the node for blocks that are
close by and reads them too. It tries to build up a large window
of IO looking for blocks that are within a max distance from the top
and bottom of the IO window.
This patch changes things to just look for blocks within 64k of the
target block. It will trigger less IO and make for lower latencies on
the read size.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Now that bmap support is gone, this is the only way to get extent
mappings for userland. These are still not valid for IO, but they
can tell us if a file has holes or how much fragmentation there is.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Swapfiles use bmap to build a list of extents belonging to the file,
and they assume these extents won't change over the life of the file.
They also use resulting list to do IO directly to the block device.
This causes problems for btrfs in a few ways:
btrfs returns logical block numbers through bmap, and these are not suitable
for IO. They might translate to different devices, raid etc.
COW means that file block mappings are going to change frequently.
Using swapfiles on btrfs will lead to corruption, so we're avoiding the
problem for now by dropping bmap support entirely. A later commit
will add fiemap support for people that really want to know how
a file is laid out.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To improve performance, btrfs_sync_log merges tree log sync
requests. But it wrongly merges sync requests for different
tree logs. If multiple tree logs are synced at the same time,
only one of them actually gets synced.
This patch has following changes to fix the bug:
Move most tree log related fields in btrfs_fs_info to
btrfs_root. This allows merging sync requests separately
for each tree log.
Don't insert root item into the log root tree immediately
after log tree is allocated. Root item for log tree is
inserted when log tree get synced for the first time. This
allows syncing the log root tree without first syncing all
log trees.
At tree-log sync, btrfs_sync_log first sync the log tree;
then updates corresponding root item in the log root tree;
sync the log root tree; then update the super block.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
replace_one_extent searches tree leaves for references to a given extent. It
stops searching if it goes beyond the last possible position.
The last possible position is computed by adding the starting offset of a found
file extent to the full size of the extent. The code uses physical size of the
extent as the full size. This is incorrect when compression is used.
The fix is get the full size from ram_bytes field of file extent item.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Change one typedef to a regular enum, and remove an unused one.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_extent_post_op calls finish_current_insert and del_pending_extents. They
both may enter infinite loops.
finish_current_insert enters infinite loop if it only finds some backrefs to
update. The fix is to check for pending backref updates before restarting the
loop.
The infinite loop in del_pending_extents is due to a the skipped variable
not being properly reset before looping around.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Merge list_for_each* and list_entry to list_for_each_entry*
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
kthread_run() returns the kthread or ERR_PTR(-ENOMEM), not NULL.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The "devid <xxx> transid <xxx>" printk in btrfs_scan_one_device()
actually follows another printk that doesn't end in a newline (since the
intention is for the two printks to make one line of output), so the
KERN_INFO just ends up messing up the output:
device label exp <6>devid 1 transid 9 /dev/sda5
Fix this by changing the extra KERN_INFO to KERN_CONT.
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Andrew's review of the xattr code revealed some minor issues that this patch
addresses. Just an error return fix, got rid of a useless statement and
commented one of the trickier parts of __btrfs_getxattr.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
- Remove the unused local variable 'len';
- Check return value of kmalloc().
Signed-off-by: Wang Cong <wangcong@zeuux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix ioctl arg size (userland incompatible change!)
Btrfs: Clear the device->running_pending flag before bailing on congestion
The structure used to send device in btrfs ioctl calls was not
properly aligned, and so 32 bit ioctls would not work properly on
64 bit kernels.
We could fix this with compat ioctls, but we're just one byte away
and it doesn't make sense at this stage to carry about the compat ioctls
forever at this stage in the project.
This patch brings the ioctl arg up to an evenly aligned 4k.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs maintains a queue of async bio submissions so the checksumming
threads don't have to wait on get_request_wait. In order to avoid
extra wakeups, this code has a running_pending flag that is used
to tell new submissions they don't need to wake the thread.
When the threads notice congestion on a single device, they
may decide to requeue the job and move on to other devices. This
makes sure the running_pending flag is cleared before the
job is requeued.
It should help avoid IO stalls by making sure the task is woken up
when new submissions come in.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use the standard magic.h for btrfs and squashfs.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Cc: Phillip Lougher <phillip@lougher.demon.co.uk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c4be0c1dc4 added the ability for
write_super_lockfs to return errors, and renamed them to match. But
btrfs didn't get converted.
Do the minimal conversion to make it compile again.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Each subvolume has an extent_state_tree used to mark metadata
that needs to be sent to disk while syncing the tree. This is
used in addition to the dirty bits on the pages themselves so that
a single subvolume can be sent to disk efficiently in disk order.
Normally this marking happens in btrfs_alloc_free_block, which also does
special recording of dirty tree blocks for the tree log roots.
Yan Zheng noticed that when the root of the log tree is allocated, it is added
to the wrong writeback list. The fix used here is to explicitly set
it dirty as part of tree log creation.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Checksum verification happens in a helper thread, and there is no
need to mess with interrupts. This switches to kmap() instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch contains following things.
1) Limit the max size of btrfs_ordered_sum structure to PAGE_SIZE. This
struct is kmalloced so we want to keep it reasonable.
2) Replace copy_extent_csums by btrfs_lookup_csums_range. This was
duplicated code in tree-log.c
3) Remove replay_one_csum. csum items are replayed at the same time as
replaying file extents. This guarantees we only replay useful csums.
4) nbytes accounting fix.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
btrfs_drop_extents doesn't change file extent's ram_bytes
in the case of booked extent. To be consistent, we should
also not change ram_bytes when truncating existing extent.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Snapshot creation happens at a specific time during transaction commit. We
need to make sure the code called by snapshot creation doesn't wait
for the running transaction to commit.
This changes btrfs_delete_inode and finish_pending_snaps to use
btrfs_join_transaction instead of btrfs_start_transaction to avoid deadlocks.
It would be better if btrfs_delete_inode didn't use the join, but the
call path that triggers it is:
btrfs_commit_transaction->create_pending_snapshots->
create_pending_snapshot->btrfs_lookup_dentry->
fixup_tree_root_location->btrfs_read_fs_root->
btrfs_read_fs_root_no_name->btrfs_orphan_cleanup->iput
This will be fixed in a later patch by moving the orphan cleanup to the
cleaner thread.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is a patch to fix discard semantic to make Btrfs work with FTL and SSD.
We can improve FTL's performance by telling it which sectors are freed by file
system. But if we don't tell FTL the information of free sectors in proper
time, the transaction mechanism of Btrfs will be destroyed and Btrfs could not
roll back the previous transaction under the power loss condition.
There are some problems in the old implementation:
1, In __free_extent(), the pinned down extents should not be discarded.
2, In free_extents(), the free extents are all pinned, so they need to
be discarded in transaction committing time instead of free_extents().
3, The reserved extent used by log tree should be discard too.
This patch change discard behavior as follows:
1, For the extents which need to be free at once,
we discard them in update_block_group().
2, Delay discarding the pinned extent in btrfs_finish_extent_commit()
when committing transaction.
3, Remove discarding from free_extents() and __free_extent()
4, Add discard interface into btrfs_free_reserved_extent()
5, Discard sectors before updating the free space cache, otherwise,
FTL will destroy file system data.
drop_one_dir_item does not properly update inode's link count. It can be
reproduced by executing following commands:
#touch test
#sync
#rm -f test
#dd if=/dev/zero bs=4k count=1 of=test conv=fsync
#echo b > /proc/sysrq-trigger
This fixes it by adding an BTRFS_ORPHAN_ITEM_KEY for the inode
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
The data in fs_info->super_for_commit are zeros before the
first transaction commit. If tree log sync and system crash
both occur before the first transaction commit, super block
will get corrupted.
This fixes it by properly filling in the super_for_commit field at
open time.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
In clear_state_cb, we should check 'tree->ops->clear_bit_hook' instead
of 'tree->ops->set_bit_hook'.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Only root can add/remove devices
Only root can defrag subtrees
Only files open for writing can be defragged
Only files open for writing can be the destination for a clone
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a race in relocate_inode_pages, it happens when
find_delalloc_range finds the delalloc extent before the
boundary bit is set. Thank you,
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This adds the missing block accounting code to finish_current_insert and makes
block accounting for root item properly protected by the delalloc spin lock.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This patch adds the missing mnt_drop_write to match
mnt_want_write in btrfs_ioctl_defrag and btrfs_ioctl_clone
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
bio_end_io for reads without checksumming on and btree writes were
happening without using async thread pools. This means the extent_io.c
code had to use spin_lock_irq and friends on the rb tree locks for
extent state.
There were some irq safe vs unsafe lock inversions between the delallock
lock and the extent state locks. This patch gets rid of them by moving
all end_io code into the thread pools.
To avoid contention and deadlocks between the data end_io processing and the
metadata end_io processing yet another thread pool is added to finish
off metadata writes.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_insert_empty_items takes the space needed by the btrfs_item
structure into account when calculating the required free space.
So the tree balancing code shouldn't add sizeof(struct btrfs_item)
to the size when checking the free space. This patch removes these
superfluous additions.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Btrfs maintains a cache of blocks available for allocation in ram. The
code that frees extents was marking the extents free and then deleting
the checksum items.
This meant it was possible the extent would be reallocated before the
checksum item was actually deleted, leading to races and other
problems as the checksums were updated for the newly allocated extent.
The fix is to delete the checksum before marking the extent free.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The delalloc lock doesn't need to have irqs disabled, nobody that
changes the number of delalloc bytes in the FS is running with irqs off.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The compression code was using isize to limit the amount of data it
sent through zlib. But, it wasn't properly limiting the looping to
just the pages inside i_size. The end result was trying to compress
too many pages, including those that had not been setup and properly locked
down. This made the compression code oops while trying find_get_page on a
page that didn't exist.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Checksums on data can be disabled by mount option, so it's
possible some data extents don't have checksums or have
invalid checksums. This causes trouble for data relocation.
This patch contains following things to make data relocation
work.
1) make nodatasum/nodatacow mount option only affects new
files. Checksums and COW on data are only controlled by the
inode flags.
2) check the existence of checksum in the nodatacow checker.
If checksums exist, force COW the data extent. This ensure that
checksum for a given block is either valid or does not exist.
3) update data relocation code to properly handle the case
of checksum missing.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This patch makes seed device possible to be shared by
multiple mounted file systems. The sharing is achieved
by cloning seed device's btrfs_fs_devices structure.
Thanks you,
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
The block group structs are referenced in many different
places, and it's not safe to free while balancing. So, those block
group structs were simply leaked instead.
This patch replaces the block group pointer in the inode with the starting byte
offset of the block group and adds reference counting to the block group
struct.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This finishes off the new checksumming code by removing csum items
for extents that are no longer in use.
The trick is doing it without racing because a single csum item may
hold csums for more than one extent. Extra checks are added to
btrfs_csum_file_blocks to make sure that we are using the correct
csum item after dropping locks.
A new btrfs_split_item is added to split a single csum item so it
can be split without dropping the leaf lock. This is used to
remove csum bytes from the middle of an item.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The fsync logging code makes sure to onl copy the relevant checksum for each
extent based on the file extent pointers it finds.
But for compressed extents, it needs to copy the checksum for the
entire extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds a sequence number to the btrfs inode that is increased on
every update. NFS will be able to use that to detect when an inode has
changed, without relying on inaccurate time fields.
While we're here, this also:
Puts reserved space into the super block and inode
Adds a log root transid to the super so we can pick the newest super
based on the fsync log as well as the main transaction ID. For now
the log root transid is always zero, but that'll get fixed.
Adds a starting offset to the dev_item. This will let us do better
alignment calculations if we know the start of a partition on the disk.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It is possible that generic_bin_search will be called on a tree block
that has not been locked. This happens because cache_block_block skips
locking on the tree blocks.
Since the tree block isn't locked, we aren't allowed to change
the extent_buffer->map_token field. Using map_private_extent_buffer
avoids any changes to the internal extent buffer fields.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch implements superblock duplication. Superblocks
are stored at offset 16K, 64M and 256G on every devices.
Spaces used by superblocks are preserved by the allocator,
which uses a reverse mapping function to find the logical
addresses that correspond to superblocks. Thank you,
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Btrfs stores checksums for each data block. Until now, they have
been stored in the subvolume trees, indexed by the inode that is
referencing the data block. This means that when we read the inode,
we've probably read in at least some checksums as well.
But, this has a few problems:
* The checksums are indexed by logical offset in the file. When
compression is on, this means we have to do the expensive checksumming
on the uncompressed data. It would be faster if we could checksum
the compressed data instead.
* If we implement encryption, we'll be checksumming the plain text and
storing that on disk. This is significantly less secure.
* For either compression or encryption, we have to get the plain text
back before we can verify the checksum as correct. This makes the raid
layer balancing and extent moving much more expensive.
* It makes the front end caching code more complex, as we have touch
the subvolume and inodes as we cache extents.
* There is potentitally one copy of the checksum in each subvolume
referencing an extent.
The solution used here is to store the extent checksums in a dedicated
tree. This allows us to index the checksums by phyiscal extent
start and length. It means:
* The checksum is against the data stored on disk, after any compression
or encryption is done.
* The checksum is stored in a central location, and can be verified without
following back references, or reading inodes.
This makes compression significantly faster by reducing the amount of
data that needs to be checksummed. It will also allow much faster
raid management code in general.
The checksums are indexed by a key with a fixed objectid (a magic value
in ctree.h) and offset set to the starting byte of the extent. This
allows us to copy the checksum items into the fsync log tree directly (or
any other tree), without having to invent a second format for them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs macros to access individual struct members on disk were
sending the same variable to functions that expected different types
of endianness. This fix explicitly creates a variable of the correct
type instead of abusing a single variable for mixed purposes.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch gives us the space we will need in order to have different csum
algorithims at some point in the future. We save the csum algorithim type
in the superblock, and use those instead of define's.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This needs to be applied on top of my previous patches, but is needed for more
than just my new stuff. We're going to the wrong label when we have an error,
we try to stop the workers, but they are started below all of this code. This
fixes it so we go to the right error label and not panic when we fail one of
these cases.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This adds the necessary disk format for handling compatibility flags
in the future to handle disk format changes. We have a compat_flags,
compat_ro_flags and incompat_flags set for the super block. Compat
flags will be to hold the features that are compatible with older
versions of btrfs, compat_ro flags have features that are compatible
with older versions of btrfs if the fs is mounted read only, and
incompat_flags has features that are incompatible with older versions
of btrfs. This also axes the compat_flags field for the inode and
just makes the flags field a 64bit field, and changes the root item
flags field to 64bit.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Cleans the code up a little and also avoids a sparse warning due to the
incorrect cast in the current version of the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Provide a void __user *argp pointer so that we can avoid duplicating
the cast for various sub-command calls.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Shut up various sparse warnings about symbols that should be either
static or have their declarations in scope.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Remove unneeded debugging sanity check. It gets corrupted anyway when
multiple btrfs file systems are mounted, throwing bad warnings along the
way.
Signed-off-by: Sage Weil <sage@newdream.net>
This the lockdep complaint by having a different mutex to gaurd caching the
block group, so you don't end up with this backwards dependancy. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
The btrfs write_cache_pages call has a flush function so that it submits
the bio it has been building before it waits on any writeback pages.
This adds a check so that flush only happens on writeback pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The log replay produces dirty roots. These dirty roots
should be dropped immediately if the fs is mounted as
ro. Otherwise they can be added to the dirty root list
again when remounting the fs as rw. Thank you,
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
The btrfs git kernel trees is used to build a standalone tree for
compiling against older kernels. This commit makes the standalone tree
work with 2.6.27
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* open/close_bdev_excl -> open/close_bdev_exclusive
* blkdev_issue_discard takes a GFP mask now
* Fix blkdev_issue_discard usage now that it is enabled
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch fixes what I hope is the last early ENOSPC bug left. I did not know
that pinned extents would merge into one big extent when inserted on to the
pinned extent tree, so I was adding free space to a block group that could
possibly span multiple block groups.
This is a big issue because first that space doesn't exist in that block group,
and second we won't actually use that space because there are a bunch of other
checks to make sure we're allocating within the constraints of the block group.
This patch fixes the problem by adding the btrfs_add_free_space to
btrfs_update_pinned_extents which makes sure we are adding the appropriate
amount of free space to the appropriate block group. Thanks much to Lee Trager
for running my myriad of debug patches to help me track this problem down.
Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
fsync log replay can change the filesystem, so it cannot be delayed until
mount -o rw,remount, and it can't be forgotten entirely. So, this patch
changes btrfs to do with reiserfs, ext3 and xfs do, which is to do the
log replay even when mounted readonly.
On a readonly device if log replay is required, the mount is aborted.
Getting all of this right had required fixing up some of the error
handling in open_ctree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While building large bios in writepages, btrfs may end up waiting
for other page writeback to finish if WB_SYNC_ALL is used.
While it is waiting, the bio it is building has a number of pages with the
writeback bit set and they aren't getting to the disk any time soon. This
lowers the latencies of writeback in general by sending down the bio being
built before waiting for other pages.
The bio submission code tries to limit the total number of async bios in
flight by waiting when we're over a certain number of async bios. But,
the waits are happening while writepages is building bios, and this can easily
lead to stalls and other problems for people calling wait_on_page_writeback.
The current fix is to let the congestion tests take care of waiting.
sync() and others make sure to drain the current async requests to make
sure that everything that was pending when the sync was started really get
to disk. The code would drain pending requests both before and after
submitting a new request.
But, if one of the requests is waiting for page writeback to finish,
the draining waits might block that page writeback. This changes the
draining code to only wait after submitting the bio being processed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent based waiting was using more CPU, and other fixes have helped
with the unplug storm problems.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
For larger multi-device filesystems, there was logic to limit the
number of devices unplugged to just the page that was sent to our sync_page
function.
But, the code wasn't always unplugging the right device. Since this was
just an optimization, disable it for now.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In insert_extents(), when ret==1 and last is not zero, it should
check if the current inserted item is the last item in this batching
inserts. If so, it should just break from loop. If not, 'cur =
insert_list->next' will make no sense because the list is empty now,
and 'op' will point to an unexpectable place.
There are also some trivial fixs in this patch including one comment
typo error and deleting two redundant lines.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
For a directory tree:
/mnt/subvolA/subvolB
btrfsctl -s /mnt/subvolA/subvolB /mnt
Will create a directory loop with subvolA under subvolB. This
commit uses the forward refs for each subvol and snapshot to error out
before creating the loop.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Subvols and snapshots can now be referenced from any point in the directory
tree. We need to maintain back refs for them so we can find lost
subvols.
Forward refs are added so that we know all of the subvols and
snapshots referenced anywhere in the directory tree of a single subvol. This
can be used to do recursive snapshotting (but they aren't yet) and it is
also used to detect and prevent directory loops when creating new snapshots.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Each subvolume has its own private inode number space, and so we need
to fill in different device numbers for each subvolume to avoid confusing
applications.
This commit puts a struct super_block into struct btrfs_root so it can
call set_anon_super() and get a different device number generated for
each root.
btrfs_rename is changed to prevent renames across subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before, all snapshots and subvolumes lived in a single flat directory. This
was awkward and confusing because the single flat directory was only writable
with the ioctls.
This commit changes the ioctls to create subvols and snapshots at any
point in the directory tree. This requires making separate ioctls for
snapshot and subvol creation instead of a combining them into one.
The subvol ioctl does:
btrfsctl -S subvol_name parent_dir
After the ioctl is done subvol_name lives inside parent_dir.
The snapshot ioctl does:
btrfsctl -s path_for_snapshot root_to_snapshot
path_for_snapshot can be an absolute or relative path. btrfsctl breaks it up
into directory and basename components.
root_to_snapshot can be any file or directory in the FS. The snapshot
is taken of the entire root where that file lives.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In my batch delete/update/insert patch I introduced a free space leak. The
extent that we do the original search on in free_extents is never pinned, so we
always update the block saying that it has free space, but the free space never
actually gets added to the free space tree, since op->del will always be 0 and
it's never actually added to the pinned extents tree.
This patch fixes this problem by making sure we call pin_down_bytes on the
pending extent op and set op->del to the return value of pin_down_bytes so
update_block_group is called with the right value. This seems to fix the case
where we were getting ENOSPC when there was plenty of space available.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
In worker_loop(), the func should check whether it has been requested to stop
before it decides to schedule out.
Otherwise if the stop request(also the last wake_up()) sent by
btrfs_stop_workers() happens when worker_loop() running after the "while"
judgement and before schedule(), woker_loop() will schedule away and never be
woken up, which will also cause btrfs_stop_workers() wait forever.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When extent needs to be split, btrfs_mark_extent_written truncates the extent
first, then inserts a new extent and increases the reference count.
The race happens if someone else deletes the old extent before the new extent
is inserted. The fix here is increase the reference count in advance. This race
is similar to the race in btrfs_drop_extents that was recently fixed.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Seed device is a special btrfs with SEEDING super flag
set and can only be mounted in read-only mode. Seed
devices allow people to create new btrfs on top of it.
The new FS contains the same contents as the seed device,
but it can be mounted in read-write mode.
This patch does the following:
1) split code in btrfs_alloc_chunk into two parts. The first part does makes
the newly allocated chunk usable, but does not do any operation that modifies
the chunk tree. The second part does the the chunk tree modifications. This
division is for the bootstrap step of adding storage to the seed device.
2) Update device management code to handle seed device.
The basic idea is: For an FS grown from seed devices, its
seed devices are put into a list. Seed devices are
opened on demand at mounting time. If any seed device is
missing or has been changed, btrfs kernel module will
refuse to mount the FS.
3) make btrfs_find_block_group not return NULL when all
block groups are read-only.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This patch adds mount ro and remount support. The main
changes in patch are: adding btrfs_remount and related
helper function; splitting the transaction related code
out of close_ctree into btrfs_commit_super; updating
allocator to properly handle read only block group.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
While profiling the allocator I noticed a good amount of time was being spent in
finish_current_insert and del_pending_extents, and as the filesystem filled up
more and more time was being spent in those functions. This patch aims to try
and reduce that problem. This happens two ways
1) track if we tried to delete an extent that we are going to update or insert.
Once we get into finish_current_insert we discard any of the extents that were
marked for deletion. This saves us from doing unnecessary work almost every
time finish_current_insert runs.
2) Batch insertion/updates/deletions. Instead of doing a btrfs_search_slot for
each individual extent and doing the needed operation, we instead keep the leaf
around and see if there is anything else we can do on that leaf. On the insert
case I introduced a btrfs_insert_some_items, which will take an array of keys
with an array of data_sizes and try and squeeze in as many of those keys as
possible, and then return how many keys it was able to insert. In the update
case we search for an extent ref, update the ref and then loop through the leaf
to see if any of the other refs we are looking to update are on that leaf, and
then once we are done we release the path and search for the next ref we need to
update. And finally for the deletion we try and delete the extent+ref in pairs,
so we will try to find extent+ref pairs next to the extent we are trying to free
and free them in bulk if possible.
This along with the other cluster fix that Chris pushed out a bit ago helps make
the allocator preform more uniformly as it fills up the disk. There is still a
slight drop as we fill up the disk since we start having to stick new blocks in
odd places which results in more COW's than on a empty fs, but the drop is not
nearly as severe as it was before.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This patch adds an additional CLONE_RANGE ioctl to clone an arbitrary
(block-aligned) file range to another file. The original CLONE ioctl
becomes a special case of cloning the entire file range. The logic is a
bit more complex now since ranges may be cloned to different offsets, and
because we may only be cloning the beginning or end of a particular extent
or checksum item.
An additional sanity check ensures the source and destination files aren't
the same (which would previously deadlock), although eventually this could
be extended to allow the duplication of file data at a different offset
within the same file.
Any extents within the destination range in the target file are dropped.
We currently do not cope with the case where a compressed inline extent
needs to be split. This will probably require decompressing the extent
into a temporary address_space, and inserting just the cloned portion as a
new compressed inline extent. For now, just return -EINVAL in this case.
Note that this never comes up in the more common case of cloning an entire
file.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we fail to allocate a new block group, we should still do the
checks to make sure allocations try again with the minimum requested
allocation size.
This also fixes a deadlock that come from a missed down_read in
the chunk allocation failure handling.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This fixes latency problems on metadata reads by making sure they
don't go through the async submit queue, and by tuning down the amount
of readahead done during btree searches.
Also, the btrfs bdi congestion function is tuned to ignore the
number of pending async bios and checksums pending. There is additional
code that throttles new async bios now and the congestion function
doesn't need to worry about it anymore.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_drop_extents will drop paths and search again when it needs to
force COW of higher nodes. It was using the key it found during the last
search as the offset for the next search.
But, this wasn't always correct. The key could be from before our desired
range, and because we're dropping the path, it is possible for file's items
to change while we do the search again.
The fix here is to make sure we don't search for something smaller than
the offset btrfs_drop_extents was called with.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The allocator wasn't catching all of the cases where it needed to do
extra loops because the check to enforce them wasn't happening early
enough.
When the allocator decided to increase the size of the allocation
for metadata clustering, it wasn't always setting the empty_size to
include the extra (optional) bytes. This also fixes the empty_size field
to be correct.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs unplugs, it tries to find the correct device to unplug
via search through the extent_map tree. This avoids unplugging
a device that doesn't need it, but is a waste of time for filesystems
with a small number of devices.
This patch checks the total number of devices before doing the
search.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent_io.c code has a #define to find and cleanup extent state leaks
on module unmount. This adds a very highly contended spinlock to a
hot path for most FS operations.
Turn it off by default. A later changeset will add a .config option
for it.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This makes sure the orig_start field in struct extent_map gets set
everywhere the extent_map structs are created or modified.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With all the recent fixes to the delalloc locking, it is now safe
again to use invalidatepage inside the writepage code for
pages outside of i_size. This used to deadlock against some of the
code to write locked ranges of pages, but all of that has been fixed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The loop searching for free space would exit out too soon when
metadata clustering was trying to allocate a large extent. This makes
sure a full scan of the free space is done searching for only the
minimum extent size requested by the higher layers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan's fix to use the correct file offset during compressed reads used the
extent_map struct pointer after it had been freed. This saves the
fields we want for later use instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The decompress code doesn't take the logical offset in extent
pointer into account. If the logical offset isn't zero, data
will be decompressed into wrong pages.
The solution used here is to record the starting offset of the extent
in the file separately from the logical start of the extent_map struct.
This allows us to avoid problems inserting overlapping extents.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This adds a PageDirty check to the writeback path that locks pages
for delalloc. If a page wasn't dirty at this point, it is in the
process of being truncated away.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When metadata allocation clustering has to fall back to unclustered
allocs because large free areas could not be found, it was sometimes
substracting too much from the total bytes to allocate. This would
make it wrap below zero.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While doing a commit, btrfs makes sure all the metadata blocks
were properly written to disk, calling wait_on_page_writeback for
each page. This writeback happens after allowing another transaction
to start, so it competes for the disk with other processes in the FS.
If the page writeback bit is still set, each wait_on_page_writeback might
trigger an unplug, even though the page might be waiting for checksumming
to finish or might be waiting for the async work queue to submit the
bio.
This trades wait_on_page_writeback for waiting on the extent writeback
bits. It won't trigger any unplugs and substantially improves performance
in a number of workloads.
This also changes the async bio submission to avoid requeueing if there
is only one device. The requeue just wastes CPU time because there are
no other devices to service.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In comes cases the empty cluster was added twice to the total number of
bytes the allocator was trying to find.
With empty clustering on, the hint byte was sometimes outside of the
block group. Add an extra goto to find the correct block group.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When writing a compressed extent, a number of bios are created that
point to a single struct compressed_bio. At end_io time an atomic counter in
the compressed_bio struct makes sure that all of the bios have finished
before final end_io processing is done.
But when multiple bios are needed to write a compressed extent, the
counter was being incremented after the first bio was sent to submit_bio.
It is possible the bio will complete before the counter is incremented,
making the end_io handler free the compressed_bio struct before
processing is finished.
The fix is to increment the atomic counter before bio submission,
both for compressed reads and writes.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This lowers the empty cluster target for metadata allocations. The lower
target makes it easier to do allocations and still seems to perform well.
It also fixes the allocator loop to drop the empty cluster when things
start getting difficult, avoiding false enospc warnings.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The allocator uses the last allocation as a starting point for metadata
allocations, and tries to allocate in clusters of at least 256k.
If the search for a free block fails to find the expected block, this patch
forces a new cluster to be found in the free list.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When reading compressed extents, try to put pages into the page cache
for any pages covered by the compressed extent that readpages didn't already
preload.
Add an async work queue to handle transformations at delayed allocation processing
time. Right now this is just compression. The workflow is:
1) Find offsets in the file marked for delayed allocation
2) Lock the pages
3) Lock the state bits
4) Call the async delalloc code
The async delalloc code clears the state lock bits and delalloc bits. It is
important this happens before the range goes into the work queue because
otherwise it might deadlock with other work queue items that try to lock
those extent bits.
The file pages are compressed, and if the compression doesn't work the
pages are written back directly.
An ordered work queue is used to make sure the inodes are written in the same
order that pdflush or writepages sent them down.
This changes extent_write_cache_pages to let the writepage function
update the wbc nr_written count.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs uses kernel threads to create async work queues for cpu intensive
operations such as checksumming and decompression. These work well,
but they make it difficult to keep IO order intact.
A single writepages call from pdflush or fsync will turn into a number
of bios, and each bio is checksummed in parallel. Once the checksum is
computed, the bio is sent down to the disk, and since we don't control
the order in which the parallel operations happen, they might go down to
the disk in almost any order.
The code deals with this somewhat by having deep work queues for a single
kernel thread, making it very likely that a single thread will process all
the bios for a single inode.
This patch introduces an explicitly ordered work queue. As work structs
are placed into the queue they are put onto the tail of a list. They have
three callbacks:
->func (cpu intensive processing here)
->ordered_func (order sensitive processing here)
->ordered_free (free the work struct, all processing is done)
The work struct has three callbacks. The func callback does the cpu intensive
work, and when it completes the work struct is marked as done.
Every time a work struct completes, the list is checked to see if the head
is marked as done. If so the ordered_func callback is used to do the
order sensitive processing and the ordered_free callback is used to do
any cleanup. Then we loop back and check the head of the list again.
This patch also changes the checksumming code to use the ordered workqueues.
One a 4 drive array, it increases streaming writes from 280MB/s to 350MB/s.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Make sure we keep page->mapping NULL on the pages we're getting
via alloc_page. It gets set so a few of the callbacks can do the right
thing, but in general these pages don't have a mapping.
Don't try to truncate compressed inline items in btrfs_drop_extents.
The whole compressed item must be preserved.
Don't try to create multipage inline compressed items. When we try to
overwrite just the first page of the file, we would have to read in and recow
all the pages after it in the same compressed inline items. For now, only
create single page inline items.
Make sure we lock pages in the correct order during delalloc. The
search into the state tree for delalloc bytes can return bytes before
the page we already have locked.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch updates btrfs-progs for fallocate support.
fallocate is a little different in Btrfs because we need to tell the
COW system that a given preallocated extent doesn't need to be
cow'd as long as there are no snapshots of it. This leverages the
-o nodatacow checks.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This patch simplifies the nodatacow checker. If all references
were created after the latest snapshot, then we can avoid COW
safely. This patch also updates run_delalloc_nocow to do more
fine-grained checking.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
When dropping middle part of an extent, btrfs_drop_extents truncates
the extent at first, then inserts a bookend extent.
Since truncation and insertion can't be done atomically, there is a small
period that the bookend extent isn't in the tree. This causes problem for
functions that search the tree for file extent item. The way to fix this is
lock the range of the bookend extent before truncation.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This patch splits the hole insertion code out of btrfs_setattr
into btrfs_cont_expand and updates btrfs_get_extent to properly
handle the case that file extent items are not continuous.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
When compression was on, we were improperly ignoring -o nodatasum. This
reworks the logic a bit to properly honor all the flags.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The byte walk counting was awkward and error prone. This uses the
number of pages sent the higher layer to build bios.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
finish_current_insert and del_pending_extents process extent tree modifications
that build up while we are changing the extent tree. It is a confusing
bit of code that prevents recursion.
Both functions run through a list of pending operations and both funcs
add to the list of pending operations. If you have two procs in either
one of them, they can end up looping forever making more work for each other.
This patch makes them walk forward through the list of pending changes instead
of always trying to process the entire list. At transaction commit
time, we catch any changes that were left over.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds transaction IDs to root tree pointers.
Transaction IDs in tree pointers are compared with the
generation numbers in block headers when reading root
blocks of trees. This can detect some types of IO errors.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This patch removes the giant fs_info->alloc_mutex and replaces it with a bunch
of little locks.
There is now a pinned_mutex, which is used when messing with the pinned_extents
extent io tree, and the extent_ins_mutex which is used with the pending_del and
extent_ins extent io trees.
The locking for the extent tree stuff was inspired by a patch that Yan Zheng
wrote to fix a race condition, I cleaned it up some and changed the locking
around a little bit, but the idea remains the same. Basically instead of
holding the extent_ins_mutex throughout the processing of an extent on the
extent_ins or pending_del trees, we just hold it while we're searching and when
we clear the bits on those trees, and lock the extent for the duration of the
operations on the extent.
Also to keep from getting hung up waiting to lock an extent, I've added a
try_lock_extent so if we cannot lock the extent, move on to the next one in the
tree and we'll come back to that one. I have tested this heavily and it does
not appear to break anything. This has to be applied on top of my
find_free_extent redo patch.
I tested this patch on top of Yan's space reblancing code and it worked fine.
The only thing that has changed since the last version is I pulled out all my
debugging stuff, apparently I forgot to run guilt refresh before I sent the
last patch out. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
So there is an odd case where we can possibly return -ENOSPC when there is in
fact space to be had. It only happens with Metadata writes, and happens _very_
infrequently. What has to happen is we have to allocate have allocated out of
the first logical byte on the disk, which would set last_alloc to
first_logical_byte(root, 0), so search_start == orig_search_start. We then
need to allocate for normal metadata, so BTRFS_BLOCK_GROUP_METADATA |
BTRFS_BLOCK_GROUP_DUP. We will do a block lookup for the given search_start,
block_group_bits() won't match and we'll go to choose another block group.
However because search_start matches orig_search_start we go to see if we can
allocate a chunk.
If we are in the situation that we cannot allocate a chunk, we fail and ENOSPC.
This is kind of a big flaw of the way find_free_extent works, as it along with
find_free_space loop through _all_ of the block groups, not just the ones that
we want to allocate out of. This patch completely kills find_free_space and
rolls it into find_free_extent. I've introduced a sort of state machine into
this, which will make it easier to get cache miss information out of the
allocator, and will work well with my locking changes.
The basic flow is this: We have the variable loop which is 0, meaning we are
in the hint phase. We lookup the block group for the hint, and lookup the
space_info for what we want to allocate out of. If the block group we were
pointed at by the hint either isn't of the correct type, or just doesn't have
the space we need, we set head to space_info->block_groups, so we start at the
beginning of the block groups for this particular space info, and loop through.
This is also where we add the empty_cluster to total_needed. At this point
loop is set to 1 and we just loop through all of the block groups for this
particular space_info looking for the space we need, just as find_free_space
would have done, except we only hit the block groups we want and not _all_ of
the block groups. If we come full circle we see if we can allocate a chunk.
If we cannot of course we exit with -ENOSPC and we are good. If not we start
over at space_info->block_groups and loop through again, with loop == 2. If we
come full circle and haven't found what we need then we exit with -ENOSPC.
I've been running this for a couple of days now and it seems stable, and I
haven't yet hit a -ENOSPC when there was plenty of space left.
Also I've made a groups_sem to handle the group list for the space_info. This
is part of my locking changes, but is relatively safe and seems better than
holding the space_info spinlock over that entire search time. Thanks,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
This patch improves the space balancing code to keep more sharing
of tree blocks. The only case that breaks sharing of tree blocks is
data extents get fragmented during balancing. The main changes in
this patch are:
Add a 'drop sub-tree' function. This solves the problem in old code
that BTRFS_HEADER_FLAG_WRITTEN check breaks sharing of tree block.
Remove relocation mapping tree. Relocation mappings are stored in
struct btrfs_ref_path and updated dynamically during walking up/down
the reference path. This reduces CPU usage and simplifies code.
This patch also fixes a bug. Root items for reloc trees should be
updated in btrfs_free_reloc_root.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This is a large change for adding compression on reading and writing,
both for inline and regular extents. It does some fairly large
surgery to the writeback paths.
Compression is off by default and enabled by mount -o compress. Even
when the -o compress mount option is not used, it is possible to read
compressed extents off the disk.
If compression for a given set of pages fails to make them smaller, the
file is flagged to avoid future compression attempts later.
* While finding delalloc extents, the pages are locked before being sent down
to the delalloc handler. This allows the delalloc handler to do complex things
such as cleaning the pages, marking them writeback and starting IO on their
behalf.
* Inline extents are inserted at delalloc time now. This allows us to compress
the data before inserting the inline extent, and it allows us to insert
an inline extent that spans multiple pages.
* All of the in-memory extent representations (extent_map.c, ordered-data.c etc)
are changed to record both an in-memory size and an on disk size, as well
as a flag for compression.
From a disk format point of view, the extent pointers in the file are changed
to record the on disk size of a given extent and some encoding flags.
Space in the disk format is allocated for compression encoding, as well
as encryption and a generic 'other' field. Neither the encryption or the
'other' field are currently used.
In order to limit the amount of data read for a single random read in the
file, the size of a compressed extent is limited to 128k. This is a
software only limit, the disk format supports u64 sized compressed extents.
In order to limit the ram consumed while processing extents, the uncompressed
size of a compressed extent is limited to 256k. This is a software only limit
and will be subject to tuning later.
Checksumming is still done on compressed extents, and it is done on the
uncompressed version of the data. This way additional encodings can be
layered on without having to figure out which encoding to checksum.
Compression happens at delalloc time, which is basically singled threaded because
it is usually done by a single pdflush thread. This makes it tricky to
spread the compression load across all the cpus on the box. We'll have to
look at parallel pdflush walks of dirty inodes at a later time.
Decompression is hooked into readpages and it does spread across CPUs nicely.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Sometimes we end up freeing a reserved extent because we don't need it, however
this means that its possible for transaction->last_alloc to point to the middle
of a free area.
When we search for free space in find_free_space we do a tree_search_offset
with contains set to 0, because we want it to find the next best free area if
we do not have an offset starting on the given offset.
Unfortunately that currently means that if the offset we were given as a hint
points to the middle of a free area, we won't find anything. This is especially
bad if we happened to last allocate from the big huge chunk of a newly formed
block group, since we won't find anything and have to go back and search the
long way around.
This fixes this problem by making it so that we return the free space area
regardless of the contains variable. This made cache missing happen _alot_
less, and speeds things up considerably.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Subvol creation already requires privs, and security_inode_mkdir isn't
exported. For now we don't need it.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Creating a subvolume is in many ways like a normal VFS ->mkdir, and we
really need to play with the VFS topology locking rules. So instead of
just creating the snapshot on disk and then later getting rid of
confliting aliases do it correctly from the start. This will become
especially important once we allow for subvolumes anywhere in the tree,
and not just below a hidden root.
Note that snapshots will need the same treatment, but do to the delay
in creating them we can't do it currently. Chris promised to fix that
issue, so I'll wait on that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
This fixes the btrfs makefile for building in the tree and out of the tree
both as a module and static.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Due to the optimization for truncate, tree leaves only containing
checksum items can be deleted without being COW'ed first. This causes
reference cache misses. The way to fix the miss is create cache
entries for tree leaves only contain checksum.
This patch also fixes a -EEXIST issue in shared reference cache.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
The offset field in struct btrfs_extent_ref records the position
inside file that file extent is referenced by. In the new back
reference system, tree leaves holding references to file extent
are recorded explicitly. We can scan these tree leaves very quickly, so the
offset field is not required.
This patch also makes the back reference system check the objectid
when extents are in deleting.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
This patch makes btrfs count space allocated to file in bytes instead
of 512 byte sectors.
Everything else in btrfs uses a byte count instead of sector sizes or
blocks sizes, so this fits better.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
On 32 bit machines without CONFIG_LBD, the bi_sector field is only 32 bits.
Btrfs needs to cast it before shifting up, or we end up doing IO into
the wrong place.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The tree logging code was trying to separate tree log allocations
from normal metadata allocations to improve writeback patterns during
an fsync.
But, the code was not effective and ended up just mixing tree log
blocks with regular metadata. That seems to be working fairly well,
so the last_log_alloc code can be removed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This reworks the btrfs O_DIRECT write code a bit. It had always fallen
back to buffered IO and done an invalidate, but needed to be updated
for the data=ordered code. The invalidate wasn't actually removing pages
because they were still inside an ordered extent.
This also combines the O_DIRECT/O_SYNC paths where possible, and kicks
off IO in the main btrfs_file_write loop to keep the pipe down the the
disk full as we process long writes.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Checksum items take up a significant portion of the metadata for large files.
It is possible to avoid reading them during truncates by checking the keys in
the higher level nodes.
If a given leaf is followed by another leaf where the lowest key is a checksum
item from the same file, we know we can safely delete the leaf without
reading it.
For a 32GB file on a 6 drive raid0 array, Btrfs needs 8s to delete
the file with a cold cache. It is read bound during the run.
With this change, Btrfs is able to delete the file in 0.5s
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This fixes a deadlock that happens between the alloc_mutex and chunk_mutex.
Process A comes in, decides to do a do_chunk_alloc, which takes the
chunk_mutex, and is holding the alloc_mutex because the only way you get to
do_chunk_alloc is by holding the alloc_mutex. btrfs_alloc_chunk does its thing
and goes to insert a new item, which results in a cow of the block.
We get into del_pending_extents from there, where if we need to be rescheduled
we drop the alloc_mutex and schedule. At this point process B comes in to do
an allocation and gets the alloc_mutex, and because process A did not do the
chunk allocation completely it thinks its a good time to do a chunk allocation
as well, and hangs on the chunk_mutex.
Process A wakes up and tries to take the alloc_mutex and cannot. The way to
fix this is do a mutex_trylock() on chunk_mutex. If we return 0 we didn't get
the lock, and if this is just a "hey it may be a good time to allocate a chunk"
then we just exit. If we are trying to force an allocation then we reschedule
and keep trying to acquire the chunk_mutex. If once we acquire it the space is
already full then we can just exit, otherwise we can continue with the chunk
allocation. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
When reading in block groups, a global mask of the available raid policies
should be adjusted based on the types of block groups found on disk. This
global mask is then used to decide which raid policy to use for new
block groups.
The recent allocator changes dropped the call that updated the global
mask, making all the block groups allocated at run time single striped
onto a single drive.
This also fixes the async worker threads to set any thread that uses
the requeue mechanism as busy. This allows us to avoid blocking
on get_request_wait for the async bio submission threads.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch fixes a problem where we end up seeking too much when *last_ptr is
valid. This happens because btrfs_lookup_first_block_group only returns a
block group that starts on or after the given search start, so if the
search_start is in the middle of a block group it will return the block group
after the given search_start, which is suboptimal.
This patch fixes that by doing a btrfs_lookup_block_group, which will return
the block group that contains the given search start. If we fail to find a
block group, we fall back on btrfs_lookup_first_block_group so we can find the
next block group, not sure if this is absolutely needed, but better safe than
sorry.
Also if we can't find the block group that we need, or it happens to not be of
the right type, we need to add empty_cluster since *last_ptr could point to a
mismatched block group, which means we need to start over with empty_cluster
added to total needed. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This improves the comments at the top of many functions. It didn't
dive into the guts of functions because I was trying to
avoid merging problems with the new allocator and back reference work.
extent-tree.c and volumes.c were both skipped, and there is definitely
more work todo in cleaning and commenting the code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_add_leaf_ref was doing checks on the objects it found in the
rbtree to make sure they were properly linked into the tree. But, the field
it was checking can be safely changed outside of the tree spin lock.
The WARN_ON was for debugging the initial implementation and can be
safely removed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs-vol -a /dev/xxx will zero the first and last two MB of the device.
The kernel code needs to wait for this IO to finish before it adds
the device.
btrfs metadata IO does not happen through the block device inode. A
separate address space is used, allowing the zero filled buffer heads in
the block device inode to be written to disk after FS metadata starts
going down to the disk via the btrfs metadata inode.
The end result is zero filled metadata blocks after adding new devices
into the filesystem.
The fix is a simple filemap_write_and_wait on the block device inode
before actually inserting it into the pool of available devices.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch updates the space balancing code to utilize the new
backref format. Before, btrfs-vol -b would break any COW links
on data blocks or metadata. This was slow and caused the amount
of space used to explode if a large number of snapshots were present.
The new code can keeps the sharing of all data extents and
most of the tree blocks.
To maintain the sharing of data extents, the space balance code uses
a seperate inode hold data extent pointers, then updates the references
to point to the new location.
To maintain the sharing of tree blocks, the space balance code uses
reloc trees to relocate tree blocks in reference counted roots.
There is one reloc tree for each subvol, and all reloc trees share
same root key objectid. Reloc trees are snapshots of the latest
committed roots of subvols (root->commit_root).
To relocate a tree block referenced by a subvol, there are two steps.
COW the block through subvol's reloc tree, then update block pointer in
the subvol to point to the new block. Since all reloc trees share
same root key objectid, doing special handing for tree blocks
owned by them is easy. Once a tree block has been COWed in one
reloc tree, we can use the resulting new block directly when the
same block is required to COW again through other reloc trees.
In this way, relocated tree blocks are shared between reloc trees,
so they are also shared between subvols.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Add an EXTENT_BOUNDARY state bit to keep the writepage code
from merging data extents that are in the process of being
relocated. This allows us to do accounting for them properly.
* The balancing code relocates data extents indepdent of the underlying
inode. The extent_map code was modified to properly account for
things moving around (invalidating extent_map caches in the inode).
* Don't take the drop_mutex in the create_subvol ioctl. It isn't
required.
* Fix walking of the ordered extent list to avoid races with sys_unlink
* Change the lock ordering rules. Transaction start goes outside
the drop_mutex. This allows btrfs_commit_transaction to directly
drop the relocation trees.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs has a cache of reference counts in leaves, allowing it to
avoid reading tree leaves while deleting snapshots. To reduce
contention with multiple subvolumes, this cache is private to each
subvolume.
This patch adds shared reference cache support. The new space
balancing code plays with multiple subvols at the same time, So
the old per-subvol reference cache is not well suited.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Reserved extent accounting: reserved extents have been
allocated in the rbtrees that track free space but have not
been allocated on disk. They were never properly accounted for
in the past, making it hard to know how much space was really free.
* btrfs_find_block_group used to return NULL for block groups that
had been removed by the space balancing code. This made it hard
to account for space during the final stages of a balance run.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs metadata writeback is fairly expensive. Once a tree block is written
it must be cowed before it can be changed again. The btree writepages
code has a threshold based on a count of dirty btree bytes which is
updated as IO is sent out.
This changes btree_writepages to skip the writeout if there are less
than 32MB of dirty bytes from the btrees, improving performance
across many workloads.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The code to free block groups needs to drop the space info spin lock
before calling btrfs_remove_free_space_cache (which can schedule).
This is safe because at unmount time, nobody else is going to play
with the block groups.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs had compatibility code for kernels back to 2.6.18. These have
been removed, and will be maintained in a separate backport
git tree from now on.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After a crash, the tree log code uses btrfs_alloc_logged_extent to
record allocations of data extents that it finds in the log tree. These
come in basically random order, which does not fit how
btrfs_remove_free_space() expects to be called.
btrfs_remove_free_space was changed to support recording an extent
allocation in the middle of a region of free space.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs had magic to put the chagneset id into a printk on module load.
This removes that from the Makefile and hardcodes the printk to print
"Btrfs"
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The code to update the on disk i_size happens before the
ordered_extent record is removed. So, it is possible for multiple
ordered_extent completion routines to run at the same time, and to
find each other in the ordered tree.
The end result is they both decide not to update disk_i_size, leaving
it too small. This temporary fix just puts the updates inside
the extent_mutex. A real solution would be stronger ordering of
disk_i_size updates against removing the ordered extent from the tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch makes the back reference system to explicit record the
location of parent node for all types of extents. The location of
parent node is placed into the offset field of backref key. Every
time a tree block is balanced, the back references for the affected
lower level extents are updated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tree log blocks are only reserved, and should not ever get fully
allocated on disk. This check makes sure they stay out of the
extent tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tree blocks were using async bio submission, but the sum was still
being done directly during writepage. This moves the checksumming
into the worker thread.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
1) replace the per fs_info extent_io_tree that tracked free space with two
rb-trees per block group to track free space areas via offset and size. The
reason to do this is because most allocations come with a hint byte where to
start, so we can usually find a chunk of free space at that hint byte to satisfy
the allocation and get good space packing. If we cannot find free space at or
after the given offset we fall back on looking for a chunk of the given size as
close to that given offset as possible. When we fall back on the size search we
also try to find a slot as close to the size we want as possible, to avoid
breaking small chunks off of huge areas if possible.
2) remove the extent_io_tree that tracked the block group cache from fs_info and
replaced it with an rb-tree thats tracks block group cache via offset. also
added a per space_info list that tracks the block group cache for the particular
space so we can lookup related block groups easily.
3) cleaned up the allocation code to make it a little easier to read and a
little less complicated. Basically there are 3 steps, first look from our
provided hint. If we couldn't find from that given hint, start back at our
original search start and look for space from there. If that fails try to
allocate space if we can and start looking again. If not we're screwed and need
to start over again.
4) small fixes. there were some issues in volumes.c where we wouldn't allocate
the rest of the disk. fixed cow_file_range to actually pass the alloc_hint,
which has helped a good bit in making the fs_mark test I run have semi-normal
results as we run out of space. Generally with data allocations we don't track
where we last allocated from, so everytime we did a data allocation we'd search
through every block group that we have looking for free space. Now searching a
block group with no free space isn't terribly time consuming, it was causing a
slight degradation as we got more data block groups. The alloc_hint has fixed
this slight degredation and made things semi-normal.
There is still one nagging problem I'm working on where we will get ENOSPC when
there is definitely plenty of space. This only happens with metadata
allocations, and only when we are almost full. So you generally hit the 85%
mark first, but sometimes you'll hit the BUG before you hit the 85% wall. I'm
still tracking it down, but until then this seems to be pretty stable and make a
significant performance gain.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
cache block group had a few bugs in the error handling code,
this makes sure paths get properly released and the correct return value
goes out.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is the same way the transaction code makes sure that all the
other tree blocks are safely on disk. There's an extent_io tree
for each root, and any blocks allocated to the tree logs are
recorded in that tree.
At tree-log sync, the extent_io tree is walked to flush down the
dirty pages and wait for them.
The main benefit is less time spent walking the tree log and skipping
clean pages, and getting sequential IO down to the drive.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes the log tree copy code to use btrfs_insert_items and
to work in larger batches where possible.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since tree log blocks get freed every transaction, they never really
need to be written to disk. This skips the step where we update
metadata to record they were allocated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Drop i_mutex during the commit
Don't bother doing the fsync at all unless the dir is marked as dirtied
and needing fsync in this transaction. For directories, this means
that someone has unlinked a file from the dir without fsyncing the
file.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Pin down data blocks to prevent them from being reallocated like so:
trans 1: allocate file extent
trans 2: free file extent
trans 3: free file extent during old snapshot deletion
trans 3: allocate file extent to new file
trans 3: fsync new file
Before the tree logging code, this was legal because the fsync
would commit the transation that did the final data extent free
and the transaction that allocated the extent to the new file
at the same time.
With the tree logging code, the tree log subtransaction can commit
before the transaction that freed the extent. If we crash,
we're left with two different files using the extent.
* Don't wait in start_transaction if log replay is going on. This
avoids deadlocks from iput while we're cleaning up link counts in the
replay code.
* Don't deadlock in replay_one_name by trying to read an inode off
the disk while holding paths for the directory
* Hold the buffer lock while we mark a buffer as written. This
closes a race where someone is changing a buffer while we write it.
They are supposed to mark it dirty again after they change it, but
this violates the cow rules.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Orphan items use BTRFS_ORPHAN_OBJECTID (-5UUL) as key objectid. This
affects the find free objectid functions, inode objectid can easily
overflow after orphan file cleanup.
---
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_ilookup is unused, which is good because a normal filesystem
should never have to use ilookup anyway. Remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add two missing endianess conversions in this function, found by sparse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
d_obtain_alias is intended as a tailcall that can pass in errors encoded
in the inode pointer if needed, so use it that way instead of
duplicating the error handling.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
File syncs and directory syncs are optimized by copying their
items into a special (copy-on-write) log tree. There is one log tree per
subvolume and the btrfs super block points to a tree of log tree roots.
After a crash, items are copied out of the log tree and back into the
subvolume. See tree-log.c for all the details.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs actually stores the whole xattr name, including the prefix ondisk,
so using the generic resolver that strips off the prefix is not very
helpful. Instead do the real ondisk xattrs manually and only use the
generic resolver for synthetic xattrs like ACLs.
(Sorry Josef for guiding you towards the wrong direction here intially)
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The ->list handler is really not useful at all, because we always call
btrfs_xattr_generic_list anyway. After this is done
find_btrfs_xattr_handler becomes unused, and it becomes obvious that the
temporary name buffer allocation isn't needed but we can directly copy
into the supplied buffer.
Tested with various getfattr -d calls on varying xattr lists.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch makes btrfs so it will compile properly when acls are disabled. I
tested this and it worked with CONFIG_FS_POSIX_ACL off and on.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The current code waits for the count of async bio submits to get below
a given threshold if it is too high right after adding the latest bio
to the work queue. This isn't optimal because the caller may have
sequential adjacent bios pending they are waiting to send down the pipe.
This changeset requires the caller to wait on the async bio count,
and changes the async checksumming submits to wait for async bios any
time they self throttle.
The end result is much higher sequential throughput.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Date: Tue, 19 Aug 2008 22:20:17 +0100
btrfs_lookup_fs_root() only finds subvol roots which have already been
seen and put into the cache. For btrfs_get_dentry() we actually have to
go to the medium -- so use btrfs_read_fs_root_no_name() instead.
In btrfs_get_parent(), notice when we've hit the root of the
subvolume and return the real root instead.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Date: Tue, 19 Aug 2008 19:21:57 +0100
Using a 64-bit hash as the readdir cookie is just asking for trouble.
And gets it, when we try to export the file system by NFS.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Date: Tue, 19 Aug 2008 16:49:35 +0100
This disappeared when I removed the special case for '.' in btrfs_lookup()
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Date: Mon, 18 Aug 2008 13:10:20 +0100
This means that subvolumes get a different fsid, and NFS exporting them
works properly.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Date: Sun, 17 Aug 2008 15:14:48 +0100
We never get asked by the VFS to lookup either of them, and we can
handle the readdir() case a lot more simply, too.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Date: Mon, 21 Jul 2008 02:01:56 +0530
Here's an implementation of NFS support for btrfs. It relies on the
fixes which are going in to 2.6.28 for the NFS readdir/lookup deadlock.
This uses the btrfs_iget helper introduced previously.
[dwmw2: Tidy up a little, switch to d_obtain_alias() w/compat routine,
change fh_type, store parent's root object ID where needed,
fix some get_parent() and fs_to_dentry() bugs]
Signed-off-by: Balaji Rao <balajirrao@gmail.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Date: Mon, 21 Jul 2008 02:01:04 +0530
This patch introduces a btrfs_iget helper to be used in NFS support.
Signed-off-by: Balaji Rao <balajirrao@gmail.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before, the btrfs bdi congestion function was used to test for too many
async bios. This keeps that check to throttle pdflush, but also
adds a check while queuing bios.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This optimization had been removed because I thought it was triggering
csum errors. The real cause of the errors was elsewhere, and so
this optimization is back.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
add_extent_mapping was allowing the insertion of overlapping extents.
This never used to happen because it only inserted the extents from disk
and those were never overlapping.
But, with the data=ordered code, the disk and memory representations of the
file are not the same. add_extent_mapping needs to ensure a new extent
does not overlap before it inserts.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These ended up freeing objects while they were still using them. Under
guidance from Chris, just rip out the 'clever' bits and do things the
simple way.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before this change, btrfs would use a bdi congestion function to make
sure there weren't too many pending async checksum work items.
This change makes the process creating async work items wait instead,
leading to fewer congestion returns from the bdi. This improves
pdflush background_writeout scanning.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After writing out all the remaining btree blocks in the transaction,
the commit code would use filemap_fdatawait to make sure it was all
on disk. This means it would wait for blocks written by other procs
as well.
The new code walks the list of blocks for this transaction again
and waits only for those required by this transaction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The writeback_index field is used by write_cache_pages to pick up where
writeback on a given inode left off. But, it is never set to a sane
value, so writeback can often start at a random offset in the file.
Kernels 2.6.28 and higher will have this fixed, but for everyone else,
we also fill in the value in btrfs.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Newer RHEL5 kernels define both ClearPageFSMisc and
ClearPageChecked, so test for both before redefining.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---
Signed-off-by: Chris Mason <chris.mason@oracle.com>
rename and link don't always have a lock on the source inode, and
our use of a per-inode index variable was racy. This changes things to
store the index in a local variable instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The multi-bio code is responsible for duplicating blocks in raid1 and
single spindle duplication. It has counters to make sure all of
the locations for a given extent are properly written before io completion
is returned to the higher layers.
But, it didn't always complete the same bio it was given, sometimes a
clone was completed instead. This lead to problems with the async
work queues because they saved a pointer to the bio in a struct off
bi_private.
The fix is to remember the original bio and only complete that one.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Far from the perfect fix, but these structs are small. TODO for the
next release. The block group cache structs are referenced in many
different places, and it isn't safe to just free them while resizing.
A real fix will be a larger change to the allocator so that it doesn't
have to carry about the block group cache structs to find good places
to search for free blocks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commit 597:466b27332893 (btrfs_start_transaction: wait for commits in
progress) breaks the transaction start/stop ioctls by making
btrfs_start_transaction conditionally wait for the next transaction to
start. If an application artificially is holding a transaction open,
things deadlock.
This workaround maintains a count of open ioctl-initiated transactions in
fs_info, and avoids wait_current_trans() if any are currently open (in
start_transaction() and btrfs_throttle()). The start transaction ioctl
uses a new btrfs_start_ioctl_transaction() that _does_ call
wait_current_trans(), effectively pushing the join/wait decision to the
outer ioctl-initiated transaction.
This more or less neuters btrfs_throttle() when ioctl-initiated
transactions are in use, but that seems like a pretty fundamental
consequence of wrapping lots of write()'s in a transaction. Btrfs has no
way to tell if the application considers a given operation as part of it's
transaction.
Obviously, if the transaction start/stop ioctls aren't being used, there
is no effect on current behavior.
Signed-off-by: Sage Weil <sage@newdream.net>
---
ctree.h | 1 +
ioctl.c | 12 +++++++++++-
transaction.c | 18 +++++++++++++-----
transaction.h | 2 ++
4 files changed, 27 insertions(+), 6 deletions(-)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Intel doesn't yet ship hardware to the public with this enabled, but when they
do, they will be ready. Original code from:
Austin Zhang <austin_zhang@linux.intel.com>
It is currently disabled, but edit crc32c.h to turn it on.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Make walk_down_tree wake up throttled tasks more often
* Make walk_down_tree call cond_resched during long loops
* As the size of the ref cache grows, wait longer in throttle
* Get rid of the reada code in walk_down_tree, the leaves don't get
read anymore, thanks to the ref cache.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A btree block cow has two parts, the first is to allocate a destination
block and the second is to copy the old bock over.
The first part needs locks in the extent allocation tree, and may need to
do IO. This changeset splits that into a separate function that can be
called without any tree locks held.
btrfs_search_slot is changed to drop its path and start over if it has
to COW a contended block. This often means that many writers will
pre-alloc a new destination for a the same contended block, but they
cache their prealloc for later use on lower levels in the tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While dropping snapshots, walk_down_tree does most of the work of checking
reference counts and limiting tree traversal to just the blocks that
we are freeing.
It dropped and held the allocation mutex in strange and confusing ways,
this commit changes it to only hold the mutex while actually freeing a block.
The rest of the checks around reference counts should be safe without the lock
because we only allow one process in btrfs_drop_snapshot at a time. Other
processes dropping reference counts should not drop it to 1 because
their tree roots already have an extra ref on the block.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Large streaming reads make for large bios, which means each entry on the
list async work queues represents a large amount of data. IO
congestion throttling on the device was kicking in before the async
worker threads decided a single thread was busy and needed some help.
The end result was that a streaming read would result in a single CPU
running at 100% instead of balancing the work off to other CPUs.
This patch also changes the pre-IO checksum lookup done by reads to
work on a per-bio basis instead of a per-page. This results in many
extra btree lookups on large streaming reads. Doing the checksum lookup
right before bio submit allows us to reuse searches while processing
adjacent offsets.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This avoids waiting for transactions with pages locked by breaking out
the code to wait for the current transaction to close into a function
called by btrfs_throttle.
It also lowers the limits for where we start throttling.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add a couple of #if's to follow API changes.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The memory reclaiming issue happens when snapshot exists. In that
case, some cache entries may not be used during old snapshot dropping,
so they will remain in the cache until umount.
The patch adds a field to struct btrfs_leaf_ref to record create time. Besides,
the patch makes all dead roots of a given snapshot linked together in order of
create time. After a old snapshot was completely dropped, we check the dead
root list and remove all cache entries created before the oldest dead root in
the list.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It was incorrectly clearing the up to date flag on the buffer even
when the buffer properly verified.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To check whether a given file extent is referenced by multiple snapshots, the
checker walks down the fs tree through dead root and checks all tree blocks in
the path.
We can easily detect whether a given tree block is directly referenced by other
snapshot. We can also detect any indirect reference from other snapshot by
checking reference's generation. The checker can always detect multiple
references, but can't reliably detect cases of single reference. So btrfs may
do file data cow even there is only one reference.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When kthread_run() returns failure, this worker hasn't been
added to the list, so btrfs_stop_workers() won't free it.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A large reference cache is directly related to a lot of work pending
for the cleaner thread. This throttles back new operations based on
the size of the reference cache so the cleaner thread will be able to keep
up.
Overall, this actually makes the FS faster because the cleaner thread will
be more likely to find things in cache.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes the reference cache to make a single cache per root
instead of one cache per transaction, and to key by the byte number
of the disk block instead of the keys inside.
This makes it much less likely to have cache misses if a snapshot
or something has an extra reference on a higher node or a leaf while
the first transaction that added the leaf into the cache is dropping.
Some throttling is added to functions that free blocks heavily so they
wait for old transactions to drop.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Much of the IO done while dropping snapshots is done looking up
leaves in the filesystem trees to see if they point to any extents and
to drop the references on any extents found.
This creates a cache so that IO isn't required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The 'char name[BTRFS_PATH_NAME_MAX]' member of struct btrfs_ioctl_vol_args
is passed directly to strlen() after being copied from user. I haven't
verified this, but in theory a userspace program could pass in an
unterminated string and cause a kernel crash as strlen walks off the end of
the array.
This patch terminates the ->name string in all btrfs ioctl functions which
currently use a 'struct btrfs_ioctl_vol_args'. Since the string is now
properly terminated, it's length will never be longer than
BTRFS_PATH_NAME_MAX so that error check has been removed.
By the way, it might be better overall to just have the ioctl pass an
unterminated string + length structure but I didn't bother with that since
it'd change the kernel/user interface.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We should decrease the found slot by one as btrfs_search_slot does
when bin_search return 1 and node level > 0.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Remove a unused variable 'path' in fixup_tree_root_location.
Signed-off-by: Balaji Rao <balajirrao@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before setting an extent to delalloc, the code needs to wait for
pending ordered extents.
Also, the relocation code needs to wait for ordered IO before scanning
the block group again. This is because the extents are not removed
until the IO for the new extents is finished
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Checksum items are not inserted into the tree until all of the io from a
given extent is complete. This means one dirty page from an extent may
be written, freed, and then read again before the entire extent is on disk
and the checksum item is inserted.
The checksums themselves are stored in the ordered extent so they can
be inserted in bulk when IO is complete. On read, if a checksum item isn't
found, the ordered extents were being searched for a checksum record.
This all worked most of the time, but the checksum insertion code tries
to reduce the number of tree operations by pre-inserting checksum items
based on i_size and a few other factors. This means the read code might
find a checksum item that hasn't yet really been filled in.
This commit changes things to check the ordered extents first and only
dive into the btree if nothing was found. This removes the need for
extra locking and is more reliable.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This releases the alloc_mutex in a few places that hold it for over long
operations. btrfs_lookup_block_group is changed so that it doesn't need
the mutex at all.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Lockdep has the notion of locking subclasses so that you can identify
locks you expect to be taken after other locks of the same class. This
changes the per-extent buffer btree locking routines to use a subclass based
on the level in the tree.
Unfortunately, lockdep can only handle 8 total subclasses, and the btrfs
max level is also 8. So when lockdep is on, use a lower max level.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Stress testing was showing data checksum errors, most of which were caused
by a lookup bug in the extent_map tree. The tree was caching the last
pointer returned, and searches would check the last pointer first.
But, search callers also expect the search to return the very first
matching extent in the range, which wasn't always true with the last
pointer usage.
For now, the code to cache the last return value is just removed. It is
easy to fix, but I think lookups are rare enough that it isn't required anymore.
This commit also replaces do_sync_mapping_range with a local copy of the
related functions.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This replaces the use of the page cache lock bit for locking, which wasn't
suitable for block size < page size and couldn't be used recursively.
The mutexes alone don't fix either problem, but they are the first step.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before, extent buffers were a temporary object, meant to map a number of pages
at once and collect operations on them.
But, a few extra fields have crept in, and they are also the best place to
store a per-tree block lock field as well. This commit puts the extent
buffers into an rbtree, and ensures a single extent buffer for each
tree block.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* In btrfs_delete_inode, wait for ordered extents after calling
truncate_inode_pages. This is much faster, and more correct
* Properly clear our the PageChecked bit everywhere we redirty the page.
* Change the writepage fixup handler to lock the page range and check to
see if an ordered extent had been inserted since the improperly dirtied
page was discovered
* Wait for ordered extents outside the transaction. This isn't required
for locking rules but does improve transaction latencies
* Reduce contention on the alloc_mutex by dropping it while incrementing
refs on a node/leaf and while dropping refs on a leaf.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It was possible for stale mappings from disk to be used instead of the
new pending ordered extent. This adds a flag to the extent map struct
to keep it pinned until the pending ordered extent is actually on disk.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Data checksumming is done right before the bio is sent down the IO stack,
which means a single bio might span more than one ordered extent. In
this case, the checksumming data is split between two ordered extents.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_drop_extents is always called with a range lock held on the inode.
But, it may operate on extents outside that range as it drops and splits
them.
This patch adds a per-inode mutex that is held while calling
btrfs_drop_extents and while inserting new extents into the tree. It
prevents races from two procs working against adjacent ranges in the tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Checksum items are not inserted until the entire ordered extent is on disk,
but individual pages might be clean and available for reclaim long before
the whole extent is on disk.
In order to allow those pages to be freed, we need to be able to search
the list of ordered extents to find the checksum that is going to be inserted
in the tree. This way if the page needs to be read back in before
the checksums are in the btree, we'll be able to verify the checksum on
the page.
This commit adds the ability to search the pending ordered extents for
a given offset in the file, and changes btrfs_releasepage to allow
ordered pages to be freed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_commit_transaction has to loop waiting for any writers in the
transaction to finish before it can proceed. btrfs_start_transaction
should be polite and not join a transaction that is in the process
of being finished off.
There are a few places that can't wait, basically the ones doing IO that
might be needed to finish the transaction. For them, btrfs_join_transaction
is added.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes the ordered data code to update i_size after the extent
is on disk. An on disk i_size is maintained in the in-memory btrfs inode
structures, and this is updated as extents finish.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Higher layers sometimes call set_page_dirty without asking the filesystem
to help. This causes many problems for the data=ordered and cow code.
This commit detects pages that haven't been properly setup for IO and
kicks off an async helper to deal with them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The old data=ordered code would force commit to wait until
all the data extents from the transaction were fully on disk. This
introduced large latencies into the commit and stalled new writers
in the transaction for a long time.
The new code changes the way data allocations and extents work:
* When delayed allocation is filled, data extents are reserved, and
the extent bit EXTENT_ORDERED is set on the entire range of the extent.
A struct btrfs_ordered_extent is allocated an inserted into a per-inode
rbtree to track the pending extents.
* As each page is written EXTENT_ORDERED is cleared on the bytes corresponding
to that page.
* When all of the bytes corresponding to a single struct btrfs_ordered_extent
are written, The previously reserved extent is inserted into the FS
btree and into the extent allocation trees. The checksums for the file
data are also updated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_find_dead_roots called btrfs_read_fs_root_no_radix, which
means we end up calling btrfs_search_slot with a path already held.
The fix is to remember the key inside btrfs_find_dead_roots and drop
the path.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This calls unlock_up sooner in btrfs_search_slot in order to decrease the
amount of work done with the higher level tree locks held.
Also, it changes btrfs_tree_lock to spin for a big against the page lock
before scheduling. This makes a big difference in context switch rate under
highly contended workloads.
Longer term, a better locking structure is needed than the page lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btree defragger wasn't making forward progress because the new key wasn't
being saved by the btrfs_search_forward function.
This also disables the automatic btree defrag, it wasn't scaling well to
huge filesystems. The auto-defrag needs to be done differently.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This makes it possible for callers to check for extent_buffers in cache
without deadlocking against any btree locks held.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The online btree defragger is simplified and rewritten to use
standard btree searches instead of a walk up / down mechanism.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This creates one kthread for commits and one kthread for
deleting old snapshots. All the work queues are removed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The existing throttle mechanism was often not sufficient to prevent
new writers from coming in and making a given transaction run forever.
This adds an explicit wait at the end of most operations so they will
allow the current transaction to close.
There is no wait inside file_write, inode updates, or cow filling, all which
have different deadlock possibilities.
This is a temporary measure until better asynchronous commit support is
added. This code leads to stalls as it waits for data=ordered
writeback, and it really needs to be fixed.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Allocations may need to read in block groups from the extent allocation tree,
which will require a tree search and take locks on the extent allocation
tree. But, those locks might already be held in other places, leading
to deadlocks.
Since the alloc_mutex serializes everything right now, it is safe to
skip the btree locking while caching block groups. A better fix will be
to either create a recursive lock or find a way to back off existing
locks while caching block groups.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows us to delete an unlinked inode with dirty pages from the list
instead of forcing commit to write these out before deleting the inode.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
One lock per btree block can make for significant congestion if everyone
has to wait for IO at the high levels of the btree. This drops
locks held by a path when doing reads during a tree search.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Extent alloctions are still protected by a large alloc_mutex.
Objectid allocations are covered by a objectid mutex
Other btree operations are protected by a lock on individual btree nodes
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The allocation trees and the chunk trees are serialized via their own
dedicated mutexes. This means allocation location is still not very
fine grained.
The main FS btree is protected by locks on each block in the btree. Locks
are taken top / down, and as processing finishes on a given level of the
tree, the lock is released after locking the lower level.
The end result of a search is now a path where only the lowest level
is locked. Releasing or freeing the path drops any locks held.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If a bio submission is after a lock holder waiting for the bio
on the work queue, it is possible to deadlock. Move the bios
into their own pool.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
As mentioned in the comment next to it btrfs_ioctl_trans_start can
do bad damage to filesystems and thus should be limited to privilegued
users.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Split the ioctl handling out of inode.c into a file of it's own.
Also fix up checkpatch.pl warnings for the moved code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
mount -o thread_pool_size changes the default, which is
min(num_cpus + 2, 8). Larger thread pools would make more sense on
very large disk arrays.
This mount option controls the max size of each thread pool. There
are multiple thread pools, so the total worker count will be larger
than the mount option.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes the worker thread pool to maintain a list of idle threads,
avoiding a complex search for a good thread to wake up.
Threads have two states:
idle - we try to reuse the last thread used in hopes of improving the batching
ratios
busy - each time a new work item is added to a busy task, the task is
rotated to the end of the line.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
max_inline=0 used to force the max_inline size to one sector instead. Now
it properly disables inline data items, while still being able to read
any that happen to exist on disk.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs has been using workqueues to spread the checksumming load across
other CPUs in the system. But, workqueues only schedule work on the
same CPU that queued the work, giving them a limited benefit for systems with
higher CPU counts.
This code adds a generic facility to schedule work with pools of kthreads,
and changes the bio submission code to queue bios up. The queueing is
important to make sure large numbers of procs on the system don't
turn streaming workloads into random workloads by sending IO down
concurrently.
The end result of all of this is much higher performance (and CPU usage) when
doing checksumming on large machines. Two worker pools are created,
one for writes and one for endio processing. The two could deadlock if
we tried to service both from a single pool.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Allows to specify one or multiple device=/dev/foo options during mount
so that ioctls on the control device can be avoided. Especially useful
when trying to mount a multi-device setup as root.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Also adds lots of comments to describe what's going on here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
use normal kbuild syntax to build acl.o conditinally and remove comment
out lines.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These ioctls let a user application hold a transaction open while it
performs a series of operations. A final ioctl does a sync on the fs
(closing the current transaction). This is the main requirement for
Ceph's OSD to be able to keep the data it's storing in a btrfs volume
consistent, and AFAICS it works just fine. The application would do
something like
fd = ::open("some/file", O_RDONLY);
::ioctl(fd, BTRFS_IOC_TRANS_START);
/* do a bunch of stuff */
::ioctl(fd, BTRFS_IOC_TRANS_END);
or just
::close(fd);
And to ensure it commits to disk,
::ioctl(fd, BTRFS_IOC_SYNC);
When a transaction is held open, the trans_handle is attached to the
struct file (via private_data) so that it will get cleaned up if the
process dies unexpectedly. A held transaction is also ended on fsync() to
avoid a deadlock.
A misbehaving application could also deliberately hold a transaction open,
effectively locking up the FS, so it may make sense to restrict something
like this to root or something.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The acl code is not yet complete, and the xattr handlers are causing
problems for cp -p on some distros.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We need to invalidate an existing dcache entry after creating a new
snapshot or subvolume, because a negative dache entry will stop us from
accessing the new snapshot or subvolume.
---
ctree.h | 23 +++++++++++++++++++++++
inode.c | 4 ++++
transaction.c | 4 ++++
3 files changed, 31 insertions(+)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a new transaction was started, the code would incorrectly
set the pointer in fs_info before all the data structures were setup.
fsync heavy workloads hit races on the setup of the ordered inode spinlock
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This avoids IO stalls and poorly ordered IO from inline writers mixing in
with the async submission queue
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* Force chunk allocation when find_free_extent has to do a full scan
* Record the max key at the start of defrag so it doesn't run forever
* Block groups might not be contiguous, make a forward search for the
next block group in extent-tree.c
* Get rid of extra checks for total fs size
* Fix relocate_one_reference to avoid relocating the same file data block
twice when referenced by an older transaction
* Use the open device count when allocating chunks so that we don't
try to allocate from devices that don't exist
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The async submit workqueue was absorbing too many requests, leading to long
stalls where the async submitters were stalling.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent_io writepage calls needed an extra check for discarding
pages that started on th last byte in the file.
btrfs_truncate_page needed checks to make sure the page was still part
of the file after reading it, and most importantly, needed to wait for
all IO to the page to finish before freeing the corresponding extents on
disk.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Devices can change after the scan ioctls are done, and btrfs_open_devices
needs to be able to verify them as they are opened and used by the FS.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When duplicate copies exist, writes are allowed to fail to one of those
copies. This changeset includes a few changes that allow the FS to
continue even when some IOs fail.
It also adds verification of the parent generation number for btree blocks.
This generation is stored in the pointer to a block, and it ensures
that missed writes to are detected.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Once part of a delalloc request fails the cow checks, just cow the
entire range
It is possible for the back references to all be from the same root,
but still have snapshots against an extent. The checks are now more strict,
forcing cow any time there are multiple refs against the data extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before, nodatacow only checked to make sure multiple roots didn't have
references on a single extent. This check makes sure that multiple
inodes don't have references.
nodatacow needed an extra check to see if the block group was currently
readonly. This way cows forced by the chunk relocation code are honored.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This required a few structural changes to the code that manages bdev pointers:
The VFS super block now gets an anon-bdev instead of a pointer to the
lowest bdev. This allows us to avoid swapping the super block bdev pointer
around at run time.
The code to read in the super block no longer goes through the extent
buffer interface. Things got ugly keeping the mapping constant.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In openSUSE 10.3, AppArmor modifies remove_suid to take a struct path
rather than just a dentry. This patch tests that the kernel is openSUSE
10.3 or newer and adjusts the call accordingly.
Debian/Ubuntu with AppArmor applied will also need a similar patch.
Maintainers of btrfs under those distributions should build on this
patch or, alternatively, alter their package descriptions to add
-DREMOVE_SUID_PATH to the compiler command line.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
- --- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ b/compat.h 2008-02-06 16:46:13.000000000 -0500
@@ -0,0 +1,15 @@
+#ifndef _COMPAT_H_
+#define _COMPAT_H_
+
+
+/*
+ * Even if AppArmor isn't enabled, it still has different prototypes.
+ * Add more distro/version pairs here to declare which has AppArmor applied.
+ */
+#if defined(CONFIG_SUSE_KERNEL)
+# if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,22)
+# define REMOVE_SUID_PATH 1
+# endif
+#endif
+
+#endif /* _COMPAT_H_ */
- --- a/file.c 2008-02-06 11:37:39.000000000 -0500
+++ b/file.c 2008-02-06 16:46:23.000000000 -0500
@@ -37,6 +37,7 @@
#include "ordered-data.h"
#include "ioctl.h"
#include "print-tree.h"
+#include "compat.h"
static int btrfs_copy_from_user(loff_t pos, int num_pages, int write_bytes,
@@ -790,7 +791,11 @@ static ssize_t btrfs_file_write(struct f
goto out_nolock;
if (count == 0)
goto out_nolock;
+#ifdef REMOVE_SUID_PATH
+ err = remove_suid(&file->f_path);
+#else
err = remove_suid(fdentry(file));
+#endif
if (err)
goto out_nolock;
file_update_time(file);
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2.6.18 seems to get caught in an infinite loop when
cancel_rearming_delayed_workqueue is called more than once, so this switches
to cancel_delayed_work, which is arguably more correct.
Also, balance_dirty_pages can run into problems with 2.6.18 based kernels
because it doesn't have the per-bdi dirty limits. This avoids calling
balance_dirty_pages on the btree inode unless there is actually something
to balance, which is a good optimization in general.
Finally there's a compile fix for ordered-data.h
Signed-off-by: Chris Mason <chris.mason@oracle.com>
balance level starts by trying to empty the middle block, and then
pushes from the right to the middle. This might empty the right block
and leave a small number of pointers in the middle.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The generic O_DIRECT code assumes all the bios have the same bdev,
which isn't true for multi-device btrfs.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows other code that needs to walk every device in the FS to do so
without locking against allocations.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_invalidatepage is not allowed to leave pages around on the lru.
Any such pages will trigger an oops later on because the VM will see
page->private and assume it is a buffer head.
This also forces extra flushes of the async work queues before
dropping all the pages on the btree inode during unmount. Left over
items on the work queues are one possible cause of busy state ranges
during truncate_inode_pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btree inode should only have a single extent_map in the cache,
it doesn't make sense to ever drop it.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The data read retry code needs to find the logical disk block before it
can resubmit new bios. But, finding this block isn't allowed to take
the fs_mutex because that will deadlock with a number of different callers.
This changes the retry code to use the extent map cache instead, but
that requires the extent map cache to have the extent we're looking for.
This is a problem because btrfs_drop_extent_cache just drops the entire
extent instead of the little tiny part it is invalidating.
The bulk of the code in this patch changes btrfs_drop_extent_cache to
invalidate only a portion of the extent cache, and changes btrfs_get_extent
to deal with the results.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This isn't required anymore because we don't reallocate blocks that
have already been written in this transaction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This significantly improves streaming write performance by allowing
concurrency in the data checksumming.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows checksumming to happen in parallel among many cpus, and
keeps us from bogging down pdflush with the checksumming code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Block headers now store the chunk tree uuid
Chunk items records the device uuid for each stripes
Device extent items record better back refs to the chunk tree
Block groups record better back refs to the chunk tree
The chunk tree format has also changed. The objectid of BTRFS_CHUNK_ITEM_KEY
used to be the logical offset of the chunk. Now it is a chunk tree id,
with the logical offset being stored in the offset field of the key.
This allows a single chunk tree to record multiple logical address spaces,
upping the number of bytes indexed by a chunk tree from 2^64 to
2^128.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This includes fixing a missing spinlock init call that caused oops on mount
for most kernels other than 2.6.25.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On huge machines, delayed allocation may try to allocate massive extents.
This change allows btrfs_alloc_extent to return something smaller than
the caller asked for, and the data allocation routines will loop over
the allocations until it fills the whole delayed alloc.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix for a endianess BUG when using btrfs v0.13 with kernels older than 2.6.23
Problem:
Has of v0.13, btrfs-progs is using crc32c.c equivalent to the one found on
linux-2.6.23/lib/libcrc32c.c Since crc32c_le() changed in linux-2.6.23, when
running btrfs v0.13 with older kernels we have a missmatch between the versions
of crc32c_le() from btrfs-progs and libcrc32c in the kernel. This missmatch
causes a bug when using btrfs on big endian machines.
Solution:
btrfs_crc32c() macro that when compiling for kernels older than 2.6.23, does
endianess conversion to parameters and return value of crc32c().
This endianess conversion nullifies the differences in implementation
of crc32c_le().
If kernel 2.6.23 or better, it calls crc32c().
Signed-off-by: Miguel Sousa Filipe <miguel.filipe@gmail.com>
---
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds basic O_DIRECT read and write support. In the write case, we
just do a normal buffered write followed by a cache flush. O_DIRECT +
O_SYNC are required to trigger metadata syncs.
In the read case, there is a basic btrfs_get_block call for use by
the generic O_DIRECT code. This does honor multi-volume mapping rules
but it skips all checksumming.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before it was done by the bio end_io routine, the work queue code is able
to scale much better with faster IO subsystems.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before, metadata checksumming was done by the callers of read_tree_block,
which would set EXTENT_CSUM bits in the extent tree to show that a given
range of pages was already checksummed and didn't need to be verified
again.
But, those bits could go away via try_to_releasepage, and the end
result was bogus checksum failures on pages that never left the cache.
The new code validates checksums when the page is read. It is a little
tricky because metadata blocks can span pages and a single read may
end up going via multiple bios.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a block is freed, it can be immediately reused if it is from
the current transaction. But, an extra check is required to make sure
the block had not been written yet. If it were reused after being written,
the transid in the block header might match the transid of the
next time the block was allocated.
The parent node records the transaction ID of the block it is pointing to,
and this is used as part of validating the block on reads. So, there
can only be one version of a block per transaction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Checksums were only verified by btrfs_read_tree_block, which meant the
functions to probe the page cache for blocks were not validating checksums.
Normally this is fine because the buffers will only be in cache if they
have already been validated.
But, there is a window while the buffer is being read from disk where
it could be up to date in the cache but not yet verified. This patch
makes sure all buffers go through checksum verification before they
are used.
This is safer, and it prevents modification of buffers before they go
through the csum code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There was an optimization to drop the fs_mutex when doing snapshot deletion
reads, but this can lead to false positives on checksumming errors. Keep
the lock for now.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs_name_hash, Local variable 'buf' is declared as
__u32 buf[2];
but we then try to do this:
buf[0] = 0x67452301;
buf[1] = 0xefcdab89;
buf[2] = 0x98badcfe;
buf[3] = 0x10325476;
Oops. Fix buf to be the proper size.
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows detection of blocks that have already been written in the
running transaction so they can be recowed instead of modified again.
It is step one in trusting the transid field of the block pointers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Here's a patch against the unstable tree that gets the code to build
against Linus's current tree (2.6.24-git12). This is needed as the
kobject/kset api has changed there.
I tried to make the smallest changes needed, and it builds and loads
successfully, but I don't have a btrfs volume anywhere (yet) to try to
see if things still work properly :)
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we checkum file data during writepage, the checksumming is done one
page at a time, making it difficult to do bulk metadata modifications
to insert checksums for large ranges of the file at once.
This patch changes btrfs to checksum on a per-bio basis instead. The
bios are checksummed before they are handed off to the block layer, so
each bio is contiguous and only has pages from the same inode.
Checksumming on a bio basis allows us to insert and modify the file
checksum items in large groups. It also allows the checksumming to
be done more easily by async worker threads.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed that we don't clear the extent state tree dirty and delalloc
bits when we clear the dirty bits on the page during file write.
This leads to csum errors later on.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reduce CPU time searching for free blocks by optimizing find_first_extent_bit
Fix find_free_extent to make better use of the last_alloc hint. Before it
was often finding blocks just before the hint.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs set/get macros lose type information needed to avoid
unaligned accesses on sparc64.
ere is a patch for the kernel bits which fixes most of the
unaligned accesses on sparc64.
btrfs_name_hash is modified to return the hash value instead
of getting a return location via a (potentially unaligned)
pointer.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A few codes were not properly updated for changes of extent map. This
may be the causes of "no csum found for inode" issue.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Now that delayed allocation accounting works, i_blocks accounting is changed
to only modify i_blocks when extents inserted or removed.
The fillattr call is changed to include the delayed allocation byte count
in the i_blocks result.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When freeing root block of a tree, btrfs_free_extent' parameter
'ref_generation' is from root block itseft. When freeing non-root
block, 'ref_generation' is from its parent. so when converting a
non-root block to root block, we must guarantee its generation is
equal to its parent's generation.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This makes searches for backrefs and backref insertion much more efficient
when there are many backrefs for a single extent
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When truncating a inline extent, btrfs_drop_extents doesn't properly
handle the case "key.offset > inline_limit". This bug can only happen
when max line size is larger than 8K.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The end_bio routines are changed to take a pointer to the extent state
struct, and the state tree is walked in order to set/clear appropriate
bits as IO completes. This greatly reduces the number of rbtree searches
done by the end_bio handlers, and reduces lock contention.
The extent_io releasepage function is changed to avoid expensive searches
for locked state.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is now extent_map for mapping offsets in the file to disk and
extent_io for state tracking, IO submission and extent_bufers.
The new extent_map code shifts from [start,end] pairs to [start,len], and
pushes the locking out into the caller. This allows a few performance
optimizations and is easier to use.
A number of extent_map usage bugs were fixed, mostly with failing
to remove extent_map entries when changing the file.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There were a few places that could cause duplicate extent insertion,
this adjusts the code that creates holes to avoid it.
lookup_extent_map is changed to correctly return all of the extents in a
range, even when there are none matching at the start of the range.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
test_range_bit doesn't properly handle the case: there's a hole at the
end of the range and there's no other extent_state after the range.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_find_free_objectid may return a used objectid due to arithmetic
underflow. This bug may happen when parameter 'root' is tree root, so
it may cause serious problems when creating snapshot or sub-volume.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Using ilookup5 during data=ordered writeback could deadlock on I_LOCK. This
saves a pointer to the inode instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds readonly inode flag support. A file with this flag
can't be modified, but can be deleted.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While shrinking the FS, the allocation functions need to make sure
they don't try to allocate bytes past the end of the FS.
nodatacow needed an extra check to force cows when the existing extents are
past the end of the FS.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It is very difficult to create a consistent snapshot of the btree when
other writers may update the btree before the commit is done.
This changes the snapshot creation to happen during the commit, while
no other updates are possible.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This forces file data extents down the disk along with the metadata that
references them. The current implementation is fairly simple, and just
writes out all of the dirty pages in an inode before the commit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The shrinking code used btrfs_next_leaf to find the next item, but
this does not cow the blocks it touches. This fix calls search_slot after
finding the next item to do appropriate cow and balancing.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The patch fixes the overlapping extent issue in shrink_extent_tree.
It checks whether there is an overlapping extent by using
find_previous_extent. If there is an overlapping extent, it setups
key.objectid and cur_byte properly.
---
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is intended to prevent accidentally filling the drive. A determined
user can still make things oops.
It includes some accounting of the current bytes under delayed allocation,
but this will change as things get optimized
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed the offset into the extent was incorrectly being added to the
extent start before trying to find it in the extent allocation tree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A number of workloads do not require copy on write data or checksumming.
mount -o nodatasum to disable checksums and -o nodatacow to disable
both copy on write and checksumming.
In nodatacow mode, copy on write is still performed when a given extent
is under snapshot.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
One of my old patches introduces a new bug to
btrfs_drop_extents(changeset 275). Inline extents are not truncated
properly when "extent_end == end", it can trigger the BUG_ON at
file.c:600. I hope I don't introduce new bug this time.
---
Signed-off-by: Chris Mason <chris.mason@oracle.com>
--Boundary-00=_CcOWHFYK4T+JwSj
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Hello everybody,
compiling btrfs into the kernel results in section mismatch warnings. __exit
functions are called where they are not allowed to. The attached patch fixes
this for me. Not sure if it is correct though.
Signed-off-by: Christian Hesse <mail@earthworm.de>
--
Regards,
Chris
--Boundary-00=_CcOWHFYK4T+JwSj
Content-Type: text/x-diff; charset="iso-8859-1";
name="btrfs-section_mismatches.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="btrfs-section_mismatches.patch"
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The codes that fixup the right leaf and the codes that dirty the
extnet buffer use the variable 'right_nritems' , both of them expect
'right_nritems' is the number of items in right leaf after the push.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There was a slight problem with ACL's returning EINVAL when you tried to set an
ACL. This isn't correct, we should be returning EOPNOTSUPP, so I did a very
ugly thing and just commented everybody out and made them return EOPNOTSUPP.
This is only temporary, I'm going back to implement ACL's, but Chris wants to
push out a release so this will suffice for now.
Also Yan suggested setting reada to -1 in the delete case to enable backwards
readahead, and in the listxattr case I moved path->reada = 2; to after the if
(!path) check so we can avoid a possible null dereference. Thank you,
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds a new parameter 'full_scan' to 'find_search_start',
thereby 'find_search_start' can know whether 'find_free_extent' is in
full scan phrase. I feel that 'find_search_start' should skip calling
'btrfs_find_block_group' when 'find_free_extent' is in full scan
phrase. In my test on a 2GB volume, Oops occurs when space usage is
about 76%. After apply the patch, Oops occurs when space usage is
near 100%.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds a helper function 'update_pinned_extents' to
extent-tree.c. The usage of the helper function is similar to
'update_block_group', the last parameter of the function indicates
pin vs unpin.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When 'item_end' is equal to 'inode->i_size', 'found_type' is updated
and current item is skipped. This behavior is correct for extent item,
but incorrect for csum item. For example, there is a csum item with
'offset == 0'. When deleting the inode, 'inode->i_size' is set to 0,
so the csum item isn't deleted.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Don't set hint_byte to EXTENT_MAP_INLINE when 'end == extent_end' or
'start == key.offset' . The inline extent will be truncated in these
cases.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When calculating the size of inline extent, inode->i_size should also
be take into consideration, otherwise sys_write may drop some data
silently. You can test this bug by:
#dd if=/dev/zero bs=4k count=1 of=test_file
#dd if=/dev/zero bs=2k count=1 of=test_file conv=notrunc
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When pin_down_bytes decides not to pin a block because it was from the
current transaction, make sure the in memory cache of free extents is updated
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a 'finish_wait', but no 'prepare_to_wait' . So I think that
the 'prepare_to_wait' is missing. The second change is according to
the name of variable.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The fixes do a number of things:
1) Most btrfs_drop_extent callers will try to leave the inline extents in
place. It can truncate bytes off the beginning of the inline extent if
required.
2) writepage can now update the inline extent, allowing mmap writes to
go directly into the inline extent.
3) btrfs_truncate_in_transaction truncates inline extents
4) extent_map.c fixed to not merge inline extent mappings and hole
mappings together
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Execution should goto label 'insert' when 'btrfs_next_leaf' return a
non-zero value, otherwise the parameter 'slot' for
'btrfs_item_key_to_cpu' may be out of bounds. The original codes jump
to label 'insert' only when 'btrfs_next_leaf' return a negative
value.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
1. Reorder kmap and the test for 'page != NULL'
2. Zero-fill rest area of a block when inline extent isn't big enough.
3. Do not insert extent_map into the map tree when page == NULL.
(If insert the extent_map into the map tree, subsequent read requests
will find it in the map tree directly and the corresponding inline
extent data aren't copied into page by the the get_extent function.
extent_read_full_page can't handle that case)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The ENOTEMPTY check in btrfs_rmdir isn't reliable. It's possible that
the backward search finds . or .. at first, then some other directory
entry. In that case, btrfs_rmdir delete . or .. improperly. The
patch also fixes a fs_mutex unlock issue in btrfs_rmdir.
--
Signed-off-by: Chris Mason <chris.mason@oracle.com>
1) Forced defrag wasn't working properly (btrfsctl -d) because some
cache only checks were incorrect.
2) Defrag only the leaves unless in forced defrag mode.
3) Don't use complex logic to figure out if a leaf is needs defrag
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This modifies inline extent size calculation, so that
insert_inline_extent can handle the case that parameter 'offset' is
not zero; it also a few codes to zero uninitialized area in inline
extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When making room for a new item, it is ok to create an empty leaf, but
when making room to extend an item, split_leaf needs to make sure it
keeps the item we're extending in the path and make sure we don't end up
with an empty leaf.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This reduces the number of calls to btrfs_extend_item and greatly lowers
the cpu usage while writing large files.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Just use kobject_set_name(), that works in all kernels (I think...).
Kernels newer than 2.6.23 currently fail with:
/home/axboe/git/btrfs/btrfs-unstable/sysfs.c:188: error: unknown field
'name' specified in initializer
Signed-off-by: Chris Mason <chris.mason@oracle.com>
endio handling is typically called with interrupts disabled, but can
also be called with it enabled. So save interrupts before using KM_IRQ0
to be completely safe.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It now returns void and it is never called for partial completions, so
the bio->bi_size check must go.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows us to defrag huge directories, but skip the expensive defrag
case in more common usage, where it does not help as much.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I think check whether extent is a hole before update 'inode->i_blocks'
is unconditional required. (original codes check it only when
del_item isn't equal to 0)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
found_type has already been decreased by codes above the change, I
think decrease it by one again doesn't make sense.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_free_extent would fail to wrap around to the start of the drive because
it was doing the enospc case checking twice in some cases, causing it
to return -ENOSPC early.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_btree_balance_dirty is changed to pass the number of pages dirtied
for more accurate dirty throttling. This lets the VM make better decisions
about when to force some writeback.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Cache block group was overly complex and missed free blocks at the very start
of the group. This patch simplifies things significantly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add a helper per ioctl function to make the code more readable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
No reason to grab the BKL before calling into the btrfs ioctl code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Dead roots are trees left over after a crash, and they were either in the
process of being removed or were waiting to be removed when the box crashed.
Before, a search of the entire tree of root pointers was done on mount
looking for dead roots. Now, the search is done the first time we load
a root.
This makes mount faster when there are a large number of snapshots, and it
enables the block accounting code to properly update the block counts on
the latest root as old versions of the root are reaped after a crash.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
XFS updates the ondisk inode size only after the data I/O has finished,
so it needs a hook when the writepage end_bio handler has finished.
Might not be worth applying as-is as the per-page callback is very
ineffcient. What XFS really wants is a callback when writeout of a
whole extent has completed. This delayed i_size updates scheme might
be worthwile for btrfs aswell, btw.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The writepage_io is not mandatory, e.g. my port of xfs to the extent_map
code does not have one for now. So handle a NULL pointer gracefully
here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
generic_bmap is completely trivial, while the extent to bh mapping in
btrfs is rather complex. So provide a extent_bmap instead that takes
a get_extent callback and can be used by filesystem using the extent_map
code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The bio completion handlers can be run in any context, e.g. when using
the old ide driver they run in hardirq context with irqs disabled so
lockdep rightfully warns about using write_lock_irq useage in these
handlers.
This patch switches clear_extent_bit and set_extent_bit to
write_lock_irqsave to fix this problem.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed that set_extent_bit was exiting too early when there
was a hole in the map. The fix is to reorder the tests to check for the
hole first.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
File data checksums are only done during writepage, so we have to make sure
all pages are written when the snapshot is taken. This also adds some
locking so that new writes don't race in and add new dirty pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Modified form of original patch from Christoph Hellwig to make btrfs
mount into the default subvolume by default.
mount /dev/somedevice:subvolumename to get other subvolumes or
mount /dev/somedevice:. to get the root
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows the tree walking code to defrag only the newly allocated
buffers, it seems to be a good balance between perfect defragging and the
performance hit of repeatedly reallocating blocks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds two types of btree defrag, a run time form that tries to
defrag recently allocated blocks in the btree when they are still in ram,
and an ioctl that forces defrag of all btree blocks.
File data blocks are not defragged yet, but this can make a huge difference
in sequential btree reads.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before, snapshot deletion was a single atomic unit. This caused considerable
lock contention and required an unbounded amount of space. Now,
the drop_progress field in the root item is used to indicate how far along
snapshot deletion is, and to resume where it left off.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Almost none of the files including module.h need to do so,
remove them.
Include sched.h in extent-tree.c to silence a warning about cond_resched()
being undeclared.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The super block written during commit was not consistent with the state of
the trees. This change adds an in-memory copy of the super so that we can
make sure to write out consistent data during a commit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Attaching below is some of the code cleanups that i came across while
reading the code.
a) alloc_path already calls init_path.
b) Mention that btrfs_inode is the in memory copy.Ext4 have ext4_inode_info as
the in memory copy ext4_inode as the disk copy
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add (untested and simple) directory item code
Fix comp_keys to use the new key ordering
Add btrfs_insert_empty_item
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Remove implicit commit in del_item and insert_item
Add implicit commit to close()
Add commit op in random-test
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add a bulk insert/remove test to random-test
Add the quick-test code back as another regression test
Signed-off-by: Chris Mason <chris.mason@oracle.com>