I hit this error when reproducing a bug that would end in a transaction
abort. We take the delayed ref head's mutex to keep anybody from processing
it while we're destroying it, but we fail to drop the mutex before we carry
on and free the damned thing. Fix this by doing the remove logic for the
head ourselves and unlock the mutex, that way we can avoid use after free's
or hung tasks waiting on that mutex to come back so they know the delayed
ref completed. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
WARN_ON isn't enough, we need to stop the loop if for any reason
we would overrun the devices_info array.
I tried to track down the connection between the length of
the alloc_devices list and the rw_devices counter but
it wasn't immediately obvious, so be defensive about it.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
__btrfs_std_error didn't always properly call va_end,
and might call va_start even if fmt was NULL.
Move all the varargs handling into the block where we
have fmt.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I don't think that BTRFS_DEV_EXTENT_KEY is supposed
to fall through to BTRFS_DEV_STATS_KEY ...
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
At least backref_tree_panic() can apparently pass
in a null fs_info, so handle that in __btrfs_panic
to get the message out on the console.
The btrfs_panic macro also uses fs_info, but that's
largely pointless; it's testing to see if
BTRFS_MOUNT_PANIC_ON_FATAL_ERROR is not set.
But if it *were* set, __btrfs_panic() would have,
well, paniced and we wouldn't be here, testing it!
So just BUG() at this point.
And since we only use fs_info once now, just use it
directly.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
No need to test the result, we can't get a
null pointer from list_entry()
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Because of how little we allocate chunks now we can get really tight on
metadata space before we will allocate a new chunk. This resulted in being
unable to add device extents when allocating a new metadata chunk as we did
not have enough space. This is because we were allowed to overcommit too
much metadata without actually making sure we had enough space to make
allocations. The idea behind overcommit is that we are allowed to say "sure
you can have that reservation" when most of the free space is occupied by
reservations, not actual allocations. But in this case where a majority of
the total space is in use by actual allocations we can screw ourselves by
not being able to make real allocations when it matters. So make sure we
have enough real space for our global reserve, and if not then don't allow
overcommitting. Thanks,
Reported-and-tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I got a double free error when unmounting a file system that failed to add a
chunk during its operation. This is because we will kfree the mapping that
we created but leave the extent_map in the em_tree for chunks. So to fix
this just remove the extent_map when we error out so we don't run into this
problem. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we error out allocating a dev extent we will have already created the
block group and such which will cause problems since the allocator may have
tried to allocate out of the block group that no longer exists. This will
cause BUG_ON()'s in the bio submission path. This also makes a failure to
allocate a dev extent a non-abort error, we will just clean up the dev
extents we did allocate and exit. Now if we fail to delete the dev extents
we will abort since we can't have half of the dev extents hanging around,
but this will make us much less likely to abort. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is no lock to protect fs_info->fs_state, it will introduce
some problems, such as the value may be covered by the other task
when several tasks modify it. For example:
Task0 - CPU0 Task1 - CPU1
mov %fs_state rax
or $0x1 rax
mov %fs_state rax
or $0x2 rax
mov rax %fs_state
mov rax %fs_state
The expected value is 3, but in fact, it is 2.
Though this problem doesn't happen now (because there is only one
flag currently), the code is error prone, if we add other flags,
the above problem will happen to a certainty.
Now we use bit operation for it to fix the above problem.
In this way, we can make the code more robust and be easy to
add new flags.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There is no lock to protect
fs_info->avail_{data, metadata, system}_alloc_bits,
it may introduce some problem, such as the wrong profile
information, so we add a seqlock to protect them.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We need not use a global lock to protect the delalloc_bytes of the
inode, just use its own lock. In this way, we can reduce the lock
contention and ->delalloc_lock will just protect delalloc inode
list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
fs_info->delalloc_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant for it to reduce the lock
contention.
This patch also fixed the problem that we access the variant
without the lock protection.At worst, we would not flush the
delalloc inodes, and just return ENOSPC error when we still have
some free space in the fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
->dirty_metadata_bytes is accessed very frequently, so use percpu
counter instead of the u64 variant to reduce the contention of
the lock.
This patch also fixed the problem that we access it without
lock protection in __btrfs_btree_balance_dirty(), which may
cause we skip the dirty pages flush.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
fs_info->alloc_start is a 64bits variant, can be accessed by
multi-task, but it is not protected strictly, it can be changed
while we are accessing it. On 32bit machine, we will get wrong
value because we access it by two instructions.(In fact, it is
also possible that the same problem happens on the 64bit machine,
because the compiler may split the 64bit operation into two 32bit
operation.)
For example:
Assuming -> alloc_start is 0x0000 0000 0001 0000 at the beginning,
then we remount and set ->alloc_start to 0x0000 0100 0000 0000.
Task0 Task1
load high 32 bits
set high 32 bits
set low 32 bits
load low 32 bits
Task1 will get 0.
This patch fixes this problem by using two locks to protect it
fs_info->chunk_mutex
sb->s_umount
On the read side, we just need get one of these two locks, and on
the write side, we must lock all of them.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Though ->max_inline is a 64bit variant, and may be accessed by
multi-task, but it is just suggestive number, so we needn't add
anything to protect fs_info->max_inline, just add a comment to
explain wny we don't use a lock to protect it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The header file will then be installed under /usr/include/linux so that
userspace applications can refer to Btrfs ioctls by name and use the same
structs used internally in the kernel.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
CAP_DAC_READ_SEARCH overrides read and search permission check on
file and directory. It seems fit for BTRFS_IOC_INO_PATHS.
Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This reverts commit 2794ed013b.
Wasn't supposed to get used in btrfs_mknod, it was supposed to be in
btrfs_create, which was done in commit
9185aa587b.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_run_ordered_operations() needn't traverse the ordered operation list
repeatedly, it is because the transaction commiter will invoke it again when
there is no other writer in this transaction, it can ensure that no one can
add new objects into the ordered operation list.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
btrfs_start_delalloc_inodes() needn't traverse and flush the delalloc inodes
repeatedly. It is because we can regard the data that the users write after
we start delalloc inodes flush as the one which is after the delalloc inodes
flush is done, and we can flush it next time.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forget to check the return value of btrfs_run_ordered_operations() when
flushing all the pending stuffs, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forget to check the return value of btrfs_start_delalloc_inodes(), fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The current code of raid attr arry is hard to understand and it is easy to
introduce some problem if we modify the array. So I changed it and made it
more readable.
Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This'd save us a rbtree search which may become expensive in large filesystem.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This does not change the logic of code, but can save us a read_lock.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The API in tree log code has done sort of changes, and it proves that
we can benifit from using token, so do the same thing here.
function_graph tracer's timer shows that it costs nearly half time
of before(39.788us -> 22.391us).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
commit d53ba47484
(Btrfs: use commit root when loading free space cache) has remove
the deadlock check, and the related comments can be removed as well.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we start running low on metadata space we will try to allocate a chunk,
which could then try to allocate a chunk to add the device entry. The thing
is we allocate a chunk before we try really hard to make the allocation, so
we should be able to find space for the device entry. Add a flag to the
trans handle so we know we're currently allocating a chunk so we can just
bail out if we try to allocate another chunk. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since we don't actually copy the extent information from the source tree in
the fast case we don't need to wait for ordered io to be completed in order
to fsync, we just need to wait for the io to be completed. So when we're
logging our file just attach all of the ordered extents to the log, and then
when the log syncs just wait for IO_DONE on the ordered extents and then
write the super. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch fixes the following problem:
- improper return value
- unnecessary read-only check
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Use wrapper page_offset to get byte-offset into filesystem object for page.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We may try to flush some dirty pages when there is no enough space to reserve.
But it is possible that this operation fails, in order to get enough space to
reserve successfully, we will sync all the delalloc file. This operation is
safe, we needn't worry about the case that the filesystem goes from r/w to r/o.
because the filesystem should guarantee all the dirty pages have been written
into the disk after it becomes readonly, so the sync operation will do nothing
if the filesystem is already readonly. Though it may waste lots of time,
as a corner case, we needn't care.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Locking and unlocking delayed ref mutex are in the different functions,
and the name of lock functions is not uniform, so the readability is not
so good, this patch optimizes the lock logic and makes it more readable.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We're running into having 50-100 orphans left over with xfstests 83
because of ENOSPC when trying to start the transaction for the inode update.
But in fact, it makes no sense in updating the inode for the new size while
we're deleting the stupid thing. This patch fixes this problem.
Reported-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The delayed item commit code in several functions is similar, so
cleanup it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Since we do not want to delay the async transaction commit, we should
use common work, not delayed work.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
We clear the transaction object and the trans handle when they are about to be
freed, it is unnecessary, cleanup it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The delayed reference allocation is in the fast path of the IO, so use slabs
to improve the speed of the allocation.
And besides that, it can do check for leaked objects when the module is removed.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
btrfs_scan_one_device is calling set_blocksize() which can race
with a concurrent process making dirty page cache pages. It can end up
dropping dirty page cache pages on the floor, which isn't very nice when
someone is just running btrfs dev scan to find filesystems on the
box.
Now that udev is registering btrfs devices as it discovers them, we can
actually end up racing with our own mkfs program too. When this
happens, we drop some of the important blocks written by mkfs.
This commit changes scan_one_device to read the super out of the page
cache instead of trying to use bread. This way we don't have to care
about the blocksize of the device.
This also drops the invalidate_bdev() call. It wasn't very polite to
invalidate during the scan either. mkfs is putting the super into the
page cache, there's no reason to invalidate at this point.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When replaying a log tree with qgroups enabled, tree_mod_log_rewind does a
sanity-check of the number of items against the maximum possible number.
It calculates that number with the nodesize of fs_root. Unfortunately
fs_root is not yet set at this stage. So instead use the nodesize from
tree_root, which is already initialized.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"We've got corner cases for updating i_size that ceph was hitting,
error handling for quotas when we run out of space, a very subtle
snapshot deletion race, a crash while removing devices, and one
deadlock between subvolume creation and the sb_internal code (thanks
lockdep)."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: move d_instantiate outside the transaction during mksubvol
Btrfs: fix EDQUOT handling in btrfs_delalloc_reserve_metadata
Btrfs: fix possible stale data exposure
Btrfs: fix missing i_size update
Btrfs: fix race between snapshot deletion and getting inode
Btrfs: fix missing release of the space/qgroup reservation in start_transaction()
Btrfs: fix wrong sync_writers decrement in btrfs_file_aio_write()
Btrfs: do not merge logged extents if we've removed them from the tree
btrfs: don't try to notify udev about missing devices
Dave Sterba triggered a lockdep complaint about lock ordering
between the sb_internal lock and the cleaner semaphore.
btrfs_lookup_dentry() checks for orphans if we're looking up
the inode for a subvolume, and subvolume creation is triggering
the lookup with a transaction running.
This commit moves the d_instantiate after the transaction closes.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When btrfs_qgroup_reserve returned a failure, we were missing a counter
operation for BTRFS_I(inode)->outstanding_extents++, leading to warning
messages about outstanding extents and space_info->bytes_may_use != 0.
Additionally, the error handling code didn't take into account that we
dropped the inode lock which might require more cleanup.
Luckily, all the cleanup code we need is already there and can be shared
with reserve_metadata_bytes, which is exactly what this patch does.
Reported-by: Lev Vainblat <lev@zadarastorage.com>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We specifically do not update the disk i_size if there are ordered extents
outstanding for any area between the current disk_i_size and our ordered
extent so that we do not expose stale data. The problem is the check we
have only checks if the ordered extent starts at or after the current
disk_i_size, which doesn't take into account an ordered extent that starts
before the current disk_i_size and ends past the disk_i_size. Fix this by
checking if the extent ends past the disk_i_size. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we have an ordered extent before the ordered extent we are currently
completing that is after the current disk_i_size we will put our i_size
update into that ordered extent so that we do not expose stale data. The
problem is that if our disk i_size is updated past the previous ordered
extent we won't update the i_size with the pending i_size update. So check
the pending i_size update and if its above the current disk i_size we need
to go ahead and try to update. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
While running snapshot testscript created by Mitch and David,
the race between autodefrag and snapshot deletion can lead to
corruption of dead_root list so that we can get crash on
btrfs_clean_old_snapshots().
And besides autodefrag, scrub also does the same thing, ie. read
root first and get inode.
Here is the story(take autodefrag as an example):
(1) when we delete a snapshot or subvolume, it will set its root's
refs to zero and do a iput() on its own inode, and if this inode happens
to be the only active in-meory one in root's inode rbtree, it will add
itself to the global dead_roots list for later cleanup.
(2) after (1), the autodefrag thread may read another inode for defrag
and the inode is just in the deleted snapshot/subvolume, but all of these
are without checking if the root is still valid(refs > 0). So the end up
result is adding the deleted snapshot/subvolume's root to the global
dead_roots list AGAIN.
Fortunately, we already have a srcu lock to avoid the race, ie. subvol_srcu.
So all we need to do is to take the lock to protect 'read root and get inode',
since we synchronize to wait for the rcu grace period before adding something
to the global dead_roots list.
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When we fail to start a transaction, we need to release the reserved free space
and qgroup space, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If the checks at the beginning of btrfs_file_aio_write() fail, we needn't
decrease ->sync_writers, because we have not increased it. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
You can run into this problem where if somebody is fsyncing and writing out
the existing extents you will have removed the extent map from the em tree,
but it's still valid for the current fsync so we go ahead and write it. The
problem is we unconditionally try to merge it back into the em tree, but if
we've removed it from the em tree that will cause use after free problems.
Fix this to only merge if we are still a part of the tree. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The device removal code was incorrectly checking against two different limits for
raid5 and raid6.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We batch up operations to the extent allocation tree, which allows
us to deal with the recursive nature of using the extent allocation
tree to allocate extents to the extent allocation tree.
It also provides a mechanism to sort and collect extent
operations, which makes it much more efficient to record extents
that are close together.
The delayed extent operations must all be finished before the
running transaction commits, so we have code to make sure and run a few
of the batched operations when closing our transaction handles.
This creates a great deal of contention for the locks in the
delayed extent operation tree, and also contention for the lock on the
extent allocation tree itself. All the extra contention just slows
down the operations and doesn't get things done any faster.
This commit changes things to use a wait queue instead. As procs
want to run the delayed operations, one of them races in and gets
permission to hit the tree, and the others step back and wait for
progress to be made.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The extent buffers have a refs_lock which we use to make coordinate freeing
the extent buffer with operations on the radix tree. On tree roots and
other extent buffers that very cache hot, this can be highly contended.
These are also the extent buffers that are basically pinned in memory.
This commit adds code to cmpxchg our way through the ref modifications,
and as long as the result of the reference change is still pinned in
ram, we skip the expensive spinlock.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With the new raid56 code, we want to make sure we're
properly aligning our allocation clusters with -o ssd
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Buffered writes and DIRECT_IO writes will often break up
big contiguous changes to the file into sub-stripe writes.
This adds a plugging callback to gather those smaller writes full stripe
writes.
Example on flash:
fio job to do 64K writes in batches of 3 (which makes a full stripe):
With plugging: 450MB/s
Without plugging: 220MB/s
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The stripe cache allows us to avoid extra read/modify/write cycles
by caching the pages we read off the disk. Pages are cached when:
* They are read in during a read/modify/write cycle
* They are written during a read/modify/write cycle
* They are involved in a parity rebuild
Pages are not cached if we're doing a full stripe write. We're
assuming that a full stripe write won't be followed by another
partial stripe write any time soon.
This provides a substantial boost in performance for workloads that
synchronously modify adjacent offsets in the file, and for the parity
rebuild use case in general.
The size of the stripe cache isn't tunable (yet) and is set at 1024
entries.
Example on flash: dd if=/dev/zero of=/mnt/xxx bs=4K oflag=direct
Without the stripe cache -- 2.1MB/s
With the stripe cache 21MB/s
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This builds on David Woodhouse's original Btrfs raid5/6 implementation.
The code has changed quite a bit, blame Chris Mason for any bugs.
Read/modify/write is done after the higher levels of the filesystem have
prepared a given bio. This means the higher layers are not responsible
for building full stripes, and they don't need to query for the topology
of the extents that may get allocated during delayed allocation runs.
It also means different files can easily share the same stripe.
But, it does expose us to incorrect parity if we crash or lose power
while doing a read/modify/write cycle. This will be addressed in a
later commit.
Scrub is unable to repair crc errors on raid5/6 chunks.
Discard does not work on raid5/6 (yet)
The stripe size is fixed at 64KiB per disk. This will be tunable
in a later commit.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We'll want to merge writes so they can fill a full RAID[56] stripe, but
not necessarily reads.
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we remove a missing device, bdev is null, and if we
send that off to btrfs_kobject_uevent we'll panic.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs fixes from Chris Mason:
"It turns out that we had two crc bugs when running fsx-linux in a
loop. Many thanks to Josef, Miao Xie, and Dave Sterba for nailing it
all down. Miao also has a new OOM fix in this v2 pull as well.
Ilya fixed a regression Liu Bo found in the balance ioctls for pausing
and resuming a running balance across drives.
Josef's orphan truncate patch fixes an obscure corruption we'd see
during xfstests.
Arne's patches address problems with subvolume quotas. If the user
destroys quota groups incorrectly the FS will refuse to mount.
The rest are smaller fixes and plugs for memory leaks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (30 commits)
Btrfs: fix repeated delalloc work allocation
Btrfs: fix wrong max device number for single profile
Btrfs: fix missed transaction->aborted check
Btrfs: Add ACCESS_ONCE() to transaction->abort accesses
Btrfs: put csums on the right ordered extent
Btrfs: use right range to find checksum for compressed extents
Btrfs: fix panic when recovering tree log
Btrfs: do not allow logged extents to be merged or removed
Btrfs: fix a regression in balance usage filter
Btrfs: prevent qgroup destroy when there are still relations
Btrfs: ignore orphan qgroup relations
Btrfs: reorder locks and sanity checks in btrfs_ioctl_defrag
Btrfs: fix unlock order in btrfs_ioctl_rm_dev
Btrfs: fix unlock order in btrfs_ioctl_resize
Btrfs: fix "mutually exclusive op is running" error code
Btrfs: bring back balance pause/resume logic
btrfs: update timestamps on truncate()
btrfs: fix btrfs_cont_expand() freeing IS_ERR em
Btrfs: fix a bug when llseek for delalloc bytes behind prealloc extents
Btrfs: fix off-by-one in lseek
...
btrfs_start_delalloc_inodes() locks the delalloc_inodes list, fetches the
first inode, unlocks the list, triggers btrfs_alloc_delalloc_work/
btrfs_queue_worker for this inode, and then it locks the list, checks the
head of the list again. But because we don't delete the first inode that it
deals with before, it will fetch the same inode. As a result, this function
allocates a huge amount of btrfs_delalloc_work structures, and OOM happens.
Fix this problem by splice this delalloc list.
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The max device number of single profile is 1, not 0 (0 means 'as many as
possible'). Fix it.
Cc: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
First, though the current transaction->aborted check can stop the commit early
and avoid unnecessary operations, it is too early, and some transaction handles
don't end, those handles may set transaction->aborted after the check.
Second, when we commit the transaction, we will wake up some worker threads to
flush the space cache and inode cache. Those threads also allocate some transaction
handles and may set transaction->aborted if some serious error happens.
So we need more check for ->aborted when committing the transaction. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We may access and update transaction->aborted on the different CPUs without
lock, so we need ACCESS_ONCE() wrapper to prevent the compiler from creating
unsolicited accesses and make sure we can get the right value.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
I noticed a WARN_ON going off when adding csums because we were going over
the amount of csum bytes that should have been allowed for an ordered
extent. This is a leftover from when we used to hold the csums privately
for direct io, but now we use the normal ordered sum stuff so we need to
make sure and check if we've moved on to another extent so that the csums
are added to the right extent. Without this we could end up with csums for
bytenrs that don't have extents to cover them yet. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For compressed extents, the range of checksum is covered by disk length,
and the disk length is different with ram length, so we need to use disk
length instead to get us the right checksum.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
A user reported a BUG_ON(ret) that occured during tree log replay. Ret was
-EAGAIN, so what I think happened is that we removed an extent that covered
a bitmap entry and an extent entry. We remove the part from the bitmap and
return -EAGAIN and then search for the next piece we want to remove, which
happens to be an entire extent entry, so we just free the sucker and return.
The problem is ret is still set to -EAGAIN so we trip the BUG_ON(). The
user used btrfs-zero-log so I'm not 100% sure this is what happened so I've
added a WARN_ON() to catch the other possibility. Thanks,
Reported-by: Jan Steffens <jan.steffens@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We drop the extent map tree lock while we're logging extents, so somebody
could come in and merge another extent into this one and screw up our
logging, or they could even remove us from the list which would keep us from
logging the extent or freeing our ref on it, so we need to make sure to not
clear LOGGING until after the extent is logged, and then we can merge it to
adjacent extents. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Commit 3fed40cc ("Btrfs: cleanup duplicated division functions"), which
was merged into 3.8-rc1, has introduced a regression by removing logic
that was guarding us against bad user input. Bring it back.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently you can just destroy a qgroup even though it is in use by other qgroups
or has qgroups assigned to it. This patch prevents destruction of qgroups unless
they are completely unused. Otherwise destroy will return EBUSY.
Reported-by: Eric Hopper <hopper@omnifarious.org>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If a qgroup that has still assignments is deleted by the user, the corresponding
relations are left in the tree. This leads to an unmountable filesystem.
With this patch, those relations are simple ignored.
Reported-by: Eric Hopper <hopper@omnifarious.org>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The CONFIG_EXPERIMENTAL config item has not carried much meaning for a
while now and is almost always enabled by default. As agreed during the
Linux kernel summit, remove it from any "depends on" lines in Kconfigs.
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Operation-specific check (whether subvol is readonly or not) should go
after the mutual exclusiveness check.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
The error code that is returned in response to starting a mutually
exclusive operation when there is one already running got silently
changed from EINVAL to EINPROGRESS by 5ac00add. Returning EINPROGRESS
to, say, add_dev, when rm_dev is running is misleading. Furthermore,
the operation itself may want to use EINPROGRESS for other purposes.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Balance pause/resume logic got broken by 5ac00add (went in into 3.8-rc1
as part of dev-replace merge). Offending commit took a stab at making
mutually exclusive volume operations (add_dev, rm_dev, resize, balance,
replace_dev) not block behind volume_mutex if another such operation is
in progress and instead return an error right away. Balancing front-end
relied on the blocking behaviour, so the fix is ugly, but short of a
complete rework, it's the best we can do.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
truncate() vs. ftruncate() differ in the VFS; truncate()
doesn't set (ATTR_CTIME | ATTR_MTIME), and it's up to the
fs to do the timestamp updates if the size changes.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
btrfs_cont_expand() tries to free an IS_ERR em as it gets an error from
btrfs_get_extent() and breaks out of its loop.
An instance of -EEXIST was reported in the wild:
https://bugzilla.redhat.com/show_bug.cgi?id=874407
I have no idea if that -EEXIST is surprising, or not. Regardless, this
error handling should be cleaned up to handle other reasonable errors
(ENOMEM, EIO; whatever).
This seemed to be the only buggy freeing of the relatively rare IS_ERR
em so I opted to fix the caller rather than teach free_extent_map() to
use IS_ERR_OR_NULL().
Signed-off-by: Zach Brown <zab@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
xfstests case 285 complains.
It it because btrfs did not try to find unwritten delalloc
bytes(only dirty pages, not yet writeback) behind prealloc
extents, it ends up finding nothing while we're with SEEK_DATA.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forgot to reset the path lock state to zero after we unlock the path block,
and this can lead to the ASSERT checker in tree unlock API.
Reported-by: Slava Barinov <rayslava@gmail.com>
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This'd avoid us empty looping.
Say we have only one disk and the metadata raid type will be defaultly DUP,
and we do not need to start from index=0(RAID10) and get over two empty
loops to index=2(DUP).
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Running xfstests 83 in a loop would sometimes fail the fsck. This happens
because if we invalidate a page that already has an ordered extent setup for
it we will complete the ordered extent ourselves, assuming that the truncate
will clean everything up. The problem with this is there is plenty of time
for the truncate to fail after we've done this work. So to fix this we need
to add the orphan item first to make sure the cleanup gets done properly,
and then we can truncate the pagecache and all that stuff and be safe. This
fixes the btrfsck failures I was seeing while running 83 in a loop. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We still need to say we're flushing if we're limit flushing to keep somebody
from coming in and stealing our reservation. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We forget to give up the write access after we find some device operation
is going on. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Step to reproduce:
# mkfs.btrfs <disk>
# mount <disk> <mnt>
# btrfs sub create <mnt>/subv0
# btrfs sub snap <mnt> <mnt>/subv0/snap0
# change <mnt>/subv0 from R/W to R/O
# btrfs sub del <mnt>/subv0/snap0
We deleted the snapshot successfully. I think we should not be able to delete
the snapshot since the parent subvolume is R/O.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
When we're deleting the device we should get it in write mode since
we're going to re-write the super block magic on that device. And it
should fail if the device is read-only.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
writeback_inodes_sb(_nr)_if_idle() is re-implemented by replacing down_read()
with down_read_trylock() because
- If ->s_umount is write locked, then the sb is not idle. That is
writeback_inodes_sb(_nr)_if_idle() needn't wait for the lock.
- writeback_inodes_sb(_nr)_if_idle() grabs s_umount lock when it want to start
writeback, it may bring us deadlock problem when doing umount. In order to
fix the problem, ext4 and btrfs implemented their own writeback functions
instead of writeback_inodes_sb(_nr)_if_idle(), but it introduced the redundant
code, it is better to implement a new writeback_inodes_sb(_nr)_if_idle().
The name of these two functions is cumbersome, so rename them to
try_to_writeback_inodes_sb(_nr).
This idea came from Christoph Hellwig.
Some code is from the patch of Kamal Mostafa.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
In the big loop, cur_trans will be set fs_info->running_transaction
before it's used. And after kmem_cache_free it and goto loop, it will
be setup again. No need to setup it immediately after freed.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull VFS update from Al Viro:
"fscache fixes, ESTALE patchset, vmtruncate removal series, assorted
misc stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (79 commits)
vfs: make lremovexattr retry once on ESTALE error
vfs: make removexattr retry once on ESTALE
vfs: make llistxattr retry once on ESTALE error
vfs: make listxattr retry once on ESTALE error
vfs: make lgetxattr retry once on ESTALE
vfs: make getxattr retry once on an ESTALE error
vfs: allow lsetxattr() to retry once on ESTALE errors
vfs: allow setxattr to retry once on ESTALE errors
vfs: allow utimensat() calls to retry once on an ESTALE error
vfs: fix user_statfs to retry once on ESTALE errors
vfs: make fchownat retry once on ESTALE errors
vfs: make fchmodat retry once on ESTALE errors
vfs: have chroot retry once on ESTALE error
vfs: have chdir retry lookup and call once on ESTALE error
vfs: have faccessat retry once on an ESTALE error
vfs: have do_sys_truncate retry once on an ESTALE error
vfs: fix renameat to retry on ESTALE errors
vfs: make do_unlinkat retry once on ESTALE errors
vfs: make do_rmdir retry once on ESTALE errors
vfs: add a flags argument to user_path_parent
...
Pull two btrfs reverts from Chris Mason:
"I had missed that for two of the patches in my last pull, we had
included different fixes during 3.7."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Revert "Btrfs: reorder tree mod log operations in deleting a pointer"
Revert "Btrfs: MOD_LOG_KEY_REMOVE_WHILE_MOVING never change node's nritems"
The code that relied on that flag was ripped out of btrfs quite some
time ago, and never added back. Josef indicated that he was going to
take a different approach to the problem in btrfs, and that we
could just eliminate this flag.
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This reverts commit 6a7a665d78.
This was bug was fixed differently in 3.6, so this commit
isn't needed.
Conflicts:
fs/btrfs/ctree.c
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This reverts commit 95c80bb1f6.
The bug addressed by this commit was fixed differently back in 3.6
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull btrfs update from Chris Mason:
"A big set of fixes and features.
In terms of line count, most of the code comes from Stefan, who added
the ability to replace a single drive in place. This is different
from how btrfs normally replaces drives, and is much much much faster.
Josef is plowing through our synchronous write performance. This pull
request does not include the DIO_OWN_WAITING patch that was discussed
on the list, but it has a number of other improvements to cut down our
latencies and CPU time during fsync/O_DIRECT writes.
Miao Xie has a big series of fixes and is spreading out ordered
operations over more CPUs. This improves performance and reduces
contention.
I've put in fixes for error handling around hash collisions. These
are going back to individual stable kernels as I test against them.
Otherwise we have a lot of fixes and cleanups, thanks everyone!
raid5/6 is being rebased against the device replacement code. I'll
have it posted this Friday along with a nice series of benchmarks."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (115 commits)
Btrfs: fix a bug of per-file nocow
Btrfs: fix hash overflow handling
Btrfs: don't take inode delalloc mutex if we're a free space inode
Btrfs: fix autodefrag and umount lockup
Btrfs: fix permissions of empty files not affected by umask
Btrfs: put raid properties into global table
Btrfs: fix BUG() in scrub when first superblock reading gives EIO
Btrfs: do not call file_update_time in aio_write
Btrfs: only unlock and relock if we have to
Btrfs: use tokens where we can in the tree log
Btrfs: optimize leaf_space_used
Btrfs: don't memset new tokens
Btrfs: only clear dirty on the buffer if it is marked as dirty
Btrfs: move checks in set_page_dirty under DEBUG
Btrfs: log changed inodes based on the extent map tree
Btrfs: add path->really_keep_locks
Btrfs: do not mark ems as prealloc if we are writing to them
Btrfs: keep track of the extents original block length
Btrfs: inline csums if we're fsyncing
Btrfs: don't bother copying if we're only logging the inode
...
But the kernel decided to call it "origin" instead. Fix most of the
sites.
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Users report a bug, the reproducer is:
$ mkfs.btrfs /dev/loop0
$ mount /dev/loop0 /mnt/btrfs/
$ mkdir /mnt/btrfs/dir
$ chattr +C /mnt/btrfs/dir/
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=10;
$ lsattr /mnt/btrfs/dir/foo
---------------C- /mnt/btrfs/dir/foo
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 1 extent found ---> an extent
$ dd if=/dev/zero of=/mnt/btrfs/dir/foo bs=4K count=1 seek=5 conv=notrunc,nocreat; sync
$ filefrag /mnt/btrfs/dir/foo
/mnt/btrfs/dir/foo: 3 extents found ---> with nocow, btrfs breaks the extent into three parts
The new created file should not only inherit the NODATACOW flag, but also
honor NODATASUM flag, because we must do COW on a file extent with checksum.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The handling for directory crc hash overflows was fairly obscure,
split_leaf returns EOVERFLOW when we try to extend the item and that is
supposed to bubble up to userland. For a while it did so, but along the
way we added better handling of errors and forced the FS readonly if we
hit IO errors during the directory insertion.
Along the way, we started testing only for EEXIST and the EOVERFLOW case
was dropped. The end result is that we may force the FS readonly if we
catch a directory hash bucket overflow.
This fixes a few problem spots. First I add tests for EOVERFLOW in the
places where we can safely just return the error up the chain.
btrfs_rename is harder though, because it tries to insert the new
directory item only after it has already unlinked anything the rename
was going to overwrite. Rather than adding very complex logic, I added
a helper to test for the hash overflow case early while it is still safe
to bail out.
Snapshot and subvolume creation had a similar problem, so they are using
the new helper now too.
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Reported-by: Pascal Junod <pascal@junod.info>
This confuses and angers lockdep even though it's ok. We don't really need
the lock for free space inodes since only the transaction committer will be
reserving space. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This happens because writeback_inodes_sb_nr_if_idle does down_read. This
doesn't work for us and it has not been fixed upstream yet, so do it
ourselves and use that instead so we can stop having this stupid long
standing lockup. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When a new file is created with btrfs_create(), the inode will initially be
created with permissions 0666 and later on in btrfs_init_acl() it will be
adapted to mask out the umask bits. The problem is that this change won't make
it into the btrfs_inode unless there's another change to the inode (e.g. writing
content changing the size or touching the file changing the mtime.)
This fix adds a call to btrfs_update_inode() to btrfs_create() to make sure that
the change will not get lost if the in-memory inode is flushed before other
changes are made to the file.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Raid properties can be shared among raid calculation code, we can put
them into a global table to keep it simple.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This fixes a very special case that can be reproduced by just
disconnecting a disk at runtime, and without unmounting the
filesystem first, start scrub on the filesystem with the
disconnected disk. All read and write EIOs are handled
correctly, only the first superblock is an exception and gives
a BUG() in a subfunction. The BUG() is correct, it would crash
later otherwise. The subfunction must not be called for
superblocks and this is what the fix changes.
Reported-by: Joeri Vanthienen <mail@joerivanthienen.be>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This starts a transaction and dirties the inode everytime we call it, which
is super expensive if you have a write heavy workload. We will be updating
the inode when the IO completes and we reserve the space for the inode
update when we reserve space for the write, so there is no chance of loss of
information or enospc issues. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed while doing fsync tests that we were always dropping the path and
re-searching when we first cow the log root even though we've already gotten
the write lock on the root. That's because we don't take into account that
there might not be a parent node, so fix the check to make sure there is
actually a parent node before we undo all of this work for nothing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we are syncing over and over the overhead of doing all those maps in
fill_inode_item and log_changed_extents really starts to hurt, so use map
tokens so we can avoid all the extra mapping. Since the token maps from our
offset to the end of the page make sure to set the first thing in the item
first so we really only do one map. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This gets called at least 4 times for every level while adding an object,
and it involves 3 kmapping calls, which on my box take about 5us a piece.
So instead use a token, which brings us down to 1 kmap call and makes this
function take 1/3 of the time per call. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Our token logic depends on token->kaddr being set, and if it is not it sets
everything properly as needed. So instead of memsetting just set
token->kaddr to NULL. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
No reason to set the path blocking or loop through all of the pages if the
extent buffer isn't actually marked dirty. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is a high traffic function, let's try and do as little as possible
during normal operations shall we?
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We don't really need to copy extents from the source tree since we have all
of the information already available to us in the extent_map tree. So
instead just write the extents straight to the log tree and don't bother to
copy the extent items from the source tree.
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
You'd think path->keep_locks would keep all the locks wouldn't you? You'd
be wrong. It only keeps them if the slot is pointing to the last item in
the node. This is for use with btrfs_next_leaf, which needs this sort of
thing. But the horrible horrible things I'm going to do to the tree log
means I really need everything held from root to leaf so I can add and
delete items in the same search. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We are going to use EM's to log extents in the future, so we need to not
mark them as prealloc if they aren't actually prealloc extents. Instead
mark them with FILLING so we know to ammend mod_start/mod_len and that way
we don't confuse the extent logging code. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we've written to a prealloc extent we need to know the original block len
for the extent. We can't figure this out currently since ->block_len is
just set to the extent length. So introduce ->orig_block_len so that we
know how many bytes were in the original extent for proper extent logging
that future patches will need. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The tree logging stuff needs the csums to be on the ordered extents in order
to log them properly, so mark that we're sync and inline the csum creation
so we don't have to wait on the csumming to be done when logging extents
that are still in flight. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We don't copy inode items anwyay, we just copy them straight into the log
from the in memory inode. So if we know we're only logging the inode, don't
bother dropping anything, just try to insert it and either if it succeeds or
we get EEXIST we can update the inode item in the log and carry on. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently we copy all the file information into the log, inode item, the
refs, xattrs etc. Except most of this doesn't change from fsync to fsync,
just the inode item changes. So set a flag if an xattr changes or a link is
added, and otherwise only log the inode item. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Originally root_times_lock was introduced as part of send/receive
code however newly developed patch to label the subvol reused
the same lock, so renaming it for a meaningful name.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Currently udev does not know about the device being removed from the
file system. This may result in the situation where we're unable to
mount the file system by UUID or by LABEL because the by-uuid and
by-label links may still point to the device which is no longer part of
the btrfs file system and hence does not have any btrfs super block.
It can be easily reproduced by the following:
mkfs.btrfs -L bugfs /dev/loop[0-6]
mount /dev/loop0 /mnt/test
btrfs device delete /dev/loop0 /mnt/test
umount /mnt/test
mount LABEL=bugfs /mnt/test <---- this fails
then see:
ls -l /dev/disk/by-label/bugfs
which will still point to the /dev/loop0
We did not noticed this before because libblkid would send the udev
event for us when it notice that the link does not fit the reality,
however it does not do that anymore and completely relies on udev
information.
Fix this by sending the KOBJ_CHANGE event to the bdev kobject after
successful device removal.
Note that this does not affect device addition, because we will open the
device prior the addition from userspace and udev will notice that and
reread the device afterwards.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
ret variant may be set to 0 if we read page successfully, but it might be
released before we lock it again. On this case, if we fail to allocate a
new page, we will return 0, it is wrong, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Since we can pre-allocate the space past EOF, we should be able to reclaim
that space if we need. This patch implements it by removing the EOF check.
Though the manual of fallocate command says we can use truncate command to
reclaim the pre-allocated space which past EOF, but because truncate command
changes the file size, we must run several commands to reclaim the space if we
don't want to change the file size, so it is not a good choice.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Steps to reproduce:
# mkfs.btrfs <disk>
# mount <disk> <mnt>
# dd if=/dev/zero of=<mnt>/<file> bs=512 seek=5 count=8
# fallocate -p -o 2048 -l 16384 <mnt>/<file>
# dd if=/dev/zero of=<mnt>/<file> bs=4096 seek=3 count=8 conv=notrunc,nocreat
# umount <mnt>
# dmesg
WARNING: at fs/btrfs/inode.c:7140 btrfs_destroy_inode+0x2eb/0x330
The reason is that we inputed a range which is beyond the end of the file. And
because the end of this range was not page-aligned, we had to truncate the last
page in this range, this operation is similar to a buffered file write. In other
words, we reserved enough space and clear the data which was in the hole range
on that page. But when we expanded that test file, write the data into the same
page, we forgot that we have reserved enough space for the buffered write of
that page because in most cases there is no page that is beyond the end of
the file. As a result, we reserved the space twice.
In fact, we needn't truncate the page if it is beyond the end of the file, just
release the allocated space in that range. Fix the above problem by this way.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
(start + len) is the start of the adjacent extent, not the end of the current
extent, so we should not use it to check the hole is on the same page or not.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We forget to release the reserved space in the error path of delalloc
reservatiom, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we runt the direct IO, we should not run auto defrag, because it may
introduce buffered IO vs direcIO problem, and make direct IO slow down.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We should use ctl->unit for free space calculation instead of block_group->sectorsize
even though for free space use_bitmap or free space cluster we only have sectorsize assigned to ctl->unit currently. Also, we can keep it consisten in code style.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Refactor it by checking whether the inode has been created and needs to be
dropped (drop_inode_on_err) and also if the err variable is set. That way the
variable doesn't need to be set on each and every error handling block.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When a new file is created with btrfs_create(), the inode will initially be
created with permissions 0666 and later on in btrfs_init_acl() it will be
adapted to mask out the umask bits. The problem is that this change won't make
it into the btrfs_inode unless there's another change to the inode (e.g. writing
content changing the size or touching the file changing the mtime.)
This fix adds a call to btrfs_update_inode() to btrfs_create() to make sure that
the change will not get lost if the in-memory inode is flushed before other
changes are made to the file.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When the flag not supported is specified, it is necessary to return the error
to the caller.
So, we add the validity check of the fiemap's flag.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Passing a null extended attribute value means to remove the attribute,
but we don't have to add a new NULL extended attribute.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the acl can be exactly represented in the traditional file
mode permission bits, we don't set another acl attribute.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
alloc_end is not the real end of the current extent, it is the start of the
next adjoining extent. So we needn't +1 when calculating the size the space
that is about to be reserved.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The kernel developers have implemented some often-used align macros, we should
use them instead of the complex code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This regression was introduced by the device-replace patches.
Scrub immediately stops checking those disks that have write errors.
This is nothing that happens in the real world, but it is wrong
since scrub is the tool to detect and repair defects. Fix it.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This issue was detected by the "0-DAY kernel build testing".
fs/btrfs/volumes.c: In function 'btrfs_rm_device':
fs/btrfs/volumes.c:1505:1: warning: label 'error_close' defined but not used [-Wunused-label]
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The structure member mirror_num is modified concurrently to the
structure member is_iodone. This doesn't require any locking by
design, unless everything is stored in the same 32 bits of a
bit field. This was the case and xfstest 284 was able to
trigger false warnings from the checker code. This patch
seperates the bits and fixes the race.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we freeze the fs, the auto defragment should not run. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch restructure btrfs_run_defrag_inodes() and make the code of the auto
defragment more readable.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We forget to get the defrag lock when we re-add the defragable inode,
Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The auto defrag allocation is in the fast path of the IO, so use slabs
to improve the speed of the allocation.
And besides that, it can do check for leaked objects when the module is removed.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We need get write access for qgroup operations, or we will modify the R/O fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We need get write access for scrub, or we will modify the R/O fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Steps to reproduce:
# mkfs.btrfs -d single -m single <disk0> <disk1>
# mount -o ro <disk0> <mnt0>
# mount -o ro <disk0> <mnt1>
# mount -o remount,rw <mnt0>
# umount <mnt0>
# btrfs device delete <disk1> <mnt1>
We can remove a device from a R/O filesystem. The reason is that we just check
the R/O flag of the super block object. It is not enough, because the kernel
may set the R/O flag only for the mount point. We need invoke
mnt_want_write_file()
to do a full check.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Steps to reproduce:
# mkfs.btrfs <partition>
# mount -o ro <partition> <mnt0>
# mount -o ro <partition> <mnt1>
# mount -o remount,rw <mnt0>
# umount <mnt0>
# btrfs fi resize 10g <mnt1>
We re-sized a R/O filesystem. The reason is that we just check the R/O flag
of the super block object. It is not enough, because the kernel may set the
R/O flag only for the mount point. We need invoke mnt_want_write_file() to
do a full check.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When wen want to set the default subvolume, we must get write access, or
we will change the R/O file system.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If the id of the existed transaction is more than the one we specified, it
means the specified transaction was commited, so we should return 0, not
EINVAL.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If there is no running transaction in the fs, we needn't start a new one when
we want to start sync.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Since we have gotten the root in the caller, just pass it into
btrfs_ioctl_{start, wait}_sync() directly.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we found an invalid xattr dir item, we'd better try the next one instead.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
io_ctl_map_page is called by many functions in free-space-cache.
In most scenarios, the ->cur is not null, e.g. io_ctl_add_entry.
I think we'd better remove the warn_on here.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is the commit that allows to start the device replace
procedure.
An ioctl() interface is added that supports starting and
canceling the device replace procedure, and to retrieve
the status and progress.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Pull trivial branch from Jiri Kosina:
"Usual stuff -- comment/printk typo fixes, documentation updates, dead
code elimination."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
HOWTO: fix double words typo
x86 mtrr: fix comment typo in mtrr_bp_init
propagate name change to comments in kernel source
doc: Update the name of profiling based on sysfs
treewide: Fix typos in various drivers
treewide: Fix typos in various Kconfig
wireless: mwifiex: Fix typo in wireless/mwifiex driver
messages: i2o: Fix typo in messages/i2o
scripts/kernel-doc: check that non-void fcts describe their return value
Kernel-doc: Convention: Use a "Return" section to describe return values
radeon: Fix typo and copy/paste error in comments
doc: Remove unnecessary declarations from Documentation/accounting/getdelays.c
various: Fix spelling of "asynchronous" in comments.
Fix misspellings of "whether" in comments.
eisa: Fix spelling of "asynchronous".
various: Fix spelling of "registered" in comments.
doc: fix quite a few typos within Documentation
target: iscsi: fix comment typos in target/iscsi drivers
treewide: fix typo of "suport" in various comments and Kconfig
treewide: fix typo of "suppport" in various comments
...
Make the target disk of a running device replace operation
available for reading. This is only used as a last ressort for
the defect repair procedure. And it is dependent on the location
of the data block to read, because during an ongoing device
replace operation, the target drive is only partially filled
with the filesystem data.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This change of the define is effective in all modes, it
is required and used only in the case when a device replace
procedure is running. The reason is that during an active
device replace procedure, the target device of the copy
operation is a mirror for the filesystem data as well that
can be used to read data in order to repair read errors on
other disks.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
It is desirable to be able to configure the device replace
procedure to avoid reading the source drive (the one to be
copied) whenever possible. This is useful when the number of
read errors on this disk is high, because it would delay the
copy procedure alot. Therefore there is an option to avoid
reading from the source disk unless the repair procedure
really needs to access it. The regular read req asks for
mapping the block with mirror_num == 0, in this case the
source disk is avoided whenever possible. The repair code
selects the mirror_num explicitly (mirror_num != 0), this
case is not changed by this commit.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
During a running dev replace operation, all write requests to
the live filesystem are duplicated to also write to the target
drive. Therefore btrfs_map_block() is changed to duplicate
stripes that are written to the source disk of a device replace
procedure to be written to the target disk as well.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Before this commit, btrfs_map_block() was called with REQ_WRITE
in order to retrieve the list of mirrors for a disk block.
This needs to be changed for the device replace procedure since
it makes a difference whether you are asking for read mirrors
or for locations to write to.
GET_READ_MIRRORS is introduced as a new interface to call
btrfs_map_block().
In the current commit, the functionality is not yet changed,
only the interface for GET_READ_MIRRORS is introduced and all
the places that should use this new interface are adapted.
The reason that REQ_WRITE cannot be abused anymore to retrieve
a list of read mirrors is that during a running dev replace
operation all write requests to the live filesystem are
duplicated to also write to the target drive.
Keep in mind that the target disk is only partially a valid
copy of the source disk while the operation is ongoing. All
writes go to the target disk, but not all reads would return
valid data on the target disk. Therefore it is not possible
anymore to abuse a REQ_WRITE interface to find valid mirrors
for a REQ_READ.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This commit contains all the essential changes to the core code
of Btrfs for support of the device replace procedure.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This adds a new file to the sources together with the header file
and the changes to ioctl.h and ctree.h that are required by the
new C source file. Additionally, 4 new functions are added to
volume.c that deal with device creation and destruction.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit adds code to scrub to allow the scrub code to copy read
data to another disk.
One goal is to be able to perform as fast as possible. Therefore the
write requests are collected until huge bios are built, and the
write process is decoupled from the read process with some kind of
flow control, of course, in order to limit the allocated memory.
The best performance on spinning disks could by reached when the
head movements are avoided as much as possible. Therefore a single
worker is used to interface the read process with the write process.
The regular scrub operation works as fast as before, it is not
negatively influenced and actually it is more or less unchanged.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With the addition of the device replace procedure, it is possible
for btrfs_map_bio(READ) to report an error. This happens when the
specific mirror is requested which is located on the target disk,
and the copy operation has not yet copied this block. Hence the
block cannot be read and this error state is indicated by
returning EIO.
Some background information follows now. A new mirror is added
while the device replace procedure is running.
btrfs_get_num_copies() returns one more, and
btrfs_map_bio(GET_READ_MIRROR) adds one more mirror if a disk
location is involved that was already handled by the device
replace copy operation. The assigned mirror num is the highest
mirror number, e.g. the value 3 in case of RAID1.
If btrfs_map_bio() is invoked with mirror_num == 0 (i.e., select
any mirror), the copy on the target drive is never selected
because that disk shall be able to perform the write requests as
quickly as possible. The parallel execution of read requests would
only slow down the disk copy procedure. Second case is that
btrfs_map_bio() is called with mirror_num > 0. This is done from
the repair code only. In this case, the highest mirror num is
assigned to the target disk, since it is used last. And when this
mirror is not available because the copy procedure has not yet
handled this area, an error is returned. Everywhere in the code
the handling of such errors is added now.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch adds some code to disallow operations on the device that
is used as the target for the device replace operation.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Btrfs admin operations that are manually started from user mode
and that cannot be executed at the same time return -EINPROGRESS.
A common way to enter and leave this locked section is introduced
since it used to be specific to the balance operation.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove the attempt to cancel a running scrub or device replace
operation in btrfs_handle_error() because it adds the risk of
a deadlock. The only penalty of not canceling the operation is
that some I/O remains active until the procedure completes.
This is basically the same thing that happens to other tasks
that are running in user mode context, they are not affected
or stopped in btrfs_handle_error(), these tasks just need to
handle write errors correctly.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A small number of functions that are used in a device replace
procedure when the operation is resumed at mount time are unable
to pass the same root pointer that would be used in the regular
(ioctl) context. And since the root pointer is not required, only
the fs_info is, the root pointer argument is replaced with the
fs_info pointer argument.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This new function is used by the device replace procedure in
a later patch.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is required for the device replace procedure in a later step.
Two calling functions also had to be changed to have the fs_info
pointer: repair_io_failure() and scrub_setup_recheck_block().
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This is required for the device replace procedure in a later step.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The new function btrfs_find_device_missing_or_by_path() will be
used for the device replace procedure. This function itself calls
the second new function btrfs_find_device_by_path().
Unfortunately, it is not possible to currently make the rest of the
code use these functions as well, since all functions that look
similar at first view are all a little bit different in what they
are doing. But in the future, new code could benefit from these
two new functions, and currently, device replace uses them.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Some code to open block devices, to read the superblock and to
handle errors was repeated multiple times in 3 places, and the
following patch makes use of it as well. This code is now moved
into a subfunction.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Just move some code into functions to make everything more readable.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In the scrub repair code, the code is changed to handle memory
allocation errors a little bit smarter. The change is to handle
it just like a read error. This simplifies the code and removes
a couple of lines of code, since the code to handle read errors
is there anyway.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In case that disk blocks need to be repaired (rewritten), the
current code at first (for simplicity reasons) reads all alternate
mirrors in the first step, afterwards selects the best one in a
second step. This is now changed to read one alternate mirror
after the other and to leave the loop early when a perfect mirror
is found.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
With the modified design (in order to support the devive replace
procedure) it is necessary to alloc the page array dynamically.
The reason is that pages are reused. At first a page is used for
the bio to read the data from the filesystem, then the same page
is reused for the bio that writes the data to the target disk.
Since the read process and the write process are completely
decoupled, this requires a new concept of refcounts and get/put
functions for pages, and it requires to use newly created pages
for each read bio which are freed after the write operation
is finished.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The block device is removed from the scrub context state structure.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in the scrub
context struct and moved into the lower level scope of scrub_bio,
fixup and page structures where the block device context is known.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The device replace procedure makes use of the scrub code. The scrub
code is the most efficient code to read the allocated data of a disk,
i.e. it reads sequentially in order to avoid disk head movements, it
skips unallocated blocks, it uses read ahead mechanisms, and it
contains all the code to detect and repair defects.
This commit is a first preparation step to adapt the scrub code to
be shareable for the device replace procedure.
The block device will be removed from the scrub context state
structure in a later step. It used to be the source block device.
The scrub code as it is used for the device replace procedure reads
the source data from whereever it is optimal. The source device might
even be gone (disconnected, for instance due to a hardware failure).
Or the drive can be so faulty so that the device replace procedure
tries to avoid access to the faulty source drive as much as possible,
and only if all other mirrors are damaged, as a last resort, the
source disk is accessed.
The modified scrub code operates as if it would handle the source
drive and thereby generates an exact copy of the source disk on the
target disk, even if the source disk is not present at all. Therefore
the block device pointer to the source disk is removed in a later
patch, and therefore the context structure is renamed (this is the
goal of the current patch) to reflect that no source block device
scope is there anymore.
Summary:
This first preparation step consists of a textual substitution of the
term "dev" to the term "ctx" whereever the scrub context is used.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Since we've kill the bigger one volume_mutex, we need to add devices
list mutex back.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
- 'nr' is no more used.
- btrfs_btree_balance_dirty() and __btrfs_btree_balance_dirty() can share
a bunch of code.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When __merge_refs merges two refs, it is also needed to merge the
inode_list of both refs. Otherwise we have missed backrefs and memory
leaks. This happens for example if two inodes share an extent and
both lie in the same leaf and thus also have the same parent.
Signed-off-by: Alexander Block <ablock84@googlemail.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Even if the hole punching is executed, the modification time of the
file is not updated.
So, current time is set to inode.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Someone who is root or capable(CAP_SYS_ADMIN) could corrupt the
superblock and make Btrfs printk("%s") crash while holding the
uuid_mutex since nobody forces a limit on the string. Since the
uuid_mutex is significant, the system would be unusable
afterwards.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When creating a snapshot, failing to commit a transaction can end up
with aborting the transaction, following by doing a cleanup for it, where
we'll free all snapshots pending to disk.
So we check it and avoid double free on pending snapshots.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When committing a transaction, we may bail out of running delayed refs
due to ENOSPC, and then abort the current transaction to flip into readonly.
But we'll hit a deadlock on ref head's lock since we forget to release
its lock and other cleanup stuff.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Just use WARN_ON rather than an if containing only WARN_ON(1).
A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression e;
@@
- if (e) WARN_ON(1);
+ WARN_ON(e);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Use WARN rather than printk followed by WARN_ON(1), for conciseness.
A simplified version of the semantic patch that makes this transformation
is as follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression list es;
@@
-printk(
+WARN(1,
es);
-WARN_ON(1);
// </smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we set BTRFS_INODE_NEEDS_FULL_SYNC, we should log all the extent,
but now we forget to take it into account, and set a wrong max key,
if so, we will skip the file extent metadata when doing logging. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
We forget to protect the modified_extents list, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There are two types of the file extent - inline extent and regular extent,
When we log file extents, we didn't take inline extent into account, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Consider the following case:
Task1 Task2
start_transaction
commit_transaction
check pending snapshots list and the
list is empty.
add pending snapshot into list
skip the delalloc flush
end_transaction
...
And then the problem that the snapshot is different with the source subvolume
happen.
This patch fixes the above problem by flush all pending stuffs when all the
other tasks end the transaction.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If we flush inodes with pending delalloc in a transaction, we may join
the same transaction handler more than 2 times.
The reason is:
Task use_count of trans handle
commit_transaction 1
|-> btrfs_start_delalloc_inodes 1
|-> run_delalloc_nocow 1
|-> join_transaction 2
|-> cow_file_range 2
|-> join_transaction 3
In fact, cow_file_range needn't join the transaction again because the caller
have joined the transaction, so we fix this problem by this way.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
btrfs_wait_ordered_range expects for 'len' instead of 'end'.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we log new names, we need to log just enough to recreate the inode
during log replay, and there is no need to log extents along with it.
This actually fixes a bug revealed by xfstests 241, where it shows
that we're logging some extents that have not updated metadata,
so we don't get proper EXTENT_DATA items to be copied to log tree.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The current behavior is to allow mounting or remounting a filesystem
writeable in degraded mode if at least one writeable device is
present.
The next failed write access to a missing device which is above
the tolerance of the configured level of redundancy results in an
read-only enforcement. Even without this, the next time
barrier_all_devices() is called and more devices are missing than
tolerable, the switch to read-only mode takes place.
In order to behave predictably and to provide proper feedback to
the user at mount time, this patch compares the number of missing
devices with the number of devices that are tolerated to be missing
according to the configured RAID level. If more devices are missing
than tolerated, e.g. if two devices are missing in case of RAID1,
only a read-only mount and remount is allowed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Remove an invalid size check up from btrfs_shrink_dev().
The new size should not larger than the device->total_bytes as it was
already verified before coming to here(i.e. new_size < old_size).
Remove invalid check up for btrfs_shrink_dev().
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
There is no reason to pass the nr_pages_dirtied argument, because
nr_pages_dirtied value from the caller is unused in
balance_dirty_pages_ratelimited_nr().
Signed-off-by: Namjae Jeon <linkinjeon@gmail.com>
Signed-off-by: Vivek Trivedi <vtrivedi018@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Though the process of the ordered extents is a bit different with the delalloc inode
flush, but we can see it as a subset of the delalloc inode flush, so we also handle
them by flush workers.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The process of the ordered operations is similar to the delalloc inode flush, so
we handle them by flush workers.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
This patch introduce a new worker pool named "flush_workers", and if we
want to force all the inode with pending delalloc to the disks, we can
queue those inodes into the work queue of the worker pool, in this way,
those inodes will be flushed by multi-task.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Dave gave me an image of a very full file system that would abort the
transaction because it ran out of space while committing the transaction.
This is because we would think there was plenty of room to create a snapshot
even though the global reserve was not full. This happens because we
calculate the global reserve size before we unpin any space, so after we
unpin the space we allow reservations to occur even though we haven't
reserved all of the space for our global reserve. Fix this by adding to the
global reserve while unpinning in order to make sure we always have enough
space to do our work. With this patch we no longer end up with an aborted
transaction, we return ENOSPC properly to the person trying to create the
snapshot. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The argument 'tree_mod_log' is not necessary since all of callers enable it.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Since we don't use MOD_LOG_KEY_REMOVE_WHILE_MOVING to add nritems
during rewinding, we should insert a MOD_LOG_KEY_REMOVE operation first.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Key MOD_LOG_KEY_REMOVE_WHILE_MOVING means that we're doing memmove inside
an extent buffer node, and the node's number of items remains unchanged
(unless we are inserting a single pointer, but we have MOD_LOG_KEY_ADD for that).
So we don't need to increase node's number of items during rewinding,
otherwise we may get an node larger than leafsize and cause general protection
errors later.
Here is the details,
- If we do memory move for inserting a single pointer, we need to
add node's nritems by one, and we honor MOD_LOG_KEY_ADD for adding.
- If we do memory move for deleting a single pointer, we need to
decrease node's nritems by one, and we honor MOD_LOG_KEY_REMOVE for
deleting.
- If we do memory move for balance left/right, we need to decrease
node's nritems, and we honor MOD_LOG_KEY_REMOVE for balaning.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
When we find a bitmap free space entry, we may check the previous extent
entry covers the offset or not. But if we find this entry is also a bitmap
entry, we will continue to check the previous entry of the current one by
a while loop. It is unnecessary because it is impossible that the extent
entry which is in front of a bitmap entry can cover the offset of the entry
after that bitmap entry.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Alex reported a problem where we were writing between chunks on a rbd
device. The thing is we do bio_add_page using logical offsets, but the
physical offset may be different. So when we map the bio now check to see
if the bio is still ok with the physical offset, and if it is not split the
bio up and redo the bio_add_page with the physical sector. This fixes the
problem for Alex and doesn't affect performance in the normal case. Thanks,
Reported-and-tested-by: Alex Elder <elder@inktank.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The comment is not coincident with the code. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
div_factor{_fine} has been implemented for two times, cleanup it.
And I move them into a independent file named math.h because they are
common math functions.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
"Whether" is misspelled in various comments across the tree; this
fixes them. No code changes.
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull btrfs fixes from Chris Mason:
"This has our series of fixes for the next rc. The biggest batch is
from Jan Schmidt, fixing up some problems in our subvolume quota code
and fixing btrfs send/receive to work with the new extended inode
refs."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: do not bug when we fail to commit the transaction
Btrfs: fix memory leak when cloning root's node
Btrfs: Use btrfs_update_inode_fallback when creating a snapshot
Btrfs: Send: preserve ownership (uid and gid) also for symlinks.
Btrfs: fix deadlock caused by the nested chunk allocation
btrfs: Return EINVAL when length to trim is less than FSB
Btrfs: fix memory leak in btrfs_quota_enable()
Btrfs: send correct rdev and mode in btrfs-send
Btrfs: extended inode refs support for send mechanism
Btrfs: Fix wrong error handling code
Fix a sign bug causing invalid memory access in the ino_paths ioctl.
Btrfs: comment for loop in tree_mod_log_insert_move
Btrfs: fix extent buffer reference for tree mod log roots
Btrfs: determine level of old roots
Btrfs: tree mod log's old roots could still be part of the tree
Btrfs: fix a tree mod logging issue for root replacement operations
Btrfs: don't put removals from push_node_left into tree mod log twice
We BUG if we fail to commit the transaction when creating a snapshot, which
is just obnoxious. Remove the BUG_ON(). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
After cloning root's node, we forgot to dec the src's ref
which can lead to a memory leak.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
On a really full file system I was getting ENOSPC back from
btrfs_update_inode when trying to update the parent inode when creating a
snapshot. Just use the fallback method so we can update the inode and not
have to worry about having a delayed ref. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
This patch also requires a change in the user-space part of "receive".
We need to use "lchown" instead of "chown". We will do this in the
following patch.
Signed-off-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
if (S_ISREG(sctx->cur_inode_mode)) {
Steps to reproduce:
# mkfs.btrfs -m raid1 <disk1> <disk2>
# btrfstune -S 1 <disk1>
# mount <disk1> <mnt>
# btrfs device add <disk3> <disk4> <mnt>
# mount -o remount,rw <mnt>
# dd if=/dev/zero of=<mnt>/tmpfile bs=1M count=1
Deadlock happened.
It is because of the nested chunk allocation. When we wrote the data
into the filesystem, we would allocate the data chunk because there was
no data chunk in the filesystem. At the end of the data chunk allocation,
we should insert the metadata of the data chunk into the extent tree, but
there was no raid1 chunk, so we tried to lock the chunk allocation mutex to
allocate the new chunk, but we had held the mutex, the deadlock happened.
By rights, we would allocate the raid1 chunk when we added the second device
because the profile of the seed filesystem is raid1 and we had two devices.
But we didn't do that in fact. It is because the last step of the first device
insertion didn't commit the transaction. So when we added the second device,
we didn't cow the tree, and just inserted the relative metadata into the leaves
which were generated by the first device insertion, and its profile was dup.
So, I fix this problem by commiting the transaction at the end of the first
device insertion.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Currently if len argument in btrfs_ioctl_fitrim() is smaller than
one FSB we will continue and finally return 0 bytes discarded.
However if the length to discard is smaller then file system block
we should really return EINVAL.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
gcc says "warning: comparison of unsigned expression >= 0 is always
true" because i is an unsigned long. And gcc is right this time.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
To see the problem, create many hardlinks to the same file (120 should do it),
then look up paths by inode with:
ls -i
btrfs inspect inode-resolve -v $ino /mnt/btrfs
I noticed the memory layout of the fspath->val data had some irregularities
(some unnecessary gaps that stop appearing about halfway),
so I'm not sure there aren't any bugs left in it.
Emphasis the way tree_mod_log_insert_move avoids adding
MOD_LOG_KEY_REMOVE_WHILE_MOVING operations, depending on the direction of
the move operation.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
In get_old_root we grab a lock on the extent buffer before we obtain a
reference on that buffer. That order is changed now.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
In btrfs_find_all_roots' termination condition, we compare the level of the
old buffer we got from btrfs_search_old_slot to the level of the current
root node. We'd better compare it to the level of the rewinded root node.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Tree mod log treated old root buffers as always empty buffers when starting
the rewind operations. However, the old root may still be part of the
current tree at a lower level, with still some valid entries.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Avoid the implicit free by tree_mod_log_set_root_pointer, which is wrong in
two places. Where needed, we call tree_mod_log_free_eb explicitly now.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Independant of the check (push_items < src_items) tree_mod_log_eb_copy did
log the removal of the old data entries from the source buffer. Therefore,
we must not call tree_mod_log_eb_move if the check evaluates to true, as
that would log the removal twice, finally resulting in (rewinded) buffers
with wrong values for header_nritems.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Pull user namespace compile fixes from Eric W Biederman:
"This tree contains three trivial fixes. One compiler warning, one
thinko fix, and one build fix"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
btrfs: Fix compilation with user namespace support enabled
userns: Fix posix_acl_file_xattr_userns gid conversion
userns: Properly print bluetooth socket uids
When compiling with user namespace support btrfs fails like:
fs/btrfs/tree-log.c: In function ‘fill_inode_item’:
fs/btrfs/tree-log.c:2955:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_uid’
fs/btrfs/ctree.h:2026:1: note: expected ‘u32’ but argument is of type ‘kuid_t’
fs/btrfs/tree-log.c:2956:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_gid’
fs/btrfs/ctree.h:2027:1: note: expected ‘u32’ but argument is of type ‘kgid_t’
Fix this by using i_uid_read and i_gid_read in
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
In order to accomodate retrying path-based syscalls, we need to add a
new "type" argument to audit_inode_child. This will tell us whether
we're looking for a child entry that represents a create or a delete.
If we find a parent, don't automatically assume that we need to create a
new entry. Instead, use the information we have to try to find an
existing entry first. Update it if one is found and create a new one if
not.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Most of the callers get called with an inode and dentry in the reverse
order. The compiler then has to reshuffle the arg registers and/or
stack in order to pass them on to audit_inode_child.
Reverse those arguments for a micro-optimization.
Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull btrfs update from Chris Mason:
"This is a large pull, with the bulk of the updates coming from:
- Hole punching
- send/receive fixes
- fsync performance
- Disk format extension allowing more hardlinks inside a single
directory (btrfs-progs patch required to enable the compat bit for
this one)
I'm cooking more unrelated RAID code, but I wanted to make sure this
original batch makes it in. The largest updates here are relatively
old and have been in testing for some time."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (121 commits)
btrfs: init ref_index to zero in add_inode_ref
Btrfs: remove repeated eb->pages check in, disk-io.c/csum_dirty_buffer
Btrfs: fix page leakage
Btrfs: do not warn_on when we cannot alloc a page for an extent buffer
Btrfs: don't bug on enomem in readpage
Btrfs: cleanup pages properly when ENOMEM in compression
Btrfs: make filesystem read-only when submitting barrier fails
Btrfs: detect corrupted filesystem after write I/O errors
Btrfs: make compress and nodatacow mount options mutually exclusive
btrfs: fix message printing
Btrfs: don't bother committing delayed inode updates when fsyncing
btrfs: move inline function code to header file
Btrfs: remove unnecessary IS_ERR in bio_readpage_error()
btrfs: remove unused function btrfs_insert_some_items()
Btrfs: don't commit instead of overcommitting
Btrfs: confirmation of value is added before trace_btrfs_get_extent() is called
Btrfs: be smarter about dropping things from the tree log
Btrfs: don't lookup csums for prealloc extents
Btrfs: cache extent state when writing out dirty metadata pages
Btrfs: do not hold the file extent leaf locked when adding extent item
...
In csum_dirty_buffer, we first get eb from page->private.
Then we check if the page is the first page of eb. Later
we check it again. Remove the repeated check here.
Signed-off-by: Wang Sheng-Hui <shhuiw@gmail.com>
Alloc_dummy_extent_buffer will not free the first page in the eb array if we
fail to allocate a page, fix this. Thanks,
Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
It's just annoying and the user will have gotten a nice OOM killer message
so they are already fully aware they are screwed :). Thanks,
Reported-by: Jérôme Poulin <jeromepoulin@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Get rid of the BUG_ON(ret == -ENOMEM) in __extent_read_full_page. Thanks,
Reported-by: Jérôme Poulin <jeromepoulin@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We were freeing non-existent pages which was causing a panic for a user who
was suffering from ENOMEM. This patch fixes the problem. Thanks,
Reported-by: Jérôme Poulin <jeromepoulin@gmail.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So far the return code of barrier_all_devices() is ignored, which
means that errors are ignored. The result can be a corrupt
filesystem which is not consistent.
This commit adds code to evaluate the return code of
barrier_all_devices(). The normal btrfs_error() mechanism is used to
switch the filesystem into read-only mode when errors are detected.
In order to decide whether barrier_all_devices() should return
error or success, the number of disks that are allowed to fail the
barrier submission is calculated. This calculation accounts for the
worst RAID level of metadata, system and data. If single, dup or
RAID0 is in use, a single disk error is already considered to be
fatal. Otherwise a single disk error is tolerated.
The calculation of the number of disks that are tolerated to fail
the barrier operation is performed when the filesystem gets mounted,
when a balance operation is started and finished, and when devices
are added or removed.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
In check-integrity, detect when a superblock is written that points
to blocks that have not been written to disk due to I/O write errors.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
If a filesystem is mounted with compression and then remounted by adding nodatacow,
the compression is disabled but the compress flag is still visible.
Also, if a filesystem is mounted with nodatacow and then remounted with compression,
nodatacow flag is still present but it's not active.
This patch:
- removes compress flags and notifies that the compression has been disabled if the
filesystem is mounted with nodatacow
- removes nodatacow and nodatasum flags if mounted with compress.
Signed-off-by: Andrei Popa <andrei.popa@i-neo.ro>
We can just copy the in memory inode into the tree log directly, no sense in
updating the fs tree so we can copy it into the tree log tree. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
When building btrfs from kernel code, it will report:
fs/btrfs/extent_io.h:281: warning: 'extent_buffer_page' declared inline after being called
fs/btrfs/extent_io.h:281: warning: previous declaration of 'extent_buffer_page' was here
fs/btrfs/extent_io.h:280: warning: 'num_extent_pages' declared inline after being called
fs/btrfs/extent_io.h:280: warning: previous declaration of 'num_extent_pages' was here
because of the wrong declaration of inline functions.
Signed-off-by: Robin Dong <sanbai@taobao.com>
I don't think we have the same problem that this was supposed to fix
originally since we can allocate chunks in the enospc path now. This code
is causing us to constantly commit the transaction as we get close to using
all of our available space in our currently allocated chunks, instead of
allocating another chunk and carrying on with life, which is not nice for
performance. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We should confirm the value of extent_map before calling
trace_btrfs_get_extent() because the value of extent_map has the
possibility of NULL.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
When we truncate existing items in the tree log we've been searching for
each individual item and removing them. This is unnecessary churn and
searching, just keep track of the slot we are on and how many items we need
to delete and delete them all at once. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The tree logging stuff was looking up csums to copy over for prealloc
extents which is just work we don't need to be doing. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Everytime we write out dirty pages we search for an offset in the tree,
convert the bits in the state, and then when we wait we search for the
offset again and clear the bits. So for every dirty range in the io tree we
are doing 4 rb searches, which is suboptimal. With this patch we are only
doing 2 searches for every cycle (modulo weird things happening). Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
For some reason we unlock everything except the leaf we are on, set the path
blocking and then add the extent item for the extent we just finished
writing. I can't for the life of me figure out why we would want to do
this, and the history doesn't really indicate that there was a real reason
for it, so just remove it. This will reduce our tree lock contention on
heavy writes. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
There are a coule scenarios where farming metadata csumming off to an async
thread doesn't help. The first is if our processor supports crc32c, in
which case the csumming will be fast and so the overhead of the async model
is not worth the cost. The other case is for our tree log. We will be
making that stuff dirty and writing it out and waiting for it immediately.
Even with software crc32c this gives me a ~15% increase in speed with O_SYNC
workloads. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
commit 7ca4be45a0 limited csum items to
PAGE_CACHE_SIZE. It used min() with incompatible types in 32bit which
generates warnings:
fs/btrfs/file-item.c: In function ‘btrfs_csum_file_blocks’:
fs/btrfs/file-item.c:717: warning: comparison of distinct pointer types lacks a cast
This uses min_t(u32,) to fix the warnings. u32 seemed reasonable
because btrfs_root->leafsize is u32 and PAGE_CACHE_SIZE is unsigned
long.
Signed-off-by: Zach Brown <zab@zabbo.net>
Running delayed refs is faster than running delalloc, so lets do that first
to try and reclaim space. This makes my fs_mark test about 20% faster.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
With the following debug patch:
static int btrfs_freeze(struct super_block *sb)
{
+ struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+ struct btrfs_transaction *trans;
+
+ spin_lock(&fs_info->trans_lock);
+ trans = fs_info->running_transaction;
+ if (trans) {
+ printk("Transid %llu, use_count %d, num_writer %d\n",
+ trans->transid, atomic_read(&trans->use_count),
+ atomic_read(&trans->num_writers));
+ }
+ spin_unlock(&fs_info->trans_lock);
return 0;
}
I found there was a orphan transaction after the freeze operation was done.
It is because the transaction may not be committed when the transaction handle
end even though it is the last handle of the current transaction. This design
avoid committing the transaction frequently, but also introduce the above
problem.
So I add btrfs_attach_transaction() which can catch the current transaction
and commit it. If there is no transaction, it will return ENOENT, and do not
anything.
This function also can be used to instead of btrfs_join_transaction_freeze()
because it don't increase the writer counter and don't start a new transaction,
so it also can fix the deadlock between sync and freeze.
Besides that, it is used to instead of btrfs_join_transaction() in
transaction_kthread(), because if there is no transaction, the transaction
kthread needn't anything.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
This patch add a type field into the transaction handle structure,
in this way, we needn't implement various end-transaction functions
and can make the code more simple and readable.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
This patch fixes memory leak of the transaction handle which happened
when starting transaction failed on a freezed fs.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
The iterate_irefs in backref.c is used to build path components from inode
refs. This patch adds code to iterate extended refs as well.
I had modify the callback function signature to abstract out some of the
differences between ref structures. iref_to_path() also needed similar
changes.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
This patch adds basic support for extended inode refs. This includes support
for link and unlink of the refs, which basically gets us support for rename
as well.
Inode creation does not need changing - extended refs are only added after
the ref array is full.
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().
Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.
Now device drivers can implement this method and obtain nonlinear vma support.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Moved part of the code into a sub function and replaced most of the gotos
by ifs, hoping that it will be easier to read now.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Mark Fasheh <mfasheh@suse.de>
I started hitting warnings when running xfstest 68 in a loop because there
were EM's that were not lined up properly with the physical extents. This
is ok, if we do something like punch a hole or write to a preallocated space
or something like that we can have an EM that doesn't cover the entire
physical extent. So fix the tree logging stuff to cope with this case so we
don't just commit the transaction. With this patch I no longer see the
warnings from the tree logging code. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Call btrfs_abort_transaction as early as possible when an error
condition is detected, that way the line number reported is useful
and we're not clueless anymore which error path led to the abort.
Signed-off-by: David Sterba <dsterba@suse.cz>
The macro btrfs_abort_transaction() can get the line number of the code
where the problem happens, so we should invoke it in the place that the
error occurs, or we will lose the line number.
Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Btrfs uses inclusive range end for lock_extent(), unlock_extent() and
related functions, so we made off-by-one errors in file clone.
This fixes it and also fixes some style problems.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Hi,
the patch si simple, but it has user visible impact and I'm not quite sure how
to resolve it.
In short, $subj says it, chattr -C supports it and we want to use it.
The conditions that acutally allow to change the NOCOW flag are clear. What if
I try to set the flag on a file that is not empty? Options:
1) whole ioctl will fail, EINVAL
2.1) ioctl will succeed, the NOCOW flag will be silently removed, but the file
will stay COW-ed and checksummed
2.2) ioctl will succeed, flag will not be removed and a syslog message will
warn that the COW flag has not been changed
2.2.1) dtto, no syslog message
Man page of chattr states that
"If it is set on a file which already has data blocks, it is undefined when
the blocks assigned to the file will be fully stable."
Yes, it's undefined and with current implementation it'll never happen. So from
this end, the user cannot expect anything. I'm trying to find a reasonable
behaviour, so that a command like 'chattr -R -aijS +C' to tweak a broad set of
flags in a deep directory does not fail unnecessarily and does not pollute the
log.
My personal preference is 2.2.1, but my dev's oppinion is skewed, not counting
the fact that I know the code and otherwise would look there before consulting
the documentation.
The patch implements 2.2.1.
david
-------------8<-------------------
From: David Sterba <dsterba@suse.cz>
It's safe to turn off checksums for a zero sized file.
http://thread.gmane.org/gmane.comp.file-systems.btrfs/18030
"We cannot switch on NODATASUM for a file that already has extents that
are checksummed. The invariant here is that either all the extents or
none are checksummed.
Theoretically it's possible to add/remove all checksums from a given
file, but it's a potentially longtime operation, the file has to be in
some intermediate state where the checksums partially exist but have to
be ignored (for the csum->nocsum) until the file is fully converted,
this brings more special cases to extent handling, it has to survive
power failure and remain consistent, and probably needs to be restarted
after next mount."
Signed-off-by: David Sterba <dsterba@suse.cz>
I saw the warning in btrfs_drop_extent_cache where our end is less than our
start while running xfstests 68 in a loop. This is because we
unconditionally do drop_end = min(end, extent_end) in
__btrfs_drop_extents(), even though we may not have found an extent in the
range we were looking to drop. So keep track of wether or not we found
something, and if we didn't just use our end. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We do not need to do anything special to freeze or unfreeze, it's all taken
care of by the generic work, and what we currently have is wrong anyway
since we shouldn't be returnning to userspace with mutexes held anyway.
Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
The btree inode has it's own write cache pages so we can remove this write
cache pages hook as it's not used. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We can race when checking wether PagePrivate is set on a page and we
actually have an eb saved in the pages private pointer. We could have
easily written out this page and released it in the time that we did the
pagevec lookup and actually got around to looking at this page. So use
mapping->private_lock to ensure we get a consistent view of the
page->private pointer. This is inline with the alloc and releasepage paths
which use private_lock when manipulating page->private. Thanks,
Reported-by: David Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Dave Sterba pointed out a sleeping while atomic bug while doing fsync. This
is because I'm an idiot and didn't realize that rwlock's were spin locks, so
we've been holding this thing while doing allocations and such which is not
good. This patch fixes this by dropping the write lock before we do
anything heavy and re-acquire it when it is done. We also need to take a
ref on the em's in case their corresponding pages are evicted and mark them
as being logged so that releasepage does not remove them and doesn't remove
them from our local list. Thanks,
Reported-by: Dave Sterba <dave@jikos.cz>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
So we start our freeze, somebody comes in and does an fsync() on a file
where we have to commit a transaction for whatever reason, and we will
deadlock because the freeze is waiting on FS_FREEZE people to stop writing
to the file system, but the transaction is waiting for its free space inodes
to be written out, which are in turn waiting on sb_start_intwrite while
trying to write the file extents. To fix this we'll just skip the
sb_start_intwrite() if we TRANS_JOIN_NOLOCK since we're being waited on by a
transaction commit so we're safe wrt to freeze and this will keep us from
deadlocking. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>