This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.
This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.
[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;
if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;
This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The correct lock order is uuid_mutex -> volume_mutex -> chunk_mutex,
but when we mount a filesystem which has backing seed devices, we have
this lock chain:
open_ctree()
lock(chunk_mutex);
read_chunk_tree();
read_one_dev();
open_seed_devices();
lock(uuid_mutex);
and then we hit a lockdep splat.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
A bug was triggered while using seed device:
# mkfs.btrfs /dev/loop1
# btrfstune -S 1 /dev/loop1
# mount -o /dev/loop1 /mnt
# btrfs dev add /dev/loop2 /mnt
btrfs: block rsv returned -28
------------[ cut here ]------------
WARNING: at fs/btrfs/extent-tree.c:5969 btrfs_alloc_free_block+0x166/0x396 [btrfs]()
...
Call Trace:
...
[<f7b7c31c>] btrfs_cow_block+0x101/0x147 [btrfs]
[<f7b7eaa6>] btrfs_search_slot+0x1b8/0x55f [btrfs]
[<f7b7f844>] btrfs_insert_empty_items+0x42/0x7f [btrfs]
[<f7b7f8c1>] btrfs_insert_item+0x40/0x7e [btrfs]
[<f7b8ac02>] btrfs_make_block_group+0x243/0x2aa [btrfs]
[<f7bb3f53>] __btrfs_alloc_chunk+0x672/0x70e [btrfs]
[<f7bb41ff>] init_first_rw_device+0x77/0x13c [btrfs]
[<f7bb5a62>] btrfs_init_new_device+0x664/0x9fd [btrfs]
[<f7bbb65a>] btrfs_ioctl+0x694/0xdbe [btrfs]
[<c04f55f7>] do_vfs_ioctl+0x496/0x4cc
[<c04f5660>] sys_ioctl+0x33/0x4f
[<c07b9edf>] sysenter_do_call+0x12/0x38
---[ end trace 906adac595facc7d ]---
Since seed device is readonly, there's no usable space in the filesystem.
Afterwards we add a sprout device to it, and the kernel creates a METADATA
block group and a SYSTEM block group where comes free space we can reserve,
but we still get revervation failure because the global block_rsv hasn't
been updated accordingly.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
There are various bugs in block group trimming:
- It may trim from offset smaller than user-specified offset.
- It may trim beyond user-specified range.
- It may leak free space for extents smaller than specified minlen.
- It may truncate the last trimmed extent thus leak free space.
- With mixed extents+bitmaps, some extents may not be trimmed.
- With mixed extents+bitmaps, some bitmaps may not be trimmed (even
none will be trimmed). Even for those trimmed, not all the free space
in the bitmaps will be trimmed.
I rewrite btrfs_trim_block_group() and break it into two functions.
One is to trim extents only, and the other is to trim bitmaps only.
Before patching:
# fstrim -v /mnt/
/mnt/: 1496465408 bytes were trimmed
After patching:
# fstrim -v /mnt/
/mnt/: 2193768448 bytes were trimmed
And this matches the total free space:
# btrfs fi df /mnt
Data: total=3.58GB, used=1.79GB
System, DUP: total=8.00MB, used=4.00KB
System: total=4.00MB, used=0.00
Metadata, DUP: total=205.12MB, used=97.14MB
Metadata: total=8.00MB, used=0.00
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
For btrfs raid, while discarding a range of space, we'll need to know
the start offset and length to discard for each device, and it's done
in btrfs_map_block().
However the calculation is a bit complex for raid0 and raid10, so I
reimplement it based on a fact that:
dev1 dev2 dev3 (raid0)
-----------------------------------
s0 s3 s6 s1 s4 s7 s2 s5
Each device has (total_stripes / nr_dev) stripes, or plus one.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
We pre-allocate a btrfs bio with fixed size, and then may re-allocate
memory if we find stripes are bigger than the fixed size. But this
pre-allocation is not necessary.
Also we don't have to calcuate the stripe number twice.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
If we run into some failure path in io_ctl_prepare_pages(),
io_ctl->pages[] array may have some NULL pointers.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
I got this while running xfstests:
[24256.836098] block group 317849600 has an wrong amount of free space
[24256.836100] btrfs: failed to load free space cache for block group 317849600
We should clamp the extent returned by find_first_extent_bit(),
so the start of the extent won't smaller than the start of the
block group.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Tell the page allocator that pages allocated for a buffered write are
expected to become dirty soon.
Signed-off-by: Johannes Weiner <jweiner@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Shaohua Li <shaohua.li@intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
the latter can be obtained from the former (by looking as ->tree_root)
just as cheaply as we currently are doing the other way round.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and don't bother with it after btrfs_fill_super() failure -
->kill_sb() (unlike ->put_super()) will be called even if we
have not got non-NULL ->s_root.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We do not (fortunately) modify ->s_fs_info of superblock on the fly in
btrfs_fill_super(); apparent assignment is a no-op.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It returns either ERR_PTR(-ve) or sb->s_fs_info. The latter can
be found by caller just as well, TYVM, no need to return it. Just
return -ve or 0...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
close_ctree() uses a weird mix of accesses to root->fs_info and
its value at the beginning of function stored in local variable.
Since ->fs_info *never* changes, let's just use the local variable
to avoid confusion.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A new helper: btrfs_alloc_root(fs_info); allocates btrfs_root
and sets ->fs_info. All places allocating the suckers converted
to it. At that point we *never* reassign ->fs_info of btrfs_root;
it's set before anyone sees the address of newly allocated
struct btrfs_root and never assigned anywhere else.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
move assignments to ->fs_info in open_ctree() up, to the place
just after the original allocations. Assignment for tree_root
becomes a no-op - we'd obtained fs_info from tree_root->fs_info
in the first place.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
pathname resolution under a global mutex, taken on some paths in ->mount()
is a Bad Idea(tm) - think what happens if said pathname resolution triggers
automount of some btrfs instance and walks into attempt to grab the same
mutex. Deadlock - we are waiting for daemon to finish walking the path,
daemon is waiting for us to release the mutex...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Do *NOT* skip doomed superblocks in btrfs_test_super(); we want
sget() to wait for their shutdown to complete. Since we don't
mutilate ->s_fs_info in ->put_super() anymore (or free what it
used to point to until the superblock is past being findable
by sget()), we can just DTRT there and report a match.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and move free_fs_info() to that, out of ->put_super(). Do NOT
set ->s_fs_info to NULL in the latter; we need it for sget() to
be able to see and wait for fs in the middle of umount if we get a
mount/umount race.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We need fs_info and root to live until the moment when the victim
superblock leaves the list, so we need to postpone free_fs_info()
until after ->put_super(). The call is buried in close_ctree(),
though, so we need to lift it into the callers (including
btrfs_put_super()) first.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (53 commits)
Kconfig: acpi: Fix typo in comment.
misc latin1 to utf8 conversions
devres: Fix a typo in devm_kfree comment
btrfs: free-space-cache.c: remove extra semicolon.
fat: Spelling s/obsolate/obsolete/g
SCSI, pmcraid: Fix spelling error in a pmcraid_err() call
tools/power turbostat: update fields in manpage
mac80211: drop spelling fix
types.h: fix comment spelling for 'architectures'
typo fixes: aera -> area, exntension -> extension
devices.txt: Fix typo of 'VMware'.
sis900: Fix enum typo 'sis900_rx_bufer_status'
decompress_bunzip2: remove invalid vi modeline
treewide: Fix comment and string typo 'bufer'
hyper-v: Update MAINTAINERS
treewide: Fix typos in various parts of the kernel, and fix some comments.
clockevents: drop unknown Kconfig symbol GENERIC_CLOCKEVENTS_MIGR
gpio: Kconfig: drop unknown symbol 'CS5535_GPIO'
leds: Kconfig: Fix typo 'D2NET_V2'
sound: Kconfig: drop unknown symbol ARCH_CLPS7500
...
Fix up trivial conflicts in arch/powerpc/platforms/40x/Kconfig (some new
kconfig additions, close to removed commented-out old ones)
* 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (76 commits)
PM / Hibernate: Implement compat_ioctl for /dev/snapshot
PM / Freezer: fix return value of freezable_schedule_timeout_killable()
PM / shmobile: Allow the A4R domain to be turned off at run time
PM / input / touchscreen: Make st1232 use device PM QoS constraints
PM / QoS: Introduce dev_pm_qos_add_ancestor_request()
PM / shmobile: Remove the stay_on flag from SH7372's PM domains
PM / shmobile: Don't include SH7372's INTCS in syscore suspend/resume
PM / shmobile: Add support for the sh7372 A4S power domain / sleep mode
PM: Drop generic_subsys_pm_ops
PM / Sleep: Remove forward-only callbacks from AMBA bus type
PM / Sleep: Remove forward-only callbacks from platform bus type
PM: Run the driver callback directly if the subsystem one is not there
PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
PM/Devfreq: Add Exynos4-bus device DVFS driver for Exynos4210/4212/4412.
PM / Sleep: Merge internal functions in generic_ops.c
PM / Sleep: Simplify generic system suspend callbacks
PM / Hibernate: Remove deprecated hibernation snapshot ioctls
PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
ARM: S3C64XX: Implement basic power domain support
PM / shmobile: Use common always on power domain governor
...
Fix up trivial conflict in fs/xfs/xfs_buf.c due to removal of unused
XBT_FORCE_SLEEP bit
Parameterize clusters on minimum total size, minimum chunk size and
minimum contiguous size for at least one chunk, without limits on
cluster, window or gap sizes. Don't tolerate any fragmentation for
SSD_SPREAD; accept it for metadata, but try to keep data dense.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We store the allocation start and length twice in ins, once right
after the other, but with intervening calls that may prevent the
duplicate from being optimized out by the compiler. Remove one of the
assignments.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since the clustered allocation may be taking extents from a different
block group, there's no point in spin-locking and testing the current
block group free space before attempting to allocate space from a
cluster, even more so when we might refrain from even trying the
cluster in the current block group because, after the cluster was set
up, not enough free space remained. Furthermore, cluster creation
attempts fail fast when the block group doesn't have enough free
space, so the test was completely superfluous.
I've move the free space test past the cluster allocation attempt,
where it is more useful, and arranged for a cluster in the current
block group to be released before trying an unclustered allocation,
when we reach the LOOP_NO_EMPTY_SIZE stage, so that the free space in
the cluster stands a chance of being combined with additional free
space in the block group so as to succeed in the allocation attempt.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The chunk allocation code has tried to keep a pretty tight lid on creating new
metadata chunks. This is partially because in the past the reservation
code didn't give us an accurate idea of how much space was being used.
The new code is much more accurate, so we're able to get rid of some of these
checks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs tries to batch extent allocation tree changes to improve performance
and reduce metadata trashing. But it doesn't allocate new metadata chunks
while it is doing allocations for the extent allocation tree.
This commit changes the delayed refence code to do chunk allocations if we're
getting low on room. It prevents crashes and improves performance.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There's code in btrfs_get_extent that should never be used. This patch turns
a WARN_ON(1) into a BUG(), hoping we can remove the transaction code from
btrfs_get_extent soon.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The old backref iteration code could only safely be used on commit roots.
Besides this limitation, it had bugs in finding the roots for these
references. This commit replaces large parts of it by btrfs_find_all_roots()
which a) really finds all roots and the correct roots, b) works correctly
under heavy file system load, c) considers delayed refs.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
This function gets a byte number (a data extent), collects all the leafs
pointing to it and walks up the trees to find all fs roots pointing to those
leafs. It also returns the list of all leafs pointing to that extent.
It does proper locking for the involved trees, can be used on busy file
systems and honors delayed refs.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Now that we may be holding back delayed refs for a limited period, we
might end up having no runnable delayed refs. Without this commit, we'd
do busy waiting in that thread until another (runnable) ref arives.
Instead, we're detecting this situation and use a waitqueue, such that
we only try to run more refs after
a) another runnable ref was added or
b) delayed refs are no longer held back
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
When processing a delayed ref, first check if there are still old refs in
the process of being added. If so, put this ref back to the tree. To avoid
looping on this ref, choose a newer one in the next loop.
btrfs_find_ref_cluster has to take care of that.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Sequence numbers are needed to reconstruct the backrefs of a given extent to
a certain point in time. The total set of backrefs consist of the set of
backrefs recorded on disk plus the enqueued delayed refs for it that existed
at that moment.
This patch also adds a list that records all delayed refs which are
currently in the process of being added.
When walking all refs of an extent in btrfs_find_all_roots(), we freeze the
current state of delayed refs, honor anythinh up to this point and prevent
processing newer delayed refs to assert consistency.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
This patch adds the possibilty to read-lock an extent even if it is already
write-locked from the same thread. btrfs_find_all_roots() needs this
capability.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
vfs_create() ignores everything outside of 16bit subset of its
mode argument; switching it to umode_t is obviously equivalent
and it's the only caller of the method
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
vfs_mkdir() gets int, but immediately drops everything that might not
fit into umode_t and that's the only caller of ->mkdir()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Seeing that just about every destructor got that INIT_LIST_HEAD() copied into
it, there is no point whatsoever keeping this INIT_LIST_HEAD in inode_init_once();
the cost of taking it into inode_init_always() will be negligible for pipes
and sockets and negative for everything else. Not to mention the removal of
boilerplate code from ->destroy_inode() instances...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
it's not needed anymore; we used to, back when we had to do
mount_subtree() by hand, complete with put_mnt_ns() in it.
No more... Apparmor didn't need it since the __d_path() fix.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* pm-sleep: (51 commits)
PM: Drop generic_subsys_pm_ops
PM / Sleep: Remove forward-only callbacks from AMBA bus type
PM / Sleep: Remove forward-only callbacks from platform bus type
PM: Run the driver callback directly if the subsystem one is not there
PM / Sleep: Make pm_op() and pm_noirq_op() return callback pointers
PM / Sleep: Merge internal functions in generic_ops.c
PM / Sleep: Simplify generic system suspend callbacks
PM / Hibernate: Remove deprecated hibernation snapshot ioctls
PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()
PM / Sleep: Recommend [un]lock_system_sleep() over using pm_mutex directly
PM / Sleep: Replace mutex_[un]lock(&pm_mutex) with [un]lock_system_sleep()
PM / Sleep: Make [un]lock_system_sleep() generic
PM / Sleep: Use the freezer_count() functions in [un]lock_system_sleep() APIs
PM / Freezer: Remove the "userspace only" constraint from freezer[_do_not]_count()
PM / Hibernate: Replace unintuitive 'if' condition in kernel/power/user.c with 'else'
Freezer / sunrpc / NFS: don't allow TASK_KILLABLE sleeps to block the freezer
PM / Sleep: Unify diagnostic messages from device suspend/resume
ACPI / PM: Do not save/restore NVS on Asus K54C/K54HR
PM / Hibernate: Remove deprecated hibernation test modes
PM / Hibernate: Thaw processes in SNAPSHOT_CREATE_IMAGE ioctl test path
...
Conflicts:
kernel/kmod.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: call d_instantiate after all ops are setup
Btrfs: fix worker lock misuse in find_worker
This closes races where btrfs is calling d_instantiate too soon during
inode creation. All of the callers of btrfs_add_nondir are updated to
instantiate after the inode is fully setup in memory.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Dan Carpenter noticed that we were doing a double unlock on the worker
lock, and sometimes picking a worker thread without the lock held.
This fixes both errors.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
For consistent backref walking and (later) qgroup calculation the
information to which root a delayed ref belongs is useful even for shared
refs.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
from every call site. The for_cow parameter will later on be used to
determine if a ref will change anything with respect to qgroups.
Delayed refs coming from relocation are always counted as for_cow, as they
don't change subvol quota.
Also pass in the fs_info for later use.
btrfs_find_all_roots() will use this as an optimization, as changes that are
for_cow will not change anything with respect to which root points to a
certain leaf. Thus, we don't need to add the current sequence number to
those delayed refs.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
btrfs_next_item() makes the btrfs path point to the next item, crossing leaf
boundaries if needed.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
ulist is a generic data structures to hold a collection of unique u64
values. The only operations it supports is adding to the list and
enumerating it.
It is possible to store an auxiliary value along with the key. The
implementation is preliminary and can probably be sped up significantly.
It is used by btrfs_find_all_roots() quota to translate recursions into
iterative loops.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
* master: (848 commits)
SELinux: Fix RCU deref check warning in sel_netport_insert()
binary_sysctl(): fix memory leak
mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
ipmi_watchdog: restore settings when BMC reset
oom: fix integer overflow of points in oom_badness
memcg: keep root group unchanged if creation fails
nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
nilfs2: unbreak compat ioctl
cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
evm: prevent racing during tfm allocation
evm: key must be set once during initialization
mmc: vub300: fix type of firmware_rom_wait_states module parameter
Revert "mmc: enable runtime PM by default"
mmc: sdhci: remove "state" argument from sdhci_suspend_host
x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
IB/qib: Correct sense on freectxts increment and decrement
RDMA/cma: Verify private data length
cgroups: fix a css_set not found bug in cgroup_attach_proc
oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
...
Conflicts:
kernel/cgroup_freezer.c
This is the last part of the patch series. It modifies the btrfs
code to use the integrity check module if configured to do so
with the define BTRFS_FS_CHECK_INTEGRITY. If this define is not set,
the only effective change is that code is added that handles the
mount option to activate the integrity check. If the mount option is
set and the define BTRFS_FS_CHECK_INTEGRITY is not set, that code
complains in the log and the mount fails with EINVAL.
Add the mount option to activate the usage of the integrity check
code.
Add invocation of btrfs integrity check code init and cleanup
function on mount and umount, respectively.
Add hook to call btrfs integrity check code version of
submit_bh/submit_bio.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
If the btrfs integrity check is enabled, the files required to
implement the checks are included in the build.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
The two files added in this patch contain all the code that is
required to implement the integrity checks.
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
When doing 1KB sequential writes to the same page,
balance_dirty_pages_ratelimited_nr() should be called once instead of 4
times, the latter makes the dirtier tasks be throttled much too heavy.
Fix it with proper de-accounting on clear_page_dirty_for_io().
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: unplug every once and a while
Btrfs: deal with NULL srv_rsv in the delalloc inode reservation code
Btrfs: only set cache_generation if we setup the block group
Btrfs: don't panic if orphan item already exists
Btrfs: fix leaked space in truncate
Btrfs: fix how we do delalloc reservations and how we free reservations on error
Btrfs: deal with enospc from dirtying inodes properly
Btrfs: fix num_workers_starting bug and other bugs in async thread
BTRFS: Establish i_ops before calling d_instantiate
Btrfs: add a cond_resched() into the worker loop
Btrfs: fix ctime update of on-disk inode
btrfs: keep orphans for subvolume deletion
Btrfs: fix inaccurate available space on raid0 profile
Btrfs: fix wrong disk space information of the files
Btrfs: fix wrong i_size when truncating a file to a larger size
Btrfs: fix btrfs_end_bio to deal with write errors to a single mirror
* 'for-linus-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: lower the dirty balance poll interval
Tests show that the original large intervals can easily make the dirty
limit exceeded on 100 concurrent dd's. So adapt to as large as the
next check point selected by the dirty throttling algorithm.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs io submission threads can build up massive plug lists. This
keeps things more reasonable so we don't hand over huge dumps of IO at
once.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A user reported a problem booting into a new kernel with the old format inodes.
He was panicing in cow_file_range while writing out the inode cache. This is
because if the block group is not cached we'll just skip writing out the cache,
however if it gets dirtied again in the same transaction and it finished caching
we'd go ahead and write it out, but since we set cache_generation to the transid
we think we've already truncated it and will just carry on, running into
cow_file_range and blowing up. We need to make sure we only set
cache_generation if we've done the truncate. The user tested this patch and
verified that the panic no longer occured. Thanks,
Reported-and-Tested-by: Klaus Bitto <klaus.bitto@gmail.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
a loop. This is because we will add an orphan item, do the truncate, the
truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
left with an orphan item still in the fs. Then we come back later to do another
truncate and it blows up because we already have an orphan item. This is ok so
just fix the BUG_ON() to only BUG() if ret is not EEXIST. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We were occasionaly leaking space when running xfstest 269. This is because if
we failed to start the transaction in the truncate loop we'd just goto out, but
we need to break so that the inode is removed from the orphan list and the space
is properly freed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Running xfstests 269 with some tracing my scripts kept spitting out errors about
releasing bytes that we didn't actually have reserved. This took me down a huge
rabbit hole and it turns out the way we deal with reserved_extents is wrong,
we need to only be setting it if the reservation succeeds, otherwise the free()
method will come in and unreserve space that isn't actually reserved yet, which
can lead to other warnings and such. The math was all working out right in the
end, but it caused all sorts of other issues in addition to making my scripts
yell and scream and generally make it impossible for me to track down the
original issue I was looking for. The other problem is with our error handling
in the reservation code. There are two cases that we need to deal with
1) We raced with free. In this case free won't free anything because csum_bytes
is modified before we dro the lock in our reservation path, so free rightly
doesn't release any space because the reservation code may be depending on that
reservation. However if we fail, we need the reservation side to do the free at
that point since that space is no longer in use. So as it stands the code was
doing this fine and it worked out, except in case #2
2) We don't race with free. Nobody comes in and changes anything, and our
reservation fails. In this case we didn't reserve anything anyway and we just
need to clean up csum_bytes but not free anything. So we keep track of
csum_bytes before we drop the lock and if it hasn't changed we know we can just
decrement csum_bytes and carry on.
Because of the case where we can race with free()'s since we have to drop our
spin_lock to do the reservation, I'm going to serialize all reservations with
the i_mutex. We already get this for free in the heavy use paths, truncate and
file write all hold the i_mutex, just needed to add it to page_mkwrite and
various ioctl/balance things. With this patch my space leak scripts no longer
scream bloody murder. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Now that we're properly keeping track of delayed inode space we've been getting
a lot of warnings out of btrfs_dirty_inode() when running xfstest 83. This is
because a bunch of people call mark_inode_dirty, which is void so we can't
return ENOSPC. This needs to be fixed in a few areas
1) file_update_time - this updates the mtime and such when writing to a file,
which will call mark_inode_dirty. So copy file_update_time into btrfs so we can
call btrfs_dirty_inode directly and return an error if we get one appropriately.
2) fix symlinks to use btrfs_setattr for ->setattr. For some reason we weren't
setting ->setattr for symlinks, even though we should have been. This catches
one of the cases where we were getting errors in mark_inode_dirty.
3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly
instead of mark_inode_dirty. This lets us return errors properly for truncate
and chown/anything related to setattr.
4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and
print an error if we have one. The only remaining user we can't control for
this is touch_atime(), but we don't really want to keep people from walking
down the tree if we don't have space to save the atime update, so just complain
but don't worry about it.
With this patch xfstests 83 complains a handful of times instead of hundreds of
times. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Al pointed out we have some random problems with the way we account for
num_workers_starting in the async thread stuff. First of all we need to make
sure to decrement num_workers_starting if we fail to start the worker, so make
__btrfs_start_workers do this. Also fix __btrfs_start_workers so that it
doesn't call btrfs_stop_workers(), there is no point in stopping everybody if we
failed to create a worker. Also check_pending_worker_creates needs to call
__btrfs_start_work in it's work function since it already increments
num_workers_starting.
People only start one worker at a time, so get rid of the num_workers argument
everywhere, and make btrfs_queue_worker a void since it will always succeed.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The Smack LSM hook for security_d_instantiate checks
the inode's i_op->getxattr value to determine if the
containing filesystem supports extended attributes.
The BTRFS filesystem sets the inode's i_op value only
after it has instantiated the inode. This results in
Smack incorrectly giving new BTRFS inodes attributes
from the filesystem defaults on the assumption that
values can't be stored on the filesystem. This patch
moves the assignment of inode operation vectors ahead
of the calls to d_instantiate, letting Smack know that
the filesystem supports extended attributes. There
should be no impact on the performance or behavior of
BTRFS.
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we have a constant stream of end_io completions or crc work,
we can hit softlockup messages from the async helper threads. This
adds a cond_resched() into the loop to avoid them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since we have the free space caches, btrfs_orphan_cleanup also runs for
the tree_root. Unfortunately this also cleans up the orphans used to mark
subvol deletions in progress.
Currently if a subvol deletion gets interrupted twice by umount/mount, the
deletion will not be continued and the space permanently lost, though it
would be possible to write a tool to recover those lost subvol deletions.
This patch checks if the orphan belongs to a subvol (dead root) and skips
the deletion.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we use raid0 as the data profile, df command may show us a very
inaccurate value of the available space, which may be much less than the
real one. It may make the users puzzled. Fix it by changing the calculation
of the available space, and making it be more similar to a fake chunk
allocation.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfsck report errors after the 83th case of xfstests was run, The error
number is 400, it means the used disk space of the file is wrong.
The reason of this bug is that:
The file truncation may fail when the space of the file system is not enough,
and leave some file extents, whose offset are beyond the end of the files.
When we want to expand those files, we will drop those file extents, and
put in dummy file extents, and then we should update the i-node. But btrfs
forgets to do it.
This patch adds the forgotten i-node update.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfsck report error 100 after the 83th case of xfstests was run, it means
the i_size of the file is wrong.
The reason of this bug is that:
Btrfs increased i_size of the file at the beginning, but it failed to expand
the file, and failed to update the i_size to the old size because there is no
enough space in the file system, so we found a wrong i_size.
This patch fixes this bug by updating the i_size just when we pass the file
expanding and get enough space to update i-node.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The patch below removes an extra semicolon.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
btrfs_end_bio checks the number of errors on a bio against the max
number of errors allowed before sending any EIOs up to the higher
levels.
If we got enough copies of the bio done for a given raid level, it is
supposed to clear the bio error flag and return success.
We have pointers to the original bio sent down by the higher layers and
pointers to any cloned bios we made for raid purposes. If the original
bio happens to be the one that got an io error, but not the last one to
finish, it might not have the BIO_UPTODATE bit set.
Then, when the last bio does finish, we'll call bio_end_io on the
original bio. It won't have the uptodate bit set and we'll end up
sending EIO to the higher layers.
We already had a check for this, it just was conditional on getting the
IO error on the very last bio. Make the check unconditional so we eat
the EIOs properly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: drop spin lock when memory alloc fails
Btrfs: check if the to-be-added device is writable
Btrfs: try cluster but don't advance in search list
Btrfs: try to allocate from cluster even at LOOP_NO_EMPTY_SIZE
Drop spin lock in convert_extent_bit() when memory alloc fails,
otherwise, it will be a deadlock.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we call ioctl(BTRFS_IOC_ADD_DEV) directly, we'll succeed in adding
a readonly device to a btrfs filesystem, and btrfs will write to
that device, emitting kernel errors:
[ 3109.833692] lost page write due to I/O error on loop2
[ 3109.833720] lost page write due to I/O error on loop2
...
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we find an existing cluster, we switch to its block group as the
current block group, possibly skipping multiple blocks in the process.
Furthermore, under heavy contention, multiple threads may fail to
allocate from a cluster and then release just-created clusters just to
proceed to create new ones in a different block group.
This patch tries to allocate from an existing cluster regardless of its
block group, and doesn't switch to that group, instead proceeding to
try to allocate a cluster from the group it was iterating before the
attempt.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we reach LOOP_NO_EMPTY_SIZE, we won't even try to use a cluster that
others might have set up. Odds are that there won't be one, but if
someone else succeeded in setting it up, we might as well use it, even
if we don't try to set up a cluster again.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The below patch fixes some typos in various parts of the kernel, as well as fixes some comments.
Please let me know if I missed anything, and I will try to get it changed and resent.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: fix meta data raid-repair merge problem
Btrfs: skip allocation attempt from empty cluster
Btrfs: skip block groups without enough space for a cluster
Btrfs: start search for new cluster at the beginning
Btrfs: reset cluster's max_size when creating bitmap
Btrfs: initialize new bitmaps' list
Btrfs: fix oops when calling statfs on readonly device
Btrfs: Don't error on resizing FS to same size
Btrfs: fix deadlock on metadata reservation when evicting a inode
Fix URL of btrfs-progs git repository in docs
btrfs scrub: handle -ENOMEM from init_ipath()
Commit 4a54c8c16 introduced raid-repair, killing the individual
readpage_io_failed_hook entries from inode.c and disk-io.c. Commit
4bb31e92 introduced new readahead code, adding a readpage_io_failed_hook to
disk-io.c.
The raid-repair commit had logic to disable raid-repair, if
readpage_io_failed_hook is set. Thus, the readahead commit effectively
disabled raid-repair for meta data.
This commit changes the logic to always attempt raid-repair when needed and
call the readpage_io_failed_hook in case raid-repair fails. This is much
more straight forward and should have been like that from the beginning.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we don't have a cluster, don't bother trying to allocate from it,
jumping right away to the attempt to allocate a new cluster.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We test whether a block group has enough free space to hold the
requested block, but when we're doing clustered allocation, we can
save some cycles by testing whether it has enough room for the cluster
upfront, otherwise we end up attempting to set up a cluster and
failing. Only in the NO_EMPTY_SIZE loop do we attempt an unclustered
allocation, and by then we'll have zeroed the cluster size, so this
patch won't stop us from using the block group as a last resort.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Instead of starting at zero (offset is always zero), request a cluster
starting at search_start, that denotes the beginning of the current
block group.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The field that indicates the size of the largest contiguous chunk of
free space in the cluster is not initialized when setting up bitmaps,
it's only increased when we find a larger contiguous chunk. We end up
retaining a larger value than appropriate for highly-fragmented
clusters, which may cause pointless searches for large contiguous
groups, and even cause clusters that do not meet the density
requirements to be set up.
Signed-off-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We're failing to create clusters with bitmaps because
setup_cluster_no_bitmap checks that the list is empty before inserting
the bitmap entry in the list for setup_cluster_bitmap, but the list
field is only initialized when it is restored from the on-disk free
space cache, or when it is written out to disk.
Besides a potential race condition due to the multiple use of the list
field, filesystem performance severely degrades over time: as we use
up all non-bitmap free extents, the try-to-set-up-cluster dance is
done at every metadata block allocation. For every block group, we
fail to set up a cluster, and after failing on them all up to twice,
we fall back to the much slower unclustered allocation.
To make matters worse, before the unclustered allocation, we try to
create new block groups until we reach the 1% threshold, which
introduces additional bitmaps and thus block groups that we'll iterate
over at each metadata block request.
To reproduce this bug:
# dd if=/dev/zero of=img bs=1M count=256
# mkfs.btrfs img
# losetup -r /dev/loop1 img
# mount /dev/loop1 /mnt
OOPS!!
It triggered BUG_ON(!nr_devices) in btrfs_calc_avail_data_space().
To fix this, instead of checking write-only devices, we check all open
deivces:
# df -h /dev/loop1
Filesystem Size Used Avail Use% Mounted on
/dev/loop1 250M 28K 238M 1% /mnt
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
It seems overly harsh to fail a resize of a btrfs file system to the
same size when a shrink or grow would succeed. User app GParted trips
over this error. Allow it by bypassing the shrink or grow operation.
Signed-off-by: Mike Fleetwood <mike.fleetwood@googlemail.com>
When I ran the xfstests, I found the test tasks was blocked on meta-data
reservation.
By debugging, I found the reason of this bug:
start transaction
|
v
reserve meta-data space
|
v
flush delay allocation -> iput inode -> evict inode
^ |
| v
wait for delay allocation flush <- reserve meta-data space
And besides that, the flush on evicting inode will block the thread, which
is reclaiming the memory, and make oom happen easily.
Fix this bug by skipping the flush step when evicting inode.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
* 'pm-freezer' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc: (24 commits)
freezer: fix wait_event_freezable/__thaw_task races
freezer: kill unused set_freezable_with_signal()
dmatest: don't use set_freezable_with_signal()
usb_storage: don't use set_freezable_with_signal()
freezer: remove unused @sig_only from freeze_task()
freezer: use lock_task_sighand() in fake_signal_wake_up()
freezer: restructure __refrigerator()
freezer: fix set_freezable[_with_signal]() race
freezer: remove should_send_signal() and update frozen()
freezer: remove now unused TIF_FREEZE
freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE
cgroup_freezer: prepare for removal of TIF_FREEZE
freezer: clean up freeze_processes() failure path
freezer: kill PF_FREEZING
freezer: test freezable conditions while holding freezer_lock
freezer: make freezing indicate freeze condition in effect
freezer: use dedicated lock instead of task_lock() + memory barrier
freezer: don't distinguish nosig tasks on thaw
freezer: remove racy clear_freeze_flag() and set PF_NOFREEZE on dead tasks
freezer: rename thaw_process() to __thaw_task() and simplify the implementation
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: remove free-space-cache.c WARN during log replay
Btrfs: sectorsize align offsets in fiemap
Btrfs: clear pages dirty for io and set them extent mapped
Btrfs: wait on caching if we're loading the free space cache
Btrfs: prefix resize related printks with btrfs:
btrfs: fix stat blocks accounting
Btrfs: avoid unnecessary bitmap search for cluster setup
Btrfs: fix to search one more bitmap for cluster setup
btrfs: mirror_num should be int, not u64
btrfs: Fix up 32/64-bit compatibility for new ioctls
Btrfs: fix barrier flushes
Btrfs: fix tree corruption after multi-thread snapshots and inode_cache flush
There is no reason to export two functions for entering the
refrigerator. Calling refrigerator() instead of try_to_freeze()
doesn't save anything noticeable or removes any race condition.
* Rename refrigerator() to __refrigerator() and make it return bool
indicating whether it scheduled out for freezing.
* Update try_to_freeze() to return bool and relay the return value of
__refrigerator() if freezing().
* Convert all refrigerator() users to try_to_freeze().
* Update documentation accordingly.
* While at it, add might_sleep() to try_to_freeze().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Samuel Ortiz <samuel@sortiz.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: Christoph Hellwig <hch@infradead.org>
The log replay code only partially loads block groups, since
the block group caching code is able to detect and deal with
extents the logging code has pinned down.
While the logging code is pinning down block groups, there is
a bogus WARN_ON we're hitting if the code wasn't able to find
an extent in the cache. This commit removes the warning because
it can happen any time there isn't a valid free space cache
for that block group.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We've been hitting BUG()'s in btrfs_cont_expand and btrfs_fallocate and anywhere
else that calls btrfs_get_extent while running xfstests 13 in a loop. This is
because fiemap is calling btrfs_get_extent with non-sectorsize aligned offsets,
which will end up adding mappings that are not sectorsize aligned, which will
cause problems in some cases for subsequent calls to btrfs_get_extent for
similar areas that are sectorsize aligned. With this patch I ran xfstests 13 in
a loop for a couple of hours and didn't hit the problem that I could previously
hit in at most 20 minutes. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When doing the io_ctl helpers to clean up the free space cache stuff I stopped
using our normal prepare_pages stuff, which means I of course forgot to do
things like set the pages extent mapped, which will cause us all sorts of
wonderful propblems. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We've been hitting panics when running xfstest 13 in a loop for long periods of
time. And actually this problem has always existed so we've been hitting these
things randomly for a while. Basically what happens is we get a thread coming
into the allocator and reading the space cache off of disk and adding the
entries to the free space cache as we go. Then we get another thread that comes
in and tries to allocate from that block group. Since block_group->cached !=
BTRFS_CACHE_NO it goes ahead and tries to do the allocation. We do this because
if we're doing the old slow way of caching we don't want to hold people up and
wait for everything to finish. The problem with this is we could end up
discarding the space cache at some arbitrary point in the future, which means we
could very well end up allocating space that is either bad, or when the real
caching happens it could end up thinking the space isn't in use when it really
is and cause all sorts of other problems.
The solution is to add a new flag to indicate we are loading the free space
cache from disk, and always try to cache the block group if cache->cached !=
BTRFS_CACHE_FINISHED. That way if we are loading the space cache anybody else
who tries to allocate from the block group will have to wait until it's finished
to make sure it completes successfully. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
For the user it is confusing to find something like:
[10197.627710] new size for /dev/mapper/vg0-usr_share is 3221225472
in kernel log, because it doesn't point directly to btrfs.
This patch prefixes those messages with "btrfs:" like other btrfs
related printks.
Signed-off-by: Arnd Hannemann <arnd@arndnet.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Round inode bytes and delalloc bytes up to real blocksize before
converting to sector size. Otherwise eg. files smaller than 512
are reported with zero blocks due to incorrect rounding.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
setup_cluster_no_bitmap() searches all the extents and bitmaps starting
from offset. Therefore if it returns -ENOSPC, all the bitmaps starting
from offset are in the bitmaps list, so it's sufficient to search from
this list in setup_cluser_bitmap().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Suppose there are two bitmaps [0, 256], [256, 512] and one extent
[100, 120] in the free space cache, and we want to setup a cluster
with offset=100, bytes=50.
In this case, there will be only one bitmap [256, 512] in the temporary
bitmaps list, and then setup_cluster_bitmap() won't search bitmap [0, 256].
The cause is, the list is constructed in setup_cluster_no_bitmap(),
and only bitmaps with bitmap_entry->offset >= offset will be added
into the list, and the very bitmap that convers offset has
bitmap_entry->offset <= offset.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
My previous patch introduced some u64 for failed_mirror variables, this one
makes it consistent again.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch casts to unsigned long before casting to a pointer and fixes
the following warnings:
fs/btrfs/extent_io.c:2289:20: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2933:37: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
fs/btrfs/ioctl.c:2937:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/ioctl.c:3020:21: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/scrub.c:275:4: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
fs/btrfs/backref.c:686:27: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs is writing the super blocks, it send barrier flushes to make
sure writeback caching drives get all the metadata on disk in the
right order.
But, we have two bugs in the way these are sent down. When doing
full commits (not via the tree log), we are sending the barrier down
before the last super when it should be going down before the first.
In multi-device setups, we should be waiting for the barriers to
complete on all devices before writing any of the supers.
Both of these bugs can cause corruptions on power failures. We fix it
with some new code to send down empty barriers to all devices before
writing the first super.
Alexandre Oliva found the multi-device bug. Arne Jansen did the async
barrier loop.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Alexandre Oliva <oliva@lsd.ic.unicamp.br>
takes vfsmount and relative path, does lookup within that vfsmount
(possibly triggering automounts) and returns the result as root
of subtree suitable for return by ->mount() (i.e. a reference to
dentry and an active reference to its superblock grabbed, superblock
locked exclusive).
btrfs and nfs switched to it instead of open-coding the sucker.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Life is much saner if create_mnt_ns(mnt) drops mnt in case of error...
Switch it to such calling conventions, switch callers, fix double mntput() in
fs/nfs/super.c one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The btrfs snapshotting code requires that once a root has been
snapshotted, we don't change it during a commit.
But there are two cases to lead to tree corruptions:
1) multi-thread snapshots can commit serveral snapshots in a transaction,
and this may change the src root when processing the following pending
snapshots, which lead to the former snapshots corruptions;
2) the free inode cache was changing the roots when it root the cache,
which lead to corruptions.
This fixes things by making sure we force COW the block after we create a
snapshot during commiting a transaction, then any changes to the roots
will result in COW, and we get all the fs roots and snapshot roots to be
consistent.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: rename the option to nospace_cache
Btrfs: handle bio_add_page failure gracefully in scrub
Btrfs: fix deadlock caused by the race between relocation
Btrfs: only map pages if we know we need them when reading the space cache
Btrfs: fix orphan backref nodes
Btrfs: Abstract similar code for btrfs_block_rsv_add{, _noflush}
Btrfs: fix unreleased path in btrfs_orphan_cleanup()
Btrfs: fix no reserved space for writing out inode cache
Btrfs: fix nocow when deleting the item
Btrfs: tweak the delayed inode reservations again
Btrfs: rework error handling in btrfs_mount()
Btrfs: close devices on all error paths in open_ctree()
Btrfs: avoid null dereference and leaks when bailing from open_ctree()
Btrfs: fix subvol_name leak on error in btrfs_mount()
Btrfs: fix memory leak in btrfs_parse_early_options()
Btrfs: fix our reservations for updating an inode when completing io
Btrfs: fix oops on NULL trans handle in btrfs_truncate
btrfs: fix double-free 'tree_root' in 'btrfs_mount()'
Rename no_space_cache option to nospace_cache to be more consistent with
the rest, where the simple prefix 'no' is used to negate an option.
The option has been introduced during the -rc1 cycle and there are has not been
widely used, so it's safe.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently scrub fails with ENOMEM when bio_add_page fails. Unfortunately
dm based targets accept only one page per bio, thus making scrub always
fails. This patch just submits the current bio when an error is encountered
and starts a new one.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can not do flushable reservation for the relocation when we create snapshot,
because it may make the transaction commit task and the flush task wait for
each other and the deadlock happens.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
People have been running into a warning when loading space cache because the
page is already mapped when trying to read in a bitmap. The way we read in
entries and pages is kind of convoluted, so fix it so that io_ctl_read_entry
maps the entries if it needs to, and if it hits the end of the page it simply
unmaps the page. That way we can unconditionally unmap the io_ctl before
reading in the bitmap and we should stop hitting these warnings. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If the root node of a fs/file tree is in the block group that is
being relocated, but the others are not in the other block groups.
when we create a snapshot for this tree between the relocation tree
creation ends and ->create_reloc_tree is set to 0, Btrfs will create
some backref nodes that are the lowest nodes of the backrefs cache.
But we forget to add them into ->leaves list of the backref cache
and deal with them, and at last, they will triggered BUG_ON().
kernel BUG at fs/btrfs/relocation.c:239!
This patch fixes it by adding them into ->leaves list of backref cache.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_block_rsv_add{, _noflush}() have similar code, so abstract that code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we did stress test for the space relocation, the deadlock happened.
By debugging, We found it was caused by the carelessness that we forgot
to unlock the read lock of the extent buffers in btrfs_orphan_cleanup()
before we end the transaction handle, so the transaction commit task waited
the task, which called btrfs_orphan_cleanup(), to unlock the extent buffer,
but that task waited the commit task to end the transaction commit, and
the deadlock happened. Fix it.
Signed-ff-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I-node cache forgets to reserve the space when writing out it. And when
we do some stress test, such as synctest, it will trigger WARN_ON() in
use_block_rsv().
WARNING: at fs/btrfs/extent-tree.c:5718 btrfs_alloc_free_block+0xbf/0x281 [btrfs]()
...
Call Trace:
[<ffffffff8104df86>] warn_slowpath_common+0x80/0x98
[<ffffffff8104dfb3>] warn_slowpath_null+0x15/0x17
[<ffffffffa0369c60>] btrfs_alloc_free_block+0xbf/0x281 [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa035c040>] __btrfs_cow_block+0x118/0x3b5 [btrfs]
[<ffffffffa035c7ba>] btrfs_cow_block+0x103/0x14e [btrfs]
[<ffffffffa035e4c4>] btrfs_search_slot+0x249/0x6a4 [btrfs]
[<ffffffffa036d086>] btrfs_lookup_inode+0x2a/0x8a [btrfs]
[<ffffffffa03788b7>] btrfs_update_inode+0xaa/0x141 [btrfs]
[<ffffffffa036d7ec>] btrfs_save_ino_cache+0xea/0x202 [btrfs]
[<ffffffffa03a761e>] ? btrfs_update_reloc_root+0x17e/0x197 [btrfs]
[<ffffffffa0373867>] commit_fs_roots+0xaa/0x158 [btrfs]
[<ffffffffa03746a6>] btrfs_commit_transaction+0x405/0x731 [btrfs]
[<ffffffff810690df>] ? wake_up_bit+0x25/0x25
[<ffffffffa039d652>] ? btrfs_log_dentry_safe+0x43/0x51 [btrfs]
[<ffffffffa0381c5f>] btrfs_sync_file+0x16a/0x198 [btrfs]
[<ffffffff81122806>] ? mntput+0x21/0x23
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
Sometimes it causes BUG_ON() in the reservation code of the delayed inode
is triggered.
So we must reserve enough space for inode cache.
Note: If we can not reserve the enough space for inode cache, we will
give up writing out it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_previous_item() just search the b+ tree, do not COW the nodes or leaves,
if we modify the result of it, the meta-data will be broken. fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef sent along an incremental to the inode reservation
code to make sure we try and fall back to directly updating
the inode item if things go horribly wrong.
This reworks that patch slightly, adding a fallback function
that will always try to update the inode item directly without
going through the delayed_inode code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commits 6c41761f and 45ea6095 introduced the possibility of NULL pointer
dereference on error paths, also we would leave all devices busy and
leak fs_info with all sub-structures on error when trying to mount an
already mounted fs to a different directory.
Fix this by doing all allocations before trying to open any of the
devices, adjust error path for mount-already-mounted-fs case.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix a bug introduced by 7e662854 where we would leave devices busy on
certain error paths in open_ctree(). fs_info is guaranteed to be
non-NULL now so it's safe to dereference it on all error paths.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Fix bugs introduced by 6c41761f. Firstly, after failing to allocate any
of the tree roots (first 'goto fail' in open_ctree()) we would
dereference a NULL fs_info pointer in free_fs_info(). Secondly, after
failures from init_srcu_struct(), setup_bdi() and new_inode() we would
leak all earlier allocated roots: fs_info fields haven't been
initialized yet so free_fs_info() is rendered useless.
Fix this by initializing fs_info pointer and fs_info fields before any
allocations happen.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
btrfs_parse_early_options() can fail due to error while scanning devices
(-o device= option), but still strdup() subvol_name string:
mount -o subvol=SUBV,device=BAD_DEVICE <dev> <mnt>
So free subvol_name string on error.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Don't leak subvol_name string in case multiple subvol= options are
given. "The lastest option is effective" behavior (consistent with
subvolid= and subvolrootid= options) is preserved.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
People have been reporting ENOSPC crashes in finish_ordered_io. This is because
we try to steal from the delalloc block rsv to satisfy a reservation to update
the inode. The problem with this is we don't explicitly save space for updating
the inode when doing delalloc. This is kind of a problem and we've gotten away
with this because way back when we just stole from the delalloc reserve without
any questions, and this worked out fine because generally speaking the leaf had
been modified either by the mtime update when we did the original write or
because we just updated the leaf when we inserted the file extent item, only on
rare occasions had the leaf not actually been modified, and that was still ok
because we'd just use a block or two out of the over-reservation that is
delalloc.
Then came the delayed inode stuff. This is amazing, except it wants a full
reservation for updating the inode since it may do it at some point down the
road after we've written the blocks and we have to recow everything again. This
worked out because the delayed inode stuff just stole from the global reserve,
that is until recently when I changed that because it caused other problems.
So here we are, we're doing everything right and being screwed for it. So take
an extra reservation for the inode at delalloc reservation time and carry it
through the life of the delalloc reservation. If we need it we can steal it in
the delayed inode stuff. If we have already stolen it try and do a normal
metadata reservation. If that fails try to steal from the delalloc reservation.
If _that_ fails we'll get a WARN_ON() so I can start thinking of a better way to
solve this and in the meantime we'll steal from the global reserve.
With this patch I ran xfstests 13 in a loop for a couple of hours and didn't see
any problems.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we fail to reserve space in the transaction during truncate, we can
error out with a NULL trans handle. The cleanup code needs an extra
check to make sure we aren't trying to use the bad handle.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On error path 'tree_root' is treed in 'free_fs_info()'.
No need to free it explicitely. Noticed by SLUB in debug mode:
Complete reproducer under usermode linux (discovered on real
machine):
bdev=/dev/ubda
btr_root=/btr
/mkfs.btrfs $bdev
mount $bdev $btr_root
mkdir $btr_root/subvols/
cd $btr_root/subvols/
/btrfs su cr foo
/btrfs su cr bar
mount $bdev -osubvol=subvols/foo $btr_root/subvols/bar
umount $btr_root/subvols/bar
which gives
device fsid 4d55aa28-45b1-474b-b4ec-da912322195e devid 1 transid 7 /dev/ubda
=============================================================================
BUG kmalloc-2048: Object already free
-----------------------------------------------------------------------------
INFO: Allocated in btrfs_mount+0x389/0x7f0 age=0 cpu=0 pid=277
INFO: Freed in btrfs_mount+0x51c/0x7f0 age=0 cpu=0 pid=277
INFO: Slab 0x0000000062886200 objects=15 used=9 fp=0x0000000070b4d2d0 flags=0x4081
INFO: Object 0x0000000070b4d2d0 @offset=21200 fp=0x0000000070b4a968
...
Call Trace:
70b31948: [<6008c522>] print_trailer+0xe2/0x130
70b31978: [<6008c5aa>] object_err+0x3a/0x50
70b319a8: [<6008e242>] free_debug_processing+0x142/0x2a0
70b319e0: [<600ebf6f>] btrfs_mount+0x55f/0x7f0
70b319f8: [<6008e5c1>] __slab_free+0x221/0x2d0
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Arne Jansen <sensille@gmx.net>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (114 commits)
Btrfs: check for a null fs root when writing to the backup root log
Btrfs: fix race during transaction joins
Btrfs: fix a potential btrfs_bio leak on scrub fixups
Btrfs: rename btrfs_bio multi -> bbio for consistency
Btrfs: stop leaking btrfs_bios on readahead
Btrfs: stop the readahead threads on failed mount
Btrfs: fix extent_buffer leak in the metadata IO error handling
Btrfs: fix the new inspection ioctls for 32 bit compat
Btrfs: fix delayed insertion reservation
Btrfs: ClearPageError during writepage and clean_tree_block
Btrfs: be smarter about committing the transaction in reserve_metadata_bytes
Btrfs: make a delayed_block_rsv for the delayed item insertion
Btrfs: add a log of past tree roots
btrfs: separate superblock items out of fs_info
Btrfs: use the global reserve when truncating the free space cache inode
Btrfs: release metadata from global reserve if we have to fallback for unlink
Btrfs: make sure to flush queued bios if write_cache_pages waits
Btrfs: fix extent pinning bugs in the tree log
Btrfs: make sure btrfs_remove_free_space doesn't leak EAGAIN
Btrfs: don't wait as long for more batches during SSD log commit
...
* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: Add a 'reason' to wb_writeback_work
writeback: send work item to queue_io, move_expired_inodes
writeback: trace event balance_dirty_pages
writeback: trace event bdi_dirty_ratelimit
writeback: fix ppc compile warnings on do_div(long long, unsigned long)
writeback: per-bdi background threshold
writeback: dirty position control - bdi reserve area
writeback: control dirty pause time
writeback: limit max dirty pause time
writeback: IO-less balance_dirty_pages()
writeback: per task dirty rate limit
writeback: stabilize bdi->dirty_ratelimit
writeback: dirty rate control
writeback: add bg_threshold parameter to __bdi_update_bandwidth()
writeback: dirty position control
writeback: account per-bdi accumulated dirtied pages
During log replay, can commit the transaction before the fs_root
pointers are setup, so we have to make sure they are not null before
trying to use them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While we're allocating ram for a new transaction, we drop our spinlock.
When we get the lock back, we do check to see if a transaction started
while we slept, but we don't check to make sure it isn't blocked
because a commit has already started.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In case we were able to map less than we wanted (length < PAGE_SIZE
clause is true) btrfs_bio is still allocated and we have to free it.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The scrub readahead branch brought in a new error handling hook,
but it was leaking extent_buffer references.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The new ioctls to follow backrefs are not clean for 32/64 bit
compat. This reworks them for u64s everywhere. They are brand new, so
there are no problems with changing the interface now.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We all keep getting those stupid warnings from use_block_rsv when running
stress.sh, and it's because the delayed insertion stuff is being stupid. It's
not the delayed insertion stuffs fault, it's all just stupid. When marking an
inode dirty for oh say updating the time on it, we just do a
btrfs_join_transaction, which doesn't reserve any space. This is stupid because
we're going to have to have space reserve to make this change, but we do it
because it's fast because chances are we're going to call it over and over again
and it doesn't matter. Well thanks to the delayed insertion stuff this is
mostly the case, so we do actually need to make this reservation. So if
trans->bytes_reserved is 0 then try to do a normal reservation. If not return
ENOSPC which will make the btrfs_dirty_inode start a proper transaction which
will let it do the whole ENOSPC dance and reserve enough space for the delayed
insertion to steal the reservation from the transaction.
The other stupid thing we do is not reserve space for the inode when writing to
the thing. Usually this is ok since we have to update the time so we'd have
already done all this work before we get to the endio stuff, so it doesn't
matter. But this is stupid because we could write the data after the
transaction commits where we changed the mtime of the inode so we have to cow
all the way down to the inode anyway. This used to be masked by the delalloc
reservation stuff, but because we delay the update it doesn't get masked in this
case. So again the delayed insertion stuff bites us in the ass. So if our
trans->block_rsv is delalloc, just steal the reservation from the delalloc
reserve. Hopefully this won't bite us in the ass, but I've said that before.
With this patch stress.sh no longer spits out those stupid warnings (famous last
words). Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Failure testing was tripping up over stale PageError bits in
metadata pages. If we have an io error on a block, and later on
end up reusing it, nobody ever clears PageError on those pages.
During commit, we'll find PageError and think we had trouble writing
the block, which will lead to aborts and other problems.
This changes clean_tree_block and the btrfs writepage code to
clear the PageError bit. In both cases we're either completely
done with the page or the page has good stuff and the error bit
is no longer valid.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Because of the overcommit stuff I had to make it so that we committed the
transaction all the time in reserve_metadata_bytes in case we had overcommitted
because of delayed items. This was because previously we had no way of knowing
how much space was reserved for delayed items. Now that we have the
delayed_block_rsv we can check it to see if committing the transaction would get
us anywhere. This patch breaks out the committing logic into a helper function
that will check to see if committing the transaction would free enough space for
us to get anything done. With this patch xfstests 83 goes from taking 445
seconds to taking 28 seconds on my box. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I've been hitting warnings in use_block_rsv when running the delayed insertion
stuff. It's because we will readjust global block rsv based on what is in use,
which means we could end up discarding reservations that are for the delayed
insertion stuff. So instead create a seperate block rsv for the delayed
insertion stuff. This will also make it easier to debug problems with the
delayed insertion reservations since we will know that only the delayed
insertion code touches this block_rsv. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This takes some of the free space in the btrfs super block
to record information about most of the roots in the last four
commits.
It also adds a -o recovery to use the root history log when
we're not able to read the tree of tree roots, the extent
tree root, the device tree root or the csum root.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
fs_info has now ~9kb, more than fits into one page. This will cause
mount failure when memory is too fragmented. Top space consumers are
super block structures super_copy and super_for_commit, ~2.8kb each.
Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)
Add a wrapper for freeing fs_info and all of it's dynamically allocated
members.
Signed-off-by: David Sterba <dsterba@suse.cz>
We no longer use the orphan block rsv for holding the reservation for truncating
the inode, so instead use the global block rsv and check to make sure it has
enough space for us to truncate the space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I fixed a problem where we weren't reserving space for an orphan item when we
had to fallback to using the global reserve for an unlink, but I introduced
another problem. I was migrating the bytes from the transaction reserve to the
global reserve and then releasing from the global reserve in
btrfs_end_transaction(). The problem with this is that a migrate will jack up
the size for the destination, but leave the size alone for the source, with the
idea that you can do a release normally on the source and it all washes out, and
then you can do a release again on the destination and it works out right. My
way was skipping the release on the trans_block_rsv which still had the jacked
up size from our original reservation. So instead release manually from the
global reserve if this transaction was using it, and then set the
trans->block_rsv back to the trans_block_rsv so that btrfs_end_transaction
cleans everything up properly. With this patch xfstest 83 doesn't emit warnings
about leaking space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
write_cache_pages tries to build up a large bio to stuff down the pipe.
But if it needs to wait for a page lock, it needs to make sure and send
down any pending writes so we don't deadlock with anyone who has the
page lock and is waiting for writeback of things inside the bio.
Dave Sterba triggered this as a deadlock between the autodefrag code and
the extent write_cache_pages
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The tree log had two important bugs that could cause corruptions after a
crash. Sometimes we were allowing tree log blocks to be reused after
the tree log was committed but before the transaction commit was done.
This allowed a future metadata write to overwrite the tree log data. It
is fixed by adding a new variant of freeing reserved extents that always
pins them. Credit goes to Stefan Behrens and Arne Jansen for many many
hours spent tracking this bug down.
During tree log replay, we do a pass through the tree log and pin all
the extents we find. This makes sure the replay code won't go in and
use any of those blocks for new allocations during replay. The problem
is the free space cache isn't honoring these pinned extents. So the
allocator can end up handing them out, leading to all kinds of problems
during replay.
The fix here is to force any free space cache to load while we pin the
extents, and then to make sure we remove the pinned extents from the
free space rbtree.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Stefan Behrens <sbehrens@giantdisaster.de>
btrfs_remove_free_space needs to make sure to set ret back to a
valid return value after setting it to EAGAIN, otherwise we return
it to the callers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we're doing log commits, we try to wait for more writers to come in
and make the commit bigger. This helps improve performance on rotating
disks, but on SSDs it adds latencies.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Replace remaining direct i_nlink updates with a new set_nlink()
updater function.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
This creates a new 'reason' field in a wb_writeback_work
structure, which unambiguously identifies who initiates
writeback activity. A 'wb_reason' enumeration has been
added to writeback.h, to enumerate the possible reasons.
The 'writeback_work_class' and tracepoint event class and
'writeback_queue_io' tracepoints are updated to include the
symbolic 'reason' in all trace events.
And the 'writeback_inodes_sbXXX' family of routines has had
a wb_stats parameter added to them, so callers can specify
why writeback is being started.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Curt Wohlgemuth <curtw@google.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue: (21 commits)
leases: fix write-open/read-lease race
nfs: drop unnecessary locking in llseek
ext4: replace cut'n'pasted llseek code with generic_file_llseek_size
vfs: add generic_file_llseek_size
vfs: do (nearly) lockless generic_file_llseek
direct-io: merge direct_io_walker into __blockdev_direct_IO
direct-io: inline the complete submission path
direct-io: separate map_bh from dio
direct-io: use a slab cache for struct dio
direct-io: rearrange fields in dio/dio_submit to avoid holes
direct-io: fix a wrong comment
direct-io: separate fields only used in the submission path from struct dio
vfs: fix spinning prevention in prune_icache_sb
vfs: add a comment to inode_permission()
vfs: pass all mask flags check_acl and posix_acl_permission
vfs: add hex format for MAY_* flag values
vfs: indicate that the permission functions take all the MAY_* flags
compat: sync compat_stats with statfs.
vfs: add "device" tag to /proc/self/mountstats
cleanup: vfs: small comment fix for block_invalidatepage
...
Fix up trivial conflict in fs/gfs2/file.c (llseek changes)
The i_mutex lock use of generic _file_llseek hurts. Independent processes
accessing the same file synchronize over a single lock, even though
they have no need for synchronization at all.
Under high utilization this can cause llseek to scale very poorly on larger
systems.
This patch does some rethinking of the llseek locking model:
First the 64bit f_pos is not necessarily atomic without locks
on 32bit systems. This can already cause races with read() today.
This was discussed on linux-kernel in the past and deemed acceptable.
The patch does not change that.
Let's look at the different seek variants:
SEEK_SET: Doesn't really need any locking.
If there's a race one writer wins, the other loses.
For 32bit the non atomic update races against read()
stay the same. Without a lock they can also happen
against write() now. The read() race was deemed
acceptable in past discussions, and I think if it's
ok for read it's ok for write too.
=> Don't need a lock.
SEEK_END: This behaves like SEEK_SET plus it reads
the maximum size too. Reading the maximum size would have the
32bit atomic problem. But luckily we already have a way to read
the maximum size without locking (i_size_read), so we
can just use that instead.
Without i_mutex there is no synchronization with write() anymore,
however since the write() update is atomic on 64bit it just behaves
like another racy SEEK_SET. On non atomic 32bit it's the same
as SEEK_SET.
=> Don't need a lock, but need to use i_size_read()
SEEK_CUR: This has a read-modify-write race window
on the same file. One could argue that any application
doing unsynchronized seeks on the same file is already broken.
But for the sake of not adding a regression here I'm
using the file->f_lock to synchronize this. Using this
lock is much better than the inode mutex because it doesn't
synchronize between processes.
=> So still need a lock, but can use a f_lock.
This patch implements this new scheme in generic_file_llseek.
I dropped generic_file_llseek_unlocked and changed all callers.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
* 'next' of git://selinuxproject.org/~jmorris/linux-security: (95 commits)
TOMOYO: Fix incomplete read after seek.
Smack: allow to access /smack/access as normal user
TOMOYO: Fix unused kernel config option.
Smack: fix: invalid length set for the result of /smack/access
Smack: compilation fix
Smack: fix for /smack/access output, use string instead of byte
Smack: domain transition protections (v3)
Smack: Provide information for UDS getsockopt(SO_PEERCRED)
Smack: Clean up comments
Smack: Repair processing of fcntl
Smack: Rule list lookup performance
Smack: check permissions from user space (v2)
TOMOYO: Fix quota and garbage collector.
TOMOYO: Remove redundant tasklist_lock.
TOMOYO: Fix domain transition failure warning.
TOMOYO: Remove tomoyo_policy_memory_lock spinlock.
TOMOYO: Simplify garbage collector.
TOMOYO: Fix make namespacecheck warnings.
target: check hex2bin result
encrypted-keys: check hex2bin result
...
The WARN_ON under some circumstances heavily polute log and slow down
the machine. This is just a safety, as the warning should be fixed by
another patch, nevertheless, it still pops up during testing.
Signed-off-by: David Sterba <dsterba@suse.cz>
There's a missing test whether the path passed to subvol=path option
during mount is a real subvolume, allowing any directory located in
default subovlume to be passed and accepted for mount.
(current btrfs progs prevent this early)
$ btrfs subvol snapshot . p1-snap
ERROR: '.' is not a subvolume
(with "is subvolume?" test bypassed)
$ btrfs subvol snapshot . p1-snap
Create a snapshot of '.' in './p1-snap'
$ btrfs subvol list -p .
ID 258 parent 5 top level 5 path subvol
ID 259 parent 5 top level 5 path subvol1
ID 260 parent 5 top level 5 path default-subvol1
ID 262 parent 5 top level 5 path p1/p1-snapshot
ID 263 parent 259 top level 5 path subvol1/subvol1-snap
The problem I see is that this makes a false impression of snapshotting the
given subvolume but in fact snapshots the default one: a user expects outcome
like ID 263 but in fact gets ID 262 .
This patch makes mount fail with EINVAL with a message in syslog.
Signed-off-by: David Sterba <dsterba@suse.cz>
Fix a bug introduced by 20b45077. We have to return EINVAL on mount
failure, but doing that too early in the sequence leaves all of the
devices opened exclusively. This also fixes an issue where under some
scenarios only a second mount -o degraded <devices> command would
succeed.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Initialize fs_info->bdev_holder a bit earlier to be able to pass a
correct holder id to blkdev_get() when opening seed devices with O_EXCL.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
If lookup_extent_backref fails, path->nodes[0] reasonably could be
null along with other callers of btrfs_print_leaf, so ensure we have a
valid extent buffer before dereferencing.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
The task may fail to get free space though it is enough when multi-task
space allocation and caching space happen at the same time.
Task1 Caching Thread Task2
------------------------------------------------------------------------
find_free_extent
The space has not
be cached, and start
caching thread. And
wait for it.
cache space, if
the space is > 2MB
wake up Task1
find_free_extent
get all the space that
is cached.
try to allocate space,
but there is no space
now.
trigger BUG_ON()
The message is following:
btrfs allocation failed flags 1, wanted 4096
space_info has 1040187392 free, is not full
space_info total=1082130432, used=4096, pinned=41938944, reserved=0, may_use=40828928, readonly=0
block group 12582912 has 8388608 bytes, 0 used 8388608 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
block group 1103101952 has 1073741824 bytes, 4096 used 33550336 pinned 0 reserved
block group has cluster?: no
0 blocks of free space at or bigger than bytes is
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:835!
[<ffffffffa031261b>] __extent_writepage+0x1bf/0x5ce [btrfs]
[<ffffffff810cbcb8>] ? __set_page_dirty_nobuffers+0xfe/0x108
[<ffffffffa02f8ada>] ? wait_current_trans+0x23/0xec [btrfs]
[<ffffffff810c3fbf>] ? find_get_pages_tag+0x73/0xe2
[<ffffffffa0312d12>] extent_write_cache_pages.clone.0+0x176/0x29a [btrfs]
[<ffffffffa0312e74>] extent_writepages+0x3e/0x53 [btrfs]
[<ffffffff8110ad2c>] ? do_sync_write+0xc6/0x103
[<ffffffffa0302d6e>] ? btrfs_submit_direct+0x414/0x414 [btrfs]
[<ffffffff811380fa>] ? fsnotify+0x236/0x266
[<ffffffffa02fc930>] btrfs_writepages+0x22/0x24 [btrfs]
[<ffffffff810cc215>] do_writepages+0x1c/0x25
[<ffffffff810c4958>] __filemap_fdatawrite_range+0x4e/0x50
[<ffffffff810c4982>] filemap_write_and_wait_range+0x28/0x51
[<ffffffffa0306b2e>] btrfs_sync_file+0x7d/0x198 [btrfs]
[<ffffffff8110aa26>] ? fsnotify_modify+0x5d/0x65
[<ffffffff8112d150>] vfs_fsync_range+0x18/0x21
[<ffffffff8112d170>] vfs_fsync+0x17/0x19
[<ffffffff8112d316>] do_fsync+0x29/0x3e
[<ffffffff8112d348>] sys_fsync+0xb/0xf
[<ffffffff81468352>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa02fe08c>] cow_file_range+0x1c4/0x32b [btrfs]
We fix this bug by trying to allocate the space again if there are block groups
in caching.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
In btrfs_get_acl(), when the second __btrfs_getxattr() call fails,
acl is not correctly set.
Therefore, a wrong value might return to the caller.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Free space items are located in tree of tree roots, not in the extent
tree. It didn't pop up because lookup_free_space_inode() grabs the
inode all the time instead of actually searching the tree.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
To reproduce the bug:
# mount -o nodatacow /dev/sda7 /mnt/
# dd if=/dev/zero of=/mnt/tmp bs=4K count=1
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.000136115 s, 30.1 MB/s
# dd if=/dev/zero of=/mnt/tmp bs=4K count=1 conv=notrunc oflag=direct
dd: writing `/mnt/tmp': Input/output error
1+0 records in
0+0 records out
btrfs_ordered_update_i_size() may return 1, but btrfs_endio_direct_write()
mistakenly takes it as an error.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
It's not a big deal if we fail to allocate the array, and instead of
panic we can just give up compressing.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
We should retirn EINVAL if the start is beyond the end of the file
system in the btrfs_ioctl_fitrim(). Fix that by adding the appropriate
check for it.
Also in the btrfs_trim_fs() it is possible that len+start might overflow
if big values are passed. Fix it by decrementing the len so that start+len
is equal to the file system size in the worst case.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
We won't defrag an extent, if it's bigger than the threshold we
specified and there's no small extent before it, but actually
the code doesn't work this way.
There are three bugs:
- When should_defrag_range() decides we should keep on defragmenting
an extent, last_len is not incremented. (old bug)
- The length that passes to should_defrag_range() is not the length
we're going to defrag. (new bug)
- We always defrag 256K bytes data, and a big extent can be part of
this range. (new bug)
For a file with 4 extents:
| 4K | 4K | 256K | 256K |
The result of defrag with (the default) 256K extent thresh should be:
| 264K | 256K |
but with those bugs, we'll get:
| 520K |
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
It's off-by-one, and thus we may skip the last page while defragmenting.
An example case:
# create /mnt/file with 2 4K file extents
# btrfs fi defrag /mnt/file
# sync
# filefrag /mnt/file
/mnt/file: 2 extents found
So it's not defragmented.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Don't use inode->i_size directly, since we're not holding i_mutex.
This also fixes another bug, that i_size can change after it's checked
against 0 and then (i_size - 1) can be negative.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Offset field in data extent backref can underflow if clone range ioctl
is used. We can reliably detect the underflow because max file size is
limited to 2^63 and max data extent size is limited by block group size.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
I noticed we had a little bit of latency when writing out the space cache
inodes. It's because we flush it before we write anything in case we have dirty
pages already there. This doesn't matter though since we're just going to
overwrite the space, and there really shouldn't be any dirty pages anyway. This
makes some of my tests run a little bit faster. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Mitch kept hitting a panic because he was getting ENOSPC. One of my previous
patches makes it so we are much better at not allocating new metadata chunks.
Unfortunately coupled with the overcommit patch this works us into a bit of a
problem if we are removing a bunch of space and end up chewing up all of our
space with pinned extents. We can allocate chunks fine and overflow is ok, but
the only way to reclaim this space is to commit the transaction. So if we go to
overcommit, first check and see how much pinned space we have. If we have more
than 80% of the free space chewed up with pinned extents, just commit the
transaction, this will free up enough space for our reservation and we won't
have this problem anymore. With this patch Mitch's test doesn't blow up
anymore. Thanks,
Reported-and-tested-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently btrfs_block_rsv_check does 2 things, it will either refill a block
reserve like in the truncate or refill case, or it will check to see if there is
enough space in the global reserve and possibly refill it. However because of
overcommit we could be well overcommitting ourselves just to try and refill the
global reserve, when really we should just be committing the transaction. So
breack this out into btrfs_block_rsv_refill and btrfs_block_rsv_check. Refill
will try to reserve more metadata if it can and btrfs_block_rsv_check will not,
it will only tell you if the factor of the total space is still reserved.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In __unlink_start_trans() if we don't have enough room for a reservation we will
check to see if the unlink will free up space. If it does that's great, but we
will still could add an orphan item, so we need to reserve enough space to add
the orphan item. Do this and migrate the space the global reserve so it all
works out right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We started setting trans->block_rsv = NULL to allow the delayed refs flushing
stuff to use the right block_rsv and then just made
btrfs_trans_release_metadata() unconditionally use the trans block rsv. The
problem with this is we need to reserve some space in the transaction and then
migrate it to the global block rsv, so we need to be able to free that out
properly. So instead just move btrfs_trans_release_metadata() before the
delayed ref flushing and use trans->block_rsv for the freeing. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently we only allow a maximum of 2 megabytes of pages to be flushed at a
time. This was ok before, but now we have overcommit which will screw us in a
heartbeat if we are quickly filling the disk. So instead pick either 2
megabytes or the number of pages we need to reclaim to be safe again, which ever
is larger. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The only way we actually reclaim delalloc space is waiting for the IO to
completely finish. Usually we kick off a bunch of IO and wait for a little bit
and hope we can make our reservation, and usually this works out pretty well.
With overcommit however we can get seriously underwater if we're filling up the
disk quickly, so we need to be able to force the delalloc shrinker to wait for
the ordered IO to finish to give us a better chance of actually reclaiming
enough space to get our reservation. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Before the only reason to commit the transaction to recover space in
reserve_metadata_bytes() was if there were enough pinned_bytes to satisfy our
reservation. But now we have the delayed inode stuff which will hold it's
reservations until we commit the transaction. So say we max out our reservation
by creating a bunch of files but don't have any pinned bytes we will ENOSPC out
early even though we could commit the transaction and get that space back. So
now just unconditionally commit the transaction since currently there is no way
to know how much metadata space is being reserved by delayed inode stuff.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Recently I changed the xattr stuff to unconditionally set the xattr first in
case the xattr didn't exist yet. This has introduced a regression when setting
an xattr that already exists with a large value. If we find the key we are
looking for split_leaf will assume that we're extending that item. The problem
is the size we pass down to btrfs_search_slot includes the size of the item
already, so if we have the largest xattr we can possibly have plus the size of
the xattr item plus the xattr item that btrfs_search_slot we'd overflow the
leaf. Thankfully this is not what we're doing, but split_leaf doesn't know this
so it just returns EOVERFLOW. So in the xattr code we need to check and see if
we got back EOVERFLOW and treat it like EEXIST since that's really what
happened. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Our unlink reservations were a bit much, we were reserving 10 and I only count 8
possible items we're touching, so comment what we're reserving for and fix the
count value. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed recently that my overcommit patch was causing one of my enospc tests
to fail 25% of the time with early ENOSPC. This is because my overcommit patch
was letting us go way over board, but it wasn't waiting long enough to let the
delalloc shrinker do it's job. The problem is we just start writeback and wait
a little bit hoping we flush enough, but we only free up delalloc space by
having the writes complete all the way. We do this by waiting for ordered
extents, which we do but only if we already free'd enough for the reservation,
which isn't right, we should flush ordered extents if we didn't reclaim enough
in case that will push us over the edge. With this patch I've not seen a
failure in this enospc test after running it in a loop for an hour. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Yeah yeah I know this is how we used to do it and then I changed it, but damnit
I'm changing it back. The fact is that writing out checksums will modify
metadata, which could cause us to dirty a block group we've already written out,
so we have to truncate it and all of it's checksums and re-write it which will
write new checksums which could dirty a blockg roup that has already been
written and you see where I'm going with this? This can cause unmount or really
anything that depends on a transaction to commit to take it's sweet damned time
to happen. So go back to the way it was, only this time we're specifically
setting NODATACOW because we can't go through the COW pathway anyway and we're
doing our own built-in cow'ing by truncating the free space cache. The other
new thing is once we truncate the old cache and preallocate the new space, we
don't need to do that song and dance at all for the rest of the transaction, we
can just overwrite the existing space with the new cache if the block group
changes for whatever reason, and the NODATACOW will let us do this fine. So
keep track of which transaction we last cleared our cache in and if we cleared
it in this transaction just say we're all setup and carry on. This survives
xfstests and stress.sh.
The inode cache will continue to use the normal csum infrastructure since it
only gets written once and there will be no more modifications to the fs tree in
a transaction commit.
Signed-off-by: Josef Bacik <josef@redhat.com>
My overcommit stuff can be a little racy when we're filling up the disk with
fs_mark and we overcommit into things that quickly get used up for data. So use
num_bytes to see if we have enough available space so we're less likely to
overcommit ourselves out of the ability to make reservations. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We need to check the return value of filemap_write_and_wait in the space cache
writeout code. Also don't set the inode's generation until we're sure nothing
else is going to fail. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In writing and reading the space cache we have one big loop that keeps track of
which page we are on and then a bunch of sizeable loops underneath this big loop
to try and read/write out properly. Especially in the write case this makes
things hugely complicated and hard to follow, and makes our error checking and
recovery equally as complex. So add a io_ctl struct with a bunch of helpers to
keep track of the pages we have, where we are, if we have enough space etc.
This unifies how we deal with the pages we're writing and keeps all the messy
tracking internal. This allows us to kill the big loops in both the read and
write case and makes reviewing and chaning the write and read paths much
simpler. I've run xfstests and stress.sh on this code and it survives. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed a slight bug where we will not bother writing out the block group
cache's space cache if it's space tree is empty. Since it could have a cluster
or pinned extents that need to be written out this is just not a valid test.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Some users have requested this and I've found I needed a way to disable cache
loading without actually clearing the cache, so introduce the no_space_cache
option. Before we check the super blocks cache generation field and if it was
populated we always turned space caching on. Now we check this and set the
space cache option on, and then parse the mount options so that if we want it
off it get's turned off. Then we check the mount option all the places we do
the caching work instead of checking the super's cache generation. This makes
things more consistent and lets us turn space caching off. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Xfstests 79 was failing because we were inheriting the S_APPEND flag when we
weren't supposed to. There isn't any specific documentation on this so I'm
taking the test as the standard of how things work, and having S_APPEND set on a
directory doesn't mean that S_APPEND gets inherited by its children according to
this test. So only inherit btrfs specific things. This will let us set
compress/nocompress on specific directories and everything in the directories
will inherit this flag, same with nodatacow. With this patch test 79 passes.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
One of the things that kills us is the fact that our ENOSPC reservations are
horribly over the top in most normal cases. There isn't too much that can be
done about this because when we are completely full we really need them to work
like this so we don't under reserve. However if there is plenty of unallocated
chunks on the disk we can use that to gauge how much we can overcommit. So this
patch adds chunk free space accounting so we always know how much unallocated
space we have. Then if we fail to make a reservation within our allocated
space, check to see if we can overcommit. In the normal flushing case (like
with delalloc metadata reservations) we'll take the free space and divide it by
2 if our metadata profile is setup for DUP or any of those, and then divide it
by 8 to make sure we don't overcommit too much. Then if we're in a non-flushing
case (we really need this reservation now!) we only limit ourselves to half of
the free space. This makes this fio test
[torrent]
filename=torrent-test
rw=randwrite
size=4g
ioengine=sync
directory=/mnt/btrfs-test
go from taking around 45 minutes to 10 seconds on my freshly formatted 3 TiB
file system. This doesn't seem to break my other enospc tests, but could really
use some more testing as this is a super scary change. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed while running xfstests 83 that if we didn't have enough space to
delete our inode the orphan cleanup would just loop. This is because it keeps
finding the same orphan item and keeps trying to kill it but can't because we
don't get an error back from iput for deleting the inode. So keep track of the
last guy we tried to kill, if it's the same as the one we're trying to kill
currently we know we are having problems and can just error out. I don't have a
way to test this so look hard and make sure it's right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Xfstests 83 really stresses our ENOSPC since it uses a 100mb fs which ends up
with the mixed block group stuff. Because of this we can run into a situation
where we don't have enough space to delete inodes, or even worse we can't free
the inodes when we next mount the fs which causes the orphan code to lose its
mind. So if we fail to make our reservation, steal from the global reserve.
The global reserve will end up taking up the entire rest of the free space on
the fs in this worst case so there really is no other option. With this patch
test 83 doesn't freak out. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
While looking for a performance regression a user was complaining about, I
noticed that we had a regression with the varmail test of filebench. This was
introduced by
0d10ee2e6d
which keeps us from calling writepages in writepage. This is a correct change,
however it happens to help the varmail test because we write out in larger
chunks. This is largly to do with how we write out dirty pages for each
transaction. If you run filebench with
load varmail
set $dir=/mnt/btrfs-test
run 60
prior to this patch you would get ~1420 ops/second, but with the patch you get
~1200 ops/second. This is a 16% decrease. So since we know the range of dirty
pages we want to write out, don't write out in one page chunks, write out in
ranges. So to do this we call filemap_fdatawrite_range() on the range of bytes.
Then we convert the DIRTY extents to NEED_WAIT extents. When we then call
btrfs_wait_marked_extents() we only have to filemap_fdatawait_range() on that
range and clear the NEED_WAIT extents. This doesn't get us back to our original
speeds, but I've been seeing ~1380 ops/second, which is a <5% regression as
opposed to a >15% regression. That is acceptable given that the original commit
greatly reduces our latency to begin with. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If I have a range where I know a certain bit is and I want to set it to another
bit the only option I have is to call set and then clear bit, which will result
in 2 tree searches. This is inefficient, so introduce convert_extent_bit which
will go through and set the bit I want and clear the old bit I don't want.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
There is a bug that may lead to early ENOSPC in our reservation code. We've
been checking against num_bytes which may be above and beyond what we want to
actually reserve, which could give us a false ENOSPC. Fix this by making sure
the unused space is above how much we want to reserve and not how much we're
trying to flush. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In fixing how we deal with bad inodes, we had a regression in the orphan cleanup
code, since it expects to get a bad inode back. So fix it to deal with getting
-ESTALE back by deleting the orphan item manually and moving on. Thanks,
Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Josef Bacik <josef@redhat.com>
Johannes pointed out we were allocating only kernel pages for doing writes,
which is kind of a big deal if you are on 32bit and have more than a gig of ram.
So fix our allocations to use the mapping's gfp but still clear __GFP_FS so we
don't re-enter. Thanks,
Reported-by: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
I kept getting warnings from evict because we were calling
btrfs_start_transaction() with a transaction already started when doing a
balance. This is because we remove a block group which requires a transaction,
and the put the last reference on the cache inode. Instead of doing this we
need to delay the iput so it is done not within a transaction having started.
This gets rid of our warnings. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Checksums are charged in 2 different ways. The first case is when we're writing
to the disk, we account for the new checksums with the delalloc block rsv. In
order for this to work we check if we're allocating a block for the csum root
and if trans->block_rsv == the delalloc block rsv. But when we're deleting the
csums because of cow, this is charged to the global block rsv, and is done when
we run the delayed refs. So we need to make sure that trans->block_rsv == NULL
when running the delayed refs. So set it to NULL and reset it in
should_end_transaction, and set it to NULL in commit_transaction. This got rid
of the ridiculous amount of warnings I was seeing when trying to do a balance.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The only thing that we need to have a trans handle for is in
reserve_metadata_bytes and thats to know how much flushing we can do. So
instead of passing it around, just check current->journal_info for a
trans_handle so we know if we can commit a transaction to try and free up space
or not. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Since the durable block rsv stuff has been killed there is no need to get the
block_rsv in btrfs_free_tree_block anymore.
Signed-off-by: Josef Bacik <josef@redhat.com>
The alloc warnings everybody has been seeing is because we have been reserving
space for csums, but we weren't actually using that space. So make
get_block_rsv() return the trans->block_rsv if we're modifying the csum root.
Also set the trans->block_rsv to NULL so that if we modify the csum root when
running delayed ref's that comes out of the global reserve like it's supposed
to. With this patch I'm not seeing those alloc warnings anymore. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Since free space inodes now use normal checksumming we need to make sure to
account for their metadata use. So reserve metadata space, and then if we fail
to write out the metadata we can just release it, otherwise it will be freed up
when the io completes. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In moving some enospc stuff around I noticed that when we unmount we are often
evicting the free space cache inodes before we do our last commit. This isn't
bad, but it makes us constantly have to re-read the inodes back. So instead
don't evict the cache until after we do our last commit, this will make things a
little less crappy and makes a future enospc change work properly. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
While debugging a different issue I noticed that we were always reserving space
when we tried to use our truncate block rsv's. This is because they didn't have
a ->size value, so use_block_rsv just assumes there is nothing reserved and it
does a reserve_metadata_bytes. This is because btrfs_check_block_rsv() doesn't
actually add to the size of the block rsv. That seems to be the right thing to
do so set ->size to the minimum truncate size we need, since we will always only
refill to that size anyway, and this way everything works out correctly.
Signed-off-by: Josef Bacik <josef@redhat.com>
If we have to emergency reserve space we need to not increase the block_rsv
size, otherwise we'll leak space. Take for instance delalloc, say we reserve
4k, and we use that 4k, and then we have to emergency allocate another 4k, we
bump the size up to 8k, however we've only accounted for 4k in reservations in
all of our supporting logic, so we'll go to free the 4k and end up having a size
of 4k, which will cause us to later not free as much space. I saw this doing
testing where I wasn't reserving enough space for something but was still
leaking space, very frustrating. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When changing back to using a spin_lock to protect the extent counters I decided
that since we would only be dropping our original extent, it was ok to just drop
the extent and return. However since somebody else could have come in and done
a reservation, we need to do the normal song and dance to clear the reservation
out properly. So calculate how much space we need to free, and then subtract
what we just attempted to reserve. If it's more then we know we need to drop
those bytes from the delalloc block rsv. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We are setting ins_len to 1 even tho we are just modifying an item that should
be there already. This may cause the search stuff to split nodes on the way
down needelessly. Set this to 0 since we aren't inserting anything. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If you run xfstest 224 it you will get lots of messages about not being able to
delete inodes and that they will be cleaned up next mount. This is because
btrfs_block_rsv_check was not calling reserve_metadata_bytes with the ability to
flush, so if there was not enough space, it simply failed. But in truncate and
evict case we could easily flush space to try and get enough space to do our
work, so make btrfs_block_rsv_check take a flush argument to pass down to
reserve_metadata_bytes. Now xfstests 224 runs fine without all those
complaints. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
With btrfs_truncate_inode_items we always return if we have to go to another
leaf, which makes us do our reservation again. This means we will only ever
modify one leaf at a time, so we only need 1 items worth of slack space. Also,
since we are deleting we will not be creating nodes as we go down, if anything
we'll be free'ing them as we merge them together, so make a different
calculation for truncate which will only have the worst case useage of COW'ing
the entire path down to the leaf. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Lukas found a problem where if he tries to fallocate over the same region twice
and the first fallocate took up all the space we would fail with ENOSPC. This
is because we reserve the total space we want to use for fallocate, regardless
of wether or not we will have to actually preallocate. So instead move the
check into the loop where we actually have to do the preallocate. Thanks,
Tested-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently we're starting and stopping a transaction for no real reason, so kill
that and just reserve enough space as if we can truncate all in one transaction.
Also use btrfs_block_rsv_check() for our reserve to minimize the amount of space
we may have to allocate for our slack space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We will try and reserve metadata bytes in btrfs_block_rsv_check and if we cannot
because we have a transaction open it will return EAGAIN, so we do not need to
try and commit the transaction again.
Signed-off-by: Josef Bacik <josef@redhat.com>
The priority and refill_used flags are not used anymore, and neither is the
usage counter, so just remove them from btrfs_block_rsv.
Signed-off-by: Josef Bacik <josef@redhat.com>
A user reported getting spammed when moving to 3.0 by this message. Since we
switched to the normal checksumming infrastructure all old free space caches
will be wrong and need to be regenerated so people are likely to see this
message a lot, so ratelimit it so it doesn't fill up their logs and freak them
out. Thanks,
Reported-by: Andrew Lutomirski <luto@mit.edu>
Signed-off-by: Josef Bacik <josef@redhat.com>
I converted btrfs_truncate to do sane reservations for truncate, but didn't
convert btrfs_evict_inode. Basically we need to save the orphan_rsv for
deleting the orphan item, and do normal reservations for our truncate. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This patch kills off the calculation for the amount of space needed for the
orphan operations during a snapshot. The thing is we only do snapshots on
commit, so any space that is in the block_rsv->freed[] isn't going to be in the
new snapshot anyway, so there isn't any reason to require that space to be
reserved for the snapshot to occur. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have not been reserving enough space for checksums. We were just reserving
bytes for the checksum items themselves, we were not taking into account having
to cow the tree and such. This patch adds a csum_bytes counter to the inode for
keeping track of the number of bytes outstanding we have for checksums. Then we
calculate how many leaves would be required for the checksums we are given and
use that to reserve space. This adds a significant amount of bytes to our
reservations, but we will handle this later. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We always look for delalloc bytes in our io_tree so we can fill in delalloc.
This is fine in most cases, but if we're writing out the btree_inode this is
just a superfluous tree search on the io_tree, and if we have a lot of metadata
dirty this could be an expensive check. So instead check to see if our io_tree
has a ->fill_delalloc op, and if not don't even bother doing the lookup.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have been using bytes_reserved for metadata reservations, which is wrong
since we use that to keep track of outstanding reservations from the allocator.
This resulted in us doing a lot of silly things to make sure we don't allocate a
bunch of metadata chunks since we never had a real view of how much space was
actually in use by metadata.
This passes Arne's enospc test and xfstests as well as my own enospc tests.
Hopefully this will get us moving in the right direction. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We've only been able to mount with subvol=<whatever> where whatever was a subvol
within whatever root we had as the default. This allows us to mount -o
subvol=path/to/subvol/you/want relative from the normal fs_tree root. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently what we do is just wrong. We either
1) Alloc a new "root" dentry with sb->s_root as it's parent which is just wrong
as we could walk into this subvol later on via another path and hilarity could
ensue. Also we don't check the return value of d_splice_alias which isn't good
either.
or
2) Do a d_find_alias() which we could have lost our dentry from cache at this
point and found nothing.
So use d_obtain_alias(). In the case that we already have the inode/dentry in
cache we will get the correct dentry. If not we will get a disconnected dentry
tree so if we walk into it later on everything will be connected up properly.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Moving things around to give us better packing in the btrfs_inode. This reduces
the size of our inode by 8 bytes. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The btrfs file defrag code will loop through the extents and
force COW on them. But there is a concurrent truncate in the middle of
the defrag, it might end up defragging the same range over and over
again.
The problem is that writepage won't go through and do anything on pages
past i_size, so the cow won't happen, so the file will appear to still
be fragmented. defrag will end up hitting the same extents again and
again.
In the worst case, the truncate can actually live lock with the defrag
because the defrag keeps creating new ordered extents which the truncate
code keeps waiting on.
The fix here is to make defrag check for i_size inside the main loop,
instead of just once before the looping starts.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Follow those steps:
# mount -o autodefrag /dev/sda7 /mnt
# dd if=/dev/urandom of=/mnt/tmp bs=200K count=1
# sync
# dd if=/dev/urandom of=/mnt/tmp bs=8K count=1 conv=notrunc
and then it'll go into a loop: writeback -> defrag -> writeback ...
It's because writeback writes [8K, 200K] and then writes [0, 8K].
I tried to make writeback know if the pages are dirtied by defrag,
but the patch was a bit intrusive. Here I simply set writeback_index
when we defrag a file.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Scrub uses a simple tree-enumeration to bring the relevant portions
of the extent- and csum-tree into the page cache before starting the
scrub-I/O. This is now replaced by using the new readahead-API.
During readahead the scrub is being accounted as paused, so it won't
hold off transaction commits.
This change raises the average disk bandwith utilisation on my test
volume from 70% to 90%. On another volume, the time for a test run
went down from 89s to 43s.
Changes v5:
- reada1/2 are now of type struct reada_control *
Signed-off-by: Arne Jansen <sensille@gmx.net>
This adds the hooks needed for readahead. In the readpage_end_io_hook,
the extent state is checked for the EXTENT_READAHEAD flag. Only in this
case the readahead hook is called, to keep the impact on non-ra as low
as possible.
Additionally, a hook for a failed IO is added, otherwise readahead would
wait indefinitely for the extent to finish.
Changes for v2:
- eliminate race condition
Signed-off-by: Arne Jansen <sensille@gmx.net>
This is the implementation for the generic read ahead framework.
To trigger a readahead, btrfs_reada_add must be called. It will start
a read ahead for the given range [start, end) on tree root. The returned
handle can either be used to wait on the readahead to finish
(btrfs_reada_wait), or to send it to the background (btrfs_reada_detach).
The read ahead works as follows:
On btrfs_reada_add, the root of the tree is inserted into a radix_tree.
reada_start_machine will then search for extents to prefetch and trigger
some reads. When a read finishes for a node, all contained node/leaf
pointers that lie in the given range will also be enqueued. The reads will
be triggered in sequential order, thus giving a big win over a naive
enumeration. It will also make use of multi-device layouts. Each disk
will have its on read pointer and all disks will by utilized in parallel.
Also will no two disks read both sides of a mirror simultaneously, as this
would waste seeking capacity. Instead both disks will read different parts
of the filesystem.
Any number of readaheads can be started in parallel. The read order will be
determined globally, i.e. 2 parallel readaheads will normally finish faster
than the 2 started one after another.
Changes v2:
- protect root->node by transaction instead of node_lock
- fix missed branches:
The readahead had a too simple check to determine if a branch from
a node should be checked or not. It now also records the upper bound
of each node to see if the requested RA range lies within.
- use KERN_CONT to debug output, to avoid line breaks
- defer reada_start_machine to worker to avoid deadlock
Changes v3:
- protect root->node by rcu
Changes v5:
- changed EIO-semantics of reada_tree_block_flagged
- remove spin_lock from reada_control and make elems an atomic_t
- remove unused read_total from reada_control
- kill reada_key_cmp, use btrfs_comp_cpu_keys instead
- use kref-style release functions where possible
- return struct reada_control * instead of void * from btrfs_reada_add
Signed-off-by: Arne Jansen <sensille@gmx.net>
Add state information for readahead to btrfs_fs_info and btrfs_device
Changes v2:
- don't wait in radix_trees
- add own set of workers for readahead
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Arne Jansen <sensille@gmx.net>
Add a READAHEAD extent buffer flag.
Add a function to trigger a read with this flag set.
Changes v2:
- use extent buffer flags instead of extent state flags
Changes v5:
- adapt to changed read_extent_buffer_pages interface
- don't return eb from reada_tree_block_flagged if it has CORRUPT flag set
Signed-off-by: Arne Jansen <sensille@gmx.net>
read_extent_buffer_pages currently has two modes, either trigger a read
without waiting for anything, or wait for the I/O to finish. The former
also bails when it's unable to lock the page. This patch now adds an
additional parameter to allow it to block on page lock, but don't wait
for completion.
Changes v5:
- merge the 2 wait parameters into one and define WAIT_NONE, WAIT_COMPLETE and
WAIT_PAGE_LOCK
Change v6:
- fix bug introduced in v5
Signed-off-by: Arne Jansen <sensille@gmx.net>
A user reported a problem where ceph was getting into 100% cpu usage while doing
some writing. It turns out it's because we were doing a short write on a not
uptodate page, which means we'd fall back at one page at a time and fault the
page in. The problem is our position is on the page boundary, so our fault in
logic wasn't actually reading the page, so we'd just spin forever or until the
page got read in by somebody else. This will force a readpage if we end up
doing a short copy. Alexandre could reproduce this easily with ceph and reports
it fixes his problem. I also wrote a reproducer that no longer hangs my box
with this patch. Thanks,
Reported-and-tested-by: Alexandre Oliva <aoliva@redhat.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This ties nodatasum fixup in scrub together with raid repair patches. While
both series are working fine alone, scrub will report uncorrectable errors
if they occur in a nodatasum extent *and* the page is in the page cache.
Previously, we would have triggered readpage to find good data and do the
repair. However, readpage wouldn't read anything in the case where the page
is up to date in the cache. So, we simply take that good data we have and
call repair_io_failure directly (unless the page in the cache is dirty).
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The raid-retry code in inode.c can be generalized so that it works for
metadata as well. Thus, this patch moves it to extent_io.c and makes the
raid-retry code a raid-repair code.
Repair works that way: Whenever a read error occurs and we have more
mirrors to try, note the failed mirror, and retry another. If we find a
good one, check if we did note a failure earlier and if so, do not allow
the read to complete until after the bad sector was written with the good
data we just fetched. As we have the extent locked while reading, no one
can change the data in between.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The error correction code wants to make sure that only the bad mirror is
rewritten. Thus, we need to know which mirror is the bad one. I did not
find a more apropriate field than bi_bdev. But I think using this is fine,
because it is modified by the block layer, anyway, and should not be read
after the bio returned.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
The block layer modifies bio->bi_bdev and bio->bi_sector while working on
the bio, they do _not_ come back unmodified in the completion callback.
To call add_page, we need at least some bi_bdev set, which is why the code
was working, previously. With this patch, we use the latest_bdev from
fsinfo instead of the leftover in the bio. This gives us the possibility to
use the bi_bdev field for another purpose.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
btrfs_bio is a bio abstraction able to split and not complete after the last
bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio
tracks the mirror_num used to read data which can be used for error
correction purposes.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
these ioctls make use of the new functions initially added for scrub. they
return all inodes belonging to a logical address (BTRFS_IOC_LOGICAL_INO) and
all paths belonging to an inode (BTRFS_IOC_INO_PATHS).
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
This removes a FIXME comment and introduces the first part of nodatasum
fixup: It gets the corresponding inode for a logical address and triggers a
regular readpage for the corrupted sector.
Once we have on-the-fly error correction our error will be automatically
corrected. The correction code is expected to clear the newly introduced
EXTENT_DAMAGED flag, making scrub report that error as "corrected" instead
of "uncorrectable" eventually.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Currently, extent_read_full_page always assumes we are trying to read mirror
0, which generally is the best we can do. To add flexibility, pass it as a
parameter. This will be needed by scrub fixup code.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Fix the mirror_num determination in scrub_stripe. The rest of the scrub code
did not use mirror_num for anything important and that error went unnoticed.
The nodatasum fixup patch of this set depends on a correct mirror_num.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
While scrubbing, we may encounter various errors. Previously, a logical
address was printed to the log only. Now, all paths belonging to that
address are resolved and printed separately. That should work for hardlinks
as well as reflinks.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
In normal operation, scrub is reading data sequentially in large portions.
In case of an i/o error, we try to find the corrupted area(s) by issuing
page sized read requests. With this commit we increment the
unverified_errors counter if all of the small size requests succeed.
Userland patches carrying such conspicous events to the administrator should
already be around.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
These helper functions iterate back references and call a function for each
backref. There is also a function to resolve an inode to a path in the
file system.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Fix a crash/BUG_ON in the clone ioctl due to insufficient reservation. We
need to reserve space for:
- adjusting the old extent (possibly splitting it)
- adding the new extent
- updating the inode
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can race with readdir and the RCU path walking stuff. This is because we
clear the need lookup flag before actually instantiating the inode. This will
lead the RCU path walk stuff to find a dentry it thinks is valid without a
d_inode attached. So instead unhash the dentry when we first start the lookup,
and then clear the flag after we've instantiated the dentry so we're garunteed
to either try the slow lookup, or have the d_inode set properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The recent reworking of btrfs' lseek lead to incorrect
values being returned. This adds checks for seeking
beyond EOF in SEEK_HOLE and makes sure the error
values come back correct.
Andi Kleen also sent in similar patches.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The dst file will have the same inode flags with dst file after
file clone, and I think it's unexpected.
For example, the dst file will suddenly become immutable after
getting some share of data with src file, if the src is immutable.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To reproduce the bug:
# mount /dev/sda7 /mnt
# dd if=/dev/zero of=/mnt/src bs=4K count=1
# umount /mnt
# mount -o nodatasum /dev/sda7 /mnt
# dd if=/dev/zero of=/mnt/dst bs=4K count=1
# clone_range -s 4K -l 4K /mnt/src /mnt/dst
# echo 3 > /proc/sys/vm/drop_caches
# cat /mnt/dst
# dmesg
...
btrfs no csum found for inode 258 start 0
btrfs csum failed ino 258 off 0 csum 2566472073 private 0
It's because part of the file is checksummed and the other part is not,
and then btrfs will complain checksum is not found when we read the file.
Disallow file clone if src and dst file have different checksum flag,
so we ensure a file is completely checksummed or unchecksummed.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It's a bug in commit f81c9cdc56
(Btrfs: truncate pages from clone ioctl target range)
We should pass the dest range to the truncate function, but not the
src range.
Also move the function before locking extent state.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since the d_off in the first dirent for "." (that originates from
the 4th argument "offset" of filldir() for the 2nd dirent for "..")
is wrongly assigned in btrfs_real_readdir(), telldir returns same
offset for different locations.
| # mkfs.btrfs /dev/sdb1
| # mount /dev/sdb1 fs0
| # cd fs0
| # touch file0 file1
| # ../test
| telldir: 0
| readdir: d_off = 2, d_name = "."
| telldir: 2
| readdir: d_off = 2, d_name = ".."
| telldir: 2
| readdir: d_off = 3, d_name = "file0"
| telldir: 3
| readdir: d_off = 2147483647, d_name = "file1"
| telldir: 2147483647
To fix this problem, pass filp->f_pos (which is loff_t) instead.
| # ../test
| telldir: 0
| readdir: d_off = 1, d_name = "."
| telldir: 1
| readdir: d_off = 2, d_name = ".."
| telldir: 2
| readdir: d_off = 3, d_name = "file0"
:
At the moment the "offset" for "." is unused because there is no
preceding dirent, however it is better to pass filp->f_pos to follow
grammatical usage.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://github.com/chrismason/linux:
Btrfs: add dummy extent if dst offset excceeds file end in
Btrfs: calc file extent num_bytes correctly in file clone
btrfs: xattr: fix attribute removal
Btrfs: fix wrong nbytes information of the inode
Btrfs: fix the file extent gap when doing direct IO
Btrfs: fix unclosed transaction handle in btrfs_cont_expand
Btrfs: fix misuse of trans block rsv
Btrfs: reset to appropriate block rsv after orphan operations
Btrfs: skip locking if searching the commit root in csum lookup
btrfs: fix warning in iput for bad-inode
Btrfs: fix an oops when deleting snapshots
You can see there's no file extent with range [0, 4096]. Check this by
btrfsck:
# btrfsck /dev/sda7
root 5 inode 258 errors 100
...
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
An attribute is not removed by 'setfattr -x attr file' and remains
visible in attr list. This makes xfstests/062 pass again.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we write some data into the data hole of the file(no preallocation for this
hole), Btrfs will allocate some disk space, and update nbytes of the inode, but
the other element--disk_i_size needn't be updated. At this condition, we must
update inode metadata though disk_i_size is not changed(btrfs_ordered_update_i_size()
return 1).
# mkfs.btrfs /dev/sdb1
# mount /dev/sdb1 /mnt
# touch /mnt/a
# truncate -s 856002 /mnt/a
# dd if=/dev/zero of=/mnt/a bs=4K count=1 conv=nocreat,notrunc
# umount /mnt
# btrfsck /dev/sdb1
root 5 inode 257 errors 400
found 32768 bytes used err is 1
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we write some data to the place that is beyond the end of the file
in direct I/O mode, a data hole will be created. And Btrfs should insert
a file extent item that point to this hole into the fs tree. But unfortunately
Btrfs forgets doing it.
The following is a simple way to reproduce it:
# mkfs.btrfs /dev/sdc2
# mount /dev/sdc2 /test4
# touch /test4/a
# dd if=/dev/zero of=/test4/a seek=8 count=1 bs=4K oflag=direct conv=nocreat,notrunc
# umount /test4
# btrfsck /dev/sdc2
root 5 inode 257 errors 100
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The function - btrfs_cont_expand() forgot to close the transaction handle before
it jump out the while loop. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
At the beginning of create_pending_snapshot, trans->block_rsv is set
to pending->block_rsv and is used for snapshot things, however, when
it is done, we do not recover it as will.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While truncating free space cache, we forget to change trans->block_rsv
back to the original one, but leave it with the orphan_block_rsv, and
then with option inode_cache enable, it leads to countless warnings of
btrfs_alloc_free_block and btrfs_orphan_commit_root:
WARNING: at fs/btrfs/extent-tree.c:5711 btrfs_alloc_free_block+0x180/0x350 [btrfs]()
...
WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It's not enough to just search the commit root, since we could be cow'ing the
very block we need to search through, which would mean that its locked and we'll
still deadlock. So use path->skip_locking as well. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
iput() shouldn't be called for inodes in I_NEW state.
We need to mark inode as constructed first.
WARNING: at fs/inode.c:1309 iput+0x20b/0x210()
Call Trace:
[<ffffffff8103e7ba>] warn_slowpath_common+0x7a/0xb0
[<ffffffff8103e805>] warn_slowpath_null+0x15/0x20
[<ffffffff810eaf0b>] iput+0x20b/0x210
[<ffffffff811b96fb>] btrfs_iget+0x1eb/0x4a0
[<ffffffff811c3ad6>] btrfs_run_defrag_inodes+0x136/0x210
[<ffffffff811ad55f>] cleaner_kthread+0x17f/0x1a0
[<ffffffff81035b7d>] ? sub_preempt_count+0x9d/0xd0
[<ffffffff811ad3e0>] ? transaction_kthread+0x280/0x280
[<ffffffff8105af86>] kthread+0x96/0xa0
[<ffffffff814336d4>] kernel_thread_helper+0x4/0x10
[<ffffffff8105aef0>] ? kthread_worker_fn+0x190/0x190
[<ffffffff814336d0>] ? gs_change+0xb/0xb
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
CC: Konstantin Khlebnikov <khlebnikov@openvz.org>
Tested-by: David Sterba <dsterba@suse.cz>
CC: Josef Bacik <josef@redhat.com>
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can reproduce this oops via the following steps:
$ mkfs.btrfs /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs
$ for ((i=0; i<3; i++)); do btrfs sub snap /mnt/btrfs /mnt/btrfs/s_$i; done
$ rm -fr /mnt/btrfs/*
$ rm -fr /mnt/btrfs/*
then we'll get
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:2264!
[...]
Call Trace:
[<ffffffffa05578c7>] btrfs_rmdir+0xf7/0x1b0 [btrfs]
[<ffffffff81150b95>] vfs_rmdir+0xa5/0xf0
[<ffffffff81153cc3>] do_rmdir+0x123/0x140
[<ffffffff81145ac7>] ? fput+0x197/0x260
[<ffffffff810aecff>] ? audit_syscall_entry+0x1bf/0x1f0
[<ffffffff81153d0d>] sys_unlinkat+0x2d/0x40
[<ffffffff8147896b>] system_call_fastpath+0x16/0x1b
RIP [<ffffffffa054f7b9>] btrfs_orphan_add+0x179/0x1a0 [btrfs]
When it comes to btrfs_lookup_dentry, we may set a snapshot's inode->i_ino
to BTRFS_EMPTY_SUBVOL_DIR_OBJECTID instead of BTRFS_FIRST_FREE_OBJECTID,
while the snapshot's location.objectid remains unchanged.
However, btrfs_ino() does not take this into account, and returns a wrong ino,
and causes the oops.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This fixes a regression introduced by commit cdcb725c05 ("Btrfs: check
if there is enough space for balancing smarter"). We can't do 64-bit
divides on 32-bit architectures.
In cases where we need to divide/multiply by 2 we should just left/right
shift respectively, and in cases where theres N number of devices use
do_div. Also make the counters u64 to match up with rw_devices.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Acked-and-tested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
xfstests exposed a problem with preallocate when it fallocates a range that
already has an extent. We don't set the new i_size properly because we see that
we already have an extent. This isn't right and we should update i_size if the
space already exists. With this patch we now pass xfstests 075. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There were some unlocks on error missing in a recent patch to
btrfs_file_llseek().
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch tightens the read-only access checks in btrfs_permission to
match the constraints in inode_permission. Currently, even though the
device node itself will be unmodified, read-write access to device nodes
is denied to when the device node resides on a read-only subvolume or a
is a file that has been marked read-only by the btrfs conversion utility.
With this patch applied, the check only affects regular files,
directories, and symlinks. It also restructures the code a bit so that
we don't duplicate the MAY_WRITE check for both tests.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We need to truncate page cache pages for the clone ioctl target range or
else we'll confuse ourselves to no end. If the old data was cached, we
used to still see it (until remount). If the page was partially updated
we used to get a mix of old and new data.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
sync_pending is uninitialized before it be used, fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs subtracted the size of the allocated space twice when it allocated
the space from the bitmap in the cluster, it broke the free space information
and led to oops finally.
And this patch also fixes the bug that ctl->free_space was subtracted
without lock.
Reported-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The filesystem turns readonly instead of returning the error to the
caller when detected error in btrfs_drop_snapshot().
and, because the caller doesn't check the error, the function type is
changed to 'void'.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When checking if there is enough space for balancing a block group,
since we do not take raid types into consideration, we do not account
corrent amounts of space that we needed. This makes us do some extra
work before we get ENOSPC.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When balancing, we'll first try to shrink devices for some space,
but if it is working on a full multi-disk partition with raid protection,
we may encounter a bug, that is, while shrinking, total_bytes may be less
than bytes_used, and btrfs may allocate a dev extent that accesses out of
device's bounds.
Then we will not be able to write or read the data which stores at the end
of the device, and get the followings:
device fsid 0939f071-7ea3-46c8-95df-f176d773bfb6 devid 1 transid 10 /dev/sdb5
Btrfs detected SSD devices, enabling SSD mode
btrfs: relocating block group 476315648 flags 9
btrfs: found 4 extents
attempt to access beyond end of device
sdb5: rw=145, want=546176, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546304, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546432, limit=546147
attempt to access beyond end of device
sdb5: rw=145, want=546560, limit=546147
attempt to access beyond end of device
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs recovers from a crash, it may hit the oops below:
------------[ cut here ]------------
kernel BUG at fs/btrfs/inode.c:4580!
[...]
RIP: 0010:[<ffffffffa03df251>] [<ffffffffa03df251>] btrfs_add_link+0x161/0x1c0 [btrfs]
[...]
Call Trace:
[<ffffffffa03e7b31>] ? btrfs_inode_ref_index+0x31/0x80 [btrfs]
[<ffffffffa04054e9>] add_inode_ref+0x319/0x3f0 [btrfs]
[<ffffffffa0407087>] replay_one_buffer+0x2c7/0x390 [btrfs]
[<ffffffffa040444a>] walk_down_log_tree+0x32a/0x480 [btrfs]
[<ffffffffa0404695>] walk_log_tree+0xf5/0x240 [btrfs]
[<ffffffffa0406cc0>] btrfs_recover_log_trees+0x250/0x350 [btrfs]
[<ffffffffa0406dc0>] ? btrfs_recover_log_trees+0x350/0x350 [btrfs]
[<ffffffffa03d18b2>] open_ctree+0x1442/0x17d0 [btrfs]
[...]
This comes from that while replaying an inode ref item, we forget to
check those old conflicting DIR_ITEM and DIR_INDEX items in fs/file tree,
then we will come to conflict corners which lead to BUG_ON().
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Tested-by: Andy Lutomirski <luto@mit.edu>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have a problem where if a user specifies discard but doesn't actually support
it we will return EOPNOTSUPP from btrfs_discard_extent. This is a problem
because this gets called (in a fashion) from the tree log recovery code, which
has a nice little BUG_ON(ret) after it, which causes us to fail the tree log
replay. So instead detect wether our devices support discard when we're adding
them and then don't issue discards if we know that the device doesn't support
it. And just for good measure set ret = 0 in btrfs_issue_discard just in case
we still get EOPNOTSUPP so we don't screw anybody up like this again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs does bio submissions from a worker thread, and each device
has a list of high priority bios and regular priority bios.
Synchronous writes go to the high priority thread while async writes
go to regular list. This commit brings back an explicit unplug
any time we switch from high to regular priority, which makes it
easier for the block layer to give us low latencies.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (31 commits)
Btrfs: don't call writepages from within write_full_page
Btrfs: Remove unused variable 'last_index' in file.c
Btrfs: clean up for find_first_extent_bit()
Btrfs: clean up for wait_extent_bit()
Btrfs: clean up for insert_state()
Btrfs: remove unused members from struct extent_state
Btrfs: clean up code for merging extent maps
Btrfs: clean up code for extent_map lookup
Btrfs: clean up search_extent_mapping()
Btrfs: remove redundant code for dir item lookup
Btrfs: make acl functions really no-op if acl is not enabled
Btrfs: remove remaining ref-cache code
Btrfs: remove a BUG_ON() in btrfs_commit_transaction()
Btrfs: use wait_event()
Btrfs: check the nodatasum flag when writing compressed files
Btrfs: copy string correctly in INO_LOOKUP ioctl
Btrfs: don't print the leaf if we had an error
btrfs: make btrfs_set_root_node void
Btrfs: fix oops while writing data to SSD partitions
Btrfs: Protect the readonly flag of block group
...
Fix up trivial conflicts (due to acl and writeback cleanups) in
- fs/btrfs/acl.c
- fs/btrfs/ctree.h
- fs/btrfs/extent_io.c
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
xfs: Fix build breakage in xfs_iops.c when CONFIG_FS_POSIX_ACL is not set
VFS: Reorganise shrink_dcache_for_umount_subtree() after demise of dcache_lock
VFS: Remove dentry->d_lock locking from shrink_dcache_for_umount_subtree()
VFS: Remove detached-dentry counter from shrink_dcache_for_umount_subtree()
switch posix_acl_chmod() to umode_t
switch posix_acl_from_mode() to umode_t
switch posix_acl_equiv_mode() to umode_t *
switch posix_acl_create() to umode_t *
block: initialise bd_super in bdget()
vfs: avoid call to inode_lru_list_del() if possible
vfs: avoid taking inode_hash_lock on pipes and sockets
vfs: conditionally call inode_wb_list_del()
VFS: Fix automount for negative autofs dentries
Btrfs: load the key from the dir item in readdir into a fake dentry
devtmpfs: missing initialialization in never-hit case
hppfs: missing include
When doing a writepage we call writepages to try and write out any other dirty
pages in the area. This could cause problems where we commit a transaction and
then have somebody else dirtying metadata in the area as we could end up writing
out a lot more than we care about, which could cause latency on anybody who is
waiting for the transaction to completely finish committing. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The variable 'last_index' is calculated in the __btrfs_buffered_write
function and passed as a parameter to the prepare_pages function,
but is not used anywhere in the prepare_pages function.
Remove instances of 'last_index' in these functions.
Signed-off-by: Mitch Harder <mitch.harder@sabayonlinux.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_first_extent_bit() and find_first_extent_bit_state() share
most of the code, and we can just make the former call the latter.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can just use cond_resched_lock().
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These members are not used at all.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
unpin_extent_cache() and add_extent_mapping() shares the same code
that merges extent maps.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
lookup_extent_map() and search_extent_map() can share most of code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
rb_node returned by __tree_search() can be a valid pointer or NULL,
but won't be some errno.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we search a dir item with a specific hash code, we can
just return NULL without further checking if btrfs_search_slot()
returns 1.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since commit f2a97a9dbd
("btrfs: remove all unused functions"), there's no extern functions
at all in ref-cache.c, so just remove the remaining dead code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use wait_event() when possible to avoid code duplication.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If mounting with nodatasum option, we won't csum file data for
general write or direct-io write, and this rule should also be
applied when writing compressed files.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Memory areas [ptr, ptr+total_len] and [name, name+total_len]
may overlap, so it's wrong to use memcpy().
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In __btrfs_free_extent we will print the leaf if we fail to find the extent we
wanted, but the problem is if we get an error we won't have a leaf so often this
leads to a NULL pointer dereference and we lose the error that actually
occurred. So only print the leaf if ret > 0, which means we didn't find the
item we were looking for but we didn't error either. This way the error is
preserved.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is fairly trivial - btrfs_set_root_node() - always returns zero so we
can just make it void. All callers ignore the return code now anyway. I
also made sure to check that none of the functions that
btrfs_set_root_node() calls returns an error that we might have needed to
catch and pass back.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Here I have a two SSD-partitions btrfs, and they are defaultly set to
"data=raid0, metadata=raid1", then I try to fill my btrfs partition
till "No space left on device", via "dd if=/dev/zero of=/mnt/btrfs/tmp".
I get an oops panic from kernel BUG at fs/btrfs/extent-tree.c:5199!, which
refers to find_free_extent's
BUG_ON(index != get_block_group_index(block_group));
In SSD mode, in order to find enough space to alloc, we may check the
block_group cache which has been checked sometime before, but the index is not
updated, where it hits the BUG_ON.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Acked-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The access for ro in btrfs_block_group_cache should be protected
because of the racy lock in relocation.
Signed-off-by: Wu Bo <wu.bo@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The set/clear bit and the extent split/merge hooks only ever return 0.
Changing them to return void simplifies the error handling cases later.
This patch changes the hook prototypes, the single implementation of each,
and the functions that call them to return void instead.
Since all four of these hooks execute under a spinlock, they're necessarily
simple.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We passed the wrong value to btrfs_force_ra(). Fix this by changing
the argument of btrfs_force_ra() from last_index to nr_page.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs_unlink_inode() and btrfs_orphan_add() in btrfs_unlink()
are error, the error code is returned to the caller instead of
BUG_ON().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Don't need to check the return value of __btrfs_add_inode_defrag(),
since it will always return 0.
Signed-off-by: Wanlong Gao <gaowanlong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs we have 2 indexes for inodes. One is for readdir, it's in this nice
sequential order and works out brilliantly for readdir. However if you use ls,
it usually stat's each file it gets from readdir. This is where the second
index comes in, which is based on a hash of the name of the file. So then the
lookup has to lookup this index, and then lookup the inode. The index lookup is
going to be in random order (since its based on the name hash), which gives us
less than stellar performance. Since we know the inode location from the
readdir index, I create a dummy dentry and copy the location key into
dentry->d_fsdata. Then on lookup if we have d_fsdata we use that location to
lookup the inode, avoiding looking up the other directory index. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: make sure reserve_metadata_bytes doesn't leak out strange errors
Btrfs: use the commit_root for reading free_space_inode crcs
Btrfs: reduce extent_state lock contention for metadata
Btrfs: remove lockdep magic from btrfs_next_leaf
Btrfs: make a lockdep class for each root
Btrfs: switch the btrfs tree locks to reader/writer
Btrfs: fix deadlock when throttling transactions
Btrfs: stop using highmem for extent_buffers
Btrfs: fix BUG_ON() caused by ENOSPC when relocating space
Btrfs: tag pages for writeback in sync
Btrfs: fix enospc problems with delalloc
Btrfs: don't flush delalloc arbitrarily
Btrfs: use find_or_create_page instead of grab_cache_page
Btrfs: use a worker thread to do caching
Btrfs: fix how we merge extent states and deal with cached states
Btrfs: use the normal checksumming infrastructure for free space cache
Btrfs: serialize flushers in reserve_metadata_bytes
Btrfs: do transaction space reservation before joining the transaction
Btrfs: try to only do one btrfs_search_slot in do_setxattr
The btrfs transaction code will return any errors that come from
reserve_metadata_bytes. We need to make sure we don't return funny
things like 1 or EAGAIN.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Now that we are using regular file crcs for the free space cache,
we can deadlock if we try to read the free_space_inode while we are
updating the crc tree.
This commit fixes things by using the commit_root to read the crcs. This is
safe because we the free space cache file would already be loaded if
that block group had been changed in the current transaction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
For metadata buffers that don't straddle pages (all of them), btrfs
can safely use the page uptodate bits and extent_buffer uptodate bit
instead of needing to use the extent_state tree.
This greatly reduces contention on the state tree lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Before the reader/writer locks, btrfs_next_leaf needed to keep
the path blocking to avoid making lockdep upset.
Now that btrfs_next_leaf only takes read locks, this isn't required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch was originally from Tejun Heo. lockdep complains about the btrfs
locking because we sometimes take btree locks from two different trees at the
same time. The current classes are based only on level in the btree, which
isn't enough information for lockdep to figure out if the lock is safe.
This patch makes a class for each type of tree, and lumps all the FS trees that
actually have files and directories into the same class.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs metadata btree is the source of significant
lock contention, especially in the root node. This
commit changes our locking to use a reader/writer
lock.
The lock is built on top of rw spinlocks, and it
extends the lock tracking to remember if we have a
read lock or a write lock when we go to blocking. Atomics
count the number of blocking readers or writers at any
given time.
It removes all of the adaptive spinning from the old code
and uses only the spinning/blocking hints inside of btrfs
to decide when it should continue spinning.
In read heavy workloads this is dramatically faster. In write
heavy workloads we're still faster because of less contention
on the root node lock.
We suffer slightly in dbench because we schedule more often
during write locks, but all other benchmarks so far are improved.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Hit this nice little deadlock. What happens is this
__btrfs_end_transaction with throttle set, --use_count so it equals 0
btrfs_commit_transaction
<somebody else actually manages to start the commit>
btrfs_end_transaction --use_count so now its -1 <== BAD
we just return and wait on the transaction
This is bad because we just return after our use_count is -1 and don't let go
of our num_writer count on the transaction, so the guy committing the
transaction just sits there forever. Fix this by inc'ing our use_count if we're
going to call commit_transaction so that if we call btrfs_end_transaction it's
valid. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent_buffers have a very complex interface where
we use HIGHMEM for metadata and try to cache a kmap mapping
to access the memory.
The next commit adds reader/writer locks, and concurrent use
of this kmap cache would make it even more complex.
This commit drops the ability to use HIGHMEM with extent buffers,
and rips out all of the related code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we balanced the chunks across the devices, BUG_ON() in
__finish_chunk_alloc() was triggered.
------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:2568!
[SNIP]
Call Trace:
[<ffffffffa049525e>] btrfs_alloc_chunk+0x8e/0xa0 [btrfs]
[<ffffffffa04546b0>] do_chunk_alloc+0x330/0x3a0 [btrfs]
[<ffffffffa045c654>] btrfs_reserve_extent+0xb4/0x1f0 [btrfs]
[<ffffffffa045c86b>] btrfs_alloc_free_block+0xdb/0x350 [btrfs]
[<ffffffffa048a8d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
[<ffffffffa04476fd>] __btrfs_cow_block+0x14d/0x5e0 [btrfs]
[<ffffffffa044660d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
[<ffffffffa0447c9b>] btrfs_cow_block+0x10b/0x240 [btrfs]
[<ffffffffa044dd5e>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
[<ffffffffa044f07d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs]
[<ffffffffa045e973>] insert_with_overflow+0x43/0x110 [btrfs]
[<ffffffffa045eb0d>] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs]
[<ffffffffa0489bd0>] ? map_extent_buffer+0xb0/0xc0 [btrfs]
[<ffffffff812276ad>] ? rb_insert_color+0x9d/0x160
[<ffffffffa046cc40>] ? inode_tree_add+0xf0/0x150 [btrfs]
[<ffffffffa0474801>] btrfs_add_link+0xc1/0x1c0 [btrfs]
[<ffffffff811dacac>] ? security_inode_init_security+0x1c/0x30
[<ffffffffa04a28aa>] ? btrfs_init_acl+0x4a/0x180 [btrfs]
[<ffffffffa047492f>] btrfs_add_nondir+0x2f/0x70 [btrfs]
[<ffffffffa046af16>] ? btrfs_init_inode_security+0x46/0x60 [btrfs]
[<ffffffffa0474ac0>] btrfs_create+0x150/0x1d0 [btrfs]
[<ffffffff81159c63>] ? generic_permission+0x23/0xb0
[<ffffffff8115b415>] vfs_create+0xa5/0xc0
[<ffffffff8115ce6e>] do_last+0x5fe/0x880
[<ffffffff8115dc0d>] path_openat+0xcd/0x3d0
[<ffffffff8115e029>] do_filp_open+0x49/0xa0
[<ffffffff8116a965>] ? alloc_fd+0x95/0x160
[<ffffffff8114f0c7>] do_sys_open+0x107/0x1e0
[<ffffffff810bcc3f>] ? audit_syscall_entry+0x1bf/0x1f0
[<ffffffff8114f1e0>] sys_open+0x20/0x30
[<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa049444a>] __finish_chunk_alloc+0x20a/0x220 [btrfs]
The reason is:
Task1 Space balance task
do_chunk_alloc()
__finish_chunk_alloc()
update device info
in the chunk tree
alloc system metadata block
relocate system metadata block group
set system metadata block group
readonly, This block group is the
only one that can allocate space. So
there is no free space that can be
allocated now.
find no space and don't try
to alloc new chunk, and then
return ENOSPC
BUG_ON() in __finish_chunk_alloc()
was triggered.
Fix this bug by allocating a new system metadata chunk before relocating the
old one if we find there is no free space which can be allocated after setting
the old block group to be read-only.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Everybody else does this, we need to do it too. If we're syncing, we need to
tag the pages we're going to write for writeback so we don't end up writing the
same stuff over and over again if somebody is constantly redirtying our file.
This will keep us from having latencies with heavy sync workloads. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
So I had this brilliant idea to use atomic counters for outstanding and reserved
extents, but this turned out to be a bad idea. Consider this where we have 1
outstanding extent and 1 reserved extent
Reserver Releaser
atomic_dec(outstanding) now 0
atomic_read(outstanding)+1 get 1
atomic_read(reserved) get 1
don't actually reserve anything because
they are the same
atomic_cmpxchg(reserved, 1, 0)
atomic_inc(outstanding)
atomic_add(0, reserved)
free reserved space for 1 extent
Then the reserver now has no actual space reserved for it, and when it goes to
finish the ordered IO it won't have enough space to do it's allocation and you
get those lovely warnings.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Kill the check to see if we have 512mb of reserved space in delalloc and
shrink_delalloc if we do. This causes unexpected latencies and we have other
logic to see if we need to throttle. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
GFP_HIGHUSER_MOVABLE. So instead use find_or_create_page in all cases where we
need GFP_NOFS so we don't deadlock. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
A user reported a deadlock when copying a bunch of files. This is because they
were low on memory and kthreadd got hung up trying to migrate pages for an
allocation when starting the caching kthread. The page was locked by the person
starting the caching kthread. To fix this we just need to use the async thread
stuff so that the threads are already created and we don't have to worry about
deadlocks. Thanks,
Reported-by: Roman Mamedov <rm@romanrm.ru>
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
merge fchmod() and fchmodat() guts, kill ancient broken kludge
xfs: fix misspelled S_IS...()
xfs: get rid of open-coded S_ISREG(), etc.
vfs: document locking requirements for d_move, __d_move and d_materialise_unique
omfs: fix (mode & S_IFDIR) abuse
btrfs: S_ISREG(mode) is not mode & S_IFREG...
ima: fmode_t misspelled as mode_t...
pci-label.c: size_t misspelled as mode_t
jffs2: S_ISLNK(mode & S_IFMT) is pointless
snd_msnd ->mode is fmode_t, not mode_t
v9fs_iop_get_acl: get rid of unused variable
vfs: dont chain pipe/anon/socket on superblock s_inodes list
Documentation: Exporting: update description of d_splice_alias
fs: add missing unlock in default_llseek()
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In addition to properly handling allocation failure from btrfs_alloc_path, I
also fixed up the kzalloc error handling code immediately below it.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
I also removed the BUG_ON from error return of find_next_chunk in
init_first_rw_device(). It turns out that the only caller of
init_first_rw_device() also BUGS on any nonzero return so no actual behavior
change has occurred here.
do_chunk_alloc() also needed an update since it calls btrfs_alloc_chunk()
which can now return -ENOMEM. Instead of setting space_info->full on any
error from btrfs_alloc_chunk() I catch and return every error value _except_
-ENOSPC. Thanks goes to Tsutomu Itoh for pointing that issue out.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss. This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: posix_acl_create(&acl, gfp, mode_p). Replaces acl with
modified clone, on failure releases acl and replaces with NULL.
Returns 0 or -ve on error. All callers of posix_acl_create_masq()
switched.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: posix_acl_chmod(&acl, gfp, mode). Replaces acl with modified
clone or with NULL if that has failed; returns 0 or -ve on error. All
callers of posix_acl_chmod_masq() switched to that - they'd been doing
exactly the same thing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This moves logic for checking the cached ACL values from low-level
filesystems into generic code. The end result is a streamlined ACL
check that doesn't need to load the inode->i_op->check_acl pointer at
all for the common cached case.
The filesystems also don't need to check for a non-blocking RCU walk
case in their acl_check() functions, because that is all handled at a
VFS layer.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
both callers there have dentry->d_parent stabilized by the fact that
their caller had obtained dentry from lookup_one_len() and had not
dropped ->i_mutex on parent since then.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Btrfs needs to be able to control how filemap_write_and_wait_range() is called
in fsync to make it less of a painful operation, so push down taking i_mutex and
the calling of filemap_write_and_wait() down into the ->fsync() handlers. Some
file systems can drop taking the i_mutex altogether it seems, like ext3 and
ocfs2. For correctness sake I just pushed everything down in all cases to make
sure that we keep the current behavior the same for everybody, and then each
individual fs maintainer can make up their mind about what to do from there.
Thanks,
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
Basically for the normal SEEK_*'s we will just defer to the generic helper, and
for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
hole or data. Currently this helper doesn't check for delalloc bytes for
prealloc space, so for now treat prealloc as data until that is fixed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch changes the security_inode_init_security API by adding a
filesystem specific callback to write security extended attributes.
This change is in preparation for supporting the initialization of
multiple LSM xattrs and the EVM xattr. Initially the callback function
walks an array of xattrs, writing each xattr separately, but could be
optimized to write multiple xattrs at once.
For existing security_inode_init_security() calls, which have not yet
been converted to use the new callback function, such as those in
reiserfs and ocfs2, this patch defines security_old_inode_init_security().
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Dealing with this seems trivial - the only caller of btrfs_balance() is
btrfs_ioctl() which passes the error code directly back to userspace. There
also isn't much state to unwind (if I'm wrong about this point, we can
always safely move the allocation to the top of btrfs_balance() anyway).
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
btrfs_iget() also needed an update so that errors from btrfs_locked_inode()
are caught and bubbled back up.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
I moved the path allocation up a few lines to the top of the function so
that we couldn't get into the state where we've dropped delayed items and
the extent cache but fail due to -ENOMEM.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The two ->process_func call sites in tree-log.c which were ignoring a return
code have also been updated to gracefully exit as well.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch fixes many callers of btrfs_alloc_path() which BUG_ON allocation
failure. All the sites that are fixed in this patch were checked by me to
be fairly trivial to fix because of at least one of two criteria:
- Callers of the function catch errors from it already so bubbling the
error up will be handled.
- Callers of the function might BUG_ON any nonzero return code in which
case there is no behavior changed (but we still got to remove a BUG_ON)
The following functions were updated:
btrfs_lookup_extent, alloc_reserved_tree_block, btrfs_remove_block_group,
btrfs_lookup_csums_range, btrfs_csum_file_blocks, btrfs_mark_extent_written,
btrfs_inode_by_name, btrfs_new_inode, btrfs_symlink,
insert_reserved_file_extent, and run_delalloc_nocow
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
First, we can sometimes free the state we're merging, which means anybody who
calls merge_state() may have the state it passed in free'ed. This is
problematic because we could end up caching the state, which makes caching
useless as the state will no longer be part of the tree. So instead of free'ing
the state we passed into merge_state(), set it's end to the other->end and free
the other state. This way we are sure to cache the correct state. Also because
we can merge states together, instead of only using the cache'd state if it's
start == the start we are looking for, go ahead and use it if the start we are
looking for is within the range of the cached state. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We used to store the checksums of the space cache directly in the space cache,
however that doesn't work out too well if we have more space than we can fit the
checksums into the first page. So instead use the normal checksumming
infrastructure. There were problems with doing this originally but those
problems don't exist now so this works out fine. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We keep having problems with early enospc, and that's because our method of
making space is inherently racy. The problem is we can have one guy trying to
make space for himself, and in the meantime people come in and steal his
reservation. In order to stop this we make a waitqueue and put anybody who
comes into reserve_metadata_bytes on that waitqueue if somebody is trying to
make more space. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have to do weird things when handling enospc in the transaction joining code.
Because we've already joined the transaction we cannot commit the transaction
within the reservation code since it will deadlock, so we have to return EAGAIN
and then make sure we don't retry too many times. Instead of doing this, just
do the reservation the normal way before we join the transaction, that way we
can do whatever we want to try and reclaim space, and then if it fails we know
for sure we are out of space and we can return ENOSPC. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I've been watching how many btrfs_search_slot()'s we do and I noticed that when
we create a file with selinux enabled we were doing 2 each time we initialize
the security context. That's because we lookup the xattr first so we can delete
it if we're setting a new value to an existing xattr. But in the create case we
don't have any xattrs, so it is completely useless to have the extra lookup. So
re-arrange things so that we only lookup first if we specifically have
XATTR_REPLACE. That way in the basic case we only do 1 search, and in the more
complicated case we do the normal 2 lookups. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Pass struct wb_writeback_work all the way down to writeback_sb_inodes(),
and initialize the struct writeback_control there.
struct writeback_control is basically designed to control writeback of a
single file, but we keep abuse it for writing multiple files in
writeback_sb_inodes() and its callers.
It immediately clean things up, e.g. suddenly wbc.nr_to_write vs
work->nr_pages starts to make sense, and instead of saving and restoring
pages_skipped in writeback_sb_inodes it can always start with a clean
zero value.
It also makes a neat IO pattern change: large dirty files are now
written in the full 4MB writeback chunk size, rather than whatever
remained quota in wbc->nr_to_write.
Acked-by: Jan Kara <jack@suse.cz>
Proposed-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
btrfs: fix oops when doing space balance
Btrfs: don't panic if we get an error while balancing V2
btrfs: add missing options displayed in mount output
A user reported an error where if we try to balance an fs after a device has
been removed it will blow up. This is because we get an EIO back and this is
where BUG_ON(ret) bites us in the ass. To fix we just exit. Thanks,
Reported-by: Anand Jain <Anand.Jain@oracle.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are three missed mount options settable by user which are not
currently displayed in mount output.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
btrfs: fix inconsonant inode information
Btrfs: make sure to update total_bitmaps when freeing cache V3
Btrfs: fix type mismatch in find_free_extent()
Btrfs: make sure to record the transid in new inodes
When iputting the inode, We may leave the delayed nodes if they have some
delayed items that have not been dealt with. So when the inode is read again,
we must look up the relative delayed node, and use the information in it to
initialize the inode. Or we will get inconsonant inode information, it may
cause that the same directory index number is allocated again, and hit the
following oops:
[ 5447.554187] err add delayed dir index item(name: pglog_0.965_0) into the
insertion tree of the delayed node(root id: 262, inode id: 258, errno: -17)
[ 5447.569766] ------------[ cut here ]------------
[ 5447.575361] kernel BUG at fs/btrfs/delayed-inode.c:1301!
[SNIP]
[ 5447.790721] Call Trace:
[ 5447.793191] [<ffffffffa0641c4e>] btrfs_insert_dir_item+0x189/0x1bb [btrfs]
[ 5447.800156] [<ffffffffa0651a45>] btrfs_add_link+0x12b/0x191 [btrfs]
[ 5447.806517] [<ffffffffa0651adc>] btrfs_add_nondir+0x31/0x58 [btrfs]
[ 5447.812876] [<ffffffffa0651d6a>] btrfs_create+0xf9/0x197 [btrfs]
[ 5447.818961] [<ffffffff8111f840>] vfs_create+0x72/0x92
[ 5447.824090] [<ffffffff8111fa8c>] do_last+0x22c/0x40b
[ 5447.829133] [<ffffffff8112076a>] path_openat+0xc0/0x2ef
[ 5447.834438] [<ffffffff810c58e2>] ? __perf_event_task_sched_out+0x24/0x44
[ 5447.841216] [<ffffffff8103ecdd>] ? perf_event_task_sched_out+0x59/0x67
[ 5447.847846] [<ffffffff81121a79>] do_filp_open+0x3d/0x87
[ 5447.853156] [<ffffffff811e126c>] ? strncpy_from_user+0x43/0x4d
[ 5447.859072] [<ffffffff8111f1f5>] ? getname_flags+0x2e/0x80
[ 5447.864636] [<ffffffff8111f179>] ? do_getname+0x14b/0x173
[ 5447.870112] [<ffffffff8111f1b7>] ? audit_getname+0x16/0x26
[ 5447.875682] [<ffffffff8112b1ab>] ? spin_lock+0xe/0x10
[ 5447.880882] [<ffffffff81112d39>] do_sys_open+0x69/0xae
[ 5447.886153] [<ffffffff81112db1>] sys_open+0x20/0x22
[ 5447.891114] [<ffffffff813b9aab>] system_call_fastpath+0x16/0x1b
Fix it by reusing the old delayed node.
Reported-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
A user reported this bug again where we have more bitmaps than we are supposed
to. This is because we failed to load the free space cache, but don't update
the ctl->total_bitmaps counter when we remove entries from the tree. This patch
fixes this problem and we should be good to go again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
data parameter should be u64 because a full-sized chunk flags field is
passed instead of 0/1 for distinguishing data from metadata. All
underlying functions expect u64.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we create a new inode, we aren't filling in the
field that records the transaction that last changed this
inode.
If we then go to fsync that inode, it will be skipped because the field
isn't filled in.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It was pointed out by 'make versioncheck' that some includes of
linux/version.h were not needed in fs/ (fs/btrfs/ctree.h and
fs/omfs/file.c).
This patch removes them.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: avoid delayed metadata items during commits
btrfs: fix uninitialized return value
btrfs: fix wrong reservation when doing delayed inode operations
btrfs: Remove unused sysfs code
btrfs: fix dereference of ERR_PTR value
Btrfs: fix relocation races
Btrfs: set no_trans_join after trying to expand the transaction
Btrfs: protect the pending_snapshots list with trans_lock
Btrfs: fix path leakage on subvol deletion
Btrfs: drop the delalloc_bytes check in shrink_delalloc
Btrfs: check the return value from set_anon_super
Snapshot creation has two phases. One is the initial snapshot setup,
and the second is done during commit, while nobody is allowed to modify
the root we are snapshotting.
The delayed metadata insertion code can break that rule, it does a
delayed inode update on the inode of the parent of the snapshot,
and delayed directory item insertion.
This makes sure to run the pending delayed operations before we
record the snapshot root, which avoids corruptions.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When allocation fails in btrfs_read_fs_root_no_name, ret is not set
although it is returned, holding a garbage value.
Signed-off-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Removes code no longer used. The sysfs file itself is kept, because the
btrfs developers expressed interest in putting new entries to sysfs.
Signed-off-by: Maarten Lankhorst <m.b.lankhorst@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The recent commit to get rid of our trans_mutex introduced
some races with block group relocation. The problem is that relocation
needs to do some record keeping about each root, and it was relying
on the transaction mutex to coordinate things in subtle ways.
This fix adds a mutex just for the relocation code and makes sure
it doesn't have a big impact on normal operations. The race is
really fixed in btrfs_record_root_in_trans, which is where we
step back and wait for the relocation code to finish accounting
setup.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can lockup if we try to allow new writers join the transaction and we have
flushoncommit set or have a pending snapshot. This is because we set
no_trans_join and then loop around and try to wait for ordered extents again.
The problem is the ordered endio stuff needs to join the transaction, which it
can't do because no_trans_join is set. So instead wait until after this loop to
set no_trans_join and then make sure to wait for num_writers == 1 in case
anybody got started in between us exiting the loop and setting no_trans_join.
This could easily be reproduced by mounting -o flushoncommit and running xfstest
13. It cannot be reproduced with this patch. Thanks,
Reported-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently there is nothing protecting the pending_snapshots list on the
transaction. We only hold the directory mutex that we are snapshotting and a
read lock on the subvol_sem, so we could race with somebody else creating a
snapshot in a different directory and end up with list corruption. So protect
this list with the trans_lock. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The delayed ref patch accidently removed the btrfs_free_path in
btrfs_unlink_subvol, this puts it back and means we don't leak a path. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: use join_transaction in btrfs_evict_inode()
Btrfs - use %pU to print fsid
Btrfs: fix extent state leak on failed nodatasum reads
btrfs: fix unlocked access of delalloc_inodes
Btrfs: avoid stack bloat in btrfs_ioctl_fs_info()
btrfs: remove 64bit alignment padding to allow extent_buffer to fit into one fewer cacheline
Btrfs: clear current->journal_info on async transaction commit
Btrfs: make sure to recheck for bitmaps in clusters
btrfs: remove unneeded includes from scrub.c
btrfs: reinitialize scrub workers
btrfs: scrub: errors in tree enumeration
Btrfs: don't map extent buffer if path->skip_locking is set
Btrfs: unlock the trans lock properly
Btrfs: don't map extent buffer if path->skip_locking is set
Btrfs: fix duplicate checking logic
Btrfs: fix the allocator loop logic
Btrfs: fix bitmap regression
Btrfs: don't commit the transaction if we dont have enough pinned bytes
Btrfs: noinline the cluster searching functions
Btrfs: cache bitmaps when searching for a cluster
The WARN_ON() in start_transaction() was triggered while balancing.
The cause is btrfs_relocate_chunk() started a transaction and
then called iput() on the inode that stores free space cache,
and iput() called btrfs_start_transaction() again.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Get rid of FIXME comment. Uuids from dmesg are now the same as uuids
given by btrfs-progs.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When encountering an EIO while reading from a nodatasum extent, we
insert an error record into the inode's failure tree.
btrfs_readpage_end_io_hook returns early for nodatasum inodes. We'd
better clear the failure tree in that case, otherwise the kernel
complains about
BUG extent_state: Objects remaining on kmem_cache_close()
on rmmod.
Signed-off-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
list_splice_init will make delalloc_inodes empty, but without a spinlock
around, this may produce corrupted list head, accessed in many placess,
The race window is very tight and nobody seems to have hit it so far.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The size of struct btrfs_ioctl_fs_info_args is as big as 1KB, so
don't declare the variable on stack.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reorder extent_buffer to remove 8 bytes of alignment padding on 64 bit
builds. This shrinks its size to 128 bytes allowing it to fit into one
fewer cache lines and allows more objects per slab in its kmem_cache.
slabinfo extent_buffer reports :-
before:-
Sizes (bytes) Slabs
----------------------------------
Object : 136 Total : 123
SlabObj: 136 Full : 121
SlabSiz: 4096 Partial: 0
Loss : 0 CpuSlab: 2
Align : 8 Objects: 30
after :-
Object : 128 Total : 4
SlabObj: 128 Full : 2
SlabSiz: 4096 Partial: 0
Loss : 0 CpuSlab: 2
Align : 8 Objects: 32
Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Normally current->jouranl_info is cleared by commit_transaction. For an
async snap or subvol creation, though, it runs in a work queue. Clear
it in btrfs_commit_transaction_async() to avoid leaking a non-NULL
journal_info when we return to userspace. When the actual commit runs in
the other thread it won't care that it's current->journal_info is already
NULL.
Signed-off-by: Sage Weil <sage@newdream.net>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef recently changed the free extent cache to look in
the block group cluster for any bitmaps before trying to
add a new bitmap for the same offset. This avoids BUG_ON()s due
covering duplicate ranges.
But it didn't go quite far enough. A given free range might span
between one or more bitmaps or free space entries. The code has
looping to cover this, but it doesn't check for clustered bitmaps
every time.
This shuffles our gotos to check for a bitmap in the cluster
for every new bitmap entry we try to add.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Scrub starts the workers each time a scrub starts and stops them after it
finished. This patch adds an initialization for the workers before each
start, otherwise the workers behave strangely.
Signed-off-by: Arne Jansen <sensille@gmx.net>
due to the semantics of btrfs_search_slot the path can point to an
invalid slot when ret > 0. This condition went unnoticed, which in
turn could have led to an incomplete scrubbing.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Arne's scrub stuff exposed a problem with mapping the extent buffer in
reada_for_search. He searches the commit root with multiple threads and with
skip_locking set, so we can race and overwrite node->map_token since node isn't
locked. So fix this so that we only map the extent buffer if we don't already
have a map_token and skip_locking isn't set. Without this patch scrub would
panic almost immediately, with the patch it doesn't panic anymore. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
In btrfs_wait_for_commit if we came upon a transaction that had committed we
just exited, but that's bad since we are holding the trans_lock. So break
instead so that the lock is dropped. Thanks,
Reported-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Josef Bacik <josef@redhat.com>
Arne's scrub stuff exposed a problem with mapping the extent buffer in
reada_for_search. He searches the commit root with multiple threads and with
skip_locking set, so we can race and overwrite node->map_token since node isn't
locked. So fix this so that we only map the extent buffer if we don't already
have a map_token and skip_locking isn't set. Without this patch scrub would
panic almost immediately, with the patch it doesn't panic anymore. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
When merging my code into the integration test the second check for duplicate
entries got screwed up. This patch fixes it by dropping ret2 and just using ret
for the return value, and checking if we got an error before adding the bitmap
to the local list. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I was testing with empty_cluster = 0 to try and reproduce a problem and kept
hitting early enospc panics. This was because our loop logic was a little
confused. So this is what I did
1) Make the loop variable the ultimate decider on wether we should loop again
isntead of checking to see if we had an uncached bg, empty size or empty
cluster.
2) Increment loop before checking to see what we are on to make the loop
definitions make more sense.
3) If we are on the chunk alloc loop don't set empty_size/empty_cluster to 0
unless we didn't actually allocate a chunk. If we did allocate a chunk we
should be able to easily setup a new cluster so clearing
empty_size/empty_cluster makes us less efficient.
This kept me from hitting panics while trying to reproduce the other problem.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In cleaning up the clustering code I accidently introduced a regression by
adding bitmap entries to the cluster rb tree. The problem is if we've maxed out
the number of bitmaps we can have for the block group we can only add free space
to the bitmaps, but since the bitmap is on the cluster we can't find it and we
try to create another one. This would result in a panic because the total
bitmaps was bigger than the max bitmaps that were allowed. This patch fixes
this by checking to see if we have a cluster, and then looking at the cluster rb
tree to see if it has a bitmap entry and if it does and that space belongs to
that bitmap, go ahead and add it to that bitmap.
I could hit this panic every time with an fs_mark test within a couple of
minutes. With this patch I no longer hit the panic and fs_mark goes to
completion. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I noticed when running an enospc test that we would get stuck committing the
transaction in check_data_space even though we truly didn't have enough space.
So check to see if bytes_pinned is bigger than num_bytes, if it's not don't
commit the transaction. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When profiling the find cluster code it's hard to tell where we are spending our
time because the bitmap and non-bitmap functions get inlined by the compiler, so
make that not happen. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we are looking for a cluster in a particularly sparse or fragmented block
group, we will do a lot of looping through the free space tree looking for
various things, and if we need to look at bitmaps we will endup doing the whole
dance twice. So instead add the bitmap entries to a temporary list so if we
have to do the bitmap search we can just look through the list of entries we've
found quickly instead of having to loop through the entire tree again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
vfs: make unlink() and rmdir() return ENOENT in preference to EROFS
lmLogOpen() broken failure exit
usb: remove bad dput after dentry_unhash
more conservative S_NOSEC handling
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
btrfs: fix uninitialized variable warning
btrfs: add helper for fs_info->closing
Btrfs: add mount -o inode_cache
btrfs: scrub: add explicit plugging
btrfs: use btrfs_ino to access inode number
Btrfs: don't save the inode cache if we are deleting this root
btrfs: false BUG_ON when degraded
Btrfs: don't save the inode cache in non-FS roots
Btrfs: make sure we don't overflow the free space cache crc page
Btrfs: fix uninit variable in the delayed inode code
btrfs: scrub: don't reuse bios and pages
Btrfs: leave spinning on lookup and map the leaf
Btrfs: check for duplicate entries in the free space cache
Btrfs: don't try to allocate from a block group that doesn't have enough space
Btrfs: don't always do readahead
Btrfs: try not to sleep as much when doing slow caching
Btrfs: kill BTRFS_I(inode)->block_group
Btrfs: don't look at the extent buffer level 3 times in a row
Btrfs: map the node block when looking for readahead targets
Btrfs: set range_start to the right start in count_range_bits
...
With Linus' tree, today's linux-next build (powercp ppc64_defconfig)
produced this warning:
fs/btrfs/delayed-inode.c: In function 'btrfs_delayed_update_inode':
fs/btrfs/delayed-inode.c:1598:6: warning: 'ret' may be used
uninitialized in this function
Introduced by commit 16cdcec736 ("btrfs: implement delayed inode items
operation").
This fixes a bug in btrfs_update_inode(): if the returned value from
btrfs_delayed_update_inode is a nonzero garbage, inode stat data are not
updated and several call paths may hit a BUG_ON or fail with strange
code.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David Sterba <dsterba@suse.cz>
This makes the inode map cache default to off until we
fix the overflow problem when the free space crcs don't fit
inside a single page.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With the removal of the implicit plugging scrub ends up doing more and
smaller I/O than necessary. This patch adds explicit plugging per chunk.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
commit 4cb5300bc ("Btrfs: add mount -o auto_defrag") accesses inode
number directly while it should use the helper with the new inode
number allocator.
Signed-off-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With xfstest 254 I can panic the box every time with the inode number caching
stuff on. This is because we clean the inodes out when we delete the subvolume,
but then we write out the inode cache which adds an inode to the subvolume inode
tree, and then when it gets evicted again the root gets added back on the dead
roots list and is deleted again, so we have a double free. To stop this from
happening just return 0 if refs is 0 (and we're not the tree root since tree
root always has refs of 0). With this fix 254 no longer panics. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Tested-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In degraded mode the struct btrfs_device of missing devs don't have
device->name set. A kstrdup of NULL correctly returns NULL. Don't
BUG in this case.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds extra checks to make sure the inode map we are caching really
belongs to a FS root instead of a special relocation tree. It
prevents crashes during balancing operations.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The free space cache uses only one page for crcs right now,
which means we can't have a cache file bigger than the
crcs we can fit in the first page. This adds a check to
enforce that restriction.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The current scrub implementation reuses bios and pages as often as possible,
allocating them only on start and releasing them when finished. This leads
to more problems with the block layer than it's worth. The elevator gets
confused when there are more pages added to the bio than bi_size suggests.
This patch completely rips out the reuse of bios and pages and allocates
them freshly for each submit.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Maosn <chris.mason@oracle.com>
Caching "we have already removed suid/caps" was overenthusiastic as merged.
On network filesystems we might have had suid/caps set on another client,
silently picked by this client on revalidate, all of that *without* clearing
the S_NOSEC flag.
AFAICS, the only reasonably sane way to deal with that is
* new superblock flag; unless set, S_NOSEC is not going to be set.
* local block filesystems set it in their ->mount() (more accurately,
mount_bdev() does, so does btrfs ->mount(), users of mount_bdev() other than
local block ones clear it)
* if any network filesystem (or a cluster one) wants to use S_NOSEC,
it'll need to set MS_NOSEC in sb->s_flags *AND* take care to clear S_NOSEC when
inode attribute changes are picked from other clients.
It's not an earth-shattering hole (anybody that can set suid on another client
will almost certainly be able to write to the file before doing that anyway),
but it's a bug that needs fixing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (36 commits)
Cache xattr security drop check for write v2
fs: block_page_mkwrite should wait for writeback to finish
mm: Wait for writeback when grabbing pages to begin a write
configfs: remove unnecessary dentry_unhash on rmdir, dir rename
fat: remove unnecessary dentry_unhash on rmdir, dir rename
hpfs: remove unnecessary dentry_unhash on rmdir, dir rename
minix: remove unnecessary dentry_unhash on rmdir, dir rename
fuse: remove unnecessary dentry_unhash on rmdir, dir rename
coda: remove unnecessary dentry_unhash on rmdir, dir rename
afs: remove unnecessary dentry_unhash on rmdir, dir rename
affs: remove unnecessary dentry_unhash on rmdir, dir rename
9p: remove unnecessary dentry_unhash on rmdir, dir rename
ncpfs: fix rename over directory with dangling references
ncpfs: document dentry_unhash usage
ecryptfs: remove unnecessary dentry_unhash on rmdir, dir rename
hostfs: remove unnecessary dentry_unhash on rmdir, dir rename
hfsplus: remove unnecessary dentry_unhash on rmdir, dir rename
hfs: remove unnecessary dentry_unhash on rmdir, dir rename
omfs: remove unnecessary dentry_unhash on rmdir, dir rneame
udf: remove unnecessary dentry_unhash from rmdir, dir rename
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (58 commits)
Btrfs: use the device_list_mutex during write_dev_supers
Btrfs: setup free ino caching in a more asynchronous way
btrfs scrub: don't coalesce pages that are logically discontiguous
Btrfs: return -ENOMEM in clear_extent_bit
Btrfs: add mount -o auto_defrag
Btrfs: using rcu lock in the reader side of devices list
Btrfs: drop unnecessary device lock
Btrfs: fix the race between remove dev and alloc chunk
Btrfs: fix the race between reading and updating devices
Btrfs: fix bh leak on __btrfs_open_devices path
Btrfs: fix unsafe usage of merge_state
Btrfs: allocate extent state and check the result properly
fs/btrfs: Add missing btrfs_free_path
Btrfs: check return value of btrfs_inc_extent_ref()
Btrfs: return error to caller if read_one_inode() fails
Btrfs: BUG_ON is deleted from the caller of btrfs_truncate_item & btrfs_extend_item
Btrfs: return error code to caller when btrfs_del_item fails
Btrfs: return error code to caller when btrfs_previous_item fails
btrfs: fix typo 'testeing' -> 'testing'
btrfs: typo: 'btrfS' -> 'btrfs'
...
write_dev_supers was changed to use RCU to protect the list of
devices, but it was then sleeping while it actually wrote the supers.
This fixes it to just use the mutex, since we really don't any
concurrency in write_dev_supers anyway.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
anything else, so that the filesystem can track internally if it
needs to push out a transaction for fdatasync or not.
This is just the prototype change with no user for it yet. I plan
to push large XFS changes for the next merge window, and getting
this trivial infrastructure in this window would help a lot to avoid
tree interdependencies.
Also remove incorrect comments that ->dirty_inode can't block. That
has been changed a long time ago, and many implementations rely on it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For a filesystem that has lots of files in it, the first time we mount
it with free ino caching support, it can take quite a long time to
setup the caching before we can create new files.
Here we fill the cache with [highest_ino, BTRFS_LAST_FREE_OBJECTID]
before we start the caching thread to search through the extent tree.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
scrub_page collects several pages into one bio as long as they are physically
contiguous. As we only save one logical address for the whole bio, don't
collect pages that are physically contiguous but logically discontiguous.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This will detect small random writes into files and
queue the up for an auto defrag process. It isn't well suited to
database workloads yet, but works for smaller files such as rpm, sqlite
or bdb databases.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/djm/tmem:
xen: cleancache shim to Xen Transcendent Memory
ocfs2: add cleancache support
ext4: add cleancache support
btrfs: add cleancache support
ext3: add cleancache support
mm/fs: add hooks to support cleancache
mm: cleancache core ops functions and config
fs: add field to superblock to support cleancache
mm/fs: cleancache documentation
Fix up trivial conflict in fs/btrfs/extent_io.c due to includes
This sixth patch of eight in this cleancache series "opts-in"
cleancache for btrfs. Filesystems must explicitly enable
cleancache by calling cleancache_init_fs anytime an instance
of the filesystem is mounted. Btrfs uses its own readpage
which must be hooked, but all other cleancache hooks are in
the VFS layer including the matching cleancache_flush_fs hook
which must be called on unmount.
Details and a FAQ can be found in Documentation/vm/cleancache.txt
[v6-v8: no changes]
[v5: jeremy@goop.org: simplify init hook and any future fs init changes]
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Jeremy Fitzhardinge <jeremy@goop.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik Van Riel <riel@redhat.com>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andreas Dilger <adilger@sun.com>
Cc: Ted Ts'o <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Nitin Gupta <ngupta@vflare.org>
fs_devices->devices is only updated on remove and add device paths, so we can
use rcu to protect it in the reader side
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Drop device_list_mutex for the reader side on clone_fs_devices and
btrfs_rm_device pathes since the fs_info->volume_mutex can ensure the device
list is not updated
btrfs_close_extra_devices is the initialized path, we can not add or remove
device at this time, so we can simply drop the mutex safely, like other
initialized function does(add_missing_dev, __find_device, __btrfs_open_devices
...).
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On remove device path, it updates device->dev_alloc_list but does not hold
chunk lock
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On btrfs_congested_fn and __unplug_io_fn paths, we should hold
device_list_mutex to avoid remove/add device path to
update fs_devices->devices
On __btrfs_close_devices and btrfs_prepare_sprout paths, the devices in
fs_devices->devices or fs_devices->devices is updated, so we should hold
the mutex to avoid the reader side to reach them
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
'bh' is forgot to release if no error is detected
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
merge_state can free the current state if it can be merged with the next node,
but in set_extent_bit(), after merge_state, we still use the current extent to
get the next node and cache it into cached_state
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It doesn't allocate extent_state and check the result properly:
- in set_extent_bit, it doesn't allocate extent_state if the path is not
allowed wait
- in clear_extent_bit, it doesn't check the result after atomic-ly allocate,
we trigger BUG_ON() if it's fail
- if allocate fail, we trigger BUG_ON instead of returning -ENOMEM since
the return value of clear_extent_bit() is ignored by many callers
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs_alloc_path should be matched with btrfs_free_path in error-handling code.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@r exists@
local idexpression struct btrfs_path * x;
expression ra,rb;
position p1,p2;
@@
x = btrfs_alloc_path@p1(...)
... when != btrfs_free_path(x,...)
when != if (...) { ... btrfs_free_path(x,...) ...}
when != x = ra
if(...) { ... when != x = rb
when forall
when != btrfs_free_path(x,...)
\(return <+...x...+>; \| return@p2...; \) }
@script:python@
p1 << r.p1;
p2 << r.p2;
@@
cocci.print_main("alloc",p1)
cocci.print_secs("return",p2)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If return value of btrfs_inc_extent_ref() is not 0, BUG() is called.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When read_one_inode() fails, error code is returned to caller instead
of BUG_ON().
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently, btrfs_truncate_item and btrfs_extend_item returns only 0.
So, the check by BUG_ON in the caller is unnecessary.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The error code is returned instead of calling BUG_ON when
btrfs_del_item returns the error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The error code is returned instead of calling BUG_ON when
btrfs_previous_item returns the error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Observed as a large delay when --mixed filesystem is filled up.
Test example:
1. create tiny --mixed FS:
$ dd if=/dev/zero of=2G.img seek=$((2048 * 1024 * 1024 - 1)) count=1 bs=1
$ mkfs.btrfs --mixed 2G.img
$ mount -oloop 2G.img /mnt/ut/
2. Try to fill it up:
$ dd if=/dev/urandom of=10M.file bs=10240 count=1024
$ seq 1 256 | while read file_no; do echo $file_no; time cp 10M.file ${file_no}.copy; done
Up to '200.copy' it goes fast, but when disk fills-up each -ENOSPC
message takes 3 seconds to pop-up _every_ ENOSPC (and in usermode linux
it's even more: 30-60 seconds!). (Maybe, time depends on kernel's timer resolution).
No IO, no CPU load, just rescheduling. Some debugging revealed busy spinning
in shrink_delalloc.
Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In 2008, commit b4f6c45dfb dropped the use
of fs/btrfs/version.sh, but left the script behind. Kill it.
Commit by Jamey Sharp and Josh Triplett.
Signed-off-by: Jamey Sharp <jamey@minilop.net>
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs's tree search ioctl has a field to indicate that no more than a
given number of records should be returned. The ioctl doesn't honour
this, as the tested value is not incremented until the end of the
copy_to_sk function. This patch removes an unnecessary local variable,
and updates the num_found counter as each key is found in the tree.
Signed-off-by: Hugo Mills <hugo@carfax.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
240f62c875 replaced the node_lock with rcu_read_lock, but forgot
to remove the actual lock in the data structure. Remove it here.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On lookup we only want to read the inode item, so leave the path spinning. Also
we're just wholesale reading the leaf off, so map the leaf so we don't do a
bunch of kmap/kunmaps. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If there are duplicate entries in the free space cache, discard the entire cache
and load it the old fashioned way. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we have a very large filesystem, we can spend a lot of time in
find_free_extent just trying to allocate from empty block groups. So instead
check to see if the block group even has enough space for the allocation, and if
not go on to the next block group.
Signed-off-by: Josef Bacik <josef@redhat.com>
Our readahead is sort of sloppy, and really isn't always needed. For example if
ls is doing a stating ls (which is the default) it's going to stat in non-disk
order, so if say you have a directory with a stupid amount of files, readahead
is going to do nothing but waste time in the case of doing the stat. Taking the
unconditional readahead out made my test go from 57 minutes to 36 minutes. This
means that everywhere we do loop through the tree we want to make sure we do set
path->reada properly, so I went through and found all of the places where we
loop through the path and set reada to 1. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When the fs is super full and we unmount the fs, we could get stuck in this
thing where unmount is waiting for the caching kthread to make progress and the
caching kthread keeps scheduling because we're in the middle of a commit. So
instead just let the caching kthread keep going and only yeild if
need_resched(). This makes my horrible umount case go from taking up to 10
minutes to taking less than 20 seconds. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Originally this was going to be used as a way to give hints to the allocator,
but frankly we can get much better hints elsewhere and it's not even used at all
for anything usefull. In addition to be completely useless, when we initialize
an inode we try and find a freeish block group to set as the inodes block group,
and with a completely full 40gb fs this takes _forever_, so I imagine with say
1tb fs this is just unbearable. So just axe the thing altoghether, we don't
need it and it saves us 8 bytes in the inode and saves us 500 microseconds per
inode lookup in my testcase. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have a bit of debugging in btrfs_search_slot to make sure the level of the
cow block is the same as the original block we were cow'ing. I don't think I've
ever seen this tripped, so kill it. This saves us 2 kmap's per level in our
search. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
If we have particularly full nodes, we could call btrfs_node_blockptr up to 32
times, which is 32 pairs of kmap/kunmap, which _sucks_. So go ahead and map the
extent buffer while we look for readahead targets. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In count_range_bits we are adjusting total_bytes based on the range we are
searching for, but we don't adjust the range start according to the range we are
searching for, which makes for weird results. For example, if the range
[0-8192]
is set DELALLOC, but I search for 4096-8192, I will get back 4096 for the number
of bytes found, but the range_start will be 0, which makes it look like the
range is [0-4096]. So instead set range_start = max(cur_start, state->start).
This makes everything come out right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The ceph guys keep running into problems where we have space reserved in our
orphan block rsv when freeing it up. This is because they tend to do snapshots
alot, so their truncates tend to use a bunch of space, so when we go to do
things like update the inode we have to steal reservation space in order to make
the reservation happen. This happens because truncate can use as much space as
it freaking feels like, but we still have to hold space for removing the orphan
item and updating the inode, which will definitely always happen. So in order
to fix this we need to split all of the reservation stuf up. So with this patch
we have
1) The orphan block reserve which only holds the space for deleting our orphan
item when everything is over.
2) The truncate block reserve which gets allocated and used specifically for the
space that the truncate will use on a per truncate basis.
3) The transaction will always have 1 item's worth of data reserved so we can
update the inode normally.
Hopefully this will make the ceph problem go away. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We use trans_mutex for lots of things, here's a basic list
1) To serialize trans_handles joining the currently running transaction
2) To make sure that no new trans handles are started while we are committing
3) To protect the dead_roots list and the transaction lists
Really the serializing trans_handles joining is not too hard, and can really get
bogged down in acquiring a reference to the transaction. So replace the
trans_mutex with a trans_lock spinlock and use it to do the following
1) Protect fs_info->running_transaction. All trans handles have to do is check
this, and then take a reference of the transaction and keep on going.
2) Protect the fs_info->trans_list. This doesn't get used too much, basically
it just holds the current transactions, which will usually just be the currently
committing transaction and the currently running transaction at most.
3) Protect the dead roots list. This is only ever processed by splicing the
list so this is relatively simple.
4) Protect the fs_info->reloc_ctl stuff. This is very lightweight and was using
the trans_mutex before, so this is a pretty straightforward change.
5) Protect fs_info->no_trans_join. Because we don't hold the trans_lock over
the entirety of the commit we need to have a way to block new people from
creating a new transaction while we're doing our work. So we set no_trans_join
and in join_transaction we test to see if that is set, and if it is we do a
wait_on_commit.
6) Make the transaction use count atomic so we don't need to take locks to
modify it when we're dropping references.
7) Add a commit_lock to the transaction to make sure multiple people trying to
commit the same transaction don't race and commit at the same time.
8) Make open_ioctl_trans an atomic so we don't have to take any locks for ioctl
trans.
I have tested this with xfstests, but obviously it is a pretty hairy change so
lots of testing is greatly appreciated. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We currently track trans handles in current->journal_info, but we don't actually
use it. This patch fixes it. This will cover the case where we have multiple
people starting transactions down the call chain. This keeps us from having to
allocate a new handle and all of that, we just increase the use count of the
current handle, save the old block_rsv, and return. I tested this with xfstests
and it worked out fine. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I keep forgetting that btrfs_join_transaction() just ignores the num_items
argument, which leads me to sending pointless patches and looking stupid :). So
just kill the num_items argument from btrfs_join_transaction and
btrfs_start_ioctl_transaction, since neither of them use it. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In the prealloc filling code and compressed code we don't set trans->block_rsv
to the delalloc block reserve properly, which is going to make us use metadata
from the wrong pool, this patch fixes that. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits)
b43: fix comment typo reqest -> request
Haavard Skinnemoen has left Atmel
cris: typo in mach-fs Makefile
Kconfig: fix copy/paste-ism for dell-wmi-aio driver
doc: timers-howto: fix a typo ("unsgined")
perf: Only include annotate.h once in tools/perf/util/ui/browsers/annotate.c
md, raid5: Fix spelling error in comment ('Ofcourse' --> 'Of course').
treewide: fix a few typos in comments
regulator: change debug statement be consistent with the style of the rest
Revert "arm: mach-u300/gpio: Fix mem_region resource size miscalculations"
audit: acquire creds selectively to reduce atomic op overhead
rtlwifi: don't touch with treewide double semicolon removal
treewide: cleanup continuations and remove logging message whitespace
ath9k_hw: don't touch with treewide double semicolon removal
include/linux/leds-regulator.h: fix syntax in example code
tty: fix typo in descripton of tty_termios_encode_baud_rate
xtensa: remove obsolete BKL kernel option from defconfig
m68k: fix comment typo 'occcured'
arch:Kconfig.locks Remove unused config option.
treewide: remove extra semicolons
...
The current code relogs the entire inode every time during fsync log,
and it is much better suited to small files rather than large ones.
During my performance test, the fsync performace of large files sucks,
and we can ascribe this to the tremendous amount of csum infos of the
large ones, cause we have to flush all of these csum infos into log trees
even when there are only _one_ change in the whole file data. Apparently,
to optimize fsync, we need to create a filter to skip the unnecessary csum
ones, that is, the corresponding file data remains unchanged before this fsync.
Here I have some test results to show, I use sysbench to do "random write + fsync".
===
sysbench --test=fileio --num-threads=1 --file-num=2 --file-block-size=4K --file-total-size=8G --file-test-mode=rndwr --file-io-mode=sync --file-extra-flags= [prepare, run]
===
Sysbench args:
- Number of threads: 1
- Extra file open flags: 0
- 2 files, 4Gb each
- Block size 4Kb
- Number of random requests for random IO: 10000
- Read/Write ratio for combined random IO test: 1.50
- Periodic FSYNC enabled, calling fsync() each 100 requests.
- Calling fsync() at the end of test, Enabled.
- Using synchronous I/O mode
- Doing random write test
Sysbench results:
===
Operations performed: 0 Read, 10000 Write, 200 Other = 10200 Total
Read 0b Written 39.062Mb Total transferred 39.062Mb
===
a) without patch: (*SPEED* : 451.01Kb/sec)
112.75 Requests/sec executed
b) with patch: (*SPEED* : 4.7533Mb/sec)
1216.84 Requests/sec executed
PS: I've made a _sub transid_ stuff patch, but it does not perform as effectively as this patch,
and I'm wanderring where the problem is and trying to improve it more.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Changelog V5 -> V6:
- Fix oom when the memory load is high, by storing the delayed nodes into the
root's radix tree, and letting btrfs inodes go.
Changelog V4 -> V5:
- Fix the race on adding the delayed node to the inode, which is spotted by
Chris Mason.
- Merge Chris Mason's incremental patch into this patch.
- Fix deadlock between readdir() and memory fault, which is reported by
Itaru Kitayama.
Changelog V3 -> V4:
- Fix nested lock, which is reported by Itaru Kitayama, by updating space cache
inode in time.
Changelog V2 -> V3:
- Fix the race between the delayed worker and the task which does delayed items
balance, which is reported by Tsutomu Itoh.
- Modify the patch address David Sterba's comment.
- Fix the bug of the cpu recursion spinlock, reported by Chris Mason
Changelog V1 -> V2:
- break up the global rb-tree, use a list to manage the delayed nodes,
which is created for every directory and file, and used to manage the
delayed directory name index items and the delayed inode item.
- introduce a worker to deal with the delayed nodes.
Compare with Ext3/4, the performance of file creation and deletion on btrfs
is very poor. the reason is that btrfs must do a lot of b+ tree insertions,
such as inode item, directory name item, directory name index and so on.
If we can do some delayed b+ tree insertion or deletion, we can improve the
performance, so we made this patch which implemented delayed directory name
index insertion/deletion and delayed inode update.
Implementation:
- introduce a delayed root object into the filesystem, that use two lists to
manage the delayed nodes which are created for every file/directory.
One is used to manage all the delayed nodes that have delayed items. And the
other is used to manage the delayed nodes which is waiting to be dealt with
by the work thread.
- Every delayed node has two rb-tree, one is used to manage the directory name
index which is going to be inserted into b+ tree, and the other is used to
manage the directory name index which is going to be deleted from b+ tree.
- introduce a worker to deal with the delayed operation. This worker is used
to deal with the works of the delayed directory name index items insertion
and deletion and the delayed inode update.
When the delayed items is beyond the lower limit, we create works for some
delayed nodes and insert them into the work queue of the worker, and then
go back.
When the delayed items is beyond the upper bound, we create works for all
the delayed nodes that haven't been dealt with, and insert them into the work
queue of the worker, and then wait for that the untreated items is below some
threshold value.
- When we want to insert a directory name index into b+ tree, we just add the
information into the delayed inserting rb-tree.
And then we check the number of the delayed items and do delayed items
balance. (The balance policy is above.)
- When we want to delete a directory name index from the b+ tree, we search it
in the inserting rb-tree at first. If we look it up, just drop it. If not,
add the key of it into the delayed deleting rb-tree.
Similar to the delayed inserting rb-tree, we also check the number of the
delayed items and do delayed items balance.
(The same to inserting manipulation)
- When we want to update the metadata of some inode, we cached the data of the
inode into the delayed node. the worker will flush it into the b+ tree after
dealing with the delayed insertion and deletion.
- We will move the delayed node to the tail of the list after we access the
delayed node, By this way, we can cache more delayed items and merge more
inode updates.
- If we want to commit transaction, we will deal with all the delayed node.
- the delayed node will be freed when we free the btrfs inode.
- Before we log the inode items, we commit all the directory name index items
and the delayed inode update.
I did a quick test by the benchmark tool[1] and found we can improve the
performance of file creation by ~15%, and file deletion by ~20%.
Before applying this patch:
Create files:
Total files: 50000
Total time: 1.096108
Average time: 0.000022
Delete files:
Total files: 50000
Total time: 1.510403
Average time: 0.000030
After applying this patch:
Create files:
Total files: 50000
Total time: 0.932899
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.215732
Average time: 0.000024
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
Many thanks for Kitayama-san's help!
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: David Sterba <dave@jikos.cz>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Tested-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commit e66eed651f ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.
So this fixes things up a bit, using
grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
to guide us in finding files that either need <linux/prefetch.h>
inclusion, or have it despite not needing it.
There are more of them around (mostly network drivers), but this gets
many core ones.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Steps to reproduce the bug:
- Call FS_IOC_SETLFAGS ioctl with flags=FS_COMPR_FL
- Call FS_IOC_SETFLAGS ioctl with flags=0
- Call FS_IOC_GETFLAGS ioctl, and you'll see FS_COMPR_FL is still set!
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
FS_COW_FL and FS_NOCOW_FL were newly introduced to control per file
COW in btrfs, but FS_NOCOW_FL is sufficient.
The fact is we don't have corresponding BTRFS_INODE_COW flag.
COW is default, and FS_NOCOW_FL can be used to switch off COW for
a single file.
If we mount btrfs with nodatacow, a newly created file will be set with
the FS_NOCOW_FL flag. So to turn on COW for it, we can just clear the
FS_NOCOW_FL flag.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a btrfs disk is created by mixed data & metadata option, it will have no
pure data or pure metadata space info.
In btrfs's for-linus branch, commit 78b1ea13838039cd88afdd62519b40b344d6c920
(Btrfs: fix OOPS of empty filesystem after balance) initializes space infos at
the very beginning. The problem is this initialization does not take the mixed
case into account, which will cause btrfs will easily get into ENOSPC in mixed
case.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If posix_acl_from_xattr() returns an error code, a negative address is
dereferenced causing an oops; fix by checking for error code first.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In a multi device setup, the chunk allocator currently always allocates
chunks on the devices in the same order. This leads to a very uneven
distribution, especially with RAID1 or RAID10 and an uneven number of
devices.
This patch always sorts the devices before allocating, and allocates the
stripes on the devices with the most available space, as long as there
is enough space available. In a low space situation, it first tries to
maximize striping.
The patch also simplifies the allocator and reduces the checks for
corner cases.
The simplification is done by several means. First, it defines the
properties of each RAID type upfront. These properties are used afterwards
instead of differentiating cases in several places.
Second, the old allocator defined a minimum stripe size for each block
group type, tried to find a large enough chunk, and if this fails just
allocates a smaller one. This is now done in one step. The largest possible
chunk (up to max_chunk_size) is searched and allocated.
Because we now have only one pass, the allocation of the map (struct
map_lookup) is moved down to the point where the number of stripes is
already known. This way we avoid reallocation of the map.
We still avoid allocating stripes that are not a multiple of STRIPE_SIZE.
currently alloc_start is disregarded if the requested
chunk size is bigger than (device size - alloc_start),
but smaller than the device size.
The only situation where I see this could have made sense
was when a chunk equal the size of the device has been
requested. This was possible as the allocator failed to
take alloc_start into account when calculating the request
chunk size. As this gets fixed by this patch, the workaround
is not necessary anymore.
btrfs scrub - make fixups sync, don't reuse fixup bios
Fixups are already sync for csum failures, this patch makes them sync
for EIO case as well.
Fixups are now sharing pages with the parent sbio - instead of
allocating a separate page to do a fixup we grab the page from the sbio
buffer.
Fixup bios are no longer reused.
struct fixup is no longer needed, instead pass [sbio pointer, index].
Originally this was added to look at the possibility of sharing the code
between drive swap and scrub, but it actually fixes a serious bug in
scrub code where errors that could be corrected were ignored and
reported as uncorrectable.
btrfs scrub - restore bios properly after media errors
The current code reallocates a bio after a media error. This is a
temporary measure introduced in v3 after a serious problem related to
bio reuse was found in v2 of scrub patchset.
Basically we did not reset bv_offset and bv_len fields of the bio_vec
structure. They are changed in case I/O error happens, for example, at
offset 512 or 1024 into the page. Also bi_flags field wasn't properly
setup before reusing the bio.
Signed-off-by: Arne Jansen <sensille@gmx.net>
adds ioctls necessary to start and cancel scrubs, to get current
progress and to get info about devices to be scrubbed.
Note that the scrub is done per-device and that the ioctl only
returns after the scrub for this devices is finished or has been
canceled.
Signed-off-by: Arne Jansen <sensille@gmx.net>
This adds an initial implementation for scrub. It works quite
straightforward. The usermode issues an ioctl for each device in the
fs. For each device, it enumerates the allocated device chunks. For
each chunk, the contained extents are enumerated and the data checksums
fetched. The extents are read sequentially and the checksums verified.
If an error occurs (checksum or EIO), a good copy is searched for. If
one is found, the bad copy will be rewritten.
All enumerations happen from the commit roots. During a transaction
commit, the scrubs get paused and afterwards continue from the new
roots.
This commit is based on the series originally posted to linux-btrfs
with some improvements that resulted from comments from David Sterba,
Ilya Dryomov and Jan Schmidt.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Remove code which has been #if0-ed out for a very long time and does not
seem to be related to current codebase anymore.
Signed-off-by: David Sterba <dsterba@suse.cz>
Remove static and global declarations and/or definitions. Reduces size
of btrfs.ko by ~3.4kB.
text data bss dec hex filename
402081 7464 200 409745 64091 btrfs.ko.base
398620 7144 200 405964 631cc btrfs.ko.remove-all
Signed-off-by: David Sterba <dsterba@suse.cz>
parameter tree root it's not used since commit
5f39d397df ("Btrfs: Create extent_buffer
interface for large blocksizes")
Signed-off-by: David Sterba <dsterba@suse.cz>
all callers pass GFP_NOFS, but the GFP mask argument is not used in the
function; GFP_ATOMIC is passed to radix tree initialization and it's the
only correct one, since we're using the preload/insert mechanism of
radix tree.
Let's drop the gfp mask from btrfs function, this will not change
behaviour.
Signed-off-by: David Sterba <dsterba@suse.cz>
The superblock's ->s_fs_info is properly set in btrfs_fill_super, after
a call to open_ctree, which derefereces it before check. Although
tree_root is set via btrfs_set_super, let's be defensive and leave the
check in place.
Signed-off-by: David Sterba <dsterba@suse.cz>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: cleanup error handling in inode.c
Btrfs: put the right bio if we have an error
Btrfs: free bitmaps properly when evicting the cache
Btrfs: Free free_space item properly in btrfs_trim_block_group()
btrfs: add missing spin_unlock to a rare exit path
Btrfs: check return value of kmalloc()
btrfs: fix wrong allocating flag when reading page
Btrfs: fix missing mutex_unlock in btrfs_del_dir_entries_in_log()
The error processing of several places is changed like setting the
error number only at the error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs_submit_direct_hook if the first btrfs_map_block fails we need to put
the orig_bio, not bio.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If our space cache is wrong, we do the right thing and free up everything that
we loaded, however we don't reset the total_bitmaps counter or the thresholds or
anything. So in btrfs_remove_free_space_cache make sure to call free_bitmap()
if it's a bitmap, this will keep us from panicing when we check to make sure we
don't have too many bitmaps. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since commit dc89e98244, we've changed
to use a specific slab for alocation of free_space items.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The check on the return value of kmalloc() is added to some places.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
the space cache use extent_readpages() to read free space information,
so we can not use GFP_KERNEL flag to allocate memory, or it may lead
to deadlock.
Signed-off-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It is necessary to unlock mutex_lock before it return an error when
btrfs_alloc_path() fails.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This is similar to block group caching.
We dedicate a special inode in fs tree to save free ino cache.
At the very first time we create/delete a file after mount, the free ino
cache will be loaded from disk into memory. When the fs tree is commited,
the cache will be written back to disk.
To keep compatibility, we check the root generation against the generation
of the special inode when loading the cache, so the loading will fail
if the btrfs filesystem was mounted in an older kernel before.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
There's a potential problem in 32bit system when we exhaust 32bit inode
numbers and start to allocate big inode numbers, because btrfs uses
inode->i_ino in many places.
So here we always use BTRFS_I(inode)->location.objectid, which is an
u64 variable.
There are 2 exceptions that BTRFS_I(inode)->location.objectid !=
inode->i_ino: the btree inode (0 vs 1) and empty subvol dirs (256 vs 2),
and inode->i_ino will be used in those cases.
Another reason to make this change is I'm going to use a special inode
to save free ino cache, and the inode number must be > (u64)-256.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Extract out block group specific code from lookup_free_space_inode(),
create_free_space_inode(), load_free_space_cache() and
btrfs_write_out_cache(), so the code can be used to read/write
free ino cache.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Currently btrfs stores the highest objectid of the fs tree, and it always
returns (highest+1) inode number when we create a file, so inode numbers
won't be reclaimed when we delete files, so we'll run out of inode numbers
as we keep create/delete files in 32bits machines.
This fixes it, and it works similarly to how we cache free space in block
cgroups.
We start a kernel thread to read the file tree. By scanning inode items,
we know which chunks of inode numbers are free, and we cache them in
an rb-tree.
Because we are searching the commit root, we have to carefully handle the
cross-transaction case.
The rb-tree is a hybrid extent+bitmap tree, so if we have too many small
chunks of inode numbers, we'll use bitmaps. Initially we allow 16K ram
of extents, and a bitmap will be used if we exceed this threshold. The
extents threshold is adjusted in runtime.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
So we can re-use the code to cache free inode numbers.
The change is quite straightforward. Two new structures are introduced.
- struct btrfs_free_space_ctl
We move those variables that are used for caching free space from
struct btrfs_block_group_cache to this new struct.
- struct btrfs_free_space_op
We do block group specific work (e.g. calculation of extents threshold)
through functions registered in this struct.
And then we can remove references to struct btrfs_block_group_cache.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
The Btrfs submit bio threads have a small number of
threads responsible for pushing down bios we've collected
for a large number of devices.
Since we do all the bios for a single device at once,
we want to make sure we unplug and send down the bios
for each device as we're done processing them.
The new plugging API removed the btrfs code to
unplug while processing bios, this adds it back with
the new API.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
Btrfs: fix free space cache leak
Btrfs: avoid taking the chunk_mutex in do_chunk_alloc
Btrfs end_bio_extent_readpage should look for locked bits
Btrfs: don't force chunk allocation in find_free_extent
Btrfs: Check validity before setting an acl
Btrfs: Fix incorrect inode nlink in btrfs_link()
Btrfs: Check if btrfs_next_leaf() returns error in btrfs_real_readdir()
Btrfs: Check if btrfs_next_leaf() returns error in btrfs_listxattr()
Btrfs: make uncache_state unconditional
btrfs: using cached extent_state in set/unlock combinations
Btrfs: avoid taking the trans_mutex in btrfs_end_transaction
Btrfs: fix subvolume mount by name problem when default mount subvolume is set
fix user annotation in ioctl.c
Btrfs: check for duplicate iov_base's when doing dio reads
btrfs: properly handle overlapping areas in memmove_extent_buffer
Btrfs: fix memory leaks in btrfs_new_inode()
Btrfs: check for duplicate iov_base's when doing dio reads
Btrfs: reuse the extent_map we found when calling btrfs_get_extent
Btrfs: do not use async submit for small DIO io's
Btrfs: don't split dio bios if we don't have to
...
The free space caching code was recently reworked to
cache all the pages it needed instead of using find_get_page everywhere.
One loop was missed though, so it ended up leaking pages. This fixes
it to use our page array instead of find_get_page.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Everytime we try to allocate disk space we try and see if we can pre-emptively
allocate a chunk, but in the common case we don't allocate anything, so there is
no sense in taking the chunk_mutex at all. So instead if we are allocating a
chunk, mark it in the space_info so we don't get two people trying to allocate
at the same time. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Reviewed-by: Liu Bo <liubo2009@cn.fujitsu.com>
A recent commit caches the extent state in end_bio_extent_readpage,
but the search it does should look for locked extents. This
fixes things to make it more effective.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_free_extent likes to allocate in contiguous clusters,
which makes writeback faster, especially on SSD storage. As
the FS fragments, these clusters become harder to find and we have
to decide between allocating a new chunk to make more clusters
or giving up on the cluster to allocate from the free space
we have.
Right now it creates too many chunks, and you can end up with
a whole FS that is mostly empty metadata chunks. This commit
changes the allocation code to be more strict and only
allocate new chunks when we've made good use of the chunks we
already have.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Call posix_acl_valid() to check if an acl is valid or not.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Link count of the inode is not decreased if btrfs_set_inode_index()
fails.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Singed-off-by: Li Zefan <lizf@cn.fujitsu.com>
btrfs_next_leaf() can return -errno, and we should propagate
it to userspace.
This also simplifies how we walk the btree path.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
btrfs_next_leaf() can return -errno, and we should propagate
it to userspace.
This also simplifies how we walk the btree path.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
The extent_io code can take cached pointers into the extent state trees,
and these can make lookups much faster in common operations. The
caching only happens when specific bits are set that prevent merging
and splitting of the extent state.
A help function was added to uncache the state, and it was testing
the same set of conditionals. This can leak in very strange corner
cases where the lock bit goes away unexpectedly.
The uncaching should be unconditional. Once we have a ref on the
extent we should always give it up.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In several places the sequence (set_extent_uptodate, unlock_extent) is used.
This leads to a duplicate lookup of the extent state. This patch lets
set_extent_uptodate return a cached extent_state which can be passed to
unlock_extent_cached.
The occurences of the above sequences are updated to use the cache. Only
end_bio_extent_readpage is updated that it first gets a cached state to
pass it to the readpage_end_io_hook as the prototype requested and is later
on being used for set/unlock.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I've been working on making our O_DIRECT latency not suck and I noticed we were
taking the trans_mutex in btrfs_end_transaction. So to do this we convert
num_writers and use_count to atomic_t's and just decrement them in
btrfs_end_transaction. Instead of deleting the transaction from the trans list
in put_transaction we do that in btrfs_commit_transaction() since that's the
only time it actually needs to be removed from the list. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We create two subvolumes (meego_root and meego_home) in
btrfs root directory. And set meego_root as default mount
subvolume. After we remount btrfs, meego_root is mounted
to top directory by default. Then when we try to mount
meego_home (subvol=meego_home) to a subdirectory, it failed.
The problem is when default mount subvolume is set to
meego_root, we search meego_home in meego_root but can not find
it. So the solution is to add a new mount option (subvolrootid)
to specify subvol id of root and search subvol name in it. For
our case, now we can use "-o subvolrootid=0,subvol=meego_home)
to mount meego_home.
Detail information can be found in meego bugzilla:
https://bugs.meego.com/show_bug.cgi?id=15055
Signed-off-by: Zhong, Xin <xin.zhong@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Apparently it is ok to submit a read to an IDE device with the same target page
for different offsets. This is what Windows does under qemu. The problem is
under DIO we expect them to be different buffers for checksumming reasons, and
so this sort of thing will result in checksum errors, when in reality the file
is fine. So when reading, check to make sure that all iov bases are different,
and if they aren't fall back to buffered mode, since that will work out right.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch fixes memory leaks in btrfs_new_inode().
Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Apparently it is ok to submit a read to an IDE device with the same target page
for different offsets. This is what Windows does under qemu. The problem is
under DIO we expect them to be different buffers for checksumming reasons, and
so this sort of thing will result in checksum errors, when in reality the file
is fine. So when reading, check to make sure that all iov bases are different,
and if they aren't fall back to buffered mode, since that will work out right.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In btrfs_get_block_direct we call btrfs_get_extent to lookup the extent for the
range that we are looking for. If we don't find an extent, btrfs_get_extent
will insert a extent_map for that area and mark it as a hole. So it does the
job of allocating a new extent map and inserting it into the io tree. But if
we're creating a new extent we free it up and redo all of that work. So instead
pass the em to btrfs_new_extent_direct(), and if it will work just allocate the
disk space and set it up properly and bypass the freeing/allocating of a new
extent map and the expensive operation of inserting the thing into the io_tree.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When looking at our DIO performance Chris said that for small IO's doing the
async submit stuff tends to be more overhead than it's worth. With this on top
of my other fixes I get about a 17-20% speedup doing a sequential dd with 4k
IO's. Basically if we don't have to split the bio for the map length it's small
enough to be directly submitted, otherwise go back to the async submit. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We have been unconditionally allocating a new bio and re-adding all pages from
our original bio to the new bio. This is needed if our original bio is larger
than our stripe size, but if it is smaller than the stripe size then there is no
need to do this. So check the map length and if we are under that then go ahead
and submit the original bio. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In the DIO code we often don't update the i_disk_size because the i_size isn't
updated until after the DIO is completed, so basically we are allocating a path,
doing a search, and updating the inode item for no reason since nothing changed.
btrfs_ordered_update_i_size will return 1 if it didn't update i_disk_size, so
only run btrfs_update_inode if btrfs_ordered_update_i_size returns 0. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Instead of calling kmap_atomic for every thing we set in the inode item, map the
entire inode item at the start and unmap it at the end. This makes a sequential
dd of 400mb O_DIRECT something like 1% faster. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
I saw a lockup where we kept getting into this start transaction->commit
transaction loop because of enospce. The fact is if we fail to make our
reservation, we've tried _everything_ several times, so we only need to try and
commit the transaction once, and if that doesn't work then we really are out of
space and need to just exit. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently we don't handle running out of space in the cache, so to fix this we
keep track of how far in the cache we are. Then we only dirty the pages if we
successfully modify all of them, otherwise if we have an error or run out of
space we can just drop them and not worry about the vm writing them out.
Thanks,
Tested-by Johannes Hirte <johannes.hirte@fem.tu-ilmenau.de>
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: don't warn in btrfs_add_orphan
Btrfs: fix free space cache when there are pinned extents and clusters V2
Btrfs: Fix uninitialized root flags for subvolumes
btrfs: clear __GFP_FS flag in the space cache inode
Btrfs: fix memory leak in start_transaction()
Btrfs: fix memory leak in btrfs_ioctl_start_sync()
Btrfs: fix subvol_sem leak in btrfs_rename()
Btrfs: Fix oops for defrag with compression turned on
Btrfs: fix /proc/mounts info.
Btrfs: fix compiler warning in file.c
When I moved the orphan adding to btrfs_truncate I missed the fact that during
orphan cleanup we just add the orphan items to the orphan list without going
through btrfs_orphan_add, which results in lots of warnings on mount if you have
any orphan items that need to be truncated. Just remove this warning since it's
ok, this will allow all of the normal space accounting take place. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I noticed a huge problem with the free space cache that was presenting
as an early ENOSPC. Turns out when writing the free space cache out I
forgot to take into account pinned extents and more importantly
clusters. This would result in us leaking free space everytime we
unmounted the filesystem and remounted it.
I fix this by making sure to check and see if the current block group
has a cluster and writing out any entries that are in the cluster to the
cache, as well as writing any pinned extents we currently have to the
cache since those will be available for us to use the next time the fs
mounts.
This patch also adds a check to the end of load_free_space_cache to make
sure we got the right amount of free space cache, and if not make sure
to clear the cache and re-cache the old fashioned way.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
root_item->flags and root_item->byte_limit are not initialized when
a subvolume is created. This bug is not revealed until we added
readonly snapshot support - now you mount a btrfs filesystem and you
may find the subvolumes in it are readonly.
To work around this problem, we steal a bit from root_item->inode_item->flags,
and use it to indicate if those fields have been properly initialized.
When we read a tree root from disk, we check if the bit is set, and if
not we'll set the flag and initialize the two fields of the root item.
Reported-by: Andreas Philipp <philipp.andreas@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Tested-by: Andreas Philipp <philipp.andreas@gmail.com>
cc: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
the object id of the space cache inode's key is allocated from the relative
root, just like the regular file. So we can't identify space cache inode by
checking the object id of the inode's key, and we have to clear __GFP_FS flag
at the time we look up the space cache inode.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Free btrfs_trans_handle when join_transaction() fails
in start_transaction()
Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_rename() does not release the subvol_sem if the transaction failed to start.
Signed-off-by: Johann Lombardi <johann@whamcloud.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we defrag a file, whose size can be fit into an inline extent,
with compression enabled, the compress type is set to be
fs_info->compress_type, which is 0 if the btrfs filesystem is mounted
without compress option. This leads to oops.
Reported-by: Daniel Blueman <daniel.blueman@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Some mount options are not displayed by /proc/mounts.
This patch displays the option such as compress_type by /proc/mounts.
Ex.
[before]
$ mount | grep sdc2
/dev/sdc2 on /test12 type btrfs (rw,space_cache,compress=lzo)
$ cat /proc/mounts | grep sdc2
/dev/sdc2 /test12 btrfs rw,relatime,compress 0 0
[after]
$ mount | grep sdc2
/dev/sdc2 on /test12 type btrfs (rw,space_cache,compress=lzo)
$ cat /proc/mounts | grep sdc2
/dev/sdc2 /test12 btrfs rw,relatime,compress=lzo,space_cache 0 0
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
While compiling Btrfs, I got following messages:
CC [M] fs/btrfs/file.o
fs/btrfs/file.c: In function '__btrfs_buffered_write':
fs/btrfs/file.c:909: warning: 'ret' may be used uninitialized in this function
CC [M] fs/btrfs/tree-defrag.o
This patch fixes compiler warning.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus-unmerged' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (45 commits)
Btrfs: fix __btrfs_map_block on 32 bit machines
btrfs: fix possible deadlock by clearing __GFP_FS flag
btrfs: check link counter overflow in link(2)
btrfs: don't mess with i_nlink of unlocked inode in rename()
Btrfs: check return value of btrfs_alloc_path()
Btrfs: fix OOPS of empty filesystem after balance
Btrfs: fix memory leak of empty filesystem after balance
Btrfs: fix return value of setflags ioctl
Btrfs: fix uncheck memory allocations
btrfs: make inode ref log recovery faster
Btrfs: add btrfs_trim_fs() to handle FITRIM
Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes
Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP
Btrfs: make update_reserved_bytes() public
btrfs: return EXDEV when linking from different subvolumes
Btrfs: Per file/directory controls for COW and compression
Btrfs: add datacow flag in inode flag
btrfs: use GFP_NOFS instead of GFP_KERNEL
Btrfs: check return value of read_tree_block()
btrfs: properly access unaligned checksum buffer
...
Fix up trivial conflicts in fs/btrfs/volumes.c due to plug removal in
the block layer.
Recent changes for discard support didn't compile,
this fixes them not to try and % 64 bit numbers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Using the GFP_HIGHUSER_MOVABLE flag to allocate the metadata's page may cause
deadlock.
Task1
open()
...
btrfs_search_slot()
...
btrfs_cow_block()
...
alloc_page()
wait for reclaiming
shrink_slab()
...
shrink_icache_memory()
...
btrfs_evict_inode()
...
btrfs_search_slot()
If the path is locked by task1, the deadlock happens.
So the btree's page cache is different with the file's page cache, it can not
allocate pages by GFP_HIGHUSER_MOVABLE flag, we must clear __GFP_FS flag in
GFP_HIGHUSER_MOVABLE flag.
Reported-by: Itaru Kitayama <kitayama@cl.bb4u.ne.jp>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
old_inode is not locked; it's not safe to play with its link
count. Instead of bumping it and calling btrfs_unlink_inode(),
add a variant of the latter that does not do btrfs_drop_nlink()/
btrfs_update_inode(), call it instead of btrfs_inc_nlink()/
btrfs_unlink_inode() and do btrfs_update_inode() ourselves.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Adding the check on the return value of btrfs_alloc_path() to several places.
And, some of callers are modified by this change.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs will remove unused block groups after balance.
When a empty filesystem is balanced, the block group with tag "DATA" may be
dropped, and after umount and mount again, it will not find "DATA" space_info
and lead to OOPS.
So we initial the necessary space_infos(DATA, SYSTEM, METADATA) to avoid OOPS.
Reported-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After Josef's patch(commit 3c14874acc),
btrfs will exclude super bytes when reading block groups(by marking a extent
state UPTODATE). However, these bytes do not get freed while balance remove
unused block groups, and we won't process those removed ones any more, when
we do umount and unload the btrfs module, btrfs hits a memory leak.
This patch add the missing free operation.
Reproduce steps:
$ mkfs.btrfs disk
$ mount disk /mnt/btrfs -o loop
$ btrfs filesystem balance /mnt/btrfs
$ umount /mnt/btrfs
$ rmmod btrfs
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
setflags ioctl should return error when any checks fail.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To make Btrfs code more robust, several return value checks where memory
allocation can fail are introduced. I use BUG_ON where I don't know how
to handle the error properly, which increases the number of using the
notorious BUG_ON, though.
Signed-off-by: Yoshinori Sano <yoshinori.sano@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we recover from crash via write-ahead log tree and process
the inode refs, for each btrfs_inode_ref item, we will
1) check if we already have a perfect match in fs/file tree, if
we have, then we're done.
2) search the corresponding back reference in fs/file tree, and
check all the names in this back reference to see if they are
also in the log to avoid conflict corners.
3) recover the logged inode refs to fs/file tree.
In current btrfs, however,
- for 2)'s check, once is enough, since the checked back reference
will remain unchanged after processing all the inode refs belonged
to the key.
- it has no need to do another 1) between 2) and 3).
I've made a small test to show how it improves,
$dd if=/dev/zero of=foobar bs=4K count=1
$sync
$make 100 hard links continuously, like ln foobar link_i
$fsync foobar
$echo b > /proc/sysrq-trigger
after reboot
$time mount DEV PATH
without patch:
real 0m0.285s
user 0m0.001s
sys 0m0.009s
with patch:
real 0m0.123s
user 0m0.000s
sys 0m0.010s
Changelog v1->v2:
- fix double free - pointed by David Sterba
Changelog v2->v3:
- adjust free order
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We take an free extent out from allocator, trim it, then put it back,
but before we trim the block group, we should make sure the block group is
cached, so plus a little change to make cache_block_group() run without a
transaction.
Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Callers of btrfs_discard_extent() should check if we are mounted with -o discard,
as we want to make fitrim to work even the fs is not mounted with -o discard.
Also we should use REQ_DISCARD to map the free extent to get a full mapping,
last we only return errors if
1. the error is not a EOPNOTSUPP
2. no device supports discard
Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_map_block() will only return a single stripe length, but we want the
full extent be mapped to each disk when we are trimming the extent,
so we add length to btrfs_bio_stripe and fill it if we are mapping for REQ_DISCARD.
Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Make the function public as we should update the reserved extents calculations
after taking out an extent for trimming.
Signed-off-by: Li Dongyang <lidongyang@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_link returns EPERM if a cross-subvolume link is attempted.
However, in this case I believe EXDEV to be the more appropriate value.
>From the link(2) man page:
EXDEV oldpath and newpath are not on the same mounted file system. (Linux
permits a file system to be mounted at multiple points, but link()
does not work across different mount points, even if the same file
system is mounted on both.)
This matters because an application may have different behaviors based on
return codes.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Data compression and data cow are controlled across the entire FS by mount
options right now. ioctls are needed to set this on a per file or per
directory basis. This has been proposed previously, but VFS developers
wanted us to use generic ioctls rather than btrfs-specific ones.
According to Chris's comment, there should be just one true compression
method(probably LZO) stored in the super. However, before this, we would
wait for that one method is stable enough to be adopted into the super.
So I list it as a long term goal, and just store it in ram today.
After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
control file and directory's datacow and compression attribute.
NOTE:
- The compression type is selected by such rules:
If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
Otherwise, we'll use the default compress type (zlib today).
v1->v2:
- rebase to the latest btrfs.
v2->v3:
- fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
will be screwed by inheritance from parent directory.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In the filesystem context, we must allocate memory by GFP_NOFS,
or we may start another filesystem operation and make kswap thread hang up.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch is checking return value of read_tree_block(),
and if it is NULL, error processing.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On Fri, Mar 18, 2011 at 11:56:53AM -0400, Chris Mason wrote:
> Thanks for fielding this one. Does put_unaligned_le32 optimize away on
> platforms with efficient access? It would be great if we didn't need
> the #ifdef.
(quicktest: assembly output is same for put_unaligned_le32 and direct
assignment on my x86_64)
I was originally following examples in
Documentation/unaligned-memory-access.txt. From other code it seems to me that
the define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is intended for larger
portions of code. Macros/wrappers for {put,get}_unaligned* are chosen via
arch/<arch>/include/asm/unaligned.h accordingly, therefore it's safe to use
put_unaligned_le32 without the ifdef.
dave
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch changes some BUG_ON() to the error return.
(but, most callers still use BUG_ON())
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tracepoints can provide insight into why btrfs hits bugs and be greatly
helpful for debugging, e.g
dd-7822 [000] 2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
dd-7822 [000] 2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
btrfs-transacti-7804 [001] 2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
btrfs-transacti-7804 [001] 2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
flush-btrfs-2-7821 [001] 2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
flush-btrfs-2-7821 [001] 2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
flush-btrfs-2-7821 [001] 2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
flush-btrfs-2-7821 [000] 2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
btrfs-endio-wri-7800 [001] 2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
Here is what I have added:
1) ordere_extent:
btrfs_ordered_extent_add
btrfs_ordered_extent_remove
btrfs_ordered_extent_start
btrfs_ordered_extent_put
These provide critical information to understand how ordered_extents are
updated.
2) extent_map:
btrfs_get_extent
extent_map is used in both read and write cases, and it is useful for tracking
how btrfs specific IO is running.
3) writepage:
__extent_writepage
btrfs_writepage_end_io_hook
Pages are cirtical resourses and produce a lot of corner cases during writeback,
so it is valuable to know how page is written to disk.
4) inode:
btrfs_inode_new
btrfs_inode_request
btrfs_inode_evict
These can show where and when a inode is created, when a inode is evicted.
5) sync:
btrfs_sync_file
btrfs_sync_fs
These show sync arguments.
6) transaction:
btrfs_transaction_commit
In transaction based filesystem, it will be useful to know the generation and
who does commit.
7) back reference and cow:
btrfs_delayed_tree_ref
btrfs_delayed_data_ref
btrfs_delayed_ref_head
btrfs_cow_block
Btrfs natively supports back references, these tracepoints are helpful on
understanding btrfs's COW mechanism.
8) chunk:
btrfs_chunk_alloc
btrfs_chunk_free
Chunk is a link between physical offset and logical offset, and stands for space
infomation in btrfs, and these are helpful on tracing space things.
9) reserved_extent:
btrfs_reserved_extent_alloc
btrfs_reserved_extent_free
These can show how btrfs uses its space.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The pointer to the extent buffer for the root of each tree
is protected by a spinlock so that we can safely read the pointer
and take a reference on the extent buffer.
But now that the extent buffers are freed via RCU, we can safely
use rcu_read_lock instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I noticed that dio_end_io calls the appropriate endio function with an error,
but the endio functions don't actually do anything with that error, they assume
that if there was an error then the bio will not be uptodate. So if we had
checksum failures we would never pass back EIO. So if there is an error in our
endio functions make sure to clear the uptodate flag on the bio. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
When doing direct writes we store the checksums in the ordered sum stuff in the
ordered extent for writing them when the write completes, so we don't even use
the dip->csums array. So if we're writing, don't bother allocating dip->csums
since we won't use it anyway. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This patch makes the free space cluster refilling code a little easier to
understand, and fixes some things with the bitmap part of it. Currently we
either want to refill a cluster with
1) All normal extent entries (those without bitmaps)
2) A bitmap entry with enough space
The current code has this ugly jump around logic that will first try and fill up
the cluster with extent entries and then if it can't do that it will try and
find a bitmap to use. So instead split this out into two functions, one that
tries to find only normal entries, and one that tries to find bitmaps.
This also fixes a suboptimal thing we would do with bitmaps. If we used a
bitmap we would just tell the cluster that we were pointing at a bitmap and it
would do the tree search in the block group for that entry every time we tried
to make an allocation. Instead of doing that now we just add it to the clusters
group.
I tested this with my ENOSPC tests and xfstests and it survived.
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
Documentation/iostats.txt: bit-size reference etc.
cfq-iosched: removing unnecessary think time checking
cfq-iosched: Don't clear queue stats when preempt.
blk-throttle: Reset group slice when limits are changed
blk-cgroup: Only give unaccounted_time under debug
cfq-iosched: Don't set active queue in preempt
block: fix non-atomic access to genhd inflight structures
block: attempt to merge with existing requests on plug flush
block: NULL dereference on error path in __blkdev_get()
cfq-iosched: Don't update group weights when on service tree
fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
block: Require subsystems to explicitly allocate bio_set integrity mempool
jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
fs: make fsync_buffers_list() plug
mm: make generic_writepages() use plugging
blk-cgroup: Add unaccounted time to timeslice_used.
block: fixup plugging stubs for !CONFIG_BLOCK
block: remove obsolete comments for blkdev_issue_zeroout.
blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
...
Fix up conflicts in fs/{aio.c,super.c}
And give it a kernel-doc comment.
[akpm@linux-foundation.org: btrfs changed in linux-next]
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of always creating a huge (268K) deflate_workspace with the
maximum compression parameters (windowBits=15, memLevel=8), allow the
caller to obtain a smaller workspace by specifying smaller parameter
values.
For example, when capturing oops and panic reports to a medium with
limited capacity, such as NVRAM, compression may be the only way to
capture the whole report. In this case, a small workspace (24K works
fine) is a win, whether you allocate the workspace when you need it (i.e.,
during an oops or panic) or at boot time.
I've verified that this patch works with all accepted values of windowBits
(positive and negative), memLevel, and compression level.
Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: David Miller <davem@davemloft.net>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have been creating bitmaps for small extents unconditionally forever. This
was great when testing to make sure the bitmap stuff was working, but is
overkill normally. So instead of always adding small chunks of free space to
bitmaps, only start doing it if we go past half of our extent threshold. This
will keeps us from creating a bitmap for just one small free extent at the front
of the block group, and will make the allocator a little faster as a result.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We do all this fun stuff with min_bytes, but either don't use it in the case of
just normal extents, or use it completely wrong in the case of bitmaps. So fix
this for both cases
1) In the extent case, stop looking for space with window_free >= min_bytes
instead of bytes + empty_size.
2) In the bitmap case, we were looking for streches of free space that was at
least min_bytes in size, which was not right at all. So instead search for
stretches of free space that are at least bytes in size (this will make a
difference when we have > page size blocks) and then only search for min_bytes
amount of free space.
Thanks,
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
The free space cluster stuff is heavy duty, so there is no sense in going
through the entire song and dance if there isn't enough space in the block group
to begin with. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits)
doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore
Update cpuset info & webiste for cgroups
dcdbas: force SMI to happen when expected
arch/arm/Kconfig: remove one to many l's in the word.
asm-generic/user.h: Fix spelling in comment
drm: fix printk typo 'sracth'
Remove one to many n's in a word
Documentation/filesystems/romfs.txt: fixing link to genromfs
drivers:scsi Change printk typo initate -> initiate
serial, pch uart: Remove duplicate inclusion of linux/pci.h header
fs/eventpoll.c: fix spelling
mm: Fix out-of-date comments which refers non-existent functions
drm: Fix printk typo 'failled'
coh901318.c: Change initate to initiate.
mbox-db5500.c Change initate to initiate.
edac: correct i82975x error-info reported
edac: correct i82975x mci initialisation
edac: correct commented info
fs: update comments to point correct document
target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c
...
Trivial conflict in fs/eventpoll.c (spelling vs addition)
We need to make sure the dir items we get are valid dir items. So any time we
try and read one check it with verify_dir_item, which will do various sanity
checks to make sure it looks sane. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Doing an audit of where we use btrfs_search_slot only showed one place where we
don't check the return value of btrfs_search_slot properly. Just fix
mark_extent_written to see if btrfs_search_slot failed and act accordingly.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently if we have corrupted items things will blow up in spectacular ways.
So as we read in blocks and they are leaves, check the entire leaf to make sure
all of the items are correct and point to valid parts in the leaf for the item
data the are responsible for. If the item is corrupt we will kick back EIO and
not read any of the copies since they are likely to not be correct either. This
will catch generic corruptions, it will be up to the individual callers of
btrfs_search_slot to make sure their items are right. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently if we have corrupt metadata map_extent_buffer will complain about it,
but not return an error so the caller has no idea a problem was hit. Fix this.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Everytime I have to deal with btrfs_cont_expand I stare at it for 20 minutes
trying to remember what exactly it does and why the hell we need it. So add a
comment to save future-Josef some time. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Mark_inode_dirty will call btrfs_dirty_inode which will take care of updating
the inode. This makes setsize a little cleaner since we don't have to start a
transaction and update the inode in there, we can just call mark_inode_dirty.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We don't need an orphan item when expanding files, we just need them for
truncating them, so only add the orphan item in btrfs_truncate instead of in
btrfs_setsize. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This fixes a problem where if truncate fails the inode will still be on the in
memory orphan list. This is will make us complain when the inode gets destroyed
because it's still on the orphan list. So if we fail just remove us from the in
memory list and carry on.
Signed-off-by: Josef Bacik <josef@redhat.com>
If we cannot truncate an inode for some reason we will never delete the orphan
item associated with that inode, which means that we will loop forever in
btrfs_orphan_cleanup. Instead of doing this just return error so we fail to
mount. It sucks, but hey it's better than hanging. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Now that we can handle having errors in the truncate path lets make sure we
return errors instead of doing BUG_ON() and such. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
->truncate() is going away, instead all of the work needs to be done in
->setattr(). So this converts us over to do this. It's fairly straightforward,
just get rid of our .truncate inode operation and call btrfs_truncate() directly
from btrfs_setsize. This works out better for us since truncate can technically
return ENOSPC, and before we had no way of letting anybody know. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Since we alloc/free free space entries a whole lot, lets use a slab to keep
track of them. This makes some of my tests slightly faster. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
We track delayed allocation per inodes via 2 counters, one is
outstanding_extents and reserved_extents. Outstanding_extents is already an
atomic_t, but reserved_extents is not and is protected by a spinlock. So
convert this to an atomic_t and instead of using a spinlock, use atomic_cmpxchg
when releasing delalloc bytes. This makes our inode 72 bytes smaller, and
reduces locking overhead (albiet it was minimal to begin with). Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Really we don't need to memset the pages array at all, since we know how many
pages we're going to use in the array and pass that around. So don't memset,
just trust we're not idiots and we pass num_pages around properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Our aio_write function is huge and kind of hard to follow at times. So this
patch fixes this by breaking out the buffered and direct write paths out into
seperate functions so it's a little clearer what's going on. I've also fixed
some wrong typing that we had and added the ability to handle getting an error
back from btrfs_set_extent_delalloc. Tested this with xfstests and everything
came out fine. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (33 commits)
AppArmor: kill unused macros in lsm.c
AppArmor: cleanup generated files correctly
KEYS: Add an iovec version of KEYCTL_INSTANTIATE
KEYS: Add a new keyctl op to reject a key with a specified error code
KEYS: Add a key type op to permit the key description to be vetted
KEYS: Add an RCU payload dereference macro
AppArmor: Cleanup make file to remove cruft and make it easier to read
SELinux: implement the new sb_remount LSM hook
LSM: Pass -o remount options to the LSM
SELinux: Compute SID for the newly created socket
SELinux: Socket retains creator role and MLS attribute
SELinux: Auto-generate security_is_socket_class
TOMOYO: Fix memory leak upon file open.
Revert "selinux: simplify ioctl checking"
selinux: drop unused packet flow permissions
selinux: Fix packet forwarding checks on postrouting
selinux: Fix wrong checks for selinux_policycap_netpeer
selinux: Fix check for xfrm selinux context algorithm
ima: remove unnecessary call to ima_must_measure
IMA: remove IMA imbalance checking
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (57 commits)
tidy the trailing symlinks traversal up
Turn resolution of trailing symlinks iterative everywhere
simplify link_path_walk() tail
Make trailing symlink resolution in path_lookupat() iterative
update nd->inode in __do_follow_link() instead of after do_follow_link()
pull handling of one pathname component into a helper
fs: allow AT_EMPTY_PATH in linkat(), limit that to CAP_DAC_READ_SEARCH
Allow passing O_PATH descriptors via SCM_RIGHTS datagrams
readlinkat(), fchownat() and fstatat() with empty relative pathnames
Allow O_PATH for symlinks
New kind of open files - "location only".
ext4: Copy fs UUID to superblock
ext3: Copy fs UUID to superblock.
vfs: Export file system uuid via /proc/<pid>/mountinfo
unistd.h: Add new syscalls numbers to asm-generic
x86: Add new syscalls for x86_64
x86: Add new syscalls for x86_32
fs: Remove i_nlink check from file system link callback
fs: Don't allow to create hardlink for deleted file
vfs: Add open by file handle support
...
Now that VFS check for inode->i_nlink == 0 and returns proper
error, remove similar check from file system
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The exportfs encode handle function should return the minimum required
handle size. This helps user to find out the handle size by passing 0
handle size in the first step and then redoing to the call again with
the returned handle size value.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: break out of shrink_delalloc earlier
btrfs: fix not enough reserved space
btrfs: fix dip leak
Btrfs: make sure not to return overlapping extents to fiemap
Btrfs: deal with short returns from copy_from_user
Btrfs: fix regressions in copy_from_user handling
Josef had changed shrink_delalloc to exit after three shrink
attempts, which wasn't quite enough because new writers could
race in and steal free space.
But it also fixed deadlocks and stalls as we tried to recover
delalloc reservations. The code was tweaked to loop 1024
times, and would reset the counter any time a small amount
of progress was made. This was too drastic, and with a
lot of writers we can end up stuck in shrink_delalloc forever.
The shrink_delalloc loop is fairly complex because the caller is looping
too, and the caller will go ahead and force a transaction commit to make
sure we reclaim space.
This reworks things to exit shrink_delalloc when we've forced some
writeback and the delalloc reservations have gone down. This means
the writeback has not just started but has also finished at
least some of the metadata changes required to reclaim delalloc
space.
If we've got this wrong, we're returning ENOSPC too early, which
is a big improvement over the current behavior of hanging the machine.
Test 224 in xfstests hammers on this nicely, and with 1000 writers
trying to fill a 1GB drive we get our first ENOSPC at 93% full. The
other writers are able to continue until we get 100%.
This is a worst case test for btrfs because the 1000 writers are doing
small IO, and the small FS size means we don't have a lot of room
for metadata chunks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_link() will insert 3 items(inode ref, dir name item and dir index item)
into the b+ tree and update 2 items(its inode, and parent's inode) in the b+
tree. So we should reserve space for these 5 items, not 3 items.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs DIO code leaks dip structs when dip->csums allocation
fails; bio->bi_end_io isn't set at the point where the free_ordered
branch is consequently taken, thus bio_endio doesn't call the function
which would free it in the normal case. Fix.
Signed-off-by: Daniel J Blueman <daniel.blueman@gmail.com>
Acked-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With the plugging now being explicitly controlled by the
submitter, callers need not pass down unplugging hints
to the block layer. If they want to unplug, it's because they
manually plugged on their own - in which case, they should just
unplug at will.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The btrfs fiemap code was incorrectly returning duplicate or overlapping
extents in some cases. cp was blindly trusting this result and we would
end up with a destination file that was bigger than the original because
some bytes were copied twice.
The fix here adjusts our offsets to make sure we're always moving
forward in the fiemap results.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When copy_from_user is only able to copy some of the bytes we requested,
we may end up creating a partially up to date page. To avoid garbage in
the page, we need to treat a partial copy as a zero length copy.
This makes the rest of the file_write code drop the page and
retry the whole copy instead of marking the partially up to
date page as dirty.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
cc: stable@kernel.org
Commit 914ee295af fixed deadlocks in
btrfs_file_write where we would catch page faults on pages we had
locked.
But, there were a few problems:
1) The x86-32 iov_iter_copy_from_user_atomic code always fails to copy
data when the amount to copy is more than 4K and the offset to start
copying from is not page aligned. The result was btrfs_file_write
looping forever retrying the iov_iter_copy_from_user_atomic
We deal with this by changing btrfs_file_write to drop down to single
page copies when iov_iter_copy_from_user_atomic starts returning failure.
2) The btrfs_file_write code was leaking delalloc reservations when
iov_iter_copy_from_user_atomic returned zero. The looping above would
result in the entire filesystem running out of delalloc reservations and
constantly trying to flush things to disk.
3) btrfs_file_write will lock down page cache pages, make sure
any writeback is finished, do the copy_from_user and then release them.
Before the loop runs we check the first and last pages in the write to
see if they are only being partially modified. If the start or end of
the write isn't aligned, we make sure the corresponding pages are
up to date so that we don't introduce garbage into the file.
With the copy_from_user changes, we're allowing the VM to reclaim the
pages after a partial update from copy_from_user, but we're not
making sure the page cache page is up to date when we loop around to
resume the write.
We deal with this by pushing the up to date checks down into the page
prep code. This fits better with how the rest of file_write works.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reported-by: Mitch Harder <mitch.harder@sabayonlinux.org>
cc: stable@kernel.org
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: fix fiemap bugs with delalloc
Btrfs: set FMODE_EXCL in btrfs_device->mode
Btrfs: make btrfs_rm_device() fail gracefully
Btrfs: Avoid accessing unmapped kernel address
Btrfs: Fix BTRFS_IOC_SUBVOL_SETFLAGS ioctl
Btrfs: allow balance to explicitly allocate chunks as it relocates
Btrfs: put ENOSPC debugging under a mount option
The Btrfs fiemap code wasn't properly returning delalloc extents,
so applications that trust fiemap to decide if there are holes in the
file see holes instead of delalloc.
This reworks the btrfs fiemap code, adding a get_extent helper that
searches for delalloc ranges and also adding a helper for extent_fiemap
that skips past holes in the file.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This fixes a bug introduced in d4d77629, where the device added online
(and therefore initialized via btrfs_init_new_device()) would be left
with the positive bdev->bd_holders after unmount. Since d4d77629 we no
longer OR FMODE_EXCL explicitly on blkdev_put(), set it in
btrfs_device->mode.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If shrinking done as part of the online device removal fails add that
device back to the allocation list and increment the rw_devices counter.
This fixes two bugs:
1) we could have a perfectly good device out of alloc list for no good
reason;
2) in the btrfs consisting of two devices, failure in btrfs_rm_device()
could lead to a situation where it was impossible to remove any of the
devices because of the "unable to remove the only writeable device"
error.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When decompressing a chunk of data, we'll copy the data out to
a working buffer if the data is stored in more than one page,
otherwise we'll use the mapped page directly to avoid memory
copy.
In the latter case, we'll end up accessing the kernel address
after we've unmapped the page in a corner case.
Reported-by: Juan Francisco Cantero Hurtado <iam@juanfra.info>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
- Check user-specified flags correctly
- Check the inode owership
- Search root item in root tree but not fs tree
Reported-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs device shrinking and balancing ends up reallocating all the blocks
in order to allow COW to move them to new destinations. It is somewhat
awkward in terms of ENOSPC because most of the enospc code is built
around the idea that some operation on a reference counted tree triggers
allocations in the non-reference counted trees.
This commit changes the balancing code to deal with enospc by trying to
allocate a new chunk. If that allocation succeeds, we go ahead and
retry whatever failed due to enospc.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
ENOSPC in btrfs is getting to the point where the extra debugging isn't
required. I've put it under mount -o enospc_debug just in case someone
is having difficult problems.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I add the check on the return value of alloc_extent_map() to several places.
In addition, alloc_extent_map() returns only the address or NULL.
Therefore, check by IS_ERR() is unnecessary. So, I remove IS_ERR() checking.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Memory allocated by calling kstrdup() should be freed.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Commit bf5fc093c5 refactored
btrfs_ioctl_space_info() and introduced several security issues.
space_args.space_slots is an unsigned 64-bit type controlled by a
possibly unprivileged caller. The comparison as a signed int type
allows providing values that are treated as negative and cause the
subsequent allocation size calculation to wrap, or be truncated to 0.
By providing a size that's truncated to 0, kmalloc() will return
ZERO_SIZE_PTR. It's also possible to provide a value smaller than the
slot count. The subsequent loop ignores the allocation size when
copying data in, resulting in a heap overflow or write to ZERO_SIZE_PTR.
The fix changes the slot count type and comparison typecast to u64,
which prevents truncation or signedness errors, and also ensures that we
don't copy more data than we've allocated in the subsequent loop. Note
that zero-size allocations are no longer possible since there is already
an explicit check for space_args.space_slots being 0 and truncation of
this value is no longer an issue.
Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
Signed-off-by: Josef Bacik <josef@redhat.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Mark the cloned backref_node as checked in clone_backref_node()
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs tracks uptodate state in an rbtree as well as in the
page bits. This is supposed to enable us to use block sizes other than
the page size, but there are a few parts still missing before that
completely works.
But, our readpage routine trusts this additional range based tracking
of uptodateness, much in the same way the buffer head up to date bits
are trusted for the other filesystems.
The problem is that sometimes we need to allocate memory in order to
split records in the rbtree, even when we are just clearing bits. This
can be difficult when our clearing function is called GFP_ATOMIC, which
can happen in the releasepage path.
So, what happens today looks like this:
releasepage called with GFP_ATOMIC
btrfs_releasepage calls clear_extent_bit
clear_extent_bit fails to allocate ram, leaving the up to date bit set
btrfs_releasepage returns success
The end result is the page being gone, but btrfs thinking the range is
up to date. Later on if someone tries to read that same page, the
btrfs readpage code will return immediately thinking the page is already
up to date.
This commit fixes things to fail the releasepage when we can't clear the
extent state bits. It covers both data pages and metadata tree blocks.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a race where btrfs_releasepage can drop the
page->private contents just as alloc_extent_buffer is setting
up pages for metadata. Because of how the Btrfs page flags work,
this results in us skipping the crc on the page during IO.
This patch sovles the race by waiting until after the extent buffer
is inserted into the radix tree before it sets page private.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (33 commits)
Btrfs: Fix page count calculation
btrfs: Drop __exit attribute on btrfs_exit_compress
btrfs: cleanup error handling in btrfs_unlink_inode()
Btrfs: exclude super blocks when we read in block groups
Btrfs: make sure search_bitmap finds something in remove_from_bitmap
btrfs: fix return value check of btrfs_start_transaction()
btrfs: checking NULL or not in some functions
Btrfs: avoid uninit variable warnings in ordered-data.c
Btrfs: catch errors from btrfs_sync_log
Btrfs: make shrink_delalloc a little friendlier
Btrfs: handle no memory properly in prepare_pages
Btrfs: do error checking in btrfs_del_csums
Btrfs: use the global block reserve if we cannot reserve space
Btrfs: do not release more reserved bytes to the global_block_rsv than we need
Btrfs: fix check_path_shared so it returns the right value
btrfs: check return value of btrfs_start_ioctl_transaction() properly
btrfs: fix return value check of btrfs_join_transaction()
fs/btrfs/inode.c: Add missing IS_ERR test
btrfs: fix missing break in switch phrase
btrfs: fix several uncheck memory allocations
...
take offset of start position into account when calculating page count.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
As this function is called in some error paths while not
removing the module, the __exit attribute prevents the kernel
image from linking when btrfs is compiled in statically.
Signed-off-by: Alexey Charkov <alchark@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs_alloc_path() fails, btrfs_free_path() need not be called.
Therefore, it changes the branch ahead.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This has been resulting in a BUT_ON(ret) after btrfs_reserve_extent in
btrfs_cow_file_range. The reason is we don't actually calculate the bytes_super
for a block group until we go to cache it, which means that the space_info can
hand out reservations for space that it doesn't actually have, and we can run
out of data space. This is also a problem if you are using space caching since
we don't ever calculate bytes_super for the block groups. So instead everytime
we read a block group call exclude_super_stripes, which calculates the
bytes_super for the block group so it can be left out of the space_info. Then
whenever caching completes we just call free_excluded_extents so that the super
excluded extents are freed up. Also if we are unmounting and we hit any block
groups that haven't been cached we still need to call free_excluded_extents to
make sure things are cleaned up properly. Thanks,
Reported-by: Arne Jansen <sensille@gmx.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we're cleaning up the tree log we need to be able to remove free space from
the block group. The problem is if that free space spans bitmaps we would not
find the space since we're looking for too many bytes. So make sure the amount
of bytes we search for is limited to either the number of bytes we want, or the
number of bytes left in the bitmap. This was tested by a user who was hitting
the BUG() after search_bitmap. With this patch he can now mount his fs.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
SELinux would like to implement a new labeling behavior of newly created
inodes. We currently label new inodes based on the parent and the creating
process. This new behavior would also take into account the name of the
new object when deciding the new label. This is not the (supposed) full path,
just the last component of the path.
This is very useful because creating /etc/shadow is different than creating
/etc/passwd but the kernel hooks are unable to differentiate these
operations. We currently require that userspace realize it is doing some
difficult operation like that and than userspace jumps through SELinux hoops
to get things set up correctly. This patch does not implement new
behavior, that is obviously contained in a seperate SELinux patch, but it
does pass the needed name down to the correct LSM hook. If no such name
exists it is fine to pass NULL.
Signed-off-by: Eric Paris <eparis@redhat.com>
The error check of btrfs_start_transaction() is added, and the mistake
of the error check on several places is corrected.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Because NULL is returned when the memory allocation fails,
it is checked whether it is NULL.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This one isn't really an uninit variable, but for pretty
obscure reasons. Let's make it clearly correct.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_sync_log returns -EAGAIN when we need full transaction commits
instead of small log commits, but sometimes we were dropping the return
value.
In practice, we check for this a few different ways, but this is still a
bug that can leave off full log commits when we really need them.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Xfstests 224 will just sit there and spin for ever until eventually we give up
flushing delalloc and exit. On my box this took several hours. I could not
interrupt this process either, even though we use INTERRUPTIBLE. So do 2 things
1) Keep us from looping over and over again without reclaiming anything
2) If we get interrupted exit the loop
I tested this and the test now exits in a reasonable amount of time, and can be
interrupted with ctrl+c. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Instead of doing a BUG_ON(1) in prepare_pages if grab_cache_page() fails, just
loop through the pages we've already grabbed and unlock and release them, then
return -ENOMEM like we should. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Got a report of a box panicing because we got a NULL eb in read_extent_buffer.
His fs was borked and btrfs_search_path returned EIO, but we don't check for
errors so the box paniced. Yes I know this will just make something higher up
the stack panic, but that's a problem for future Josef. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We call use_block_rsv right before we make an allocation in order to make sure
we have enough space. Now normally people have called btrfs_start_transaction()
with the appropriate amount of space that we need, so we just use some of that
pre-reserved space and move along happily. The problem is where people use
btrfs_join_transaction(), which doesn't actually reserve any space. So we try
and reserve space here, but we cannot flush delalloc, so this forces us to
return -ENOSPC when in reality we have plenty of space. The most common symptom
is seeing a bunch of "couldn't dirty inode" messages in syslog. With
xfstests 224 we end up falling back to start_transaction and then doing all the
flush delalloc stuff which causes to hang for a very long time.
So instead steal from the global reserve, which is what this is meant for
anyway. With this patch and the other 2 I have sent xfstests 224 now passes
successfully. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we do btrfs_block_rsv_release, if global_block_rsv is not full we will
release all the extra bytes to global_block_rsv, even if it's only a little
short of the amount of space that we need to reserve. This causes us to starve
ourselves of reservable space during the transaction which will force us to
shrink delalloc bytes and commit the transaction more often than we should. So
instead just add the amount of bytes we need to add to the global reserve so
reserved == size, and then add the rest back into the space_info for general
use. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When running xfstests 224 I kept getting ENOSPC when trying to remove the files,
and this is because we were returning ret from check_path_shared while it was
uninitalized, which isn't right. Fix this to return 0 properly, and now
xfstests 224 doesn't freak out when it tries to clean itself up. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_start_ioctl_transaction() returns ERR_PTR(), not NULL.
So, it is necessary to use IS_ERR() to check the return value.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The error check of btrfs_join_transaction()/btrfs_join_transaction_nolock()
is added, and the mistake of the error check in several places is
corrected.
For more stable Btrfs, I think that we should reduce BUG_ON().
But, I think that long time is necessary for this.
So, I propose this patch as a short-term solution.
With this patch:
- To more stable Btrfs, the part that should be corrected is clarified.
- The panic isn't done by the NULL pointer reference etc. (even if
BUG_ON() is increased temporarily)
- The error code is returned in the place where the error can be easily
returned.
As a long-term plan:
- BUG_ON() is reduced by using the forced-readonly framework, etc.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After the conditional that precedes the following code, inode may be an
ERR_PTR value. This can eg result from a memory allocation failure via the
call to btrfs_iget, and thus does not imply that root is different than
sub_root. Thus, an IS_ERR check is added to ensure that there is no
dereference of inode in this case.
The semantic match that finds this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@r@
identifier f;
@@
f(...) { ... return ERR_PTR(...); }
@@
identifier r.f, fld;
expression x;
statement S1,S2;
@@
x = f(...)
... when != IS_ERR(x)
(
if (IS_ERR(x) ||...) S1 else S2
|
*x->fld
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
To make btrfs more stable, add several missing necessary memory allocation
checks, and when no memory, return proper errno.
We've checked that some of those -ENOMEM errors will be returned to
userspace, and some will be catched by BUG_ON() in the upper callers,
and none will be ignored silently.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_submit_compressed_read() is lack of memory allocation checks and
corresponding error route.
After this fix, if it comes to "no memory" case, errno will be returned
to userland step by step, and tell users this operation cannot go on.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Suppose:
- the source extent is: [0, 100]
- the src offset is 10
- the clone length is 90
- the dest offset is 0
This statement:
new_key.offset = key.offset + destoff - off
will produce such an extent for the dest file:
[ino, BTRFS_EXTENT_DATA_KEY, -10]
, which is obviously wrong.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
fixup, which is allocated when starting page write to fix up the
extent without ORDERED bit set, should be freed after this work
is done.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
We must save and free the original kstrdup()'ed pointer
because strsep() modifies its first argument.
Signed-off-by: Tero Roponen <tero.roponen@gmail.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
We missed a memory deallocation in commit 450ba0ea.
If an existing super block is found at mount and there is no
error condition then the pre-allocated tree_root and fs_info
are no not used and are not freeded.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
After returing extents from a cluster to the block group, some
extents in the block group may be mergeable.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
When adding a new extent, we'll firstly see if we can merge
this extent to the left or/and right extent. Extract this as
a helper try_merge_free_space().
As a side effect, we fix a small bug that if the new extent
has non-bitmap left entry but is unmergeble, we'll directly
link the extent without trying to drop it into bitmap.
This also prepares for the next patch.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
When allocating extent entry from a cluster, we should update
the free_space and free_extents fields of the block group.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
If there's no more free space in a bitmap, we should free it.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Remove some duplicated code.
This prepares for the next patch.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
If a block group is smaller than 1GB, the extent entry threadhold
calculation will always set the threshold to 0.
So as free space gets fragmented, btrfs will switch to use bitmap
to manage free space, but then will never switch back to extents
due to this bug.
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (25 commits)
Btrfs: forced readonly mounts on errors
btrfs: Require CAP_SYS_ADMIN for filesystem rebalance
Btrfs: don't warn if we get ENOSPC in btrfs_block_rsv_check
btrfs: Fix memory leak in btrfs_read_fs_root_no_radix()
btrfs: check NULL or not
btrfs: Don't pass NULL ptr to func that may deref it.
btrfs: mount failure return value fix
btrfs: Mem leak in btrfs_get_acl()
btrfs: fix wrong free space information of btrfs
btrfs: make the chunk allocator utilize the devices better
btrfs: restructure find_free_dev_extent()
btrfs: fix wrong calculation of stripe size
btrfs: try to reclaim some space when chunk allocation fails
btrfs: fix wrong data space statistics
fs/btrfs: Fix build of ctree
Btrfs: fix off by one while setting block groups readonly
Btrfs: Add BTRFS_IOC_SUBVOL_GETFLAGS/SETFLAGS ioctls
Btrfs: Add readonly snapshots support
Btrfs: Refactor btrfs_ioctl_snap_create()
btrfs: Extract duplicate decompress code
...
This patch comes from "Forced readonly mounts on errors" ideas.
As we know, this is the first step in being more fault tolerant of disk
corruptions instead of just using BUG() statements.
The major content:
- add a framework for generating errors that should result in filesystems
going readonly.
- keep FS state in disk super block.
- make sure that all of resource will be freed and released at umount time.
- make sure that fter FS is forced readonly on error, there will be no more
disk change before FS is corrected. For this, we should stop write operation.
After this patch is applied, the conversion from BUG() to such a framework can
happen incrementally.
Signed-off-by: Liu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently all filesystems except XFS implement fallocate asynchronously,
while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC
I/O we really want our allocation on disk, especially for the !KEEP_SIZE
case where we actually grow the file with user-visible zeroes. On the
other hand always commiting the transaction is a bad idea for fast-path
uses of fallocate like for example in recent Samba versions. Given
that block allocation is a data plane operation anyway change it from
an inode operation to a file operation so that we have the file structure
available that lets us check for O_SYNC.
This also includes moving the code around for a few of the filesystems,
and remove the already unnedded S_ISDIR checks given that we only wire
up fallocate for regular files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of various home grown checks that might need updates for new
flags just check for any bit outside the mask of the features supported
by the filesystem. This makes the check future proof for any newly
added flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix the build failure in some configurations:
CC [M] fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field 'super_kobj' has incomplete type
fs/btrfs/ctree.h:1074:17: error: field 'root_kobj' has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2
caused by commit 57cc7215b7 ("headers: kobject.h redux")
We need to include kobject.h here.
Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Filesystem rebalancing (BTRFS_IOC_BALANCE) affects the entire
filesystem and may run uninterruptibly for a long time. This does not
seem to be something that an unprivileged user should be able to do.
Reported-by: Aron Xu <happyaron.xu@gmail.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we run low on space we could get a bunch of warnings out of
btrfs_block_rsv_check, but this is mostly just called via the transaction code
to see if we need to end the transaction, it expects to see failures, so let's
not WARN and freak everybody out for no reason. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In btrfs_read_fs_root_no_radix(), 'root' is not freed if
btrfs_search_slot() returns error.
Signed-off-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Hi,
In fs/btrfs/inode.c::fixup_tree_root_location() we have this code:
...
if (!path) {
err = -ENOMEM;
goto out;
}
...
out:
btrfs_free_path(path);
return err;
btrfs_free_path() passes its argument on to other functions and some of
them end up dereferencing the pointer.
In the code above that pointer is clearly NULL, so btrfs_free_path() will
eventually cause a NULL dereference.
There are many ways to cut this cake (fix the bug). The one I chose was to
make btrfs_free_path() deal gracefully with NULL pointers. If you
disagree, feel free to come up with an alternative patch.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I happened to pass swap partition as root partition in cmdline,
then kernel panic and tell me about "Cannot open root device".
It is not correct, in fact it is a fs type mismatch instead of 'no device'.
Eventually I found btrfs mounting failed with -EIO, it should be -EINVAL.
The logic in init/do_mounts.c:
for (p = fs_names; *p; p += strlen(p)+1) {
int err = do_mount_root(name, p, flags, root_mount_data);
switch (err) {
case 0:
goto out;
case -EACCES:
flags |= MS_RDONLY;
goto retry;
case -EINVAL:
continue;
}
print "Cannot open root device"
panic
}
SO fs type after btrfs will have no chance to mount
Here fix the return value as -EINVAL
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It seems to me that we leak the memory allocated to 'value' in
btrfs_get_acl() if the call to posix_acl_from_xattr() fails.
Here's a patch that attempts to correct that problem.
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we store data by raid profile in btrfs with two or more different size
disks, df command shows there is some free space in the filesystem, but the
user can not write any data in fact, df command shows the wrong free space
information of btrfs.
# mkfs.btrfs -d raid1 /dev/sda9 /dev/sda10
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 28.00KB
devid 1 size 5.01GB used 2.03GB path /dev/sda9
devid 2 size 10.00GB used 2.01GB path /dev/sda10
# btrfs device scan /dev/sda9 /dev/sda10
# mount /dev/sda9 /mnt
# dd if=/dev/zero of=tmpfile0 bs=4K count=9999999999
(fill the filesystem)
# sync
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 5.4G 62% /mnt
# btrfs-show
Label: none uuid: a95cd49e-6e33-45b8-8741-a36153ce4b64
Total devices 2 FS bytes used 3.99GB
devid 1 size 5.01GB used 5.01GB path /dev/sda9
devid 2 size 10.00GB used 4.99GB path /dev/sda10
It is because btrfs cannot allocate chunks when one of the pairing disks has
no space, the free space on the other disks can not be used for ever, and should
be subtracted from the total space, but btrfs doesn't subtract this space from
the total. It is strange to the user.
This patch fixes it by calcing the free space that can be used to allocate
chunks.
Implementation:
1. get all the devices free space, and align them by stripe length.
2. sort the devices by the free space.
3. check the free space of the devices,
3.1. if it is not zero, and then check the number of the devices that has
more free space than this device,
if the number of the devices is beyond the min stripe number, the free
space can be used, and add into total free space.
if the number of the devices is below the min stripe number, we can not
use the free space, the check ends.
3.2. if the free space is zero, check the next devices, goto 3.1
This implementation is just likely fake chunk allocation.
After appling this patch, df can show correct space information:
# df -TH
Filesystem Type Size Used Avail Use% Mounted on
/dev/sda9 btrfs 17G 8.6G 0 100% /mnt
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
With this patch, we change the handling method when we can not get enough free
extents with default size.
Implementation:
1. Look up the suitable free extent on each device and keep the search result.
If not find a suitable free extent, keep the max free extent
2. If we get enough suitable free extents with default size, chunk allocation
succeeds.
3. If we can not get enough free extents, but the number of the extent with
default size is >= min_stripes, we just change the mapping information
(reduce the number of stripes in the extent map), and chunk allocation
succeeds.
4. If the number of the extent with default size is < min_stripes, sort the
devices by its max free extent's size descending
5. Use the size of the max free extent on the (num_stripes - 1)th device as the
stripe size to allocate the device space
By this way, the chunk allocator can allocate chunks as large as possible when
the devices' space is not enough and make full use of the devices.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
- make it return the start position and length of the max free space when it can
not find a suitable free space.
- make it more readability
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are two tiny problem:
- One is When we check the chunk size is greater than the max chunk size or not,
we should take mirrors into account, but the original code didn't.
- The other is btrfs shouldn't use the size of the residual free space as the
length of of a dup chunk when doing chunk allocation. It is because the device
space that a dup chunk needs is twice as large as the chunk size, if we use
the size of the residual free space as the length of a dup chunk, we can not
get enough free space. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We cannot write data into files when when there is tiny space in the filesystem.
Reproduce steps:
# mkfs.btrfs /dev/sda1
# mount /dev/sda1 /mnt
# dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
# dd if=/dev/zero of=/mnt/tmpfile1 bs=4K count=99999999999999
(fill the filesystem)
# umount /mnt
# mount /dev/sda1 /mnt
# rm -f /mnt/tmpfile0
# dd if=/dev/zero of=/mnt/tmpfile0 bs=4K count=1
(failed with nospec)
But if we do the last step again, we can write data successfully. The reason of
the problem is that btrfs didn't try to commit the current transaction and
reclaim some space when chunk allocation failed.
This patch fixes it by committing the current transaction to reclaim some
space when chunk allocation fails.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Josef has implemented mixed data/metadata chunks, we must add those chunks'
space just like data chunks.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
CC [M] fs/btrfs/ctree.o
In file included from fs/btrfs/ctree.c:21:0:
fs/btrfs/ctree.h:1003:17: error: field <91>super_kobj<92> has incomplete type
fs/btrfs/ctree.h:1074:17: error: field <91>root_kobj<92> has incomplete type
make[2]: *** [fs/btrfs/ctree.o] Error 1
make[1]: *** [fs/btrfs] Error 2
make: *** [fs] Error 2
We need to include kobject.h here.
Reported-by: Jeff Garzik <jeff@garzik.org>
Fix-suggested-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
block: ensure that completion error gets properly traced
blktrace: add missing probe argument to block_bio_complete
block cfq: don't use atomic_t for cfq_group
block cfq: don't use atomic_t for cfq_queue
block: trace event block fix unassigned field
block: add internal hd part table references
block: fix accounting bug on cross partition merges
kref: add kref_test_and_get
bio-integrity: mark kintegrityd_wq highpri and CPU intensive
block: make kblockd_workqueue smarter
Revert "sd: implement sd_check_events()"
block: Clean up exit_io_context() source code.
Fix compile warnings due to missing removal of a 'ret' variable
fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
cfq-iosched: don't check cfqg in choose_service_tree()
fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
cdrom: export cdrom_check_events()
sd: implement sd_check_events()
sr: implement sr_check_events()
...
Btrfs doesn't have the ability to punch holes yet, so make sure we return
EOPNOTSUPP if we try to use hole punching through fallocate. This support can
be added later. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This simple implementation just checks for no ACLs on the inode, and
if so, then the rcu-walk may proceed, otherwise fail it.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.
Patched with:
git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
RCU free the struct inode. This will allow:
- Subsequent store-free path walking patch. The inode must be consulted for
permissions when walking, so an RCU inode reference is a must.
- sb_inode_list_lock to be moved inside i_lock because sb list walkers who want
to take i_lock no longer need to take sb_inode_list_lock to walk the list in
the first place. This will simplify and optimize locking.
- Could remove some nested trylock loops in dcache code
- Could potentially simplify things a bit in VM land. Do not need to take the
page lock to follow page->mapping.
The downsides of this is the performance cost of using RCU. In a simple
creat/unlink microbenchmark, performance drops by about 10% due to inability to
reuse cache-hot slab objects. As iterations increase and RCU freeing starts
kicking over, this increases to about 20%.
In cases where inode lifetimes are longer (ie. many inodes may be allocated
during the average life span of a single inode), a lot of this cache reuse is
not applicable, so the regression caused by this patch is smaller.
The cache-hot regression could largely be avoided by using SLAB_DESTROY_BY_RCU,
however this adds some complexity to list walking and store-free path walking,
so I prefer to implement this at a later date, if it is shown to be a win in
real situations. I haven't found a regression in any non-micro benchmark so I
doubt it will be a problem.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Change d_delete from a dentry deletion notification to a dentry caching
advise, more like ->drop_inode. Require it to be constant and idempotent,
and not take d_lock. This is how all existing filesystems use the callback
anyway.
This makes fine grained dentry locking of dput and dentry lru scanning
much simpler.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
When we read in block groups, we'll set non-redundant groups
readonly if we find a raid1, DUP or raid10 group. But the
ro code has an off by one bug in the math around testing to
make sure out accounting doesn't go wrong.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This allows us to set a snapshot or a subvolume readonly or writable
on the fly.
Usage:
Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and then
call ioctl(BTRFS_IOCTL_SUBVOL_SETFLAGS);
Changelog for v3:
- Change to pass __u64 as ioctl parameter.
Changelog for v2:
- Add _GETFLAGS ioctl.
- Check if the passed fd is the root of a subvolume.
- Change the name from _SNAP_SETFLAGS to _SUBVOL_SETFLAGS.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Usage:
Set BTRFS_SUBVOL_RDONLY of btrfs_ioctl_vol_arg_v2->flags, and call
ioctl(BTRFS_I0CTL_SNAP_CREATE_V2).
Implementation:
- Set readonly bit of btrfs_root_item->flags.
- Add readonly checks in btrfs_permission (inode_permission),
btrfs_setattr, btrfs_set/remove_xattr and some ioctls.
Changelog for v3:
- Eliminate btrfs_root->readonly, but check btrfs_root->root_item.flags.
- Rename BTRFS_ROOT_SNAP_RDONLY to BTRFS_ROOT_SUBVOL_RDONLY.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Update defrag ioctl, so one can choose lzo or zlib when turning
on compression in defrag operation.
Changelog:
v1 -> v2
- Add incompability flag.
- Fix to check invalid compress type.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Lzo is a much faster compression algorithm than gzib, so would allow
more users to enable transparent compression, and some users can
choose from compression ratio and speed for different applications
Usage:
# mount -t btrfs -o compress[=<zlib,lzo>] dev /mnt
or
# mount -t btrfs -o compress-force[=<zlib,lzo>] dev /mnt
"-o compress" without argument is still allowed for compatability.
Compatibility:
If we mount a filesystem with lzo compression, it will not be able be
mounted in old kernels. One reason is, otherwise btrfs will directly
dump compressed data, which sits in inline extent, to user.
Performance:
The test copied a linux source tarball (~400M) from an ext4 partition
to the btrfs partition, and then extracted it.
(time in second)
lzo zlib nocompress
copy: 10.6 21.7 14.9
extract: 70.1 94.4 66.6
(data size in MB)
lzo zlib nocompress
copy: 185.87 108.69 394.49
extract: 193.80 132.36 381.21
Changelog:
v1 -> v2:
- Select LZO_COMPRESS and LZO_DECOMPRESS in btrfs Kconfig.
- Add incompability flag.
- Fix error handling in compress code.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Make the code aware of compression type, instead of always assuming
zlib compression.
Also make the zlib workspace function as common code for all
compression types.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Return failure if alloc_page() fails to allocate memory,
and the upper code will just give up compression.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: prevent RAID level downgrades when space is low
Btrfs: account for missing devices in RAID allocation profiles
Btrfs: EIO when we fail to read tree roots
Btrfs: fix compiler warnings
Btrfs: Make async snapshot ioctl more generic
Btrfs: pwrite blocked when writing from the mmaped buffer of the same page
Btrfs: Fix a crash when mounting a subvolume
Btrfs: fix sync subvol/snapshot creation
Btrfs: Fix page leak in compressed writeback path
Btrfs: do not BUG if we fail to remove the orphan item for dead snapshots
Btrfs: fixup return code for btrfs_del_orphan_item
Btrfs: do not do fast caching if we are allocating blocks for tree_root
Btrfs: deal with space cache errors better
Btrfs: fix use after free in O_DIRECT
The extent allocator has code that allows us to fill
allocations from any available block group, even if it doesn't
match the raid level we've requested.
This was put in because adding a new drive to a filesystem
made with the default mkfs options actually upgrades the metadata from
single spindle dup to full RAID1.
But, the code also allows us to allocate from a raid0 chunk when we
really want a raid1 or raid10 chunk. This can cause big trouble because
mkfs creates a small (4MB) raid0 chunk for data and metadata which then
goes unused for raid1/raid10 installs.
The allocator will happily wander in and allocate from that chunk when
things get tight, which is not correct.
The fix here is to make sure that we provide duplication when the
caller has asked for it. It does all the dups to be any raid level,
which preserves the dup->raid1 upgrade abilities.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we mount in RAID degraded mode without adding a new device to
replace the failed one, we can end up using the wrong RAID flags for
allocations.
This results in strange combinations of block groups (raid1 in a raid10
filesystem) and corruptions when we try to allocate blocks from single
spindle chunks on drives that are actually missing.
The first device has two small 4MB chunks in it that mkfs creates and
these are usually unused in a raid1 or raid10 setup. But, in -o degraded,
the allocator will fall back to these because the mask of desired raid groups
isn't correct.
The fix here is to count the missing devices as we build up the list
of devices in the system. This count is used when picking the
raid level to make sure we continue using the same levels that were
in place before we lost a drive.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we just get a plain IO error when we read tree roots, the code
wasn't properly sending that error up the chain. This allowed mounts to
continue when they should failed, and allowed operations
on partially setup root structs. The end result was usually oopsen
on spinlocks that hadn't been spun up correctly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
... regarding an unused function when !MIGRATION, and regarding a
printk() format string vs argument mismatch.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If we had reserved some bytes in struct btrfs_ioctl_vol_args, we
wouldn't have to create a new structure for async snapshot creation.
Here we convert async snapshot ioctl to use a more generic ABI, as
we'll add more ioctls for snapshots/subvolumes in the future, readonly
snapshots for example.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This problem is found in meego testing:
http://bugs.meego.com/show_bug.cgi?id=6672
A file in btrfs is mmaped and the mmaped buffer is passed to pwrite to write to the same page
of the same file. In btrfs_file_aio_write(), the pages is locked by prepare_pages(). So when
btrfs_copy_from_user() is called, page fault happens and the same page needs to be locked again
in filemap_fault(). The fix is to move iov_iter_fault_in_readable() before prepage_pages() to make page
fault happen before pages are locked. And also disable page fault in critical region in
btrfs_copy_from_user().
Reviewed-by: Yan, Zheng<zheng.z.yan@intel.com>
Signed-off-by: Zhong, Xin <xin.zhong@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We should drop dentry before deactivating the superblock, otherwise
we can hit this bug:
BUG: Dentry f349a690{i=100,n=/} still in use (1) [unmount of btrfs loop1]
...
Steps to reproduce the bug:
# mount /dev/loop1 /mnt
# mkdir save
# btrfs subvolume snapshot /mnt save/snap1
# umount /mnt
# mount -o subvol=save/snap1 /dev/loop1 /mnt
(crash)
Reported-by: Michael Niederle <mniederle@gmx.at>
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We were incorrectly taking the async path even for the sync ioctls by
passing in &transid unconditionally.
There's ample room for further cleanup here, but this keeps the fix simple.
Signed-off-by: Sage Weil <sage@newdream.net>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
"start + num_bytes >= actual_end" can happen when compressed page writeback races
with file truncation. In that case we need unlock and release pages past the end
of file.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Not being able to delete an orphan item isn't a horrible thing. The worst that
happens is the next time around we try and do the orphan cleanup and we can't
find the referenced object and just delete the item and move on.
Signed-off-by: Josef Bacik <josef@redhat.com>
If the orphan item doesn't exist, we return 1, which doesn't make any sense to
the callers. Instead return -ENOENT if we didn't find the item. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Since the fast caching uses normal tree locking, we can possibly deadlock if we
get to the caching via a btrfs_search_slot() on the tree_root. So just check to
see if the root we are on is the tree root, and just don't do the fast caching.
Reported-by: Sage Weil <sage@newdream.net>
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently if the space cache inode generation number doesn't match the
generation number in the space cache header we will just fail to load the space
cache, but we won't mark the space cache as an error, so we'll keep getting that
error each time somebody tries to cache that block group until we actually clear
the thing. Fix this by marking the space cache as having an error so we only
get the message once. This patch also makes it so that we don't try and setup
space cache for a block group that isn't cached, since we won't be able to write
it out anyway. None of these problems are actual problems, they are just
annoying and sub-optimal. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This fixes a bug where we use dip after we have freed it. Instead just use the
file_offset that was passed to the function. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (24 commits)
Btrfs: don't use migrate page without CONFIG_MIGRATION
Btrfs: deal with DIO bios that span more than one ordered extent
Btrfs: setup blank root and fs_info for mount time
Btrfs: fix fiemap
Btrfs - fix race between btrfs_get_sb() and umount
Btrfs: update inode ctime when using links
Btrfs: make sure new inode size is ok in fallocate
Btrfs: fix typo in fallocate to make it honor actual size
Btrfs: avoid NULL pointer deref in try_release_extent_buffer
Btrfs: make btrfs_add_nondir take parent inode as an argument
Btrfs: hold i_mutex when calling btrfs_log_dentry_safe
Btrfs: use dget_parent where we can UPDATED
Btrfs: fix more ESTALE problems with NFS
Btrfs: handle NFS lookups properly
btrfs: make 1-bit signed fileds unsigned
btrfs: Show device attr correctly for symlinks
btrfs: Set file size correctly in file clone
btrfs: Check if dest_offset is block-size aligned before cloning file
Btrfs: handle the space_cache option properly
btrfs: Fix early enospc because 'unused' calculated with wrong sign.
...
The new DIO bio splitting code has problems when the bio
spans more than one ordered extent. This will happen as the
generic DIO code merges our get_blocks calls together into
a bigger single bio.
This fixes things by walking forward in the ordered extent
code finding all the overlapping ordered extents and completing them
all at once.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a problem with how we use sget, it searches through the list of supers
attached to the fs_type looking for a super with the same fs_devices as what
we're trying to mount. This depends on sb->s_fs_info being filled, but we don't
fill that in until we get to btrfs_fill_super, so we could hit supers on the
fs_type super list that have a null s_fs_info. In order to fix that we need to
go ahead and setup a blank root with a blank fs_info to hold fs_devices, that
way our test will work out right and then we can set s_fs_info in
btrfs_set_super, and then open_ctree will simply use our pre-allocated root and
fs_info when setting everything up. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are two big problems currently with FIEMAP
1) We return extents for holes. This isn't supposed to happen, we just don't
return extents for holes and then userspace interprets the lack of an extent as
a hole.
2) We sometimes don't set FIEMAP_EXTENT_LAST properly. This is because we wait
to see a EXTENT_FLAG_VACANCY flag on the em, but this won't happen if say we ask
fiemap to map up to the last extent in a file, and there is nothing but holes up
to the i_size. To fix this we need to lookup the last extent in this file and
save the logical offset, so if we happen to try and map that extent we can be
sure to set FIEMAP_EXTENT_LAST.
With this patch we now pass xfstest 225, which we never have before.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When mounting a btrfs file system btrfs_test_super() may attempt to
use sb->s_fs_info, the btrfs root, of a super block that is going away
and that has had the btrfs root set to NULL in its ->put_super(). But
if the super block is going away it cannot be an existing super block
so we can return false in this case.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently we fail xfstest 236 because we're not updating the inode ctime on
link. This is a simple fix, and makes it so we pass 236 now.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have been failing xfstest 228 forever, because we don't check to make sure
the new inode size is acceptable as far as RLIMIT is concerned. Just check to
make sure it's ok to create a inode with this new size and error out if not.
With this patch we now pass 228.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a typo in __btrfs_prealloc_file_range() where we set the i_size to
actual_len/cur_offset, and then just set it to cur_offset again, and do the same
with btrfs_ordered_update_i_size(). This fixes it back to keeping i_size in a
local variable and then updating i_size properly. Tested this with
xfs_io -F -f -c "falloc 0 1" -c "pwrite 0 1" foo
stat'ing foo gives us a size of 1 instead of 4096 like it was. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Everybody who calls btrfs_add_nondir just passes in the dentry of the new file
and then dereference dentry->d_parent->d_inode, but everybody who calls
btrfs_add_nondir() are already passed the parent's inode. So instead of
dereferencing dentry->d_parent, just make btrfs_add_nondir take the dir inode as
an argument and pass that along so we don't have to worry about d_parent.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Since we walk up the path logging all of the parts of the inode's path, we need
to hold i_mutex to make sure that the inode is not renamed while we're logging
everything. btrfs_log_dentry_safe does dget_parent and all of that jazz, but we
may get unexpected results if the rename changes the inode's location while
we're higher up the path logging those dentries, so do this for safety reasons.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are lots of places where we do dentry->d_parent->d_inode without holding
the dentry->d_lock. This could cause problems with rename. So instead we need
to use dget_parent() and hold the reference to the parent as long as we are
going to use it's inode and then dput it at the end.
Signed-off-by: Josef Bacik <josef@redhat.com>
Cc: raven@themaw.net
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When creating new inodes we don't setup inode->i_generation. So if we generate
an fh with a newly created inode we save the generation of 0, but if we flush
the inode to disk and have to read it back when getting the inode on the server
we'll have the right i_generation, so gens wont match and we get ESTALE. This
patch properly sets inode->i_generation when we create the new inode and now I'm
no longer getting ESTALE. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
People kept reporting NFS issues, specifically getting ESTALE alot. I figured
out how to reproduce the problem
SERVER
mkfs.btrfs /dev/sda1
mount /dev/sda1 /mnt/btrfs-test
<add /mnt/btrfs-test to /etc/exports>
btrfs subvol create /mnt/btrfs-test/foo
service nfs start
CLIENT
mount server:/mnt/btrfs /mnt/test
cd /mnt/test/foo
ls
SERVER
echo 3 > /proc/sys/vm/drop_caches
CLIENT
ls <-- get an ESTALE here
This is because the standard way to lookup a name in nfsd is to use readdir, and
what it does is do a readdir on the parent directory looking for the inode of
the child. So in this case the parent being / and the child being foo. Well
subvols all have the same inode number, so doing a readdir of / looking for
inode 256 will return '.', which obviously doesn't match foo. So instead we
need to have our own .get_name so that we can find the right name.
Our .get_name will either lookup the inode backref or the root backref,
whichever we're looking for, and return the name we find. Running the above
reproducer with this patch results in everything acting the way its supposed to.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Set src_offset = 0, src_length = 20K, dest_offset = 20K. And the
original filesize of the dest file 'file2' is 30K:
# ls -l /mnt/file2
-rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
Now clone file1 to file2, the dest file should be 40K, but it
still shows 30K:
# ls -l /mnt/file2
-rw-r--r-- 1 root root 30720 Nov 18 16:42 /mnt/file2
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We've done the check for src_offset and src_length, and We should
also check dest_offset, otherwise we'll corrupt the destination
file:
(After cloning file1 to file2 with unaligned dest_offset)
# cat /mnt/file2
cat: /mnt/file2: Input/output error
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When I added the clear_cache option I screwed up and took the break out of
the space_cache case statement, so whenever you mount with space_cache you also
get clear_cache, which does you no good if you say set space_cache in fstab so
it always gets set. This patch adds the break back in properly.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
'unused' calculated with wrong sign in reserve_metadata_bytes().
This might have lead to unwanted over-reservations.
Signed-off-by: Arne Jansen <sensille@gmx.net>
Reviewed-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
extent_bio_alloc() and compressed_bio_alloc() are similar, cleanup
similar source code.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
bio_endio() will free dip and dip->csums, so dip and dip->csums twice will
be freed twice. Fix it.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Migrate page will directly call the btrfs btree writepage function,
which isn't actually allowed.
Our writepage assumes that you have locked the extent_buffer and
flagged the block as written. Without doing these steps, we can
corrupt metadata blocks.
A later commit will remove the btree writepage function since
it is really only safely used internally by btrfs. We
use writepages for everything else.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After recent blkdev_get() modifications, open_by_devnum() and
open_bdev_exclusive() are simple wrappers around blkdev_get().
Replace them with blkdev_get_by_dev() and blkdev_get_by_path().
blkdev_get_by_dev() is identical to open_by_devnum().
blkdev_get_by_path() is slightly different in that it doesn't
automatically add %FMODE_EXCL to @mode.
All users are converted. Most conversions are mechanical and don't
introduce any behavior difference. There are several exceptions.
* btrfs now sets FMODE_EXCL in btrfs_device->mode, so there's no
reason to OR it explicitly on blkdev_put().
* gfs2, nilfs2 and the generic mount_bdev() now set FMODE_EXCL in
sb->s_mode.
* With the above changes, sb->s_mode now always should contain
FMODE_EXCL. WARN_ON_ONCE() added to kill_block_super() to detect
errors.
The new blkdev_get_*() functions are with proper docbook comments.
While at it, add function description to blkdev_get() too.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Joern Engel <joern@lazybastard.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Cc: reiserfs-devel@vger.kernel.org
Cc: xfs-masters@oss.sgi.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Over time, block layer has accumulated a set of APIs dealing with bdev
open, close, claim and release.
* blkdev_get/put() are the primary open and close functions.
* bd_claim/release() deal with exclusive open.
* open/close_bdev_exclusive() are combination of open and claim and
the other way around, respectively.
* bd_link/unlink_disk_holder() to create and remove holder/slave
symlinks.
* open_by_devnum() wraps bdget() + blkdev_get().
The interface is a bit confusing and the decoupling of open and claim
makes it impossible to properly guarantee exclusive access as
in-kernel open + claim sequence can disturb the existing exclusive
open even before the block layer knows the current open if for another
exclusive access. Reorganize the interface such that,
* blkdev_get() is extended to include exclusive access management.
@holder argument is added and, if is @FMODE_EXCL specified, it will
gain exclusive access atomically w.r.t. other exclusive accesses.
* blkdev_put() is similarly extended. It now takes @mode argument and
if @FMODE_EXCL is set, it releases an exclusive access. Also, when
the last exclusive claim is released, the holder/slave symlinks are
removed automatically.
* bd_claim/release() and close_bdev_exclusive() are no longer
necessary and either made static or removed.
* bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
is no longer necessary and removed.
* open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
and blkdev_get(). It also has an unexpected extra bdev_read_only()
test which probably should be moved into blkdev_get().
* open_by_devnum() is modified to take @holder argument and pass it to
blkdev_get().
Most of bdev open/close operations are unified into blkdev_get/put()
and most exclusive accesses are tested atomically at the open time (as
it should). This cleans up code and removes some, both valid and
invalid, but unnecessary all the same, corner cases.
open_bdev_exclusive() and open_by_devnum() can use further cleanup -
rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
special features. Well, let's leave them for another day.
Most conversions are straight-forward. drbd conversion is a bit more
involved as there was some reordering, but the logic should stay the
same.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Brown <neilb@suse.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Acked-by: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Peter Osterlund <petero2@telia.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <joel.becker@oracle.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: dm-devel@redhat.com
Cc: drbd-dev@lists.linbit.com
Cc: Leo Chen <leochen@broadcom.com>
Cc: Scott Branden <sbranden@broadcom.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: Joern Engel <joern@logfs.org>
Cc: reiserfs-devel@vger.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
In the failure path of __btrfs_open_devices(), close_bdev_exclusive()
is called with @flags which doesn't match the one used during
open_bdev_exclusive(). Fix it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (39 commits)
Btrfs: deal with errors from updating the tree log
Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Btrfs: make SNAP_DESTROY async
Btrfs: add SNAP_CREATE_ASYNC ioctl
Btrfs: add START_SYNC, WAIT_SYNC ioctls
Btrfs: async transaction commit
Btrfs: fix deadlock in btrfs_commit_transaction
Btrfs: fix lockdep warning on clone ioctl
Btrfs: fix clone ioctl where range is adjacent to extent
Btrfs: fix delalloc checks in clone ioctl
Btrfs: drop unused variable in block_alloc_rsv
Btrfs: cleanup warnings from gcc 4.6 (nonbugs)
Btrfs: Fix variables set but not read (bugs found by gcc 4.6)
Btrfs: Use ERR_CAST helpers
Btrfs: use memdup_user helpers
Btrfs: fix raid code for removing missing drives
Btrfs: Switch the extent buffer rbtree into a radix tree
Btrfs: restructure try_release_extent_buffer()
Btrfs: use the flusher threads for delalloc throttling
Btrfs: tune the chunk allocation to 5% of the FS as metadata
...
Fix up trivial conflicts in fs/btrfs/super.c and fs/fs-writeback.c, and
remove use of INIT_RCU_HEAD in fs/btrfs/extent_io.c (that init macro was
useless and removed in commit 5e8067adfd: "rcu head remove init")
During unlink we remove any references to the inode from
the tree log. It can return -ENOENT and other errors,
and this changes the unlink code to deal with it.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add a mount option user_subvol_rm_allowed that allows users to delete a
(potentially non-empty!) subvol when they would otherwise we allowed to do
an rmdir(2). We duplicate the may_delete() checks from the core VFS code
to implement identical security checks (minus the directory size check).
We additionally require that the user has write+exec permission on the
subvol root inode.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is no reason to force an immediate commit when deleting a snapshot.
Users have some expectation that space from a deleted snapshot be freed
immediately, but even if we do commit the reclaim is a background process.
If users _do_ want the deletion to be durable, they can call 'sync'.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Create a snap without waiting for it to commit to disk. The ioctl is
ordered such that subsequent operations will not be contained by the
created snapshot, and the commit is initiated, but the ioctl does not
wait for the snapshot to commit to disk.
We return the specific transid to userspace so that an application can wait
for this specific snapshot creation to commit via the WAIT_SYNC ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk. This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.
The wait_for_unblock flag controls whether we wait for the 'middle' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
num_writers > 1 or should_grow at the top of the loop. Then, much much
later, we wait for that timeout if either num_writers or should_grow is
true. However, it's possible for a racing process (calling
btrfs_end_transaction()) to decrement num_writers such that we wait
forever instead of for 1.
Fix this by deciding how long to wait when we wait. Include a smp_mb()
before checking if the waitqueue is active to ensure the num_writers
is visible.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
I'm no lockdep expert, but this appears to make the lockdep warning go
away for the i_mutex locking in the clone ioctl.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We had an edge case issue where the requested range was just
following an existing extent. Instead of skipping to the next
extent, we used the previous one which lead to having zero
sized extents.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The lookup_first_ordered_extent() was done on the wrong inode, and the
->delalloc_bytes test was wrong, as the following
btrfs_wait_ordered_range() would only invoke a range write and wouldn't
write the entire file data range. Also, a bad parameter was passed to
btrfs_wait_ordered_range().
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These are all the cases where a variable is set, but not read which are
not bugs as far as I can see, but simply leftovers.
Still needs more review.
Found by gcc 4.6's new warnings
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
These are all the cases where a variable is set, but not
read which are really bugs.
- Couple of incorrect error handling fixed.
- One incorrect use of a allocation policy
- Some other things
Still needs more review.
Found by gcc 4.6's new warnings.
[akpm@linux-foundation.org: fix build. Might have been bitrot]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
T x;
identifier f;
@@
T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }
@@
expression x;
@@
- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use memdup_user when user data is immediately copied into the
allocated region.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@
- to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+ to = memdup_user(from,size);
if (
- to==NULL
+ IS_ERR(to)
|| ...) {
<+... when != goto l1;
- -ENOMEM
+ PTR_ERR(to)
...+>
}
- if (copy_from_user(to, from, size) != 0) {
- <+... when != goto l2;
- -EFAULT
- ...+>
- }
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs is mounted in degraded mode, it has some internal structures
to track the missing devices. This missing device is setup as readonly,
but the mapping code can get upset when we try to write to it.
This changes the mapping code to return -EIO instead of oops when we try
to write to the readonly device.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch reduces the CPU time spent in the extent buffer search by using the
radix tree instead of the rbtree and using the rcu lock instead of the spin
lock.
I did a quick test by the benchmark tool[1] and found the patch improve the
file creation/deletion performance problem that I have reported[2].
Before applying this patch:
Create files:
Total files: 50000
Total time: 0.971531
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.366761
Average time: 0.000027
After applying this patch:
Create files:
Total files: 50000
Total time: 0.927455
Average time: 0.000019
Delete files:
Total files: 50000
Total time: 1.292280
Average time: 0.000026
[1] http://marc.info/?l=linux-btrfs&m=128212635122920&q=p3
[2] http://marc.info/?l=linux-btrfs&m=128212635122920&w=2
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
restructure try_release_extent_buffer() and write a function to release the
extent buffer. It will be used later.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We have a fairly complex set of loops around walking our list of
delalloc inodes when we find metadata delalloc space running low.
It doesn't work very well, can use large amounts of CPU and doesn't
do very efficient writeback.
This switches us to kick the bdi flusher threads instead. All dirty
data in btrfs is accounted as delalloc data, so this is very similar
in terms of what it writes, but we're able to just kick off the IO
and wait for progress.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
An earlier commit tried to keep us from allocating too many
empty metadata chunks. It was somewhat too restrictive and could
lead to ENOSPC errors on empty filesystems.
This increases the limits to about 5% of the FS size, allowing more
metadata chunks to be preallocated.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs discovers the generation number in a btree block is
incorrect, it can loop forever without forcing the RAID
code to try a valid mirror, and without returning EIO.
This changes things to properly kick out the EIO.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If you mount -o space_cache, the option will be persistent across mounts, but to
make sure the user knows that they did this, emit a message telling them if they
didn't mount with -o space_cache but the feature is still used.
Signed-off-by: Josef Bacik <josef@redhat.com>
If something goes wrong with the free space cache we need a way to make sure
it's not loaded on mount and that it's cleared for everybody. When you pass the
clear_cache option it will make it so all block groups are setup to be cleared,
which keeps them from being loaded and then they will be truncated when the
transaction is committed. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
There are just a few things that need to be fixed in the kernel to support mixed
data+metadata block groups. Mostly we just need to make sure that if we are
using mixed block groups that we continue to allocate mixed block groups as we
need them. Also we need to make sure __find_space_info will find our space info
if we search for DATA or METADATA only. Tested this with xfstests and it works
nicely. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
With the free space disk caching we can mark the block group as started with the
caching, but we don't have a caching ctl. This can race with anybody else who
tries to get the caching ctl before we cache (this is very hard to do btw). So
instead check to see if cache->caching_ctl is set, and if not return NULL.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
This patch actually loads the free space cache if it exists. The only thing
that really changes here is that we need to cache the block group if we're going
to remove an extent from it. Previously we did not do this since the caching
kthread would pick it up. With the on disk cache we don't have this luxury so
we need to make sure we read the on disk cache in first, and then remove the
extent, that way when the extent is unpinned the free space is added to the
block group. This has been tested with all sorts of things.
Signed-off-by: Josef Bacik <josef@redhat.com>
This is a simple bit, just dump the free space cache out to our preallocated
inode when we're writing out dirty block groups. There are a bunch of changes
in inode.c in order to account for special cases. Mostly when we're doing the
writeout we're holding trans_mutex, so we need to use the nolock transacation
functions. Also we can't do asynchronous completions since the async thread
could be blocked on already completed IO waiting for the transaction lock. This
has been tested with xfstests and btrfs filesystem balance, as well as my ENOSPC
tests. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In order to save free space cache, we need an inode to hold the data, and we
need a special item to point at the right inode for the right block group. So
first, create a special item that will point to the right inode, and the number
of extent entries we will have and the number of bitmaps we will have. We
truncate and pre-allocate space everytime to make sure it's uptodate.
This feature will be turned on as soon as you mount with -o space_cache, however
it is safe to boot into old kernels, they will just generate the cache the old
fashion way. When you boot back into a newer kernel we will notice that we
modified and not the cache and automatically discard the cache.
Signed-off-by: Josef Bacik <josef@redhat.com>
Because btrfs_dirty_inode does a btrfs_join_transaction, it doesn't actually
reserve space. It does this so we can try and dirty the inode quickly without
having to deal with the ENOSPC problems. But if it does get back ENOSPC it
handles it properly. The problem is use_block_rsv does a WARN_ON whenever this
case happens, even tho btrfs_dirty_inode takes it into account and actually
expects to get -ENOSPC if things are particularly tight. So instead just remove
the warning. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
btrfs_commit_transaction will free our trans, but because we pass trans to
shrink_delalloc we could possibly have a use after free situation. So instead
if we commit the transaction, set trans to null and set committed to true so we
don't keep trying to commit a transaction. This fixes a panic I could reproduce
at will. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-2.6.37/barrier' of git://git.kernel.dk/linux-2.6-block: (46 commits)
xen-blkfront: disable barrier/flush write support
Added blk-lib.c and blk-barrier.c was renamed to blk-flush.c
block: remove BLKDEV_IFL_WAIT
aic7xxx_old: removed unused 'req' variable
block: remove the BH_Eopnotsupp flag
block: remove the BLKDEV_IFL_BARRIER flag
block: remove the WRITE_BARRIER flag
swap: do not send discards as barriers
fat: do not send discards as barriers
ext4: do not send discards as barriers
jbd2: replace barriers with explicit flush / FUA usage
jbd2: Modify ASYNC_COMMIT code to not rely on queue draining on barrier
jbd: replace barriers with explicit flush / FUA usage
nilfs2: replace barriers with explicit flush / FUA usage
reiserfs: replace barriers with explicit flush / FUA usage
gfs2: replace barriers with explicit flush / FUA usage
btrfs: replace barriers with explicit flush / FUA usage
xfs: replace barriers with explicit flush / FUA usage
block: pass gfp_mask and flags to sb_issue_discard
dm: convey that all flushes are processed as empty
...
If we failed to find the root subvol id, or the subvol=<name>, we would
deactivate the locked super and close the devices. The problem is at this point
we have gotten the SB all setup, which includes setting super_operations, so
when we'd deactiveate the super, we'd do a close_ctree() which closes the
devices, so we'd end up closing the devices twice. So if you do something like
this
mount /dev/sda1 /mnt/test1
mount /dev/sda1 /mnt/test2 -o subvol=xxx
umount /mnt/test1
it would blow up (if subvol xxx doesn't exist). This patch fixes that problem.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
With multi-threaded writes we were getting ENOSPC early because somebody would
come in, start flushing delalloc because they couldn't make their reservation,
and in the meantime other threads would come in and use the space that was
getting freed up, so when the original thread went to check to see if they had
space they didn't and they'd return ENOSPC. So instead if we have some free
space but not enough for our reservation, take the reservation and then start
doing the flushing. The only time we don't take reservations is when we've
already overcommitted our space, that way we don't have people who come late to
the party way overcommitting ourselves. This also moves all of the retrying and
flushing code into reserve_metdata_bytes so it's all uniform. This keeps my
fs_mark test from returning -ENOSPC as soon as it starts and actually lets me
fill up the disk. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Because the ENOSPC code over reserves super aggressively we end up allocating
chunks way more often than we should. For example with my fs_mark tests on a
2gb fs I can end up reserved 1gb just for metadata, when only 34mb of that is
being used. So instead check to see if the amount of space actually used is
less than 30% of the total space, and if so don't allocate a chunk, but only if
we have at least 256mb of free space to make sure we don't put too much pressure
on free space.
Signed-off-by: Josef Bacik <josef@redhat.com>
Currently we try and flush delalloc, but we only do that in a sort of weak way,
which works fine in most cases but if we're under heavy pressure we need to be
able to wait for flushing to happen. Also instead of checking the bytes
reserved in the block_rsv, check the space info since it is more accurate. The
sync option will be used in a future patch.
Signed-off-by: Josef Bacik <josef@redhat.com>
The global reservation stuff tries to add together DATA and METADATA used in
order to figure out how much to reserve for everything, but this doesn't work
right for mixed block groups. Instead if we have mixed block groups just set
data used to 0. Also with mixed block groups we will use bytes_may_use for
keeping track of delalloc bytes, so we need to take that into account in our
reservation calculations.
Signed-off-by: Josef Bacik <josef@redhat.com>
The new ENOSPC stuff breaks out the raid types which breaks the way we were
reporting df to the system. This fixes it back so that Available is the total
space available to data and used is the actual bytes used by the filesystem.
This means that Available is Total - data used - all of the metadata space.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
The new ENOSPC stuff broke the df ioctl since we no longer create seperate space
info's for each RAID type. So instead, loop through each space info's raid
lists so we can get the right RAID information which will allow the df ioctl to
tell us RAID types again. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
In very severe ENOSPC cases we can run out of inodes to do delalloc on, which
means we'll just keep looping trying to shrink delalloc. Instead, if we fail to
shrink delalloc 3 times in a row break out since we're not likely to make any
progress. Tested this with a 100mb fs an xfstests test 13. Before the patch it
would hang the box, with the patch we get -ENOSPC like we should. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
All file_operations should get a .llseek operation so we can make
nonseekable_open the default for future file operations without a
.llseek pointer.
The three cases that we can automatically detect are no_llseek, seq_lseek
and default_llseek. For cases where we can we can automatically prove that
the file offset is always ignored, we use noop_llseek, which maintains
the current behavior of not returning an error from a seek.
New drivers should normally not use noop_llseek but instead use no_llseek
and call nonseekable_open at open time. Existing drivers can be converted
to do the same when the maintainer knows for certain that no user code
relies on calling seek on the device file.
The generated code is often incorrectly indented and right now contains
comments that clarify for each added line why a specific variant was
chosen. In the version that gets submitted upstream, the comments will
be gone and I will manually fix the indentation, because there does not
seem to be a way to do that using coccinelle.
Some amount of new code is currently sitting in linux-next that should get
the same modifications, which I will do at the end of the merge window.
Many thanks to Julia Lawall for helping me learn to write a semantic
patch that does all this.
===== begin semantic patch =====
// This adds an llseek= method to all file operations,
// as a preparation for making no_llseek the default.
//
// The rules are
// - use no_llseek explicitly if we do nonseekable_open
// - use seq_lseek for sequential files
// - use default_llseek if we know we access f_pos
// - use noop_llseek if we know we don't access f_pos,
// but we still want to allow users to call lseek
//
@ open1 exists @
identifier nested_open;
@@
nested_open(...)
{
<+...
nonseekable_open(...)
...+>
}
@ open exists@
identifier open_f;
identifier i, f;
identifier open1.nested_open;
@@
int open_f(struct inode *i, struct file *f)
{
<+...
(
nonseekable_open(...)
|
nested_open(...)
)
...+>
}
@ read disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ read_no_fpos disable optional_qualifier exists @
identifier read_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t read_f(struct file *f, char *p, size_t s, loff_t *off)
{
... when != off
}
@ write @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
expression E;
identifier func;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
<+...
(
*off = E
|
*off += E
|
func(..., off, ...)
|
E = *off
)
...+>
}
@ write_no_fpos @
identifier write_f;
identifier f, p, s, off;
type ssize_t, size_t, loff_t;
@@
ssize_t write_f(struct file *f, const char *p, size_t s, loff_t *off)
{
... when != off
}
@ fops0 @
identifier fops;
@@
struct file_operations fops = {
...
};
@ has_llseek depends on fops0 @
identifier fops0.fops;
identifier llseek_f;
@@
struct file_operations fops = {
...
.llseek = llseek_f,
...
};
@ has_read depends on fops0 @
identifier fops0.fops;
identifier read_f;
@@
struct file_operations fops = {
...
.read = read_f,
...
};
@ has_write depends on fops0 @
identifier fops0.fops;
identifier write_f;
@@
struct file_operations fops = {
...
.write = write_f,
...
};
@ has_open depends on fops0 @
identifier fops0.fops;
identifier open_f;
@@
struct file_operations fops = {
...
.open = open_f,
...
};
// use no_llseek if we call nonseekable_open
////////////////////////////////////////////
@ nonseekable1 depends on !has_llseek && has_open @
identifier fops0.fops;
identifier nso ~= "nonseekable_open";
@@
struct file_operations fops = {
... .open = nso, ...
+.llseek = no_llseek, /* nonseekable */
};
@ nonseekable2 depends on !has_llseek @
identifier fops0.fops;
identifier open.open_f;
@@
struct file_operations fops = {
... .open = open_f, ...
+.llseek = no_llseek, /* open uses nonseekable */
};
// use seq_lseek for sequential files
/////////////////////////////////////
@ seq depends on !has_llseek @
identifier fops0.fops;
identifier sr ~= "seq_read";
@@
struct file_operations fops = {
... .read = sr, ...
+.llseek = seq_lseek, /* we have seq_read */
};
// use default_llseek if there is a readdir
///////////////////////////////////////////
@ fops1 depends on !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier readdir_e;
@@
// any other fop is used that changes pos
struct file_operations fops = {
... .readdir = readdir_e, ...
+.llseek = default_llseek, /* readdir is present */
};
// use default_llseek if at least one of read/write touches f_pos
/////////////////////////////////////////////////////////////////
@ fops2 depends on !fops1 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read.read_f;
@@
// read fops use offset
struct file_operations fops = {
... .read = read_f, ...
+.llseek = default_llseek, /* read accesses f_pos */
};
@ fops3 depends on !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write.write_f;
@@
// write fops use offset
struct file_operations fops = {
... .write = write_f, ...
+ .llseek = default_llseek, /* write accesses f_pos */
};
// Use noop_llseek if neither read nor write accesses f_pos
///////////////////////////////////////////////////////////
@ fops4 depends on !fops1 && !fops2 && !fops3 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
identifier write_no_fpos.write_f;
@@
// write fops use offset
struct file_operations fops = {
...
.write = write_f,
.read = read_f,
...
+.llseek = noop_llseek, /* read and write both use no f_pos */
};
@ depends on has_write && !has_read && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier write_no_fpos.write_f;
@@
struct file_operations fops = {
... .write = write_f, ...
+.llseek = noop_llseek, /* write uses no f_pos */
};
@ depends on has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
identifier read_no_fpos.read_f;
@@
struct file_operations fops = {
... .read = read_f, ...
+.llseek = noop_llseek, /* read uses no f_pos */
};
@ depends on !has_read && !has_write && !fops1 && !fops2 && !has_llseek && !nonseekable1 && !nonseekable2 && !seq @
identifier fops0.fops;
@@
struct file_operations fops = {
...
+.llseek = noop_llseek, /* no read or write fn */
};
===== End semantic patch =====
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Julia Lawall <julia@diku.dk>
Cc: Christoph Hellwig <hch@infradead.org>
All the blkdev_issue_* helpers can only sanely be used for synchronous
caller. To issue cache flushes or barriers asynchronously the caller needs
to set up a bio by itself with a completion callback to move the asynchronous
state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
specified when calling blkdev_issue_* and also remove the now unused flags
argument to blkdev_issue_flush and blkdev_issue_zeroout. For
blkdev_issue_discard we need to keep it for the secure discard flag, which
gains a more descriptive name and loses the bitops vs flag confusion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Switch to the WRITE_FLUSH_FUA flag for log writes, remove the EOPNOTSUPP
detection for barriers and stop setting the barrier flag for discards.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block: (149 commits)
block: make sure that REQ_* types are seen even with CONFIG_BLOCK=n
xen-blkfront: fix missing out label
blkdev: fix blkdev_issue_zeroout return value
block: update request stacking methods to support discards
block: fix missing export of blk_types.h
writeback: fix bad _bh spinlock nesting
drbd: revert "delay probes", feature is being re-implemented differently
drbd: Initialize all members of sync_conf to their defaults [Bugz 315]
drbd: Disable delay probes for the upcomming release
writeback: cleanup bdi_register
writeback: add new tracepoints
writeback: remove unnecessary init_timer call
writeback: optimize periodic bdi thread wakeups
writeback: prevent unnecessary bdi threads wakeups
writeback: move bdi threads exiting logic to the forker thread
writeback: restructure bdi forker loop a little
writeback: move last_active to bdi
writeback: do not remove bdi from bdi_list
writeback: simplify bdi code a little
writeback: do not lose wake-ups in bdi threads
...
Fixed up pretty trivial conflicts in drivers/block/virtio_blk.c and
drivers/scsi/scsi_error.c as per Jens.
BTRFS does not define a '->write_super()' method, so it should
not mark its superblock as dirty. This looks like some left-over.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
NB: do we want btrfs_wait_ordered_range() on eviction of
inodes with positive i_nlink on subvolume with zero root_refs?
If not, btrfs_evict_inode() can be simplified by unconditionally
bailing out in case of i_nlink > 0 in the very beginning...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
add I_CLEAR instead of replacing I_FREEING with it. I_CLEAR is
equivalent to I_FREEING for almost all code looking at either;
it's there to keep track of having called clear_inode() exactly
once per inode lifetime, at some point after having set I_FREEING.
I_CLEAR and I_FREEING never get set at the same time with the
current code, so we can switch to setting i_flags to I_FREEING | I_CLEAR
instead of I_CLEAR without loss of information. As the result of
such change, checks become simpler and the amount of code that needs
to know about I_CLEAR shrinks a lot.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Replace inode_setattr with opencoded variants of it in all callers. This
moves the remaining call to vmtruncate into the filesystem methods where it
can be replaced with the proper truncate sequence.
In a few cases it was obvious that we would never end up calling vmtruncate
so it was left out in the opencoded variant:
spufs: explicitly checks for ATTR_SIZE earlier
btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier
ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above
In addition to that ncpfs called inode_setattr with handcrafted iattrs,
which allowed to trim down the opencoded variant.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.
Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
1. The BTRFS_IOC_CLONE and BTRFS_IOC_CLONE_RANGE ioctls should check
whether the donor file is append-only before writing to it.
2. The BTRFS_IOC_CLONE_RANGE ioctl appears to have an integer
overflow that allows a user to specify an out-of-bounds range to copy
from the source file (if off + len wraps around). I haven't been able
to successfully exploit this, but I'd imagine that a clever attacker
could use this to read things he shouldn't. Even if it's not
exploitable, it couldn't hurt to be safe.
Signed-off-by: Dan Rosenberg <dan.j.rosenberg@gmail.com>
cc: stable@kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The CLONE and CLONE_RANGE ioctls round up the range of extents being
cloned to the block size when the range to clone extends to the end of file
(this is always the case with CLONE). It was then using that offset when
extending the destination file's i_size. Fix this by not setting i_size
beyond the originally requested ending offset.
This bug was introduced by a22285a6 (2.6.35-rc1).
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
split_leaf was not properly balancing leaves when it was forced to
split a leaf twice. This commit adds an extra push left and right
before forcing the double split in hopes of getting the slot where
we want to insert at either the start or end of the leaf.
If the extra pushes do work, then we are able to avoid splitting twice
and we keep the tree properly balanced.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This was just an odd wrapper around writeback_inodes_wb. Removing this
also allows to get rid of the bdi member of struct writeback_control
which was rather out of place there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: The file argument for fsync() is never null
Btrfs: handle ERR_PTR from posix_acl_from_xattr()
Btrfs: avoid BUG when dropping root and reference in same transaction
Btrfs: prohibit a operation of changing acl's mask when noacl mount option used
Btrfs: should add a permission check for setfacl
Btrfs: btrfs_lookup_dir_item() can return ERR_PTR
Btrfs: btrfs_read_fs_root_no_name() returns ERR_PTRs
Btrfs: unwind after btrfs_start_transaction() errors
Btrfs: btrfs_iget() returns ERR_PTR
Btrfs: handle kzalloc() failure in open_ctree()
Btrfs: handle error returns from btrfs_lookup_dir_item()
Btrfs: Fix BUG_ON for fs converted from extN
Btrfs: Fix null dereference in relocation.c
Btrfs: fix remap_file_pages error
Btrfs: uninitialized data is check_path_shared()
Btrfs: fix fallocate regression
Btrfs: fix loop device on top of btrfs
The "file" argument for fsync is never null so we can remove this check.
What drew my attention here is that 7ea8085910: "drop unused dentry
argument to ->fsync" introduced an unconditional dereference at the
start of the function and that generated a smatch warning.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
posix_acl_from_xattr() returns both ERR_PTRs and null, but it's OK to
pass null values to set_cached_acl()
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If btrfs_ioctl_snap_destroy() deletes a snapshot but finishes
with end_transaction(), the cleaner kthread may come in and
drop the root in the same transaction. If that's the case, the
root's refs still == 1 in the tree when btrfs_del_root() deletes
the item, because commit_fs_roots() hasn't updated it yet (that
happens during the commit).
This wasn't a problem before only because
btrfs_ioctl_snap_destroy() would commit the transaction before dropping
the dentry reference, so the dead root wouldn't get queued up until
after the fs root item was updated in the btree.
Since it is not an error to drop the root reference and the root in the
same transaction, just drop the BUG_ON() in btrfs_del_root().
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
when used Posix File System Test Suite(pjd-fstest) to test btrfs,
some cases about setfacl failed when noacl mount option used.
I simplified used commands in pjd-fstest, and the following steps
can reproduce it.
------------------------
# cd btrfs-part/
# mkdir aaa
# setfacl -m m::rw aaa <- successed, but not expected by pjd-fstest.
------------------------
I checked ext3, a warning message occured, like as:
setfacl: aaa/: Operation not supported
Certainly, it's expected by pjd-fstest.
So, i compared acl.c of btrfs and ext3. Based on that, a patch created.
Fortunately, it works.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
On btrfs, do the following
------------------
# su user1
# cd btrfs-part/
# touch aaa
# getfacl aaa
# file: aaa
# owner: user1
# group: user1
user::rw-
group::rw-
other::r--
# su user2
# cd btrfs-part/
# setfacl -m u::rwx aaa
# getfacl aaa
# file: aaa
# owner: user1
# group: user1
user::rwx <- successed to setfacl
group::rw-
other::r--
------------------
but we should prohibit it that user2 changing user1's acl.
In fact, on ext3 and other fs, a message occurs:
setfacl: aaa: Operation not permitted
This patch fixed it.
Signed-off-by: Shi Weihua <shiwh@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_lookup_dir_item() can return either ERR_PTRs or null.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_read_fs_root_no_name() returns ERR_PTRs on error so I added a
check for that. It's not clear to me if it can also return NULL
pointers or not so I left the original NULL pointer check as is.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This was added by a22285a6a3: "Btrfs: Integrate metadata reservation
with start_transaction". If we goto out here then we skip all the
unwinding and there are locks still held etc.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_iget() returns an ERR_PTR() on failure and not null.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Unwind and return -ENOMEM if the allocation fails here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If btrfs_lookup_dir_item() fails, we should can just let the mount fail
with an error.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Tree blocks can live in data block groups in FS converted from extN.
So it's easy to trigger the BUG_ON.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Fix a potential null dereference in relocation.c
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Acked-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
when we use remap_file_pages() to remap a file, remap_file_pages always return
error. It is because btrfs didn't set VM_CAN_NONLINEAR for vma.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
refs can be used with uninitialized data if btrfs_lookup_extent_info()
fails on the first pass through the loop. In the original code if that
happens then check_path_shared() probably returns 1, this patch
changes it to return 1 for safety.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Seems that when btrfs_fallocate was converted to use the new ENOSPC stuff we
dropped passing the mode to the function that actually does the preallocation.
This breaks anybody who wants to use FALLOC_FL_KEEP_SIZE. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We cannot use the loop device which has been connected to a file in the btrf
The reproduce steps is following:
# dd if=/dev/zero of=vdev0 bs=1M count=1024
# losetup /dev/loop0 vdev0
# mkfs.btrfs /dev/loop0
...
failed to zero device start -5
The reason is that the btrfs don't implement either ->write_begin or ->write
the VFS API, so we fix it by setting ->write to do_sync_write().
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits)
Btrfs: add more error checking to btrfs_dirty_inode
Btrfs: allow unaligned DIO
Btrfs: drop verbose enospc printk
Btrfs: Fix block generation verification race
Btrfs: fix preallocation and nodatacow checks in O_DIRECT
Btrfs: avoid ENOSPC errors in btrfs_dirty_inode
Btrfs: move O_DIRECT space reservation to btrfs_direct_IO
Btrfs: rework O_DIRECT enospc handling
Btrfs: use async helpers for DIO write checksumming
Btrfs: don't walk around with task->state != TASK_RUNNING
Btrfs: do aio_write instead of write
Btrfs: add basic DIO read/write support
direct-io: do not merge logically non-contiguous requests
direct-io: add a hook for the fs to provide its own submit_bio function
fs: allow short direct-io reads to be completed via buffered IO
Btrfs: Metadata ENOSPC handling for balance
Btrfs: Pre-allocate space for data relocation
Btrfs: Metadata ENOSPC handling for tree log
Btrfs: Metadata reservation for orphan inodes
Btrfs: Introduce global metadata reservation
...
The ENOSPC code will now return ENOSPC to btrfs_start_transaction.
btrfs_dirty_inode needs to check for this and error out appropriately.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In order to support DIO that isn't aligned to the filesystem blocksize,
we fall back to buffered for any unaligned DIOs.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After the path is released, the generation number got from block
pointer is no long valid. The race may cause disk corruption, because
verify_parent_transid() calls clear_extent_buffer_uptodate() when
generation numbers mismatch.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The O_DIRECT code wasn't checking for multiple references
on preallocated or nodatacow extents. This means it
wasn't honoring snapshots properly.
The fix here is to add an explicit check for multiple references
This also fixes the math for selecting the correct disk block,
making sure not to go past the end of the extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_dirty_inode tries to sneak in without much waiting or
space reservation, mostly for performance reasons. This
usually works well but can cause problems when there are
many many writers.
When btrfs_update_inode fails with ENOSPC, we fallback
to a slower btrfs_start_transaction call that will reserve
some space.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This moves the delalloc space reservation done for O_DIRECT
into btrfs_direct_IO. This way we don't leak reserved space
if the generic O_DIRECT write code errors out before it
calls into btrfs_direct_IO.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes O_DIRECT write code to mark extents as delalloc
while it is processing them. Yan Zheng has reworked the
enospc accounting based on tracking delalloc extents and
this makes it much easier to track enospc in the O_DIRECT code.
There are a few space cases with the O_DIRECT code though,
it only sets the EXTENT_DELALLOC bits, instead of doing
EXTENT_DELALLOC | EXTENT_DIRTY | EXTENT_UPTODATE, because
we don't want to mess with clearing the dirty and uptodate
bits when things go wrong. This is important because there
are no pages in the page cache, so any extent state structs
that we put in the tree won't get freed by releasepage. We have
to clear them ourselves as the DIO ends.
With this commit, we reserve space at in btrfs_file_aio_write,
and then as each btrfs_direct_IO call progresses it sets
EXTENT_DELALLOC on the range.
btrfs_get_blocks_direct is responsible for clearing the delalloc
at the same time it drops the extent lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This adds:
alias: devname:<name>
to some common kernel modules, which will allow the on-demand loading
of the kernel module when the device node is accessed.
Ideally all these modules would be compiled-in, but distros seems too
much in love with their modularization that we need to cover the common
cases with this new facility. It will allow us to remove a bunch of pretty
useless init scripts and modprobes from init scripts.
The static device node aliases will be carried in the module itself. The
program depmod will extract this information to a file in the module directory:
$ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname
# Device nodes to trigger on-demand module loading.
microcode cpu/microcode c10:184
fuse fuse c10:229
ppp_generic ppp c108:0
tun net/tun c10:200
dm_mod mapper/control c10:235
Udev will pick up the depmod created file on startup and create all the
static device nodes which the kernel modules specify, so that these modules
get automatically loaded when the device node is accessed:
$ /sbin/udevd --debug
...
static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184
static_dev_create_from_modules: mknod '/dev/fuse' c10:229
static_dev_create_from_modules: mknod '/dev/ppp' c108:0
static_dev_create_from_modules: mknod '/dev/net/tun' c10:200
static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235
udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666
udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666
A few device nodes are switched to statically allocated numbers, to allow
the static nodes to work. This might also useful for systems which still run
a plain static /dev, which is completely unsafe to use with any dynamic minor
numbers.
Note:
The devname aliases must be limited to the *common* and *single*instance*
device nodes, like the misc devices, and never be used for conceptually limited
systems like the loop devices, which should rather get fixed properly and get a
control node for losetup to talk to, instead of creating a random number of
device nodes in advance, regardless if they are ever used.
This facility is to hide the mess distros are creating with too modualized
kernels, and just to hide that these modules are not compiled-in, and not to
paper-over broken concepts. Thanks! :)
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The async helper threads offload crc work onto all the
CPUs, and make streaming writes much faster. This
changes the O_DIRECT write code to use them. The only
small complication was that we need to pass in the
logical offset in the file for each bio, because we can't
find it in the bio's pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng noticed two places we were doing a lot of work
without task->state set to TASK_RUNNING. This sets the state
properly after we get ready to sleep but decide not to.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
In order for AIO to work, we need to implement aio_write. This patch converts
our btrfs_file_write to btrfs_aio_write. I've tested this with xfstests and
nothing broke, and the AIO stuff magically started working. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This provides basic DIO support for reading and writing. It does not do the
work to recover from mismatching checksums, that will come later. A few design
changes have been made from Jim's code (sorry Jim!)
1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO
code in order to account for all of BTRFS's oddities, but thanks to that work it
seems like the best bet is to just ignore compression and such and just opt to
fallback on buffered IO.
2) Fallback on buffered IO for compressed or inline extents. Jim's code did
it's own buffering to make dio with compressed extents work. Now we just
fallback onto normal buffered IO.
3) Use ordered extents for the writes so that all of the
lock_extent()
lookup_ordered()
type checks continue to work.
4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
DIO writes.
I've tested this with fsx and everything works great. This patch depends on my
dio and filemap.c patches to work. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds metadata ENOSPC handling for the balance code.
It is consisted by following major changes:
1. Avoid COW tree leave in the phrase of merging tree.
2. Handle interaction with snapshot creation.
3. make the backref cache can live across transactions.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Pre-allocate space for data relocation. This can detect ENOPSC
condition caused by fragmentation of free space.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Previous patches make the allocater return -ENOSPC if there is no
unreserved free metadata space. This patch updates tree log code
and various other places to propagate/handle the ENOSPC error.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Reserve metadata space for extent tree, checksum tree and root tree
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Introduce metadata reservation context for delayed allocation
and update various related functions.
This patch also introduces EXTENT_FIRST_DELALLOC control bit for
set/clear_extent_bit. It tells set/clear_bit_hook whether they
are processing the first extent_state with EXTENT_DELALLOC bit
set. This change is important if set/clear_extent_bit involves
multiple extent_state.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Besides simplify the code, this change makes sure all metadata
reservation for normal metadata operations are released after
committing transaction.
Changes since V1:
Add code that check if unlink and rmdir will free space.
Add ENOSPC handling for clone ioctl.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Introducing metadata reseravtion contexts has two major advantages.
First, it makes metadata reseravtion more traceable. Second, it can
reclaim freed space and re-add them to the itself after transaction
committed.
Besides add btrfs_block_rsv structure and related helper functions,
This patch contains following changes:
Move code that decides if freed tree block should be pinned into
btrfs_free_tree_block().
Make space accounting more accurate, mainly for handling read only
block groups.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
All code in init_btrfs_i can be moved into btrfs_alloc_inode()
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Shrink delayed allocation space in a synchronized manner is more
controllable than flushing all delay allocated space in an async
thread.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We already have fs_info->chunk_mutex to avoid concurrent
chunk creation.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The size of reserved space is stored in space_info. If block groups
of different raid types are linked to separate space_info, changing
allocation profile will corrupt reserved space accounting.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (69 commits)
fix handling of offsets in cris eeprom.c, get rid of fake on-stack files
get rid of home-grown mutex in cris eeprom.c
switch ecryptfs_write() to struct inode *, kill on-stack fake files
switch ecryptfs_get_locked_page() to struct inode *
simplify access to ecryptfs inodes in ->readpage() and friends
AFS: Don't put struct file on the stack
Ban ecryptfs over ecryptfs
logfs: replace inode uid,gid,mode initialization with helper function
ufs: replace inode uid,gid,mode initialization with helper function
udf: replace inode uid,gid,mode init with helper
ubifs: replace inode uid,gid,mode initialization with helper function
sysv: replace inode uid,gid,mode initialization with helper function
reiserfs: replace inode uid,gid,mode initialization with helper function
ramfs: replace inode uid,gid,mode initialization with helper function
omfs: replace inode uid,gid,mode initialization with helper function
bfs: replace inode uid,gid,mode initialization with helper function
ocfs2: replace inode uid,gid,mode initialization with helper function
nilfs2: replace inode uid,gid,mode initialization with helper function
minix: replace inode uid,gid,mode init with helper
ext4: replace inode uid,gid,mode init with helper
...
Trivial conflict in fs/fs-writeback.c (mark bitfields unsigned)
The patch just convert all blkdev_issue_xxx function to common
set of flags. Wait/allocation semantics preserved.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: make sure the chunk allocator doesn't create zero length chunks
Btrfs: fix data enospc check overflow
A recent commit allowed for smaller chunks to be created, but didn't
make sure they were always bigger than a stripe. After some divides,
this led to zero length stripes.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: add check for changed leaves in setup_leaf_for_split
Btrfs: create snapshot references in same commit as snapshot
Btrfs: fix small race with delalloc flushing waitqueue's
Btrfs: use add_to_page_cache_lru, use __page_cache_alloc
Btrfs: fix chunk allocate size calculation
Btrfs: kill max_extent mount option
Btrfs: fail to mount if we have problems reading the block groups
Btrfs: check btrfs_get_extent return for IS_ERR()
Btrfs: handle kmalloc() failure in inode lookup ioctl
Btrfs: dereferencing freed memory
Btrfs: Simplify num_stripes's calculation logical for __btrfs_alloc_chunk()
Btrfs: Add error handle for btrfs_search_slot() in btrfs_read_chunk_tree()
Btrfs: Remove unnecessary finish_wait() in wait_current_trans()
Btrfs: add NULL check for do_walk_down()
Btrfs: remove duplicate include in ioctl.c
Fix trivial conflict in fs/btrfs/compression.c due to slab.h include
cleanups.
Because we account for reserved space we get from the allocator before we
actually account for allocating delalloc space, we can have a small window where
the amount of "used" space in a space_info is more than the total amount of
space in the space_info. This will cause a overflow in our check, so it will
seem like we have _tons_ of free space, and we'll allow reservations to occur
that will end up larger than the amount of space we have. I've seen users
report ENOSPC panic's in cow_file_range a few times recently, so I tried to
reproduce this problem and found I could reproduce it if I ran one of my tests
in a loop for like 20 minutes. With this patch my test ran all night without
issues. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
setup_leaf_for_split needs to drop the path and search again, and has
checks to see if the item we want to split changed size. But, it misses
the case where the leaf changed and now has enough room for the item
we want to insert.
This adds an extra check to make sure the leaf really needs splitting
before we call btrfs_split_leaf(), which keeps us from trying to split
a leaf with a single item.
btrfs_split_leaf() will blindly split the single item leaf, leaving us
with one good leaf and one empty leaf and then a crash.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This creates the reference to a new snapshot in the same commit as the
snapshot itself. This avoids the need for a second commit in order for a
snapshot to be persistent, and also avoids the problem of "leaking" a
new snapshot tree root if the host crashes before the second commit takes
place.
It is not at all clear to me why it wasn't always done this way. If there
is still a reason for the two-stage {create,finish}_pending_snapshots()
approach I'm missing something! :)
I've been running this for a couple weeks under pretty heavy usage (a few
snapshots per minute) without obvious problems.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Everytime we start a new flushing thread, we init the waitqueue if there isn't a
flushing thread running. The problem with this is we check
space_info->flushing, which we clear right before doing a wake_up on the
flushing waitqueue, which causes problems if we init the waitqueue in the middle
of clearing the flushing flagh and calling wake_up. This is hard to hit, but
the code is wrong anyway, so init the flushing/allocating waitqueue when
creating the space info and let it be. I haven't seen the panic since I've been
using this patch. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Pagecache pages should be allocated with __page_cache_alloc, so they
obey pagecache memory policies.
add_to_page_cache_lru is exported, so it should be used. Benefits over
using a private pagevec: neater code, 128 bytes fewer stack used, percpu
lru ordering is preserved, and finally don't need to flush pagevec
before returning so batching may be shared with other LRU insertions.
Signed-off-by: Nick Piggin <npiggin@suse.de>:
Signed-off-by: Chris Mason <chris.mason@oracle.com>
If the amount of free space left in a device is less than what we think should
be the minimum size, just ignore the minimum size and use the amount we have. I
ran into this running tests on a 600mb volume, the chunk allocator wouldn't let
me allocate the last 52mb of the disk for data because we want to have at least
64mb chunks for data. This patch fixes that problem. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
As Yan pointed out, theres not much reason for all this complicated math to
account for file extents being split up into max_extent chunks, since they are
likely to all end up in the same leaf anyway. Since there isn't much reason to
use max_extent, just remove the option altogether so we have one less thing we
need to test.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We don't actually check the return value of btrfs_read_block_groups, so we can
possibly succeed to mount, but then fail to say read the superblock xattr for
selinux which will cause the vfs code to deactivate the super.
This is a problem because in find_free_extent we just assume that we
will find the right space_info for the allocation we want. But if we
failed to read the block groups, we won't have setup any space_info's,
and we'll hit a NULL pointer deref in find_free_extent.
This patch fixes that problem by checking the return value of
btrfs_read_block_groups, and failing out properly. I've also added a
check in find_free_extent so if for some reason we don't find an
appropriate space_info, we just return -ENOSPC.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_get_extent() never returns NULL, only a valid pointer or ERR_PTR()
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The original code dereferenced range on the next line.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can use this simple method to make source more readable.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We need to check return value of btrfs_search_slot() in
btrfs_read_chunk_tree() and do corresponding error handing.
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We only need to call finish_wait() after wait loop.
By the way, this patch makes code of waiting loop similar to
example in wait.h(no functional change)
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_find_create_tree_block() may return NULL, so we must check the returned
value, or we will access a NULL pointer.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
fs/btrfs/ioctl.c: ctree.h is included more than once.
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (30 commits)
Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree
Btrfs: allow treeid==0 in the inode lookup ioctl
Btrfs: return keys for large items to the search ioctl
Btrfs: fix key checks and advance in the search ioctl
Btrfs: buffer results in the space_info ioctl
Btrfs: use __u64 types in ioctl.h
Btrfs: fix search_ioctl key advance
Btrfs: fix gfp flags masking in the compression code
Btrfs: don't look at bio flags after submit_bio
btrfs: using btrfs_stack_device_id() get devid
btrfs: use memparse
Btrfs: add a "df" ioctl for btrfs
Btrfs: cache the extent state everywhere we possibly can V2
Btrfs: cache ordered extent when completing io
Btrfs: cache extent state in find_delalloc_range
Btrfs: change the ordered tree to use a spinlock instead of a mutex
Btrfs: finish read pages in the order they are submitted
btrfs: fix btrfs_mkdir goto for no free objectids
Btrfs: flush data on snapshot creation
Btrfs: make df be a little bit more understandable
...
This is used by the inode lookup ioctl to follow all the backrefs up
to the subvol root. But the search being done would sometimes land one
past the last item in the leaf instead of finding the backref.
This changes the search to look for the highest possible backref and hop
back one item. It also fixes a leaked path on failure to find the root.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a root id of 0 is sent to the inode lookup ioctl, it will
use the root of the file we're ioctling and pass the root id
back to userland along with the results.
This allows userland to do searches based on that root later on.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The search ioctl was skipping large items entirely (ones that are too
big for the results buffer). This changes things to at least copy
the item header so that we can send information about the item back to
userland.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The search ioctl was working well for finding tree roots, but using it for
generic searches requires a few changes to how the keys are advanced.
This treats the search control min fields for objectid, type and offset
more like a key, where we drop the offset to zero once we bump the type,
etc.
The downside of this is that we are changing the min_type and min_offset
fields during the search, and so the ioctl caller needs extra checks to make sure
the keys in the result are the ones it wanted.
This also changes key_in_sk to use btrfs_comp_cpu_keys, just to make
things more readable.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The space_info ioctl was using copy_to_user inside rcu_read_lock. This
commit changes things to copy into a buffer first and then dump the
result down to userland.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
key->type is u8, not u64.
fs/btrfs/ioctl.c: In function 'copy_to_sk':
fs/btrfs/ioctl.c:1024: warning: comparison is always true due to limited range of data type
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After callling submit_bio, the bio can be freed at any time. The
btrfs submission thread helper was checking the bio flags too late,
which might not give the correct answer.
When CONFIG_DEBUG_PAGE_ALLOC is turned on, it can lead to oopsen.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
We can use btrfs_stack_device_id() to get dev_item->devid
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use memparse() instead of its own private implementation.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Chris Mason <chris.mason@oracle.com>
df is a very loaded question in btrfs. This gives us a way to get the per-space
usage information so we can tell exactly what is in use where. This will help
us figure out ENOSPC problems, and help users better understand where their disk
space is going.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch just goes through and fixes everybody that does
lock_extent()
blah
unlock_extent()
to use
lock_extent_bits()
blah
unlock_extent_cached()
and pass around a extent_state so we only have to do the searches once per
function. This gives me about a 3 mb/s boots on my random write test. I have
not converted some things, like the relocation and ioctl's, since they aren't
heavily used and the relocation stuff is in the middle of being re-written. I
also changed the clear_extent_bit() to only unset the cached state if we are
clearing EXTENT_LOCKED and related stuff, so we can do things like this
lock_extent_bits()
clear delalloc bits
unlock_extent_cached()
without losing our cached state. I tested this thoroughly and turned on
LEAK_DEBUG to make sure we weren't leaking extent states, everything worked out
fine.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When finishing io we run btrfs_dec_test_ordered_pending, and then immediately
run btrfs_lookup_ordered_extent, but btrfs_dec_test_ordered_pending does that
already, so we're searching twice when we don't have to. This patch lets us
pass a btrfs_ordered_extent in to btrfs_dec_test_ordered_pending so if we do
complete io on that ordered extent we can just use the one we found then instead
of having to do another btrfs_lookup_ordered_extent. This made my fio job with
the other patch go from 24 mb/s to 29 mb/s.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch makes us cache the extent state we find in find_delalloc_range since
we'll have to lock the extent later on in the function. This will keep us from
re-searching for the rang when we try to lock the extent.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The ordered tree used to need a mutex, but currently all we use it for is to
protect the rb_tree, and a spin_lock is just fine for that. Using a spin_lock
instead makes dbench run a little faster, 58 mb/s instead of 51 mb/s, and have
less latency, 3445.138 ms instead of 3820.633 ms.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The endio is done at reverse order of bio vectors.
That means for a sequential read, the page first submitted will finish
last in a bio. Considering we will do checksum (making cache hot) for
every page, this does introduce delay (and chance to squeeze cache used
soon) for pages submitted at the begining.
I don't observe obvious performance difference with below patch at my
simple test, but seems more natural to finish read in the order they are
submitted.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs_mkdir() must jump to the place of ending transaction after
btrfs_find_free_objectid() failed. Or this transaction can't end.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Flush any delalloc extents when we create a snapshot, so that recently
written file data is always included in the snapshot.
A later commit will add the ability to snapshot without the flush, but
most people expect flushing.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The way we report df usage is way confusing for everybody, including some other
utilities (bacula for one). So this patch makes df a little bit more
understandable. First we make used actually count the total amount of used
space in all space info's. This will give us a real view of how much disk space
is in use. Second, for blocks available, only count data space. This makes
things like bacula work because it says 0 when you can no longer write anymore
data to the disk. I think this is a nice compromise, since you will end up with
something like the following
[root@alpha ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
148G 30G 111G 21% /
/dev/sda1 194M 116M 68M 64% /boot
tmpfs 985M 12K 985M 1% /dev/shm
/dev/mapper/VolGroup-LogVol02
145G 140G 0 100% /mnt/btrfs-test
Compare this with btrfsctl -i output
[root@alpha btrfs-progs-unstable]# ./btrfsctl -i /mnt/btrfs-test/
Metadata, DUP: total=4.62GB, used=2.46GB
System, DUP: total=8.00MB, used=24.00KB
Data: total=134.80GB, used=134.80GB
Metadata: total=8.00MB, used=0.00
System: total=4.00MB, used=0.00
operation complete
This way we show that there is no more data space to be used, but we have
another 5GB of space left for metadata. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>