Before the reader/writer locks, btrfs_next_leaf needed to keep
the path blocking to avoid making lockdep upset.
Now that btrfs_next_leaf only takes read locks, this isn't required.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch was originally from Tejun Heo. lockdep complains about the btrfs
locking because we sometimes take btree locks from two different trees at the
same time. The current classes are based only on level in the btree, which
isn't enough information for lockdep to figure out if the lock is safe.
This patch makes a class for each type of tree, and lumps all the FS trees that
actually have files and directories into the same class.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs metadata btree is the source of significant
lock contention, especially in the root node. This
commit changes our locking to use a reader/writer
lock.
The lock is built on top of rw spinlocks, and it
extends the lock tracking to remember if we have a
read lock or a write lock when we go to blocking. Atomics
count the number of blocking readers or writers at any
given time.
It removes all of the adaptive spinning from the old code
and uses only the spinning/blocking hints inside of btrfs
to decide when it should continue spinning.
In read heavy workloads this is dramatically faster. In write
heavy workloads we're still faster because of less contention
on the root node lock.
We suffer slightly in dbench because we schedule more often
during write locks, but all other benchmarks so far are improved.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Hit this nice little deadlock. What happens is this
__btrfs_end_transaction with throttle set, --use_count so it equals 0
btrfs_commit_transaction
<somebody else actually manages to start the commit>
btrfs_end_transaction --use_count so now its -1 <== BAD
we just return and wait on the transaction
This is bad because we just return after our use_count is -1 and don't let go
of our num_writer count on the transaction, so the guy committing the
transaction just sits there forever. Fix this by inc'ing our use_count if we're
going to call commit_transaction so that if we call btrfs_end_transaction it's
valid. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The extent_buffers have a very complex interface where
we use HIGHMEM for metadata and try to cache a kmap mapping
to access the memory.
The next commit adds reader/writer locks, and concurrent use
of this kmap cache would make it even more complex.
This commit drops the ability to use HIGHMEM with extent buffers,
and rips out all of the related code.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we balanced the chunks across the devices, BUG_ON() in
__finish_chunk_alloc() was triggered.
------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:2568!
[SNIP]
Call Trace:
[<ffffffffa049525e>] btrfs_alloc_chunk+0x8e/0xa0 [btrfs]
[<ffffffffa04546b0>] do_chunk_alloc+0x330/0x3a0 [btrfs]
[<ffffffffa045c654>] btrfs_reserve_extent+0xb4/0x1f0 [btrfs]
[<ffffffffa045c86b>] btrfs_alloc_free_block+0xdb/0x350 [btrfs]
[<ffffffffa048a8d8>] ? read_extent_buffer+0xd8/0x1d0 [btrfs]
[<ffffffffa04476fd>] __btrfs_cow_block+0x14d/0x5e0 [btrfs]
[<ffffffffa044660d>] ? read_block_for_search+0x14d/0x4d0 [btrfs]
[<ffffffffa0447c9b>] btrfs_cow_block+0x10b/0x240 [btrfs]
[<ffffffffa044dd5e>] btrfs_search_slot+0x49e/0x7a0 [btrfs]
[<ffffffffa044f07d>] btrfs_insert_empty_items+0x8d/0xf0 [btrfs]
[<ffffffffa045e973>] insert_with_overflow+0x43/0x110 [btrfs]
[<ffffffffa045eb0d>] btrfs_insert_dir_item+0xcd/0x1f0 [btrfs]
[<ffffffffa0489bd0>] ? map_extent_buffer+0xb0/0xc0 [btrfs]
[<ffffffff812276ad>] ? rb_insert_color+0x9d/0x160
[<ffffffffa046cc40>] ? inode_tree_add+0xf0/0x150 [btrfs]
[<ffffffffa0474801>] btrfs_add_link+0xc1/0x1c0 [btrfs]
[<ffffffff811dacac>] ? security_inode_init_security+0x1c/0x30
[<ffffffffa04a28aa>] ? btrfs_init_acl+0x4a/0x180 [btrfs]
[<ffffffffa047492f>] btrfs_add_nondir+0x2f/0x70 [btrfs]
[<ffffffffa046af16>] ? btrfs_init_inode_security+0x46/0x60 [btrfs]
[<ffffffffa0474ac0>] btrfs_create+0x150/0x1d0 [btrfs]
[<ffffffff81159c63>] ? generic_permission+0x23/0xb0
[<ffffffff8115b415>] vfs_create+0xa5/0xc0
[<ffffffff8115ce6e>] do_last+0x5fe/0x880
[<ffffffff8115dc0d>] path_openat+0xcd/0x3d0
[<ffffffff8115e029>] do_filp_open+0x49/0xa0
[<ffffffff8116a965>] ? alloc_fd+0x95/0x160
[<ffffffff8114f0c7>] do_sys_open+0x107/0x1e0
[<ffffffff810bcc3f>] ? audit_syscall_entry+0x1bf/0x1f0
[<ffffffff8114f1e0>] sys_open+0x20/0x30
[<ffffffff81484ec2>] system_call_fastpath+0x16/0x1b
[SNIP]
RIP [<ffffffffa049444a>] __finish_chunk_alloc+0x20a/0x220 [btrfs]
The reason is:
Task1 Space balance task
do_chunk_alloc()
__finish_chunk_alloc()
update device info
in the chunk tree
alloc system metadata block
relocate system metadata block group
set system metadata block group
readonly, This block group is the
only one that can allocate space. So
there is no free space that can be
allocated now.
find no space and don't try
to alloc new chunk, and then
return ENOSPC
BUG_ON() in __finish_chunk_alloc()
was triggered.
Fix this bug by allocating a new system metadata chunk before relocating the
old one if we find there is no free space which can be allocated after setting
the old block group to be read-only.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Tested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Everybody else does this, we need to do it too. If we're syncing, we need to
tag the pages we're going to write for writeback so we don't end up writing the
same stuff over and over again if somebody is constantly redirtying our file.
This will keep us from having latencies with heavy sync workloads. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
So I had this brilliant idea to use atomic counters for outstanding and reserved
extents, but this turned out to be a bad idea. Consider this where we have 1
outstanding extent and 1 reserved extent
Reserver Releaser
atomic_dec(outstanding) now 0
atomic_read(outstanding)+1 get 1
atomic_read(reserved) get 1
don't actually reserve anything because
they are the same
atomic_cmpxchg(reserved, 1, 0)
atomic_inc(outstanding)
atomic_add(0, reserved)
free reserved space for 1 extent
Then the reserver now has no actual space reserved for it, and when it goes to
finish the ordered IO it won't have enough space to do it's allocation and you
get those lovely warnings.
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Kill the check to see if we have 512mb of reserved space in delalloc and
shrink_delalloc if we do. This causes unexpected latencies and we have other
logic to see if we need to throttle. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
grab_cache_page will use mapping_gfp_mask(), which for all inodes is set to
GFP_HIGHUSER_MOVABLE. So instead use find_or_create_page in all cases where we
need GFP_NOFS so we don't deadlock. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
A user reported a deadlock when copying a bunch of files. This is because they
were low on memory and kthreadd got hung up trying to migrate pages for an
allocation when starting the caching kthread. The page was locked by the person
starting the caching kthread. To fix this we just need to use the async thread
stuff so that the threads are already created and we don't have to worry about
deadlocks. Thanks,
Reported-by: Roman Mamedov <rm@romanrm.ru>
Signed-off-by: Josef Bacik <josef@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (53 commits)
Input: synaptics - fix reporting of min coordinates
Input: tegra-kbc - enable key autorepeat
Input: kxtj9 - fix locking typo in kxtj9_set_poll()
Input: kxtj9 - fix bug in probe()
Input: intel-mid-touch - remove pointless checking for variable 'found'
Input: hp_sdc - staticize hp_sdc_kicker()
Input: pmic8xxx-keypad - fix a leak of the IRQ during init failure
Input: cy8ctmg110_ts - set reset_pin and irq_pin from platform data
Input: cy8ctmg110_ts - constify i2c_device_id table
Input: cy8ctmg110_ts - fix checking return value of i2c_master_send
Input: lifebook - make dmi callback functions return 1
Input: atkbd - make dmi callback functions return 1
Input: gpio_keys - switch to using SIMPLE_DEV_PM_OPS
Input: gpio_keys - add support for device-tree platform data
Input: aiptek - remove double define
Input: synaptics - set minimum coordinates as reported by firmware
Input: synaptics - process button bits in AGM packets
Input: synaptics - rename set_slot to be more descriptive
Input: synaptics - fuzz position for touchpad with reduced filtering
Input: synaptics - set resolution for MT_POSITION_X/Y axes
...
* 'next' of git://git.monstr.eu/linux-2.6-microblaze:
microblaze: Do not show error message for 32 interrupt lines
Revert "microblaze: PCI fix typo fault in of_node pointer moving into pci_bus"
microblaze: PCI fix typo fault in of_node pointer moving into pci_bus
microblaze: Add support for early console on mdm
microblaze: Simplify early console binding from DT
microblaze: Get early printk console earlier
microblaze: Standardise cpuinfo output for cache policy
microblaze: Unprivileged stream instruction awareness
microblaze: trivial: Fix typo fault
microblaze: exec: Remove redundant set_fs(USER_DS)
microblaze: Remove duplicated prototype of start_thread()
microblaze: Fix unaligned value saving to the stack for system with MMU
microblaze/irqs: Do not trace arch_local_{*,irq_*} functions
A merge with Linus' tree added a double include of linux/interrupt.h.
Fix by removing one of the includes.
Signed-off-by: Daniel Morsing <daniel.morsing@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Copying hp_pins and speaker_pins from line_out_pins may confuse the
parser, and it can lead to duplicated initializations for the same pin
with a wrong DAC assignment. The problem appears in 3.0 kernel code.
Cc: <stable@kernel.org> (for 3.0)
Signed-off-by: Takashi Iwai <tiwai@suse.de>
"adapter" is used as an array index in the adapters[] array so
the off by one would make us read past the end.
1c073b6797 "ALSA: asihpi - Remove spurious adapter index check"
reverted Dan Rosenberg's check that would have prevented the
overflow here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Somce quirk models don't set adc_nids but let the parser filling it.
But the recent code has unnecessary NULL-checks of spec->input_mux,
and it resulted in NULL dereferences.
This patch fixes that regression.
Reported-and-tested-by: Oliver Neukum <oneukum@suse.de>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Currently skb_gro_header_slow unconditionally resets frag0 and
frag0_len. However, when we can't pull on the skb this leaves
the GRO fields in an inconsistent state.
This patch fixes this by only resetting those fields after the
pskb_may_pull test.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
When interrupt controller uses 32 interrupts lines the kernel
show error message about mismatch in kind-of-intr parameter
because it exceeds u32. Recast fixs this issue.
Signed-off-by: Michal Simek <monstr@monstr.eu>
Fixes bug introduced by 1c073b67.
Also declare pa local to block in which it is used.
Signed-off-by: Eliot Blennerhassett <eblennerhassett@audioscience.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
This patch also registers all necessary callbacks to support mute LED
only when such control is enabled. And it keeps codec AFG in D0 or D1
state all the time when aggressive power managemnt is enabled for vref-out
control (and mute LED) work correctly.
Signed-off-by: Vitaliy Kulikov <Vitaliy.Kulikov@idt.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Since the addition of file capabilities every write needs to read xattrs to
check if we have any capabilities to clear. In Linux 3.0 Andi Kleen added
a flag to cache the fact that we do not have any attributes on an inode.
Make sure to already mark a file as not having any attributes when reading
it from disk in case it doesn't even have an attribute fork. Based on an
earlier patch from Andi Kleen.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
We need to take some locks to prevent new ioends from coming in when we wait
for all existing ones to go away. Up to Linux 3.0 that was done using the
i_mutex held by the VFS fsync code, but now that we are called without
it we need to take care of it ourselves. Use the I/O lock instead of
i_mutex just like we do in other places.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Now that REQ_META bios aren't treated specially in the CFQ I/O schedule
anymore, we can tag all buffers as metadata to make blktrace traces more
meaningful. Note that we use buffers also to zero out partial blocks
in the preallocation / hole punching code, and while they operate on
data blocks the zeros written certainly aren't data. I think this case
is borderline metadata enough to not bother special casing it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
Pull into a helper function some debug-only code that validates a
xfs_da_blkinfo structure that's been read from disk.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@infradead.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
merge fchmod() and fchmodat() guts, kill ancient broken kludge
xfs: fix misspelled S_IS...()
xfs: get rid of open-coded S_ISREG(), etc.
vfs: document locking requirements for d_move, __d_move and d_materialise_unique
omfs: fix (mode & S_IFDIR) abuse
btrfs: S_ISREG(mode) is not mode & S_IFREG...
ima: fmode_t misspelled as mode_t...
pci-label.c: size_t misspelled as mode_t
jffs2: S_ISLNK(mode & S_IFMT) is pointless
snd_msnd ->mode is fmode_t, not mode_t
v9fs_iop_get_acl: get rid of unused variable
vfs: dont chain pipe/anon/socket on superblock s_inodes list
Documentation: Exporting: update description of d_splice_alias
fs: add missing unlock in default_llseek()
This patch causes MD to generate an event (for device-mapper) when the
synchronization thread is reaped. This is expected behavior for device-mapper.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Revert most of commit e384e58549
md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log.
MD should not need to use DM's dirty log - we decided to use md's
bitmaps instead.
Keeping the DIV_ROUND_UP clean-ups that were part of commit
e384e58549, however.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If device-mapper creates a RAID1 array that includes devices to
be rebuilt, it will deref a NULL pointer when finished because
sysfs is not used by device-mapper instantiated RAID devices.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
While preparing to write a stripe we keep the parity block or blocks
locked (R5_LOCKED) - towards the end of schedule_reconstruction.
If the array is discovered to have failed before this write completes
we can leave those blocks LOCKED, and init_stripe will notice that a
free stripe still has a locked block and will complain.
So clear the R5_LOCKED flag in handle_failed_stripe, and demote the
'BUG' to a 'WARN_ON'.
Signed-off-by: NeilBrown <neilb@suse.de>
Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Read errors are considered to corrected if write-back and re-read
cycle is finished without further problems. Thus moving the rdev->
corrected_errors counting after the re-reading looks more reasonable
IMHO. Also included a couple of whitespace fixes on sync_page_io().
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
page_address() returns void pointer, so the casts can be removed.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Normally we would fail a device with a READ error. However if doing
so causes the array to fail, it is better to leave the device
in place and just return the read error to the caller.
The current test for decide if the array will fail is overly
simplistic.
We have a function 'enough' which can tell if the array is failed or
not, so use it to guide the decision.
Signed-off-by: NeilBrown <neilb@suse.de>
When we get a read error during recovery, RAID10 previously
arranged for the recovering device to appear to fail so that
the recovery stops and doesn't restart. This is misleading and wrong.
Instead, make use of the new recovery_disabled handling and mark
the target device and having recovery disabled.
Add appropriate checks in add_disk and remove_disk so that devices
are removed and not re-added when recovery is disabled.
Signed-off-by: NeilBrown <neilb@suse.de>
If we hit a read error while recovering a mirror, we want to abort the
recovery without necessarily failing the disk - as having a disk this
a read error is better than not having an array at all.
Currently this is managed with a per-array flag "recovery_disabled"
and is only implemented for RAID1. For RAID10 we will need finer
grained control as we might want to disable recovery for individual
devices separately.
So push more of the decision making into the personality.
'recovery_disabled' is now a 'cookie' which is copied when the
personality want to disable recovery and is changed when a device is
added to the array as this is used as a trigger to 'try recovery
again'.
This will allow RAID10 to get the control that it needs.
Signed-off-by: NeilBrown <neilb@suse.de>
Commit c89a8eee61 ("Allow faulty devices to be removed from a
readonly array.") added some work on ro array in the function,
but it couldn't be done since we didn't allow the ro array to be
handled from the beginning. Fix it.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
There are places where sysfs links to rdev are handled
in a same way. Add the helper functions to consolidate
them.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
As per printk_ratelimit comment, it should not be used.
Signed-off-by: Christian Dietrich <christian.dietrich@informatik.uni-erlangen.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Using __test_and_{set,clear}_bit_le() with ignoring its return value
can be replaced with __{set,clear}_bit_le().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: NeilBrown <neilb@suse.de>
Cc: linux-raid@vger.kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
handle_stripe5() and handle_stripe6() are now virtually identical.
So discard one and rename the other to 'analyse_stripe()'.
It always returns 0, so change it to 'void' and remove the 'done'
variable in handle_stripe().
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
The RAID6 version of this code is usable for RAID5 providing:
- we test "conf->max_degraded" rather than "2" as appropriate
- we make sure s->failed_num[1] is meaningful (and not '-1')
when s->failed > 1
The 'return 1' must become 'goto finish' in the new location.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Apart from 'prexor' which can only be set for RAID5, and
'qd_idx' which can only be meaningful for RAID6, these two
chunks of code are nearly the same.
So combine them into one adding a test to call either
handle_parity_checks5 or handle_parity_checks6 as appropriate.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
RAID6 is only allowed to choose 'reconstruct-write' while RAID5 is
also allow 'read-modify-write'
Apart from this difference, handle_stripe_dirtying[56] are nearly
identical. So resolve these differences and create just one function.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>
Provided that ->failed_num[1] is not a valid device number (which is
easily achieved) fetch_block6 provides all the functionality of
fetch_block5.
So remove the latter and rename the former to simply "fetch_block".
Then handle_stripe_fill5 and handle_stripe_fill6 become the same and
can similarly be united.
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Namhyung Kim <namhyung@gmail.com>