For /proc/kcore, vmalloc areas are registered per arch. But, all of them
registers same range of [VMALLOC_START...VMALLOC_END) This patch unifies
them. By this. archs which have no kclist_add() hooks can see vmalloc
area correctly.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently, kclist_add() only eats start address and size as its arguments.
Considering to make kclist dynamically reconfigulable, it's necessary to
know which kclists are for System RAM and which are not.
This patch add kclist types as
KCORE_RAM
KCORE_VMALLOC
KCORE_TEXT
KCORE_OTHER
This "type" is used in a patch following this for detecting KCORE_RAM.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset is for /proc/kcore. With this,
- many per-arch hooks are removed.
- /proc/kcore will know really valid physical memory area.
- /proc/kcore will be aware of memory hotplug.
- /proc/kcore will be architecture independent i.e.
if an arch supports CONFIG_MMU, it can use /proc/kcore.
(if the arch uses usual memory layout.)
This patch:
/proc/kcore uses its own list handling codes. It's better to use
generic list codes.
No changes in logic. just clean up.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A patch to give a better overview of the userland application stack usage,
especially for embedded linux.
Currently you are only able to dump the main process/thread stack usage
which is showed in /proc/pid/status by the "VmStk" Value. But you get no
information about the consumed stack memory of the the threads.
There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
the vm mapping where the thread stack pointer reside with "[thread stack
xxxxxxxx]". xxxxxxxx is the maximum size of stack. This is a value
information, because libpthread doesn't set the start of the stack to the
top of the mapped area, depending of the pthread usage.
A sample output of /proc/<pid>/task/<tid>/maps looks like:
08048000-08049000 r-xp 00000000 03:00 8312 /opt/z
08049000-0804a000 rw-p 00001000 03:00 8312 /opt/z
0804a000-0806b000 rw-p 00000000 00:00 0 [heap]
a7d12000-a7d13000 ---p 00000000 00:00 0
a7d13000-a7f13000 rw-p 00000000 00:00 0 [thread stack: 001ff4b4]
a7f13000-a7f14000 ---p 00000000 00:00 0
a7f14000-a7f36000 rw-p 00000000 00:00 0
a7f36000-a8069000 r-xp 00000000 03:00 4222 /lib/libc.so.6
a8069000-a806b000 r--p 00133000 03:00 4222 /lib/libc.so.6
a806b000-a806c000 rw-p 00135000 03:00 4222 /lib/libc.so.6
a806c000-a806f000 rw-p 00000000 00:00 0
a806f000-a8083000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0
a8083000-a8084000 r--p 00013000 03:00 14462 /lib/libpthread.so.0
a8084000-a8085000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0
a8085000-a8088000 rw-p 00000000 00:00 0
a8088000-a80a4000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2
a80a4000-a80a5000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2
a80a5000-a80a6000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2
afaf5000-afb0a000 rw-p 00000000 00:00 0 [stack]
ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso]
Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
which will you give the current stack usage in kb.
A sample output of /proc/self/status looks like:
Name: cat
State: R (running)
Tgid: 507
Pid: 507
.
.
.
CapBnd: fffffffffffffeff
voluntary_ctxt_switches: 0
nonvoluntary_ctxt_switches: 0
Stack usage: 12 kB
I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
address of the associated thread stack and not the one of the main
process. This makes more sense.
[akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
Signed-off-by: Stefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove obfuscated zero-length input check and return -EINVAL instead of
-EIO error to make the error message clear to user. Add whitespace
stripping. No functionality changes.
The old code:
echo 1 > /proc/pid/make-it-fail (ok)
echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Input/output error)
The new code:
echo 1 > /proc/pid/make-it-fail (ok)
echo 1foo > /proc/pid/make-it-fail (-bash: echo: write error: Invalid argument)
This patch is conservative in changes to not breaking existing
scripts/applications.
Signed-off-by: Vincent Li <macli@brc.ubc.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton pointed out similar string hacking and obfuscated check for
zero-length input at the end of the function, David Rientjes suggested to
use strict_strtol to replace simple_strtol, this patch cover above
suggestions, add removing of leading and trailing whitespace from user
input. It does not change function behavious.
Signed-off-by: Vincent Li <macli@brc.ubc.ca>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Amerigo Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In 9063c61fd5 ("x86, 64-bit: Clean up user address masking") Linus
fixed the wrong size of /proc/kcore problem.
But its size still looks insane, since it never equals the size of
physical memory.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Tao Ma <tao.ma@oracle.com>
Cc: <mtk.manpages@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The exiting sub-thread flushes /proc/pid only, but this doesn't buy too
much: ps and friends mostly use /proc/tid/task/pid.
Remove "if (thread_group_leader())" checks from proc_flush_task() path,
this means we always remove /proc/tid/task/pid dentry on exit, and this
actually matches the comment above proc_flush_task().
The test-case:
static void* tfunc(void *arg)
{
char name[256];
sprintf(name, "/proc/%d/task/%ld/status", getpid(), gettid());
close(open(name, O_RDONLY));
return NULL;
}
int main(void)
{
pthread_t t;
for (;;) {
if (!pthread_create(&t, NULL, &tfunc, NULL))
pthread_join(t, NULL);
}
}
slabtop shows that pid/proc_inode_cache/etc grow quickly and
"indefinitely" until the task is killed or shrink_slab() is called, not
good. And the main thread needs a lot of time to exit.
The same can happen if something like "ps -efL" runs continuously, while
some application spawns short-living threads.
Reported-by: "James M. Leddy" <jleddy@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dominic Duval <dduval@redhat.com>
Cc: Frank Hirtz <fhirtz@redhat.com>
Cc: "Fuller, Johnray" <Johnray.Fuller@gs.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Paul Batkowski <pbatkowski@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/$pid/limits should show RLIMIT_CPU as seconds, which is the unit
used in kernel/posix-cpu-timers.c:
unsigned long psecs = cputime_to_secs(ptime);
...
if (psecs >= sig->rlim[RLIMIT_CPU].rlim_max) {
...
__group_send_sig_info(SIGKILL, SEND_SIG_PRIV, tsk);
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
qnx4 wrte support has never been fully implement, is broken since the dawn
of time and hasn't been actively developed since before git history
started.
Instead of letting it further bitrot and complicate API transition (like
the new truncate code) remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Anders Larsen <al@alarsen.net>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_sync_write() does the right thing for turning the aio_writev method
into a normal non-vectored synchronous write, no need to duplicate it in
ntfs.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Split the anonfd interface into a bare file pointer creation one, and a
file pointer creation plus install one.
There are cases, like the usage of eventfds inside other kernel
interfaces, where the file pointer created by anonfd needs to be used
inside the initialization of other structures.
As it is right now, as soon as anon_inode_getfd() returns, the kenrle can
race with userspace closing the newly installed file descriptor.
This patch, while keeping the old anon_inode_getfd(), introduces a new
anon_inode_getfile() (whose services are reused in anon_inode_getfd())
that allows to split the file creation phase and the fd install one.
Once all the kernel structures are initialized, the code can call the
proper fd_install().
Gregory manifested the need for something like this inside KVM.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: James Morris <jmorris@namei.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As mentioned in Documentation/CodingStyle, move EXPORT* macro's
to the line immediately after the closing function brace line.
Also, move the __initcall() similarly.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to Documentation/CodingStyle the EXPORT* macro should follow
immediately after the closing function brace line.
Also, mark_buffer_async_write_endio() and do_thaw_all() are not used
elsewhere so they should be marked as static.
In addition, file_fsync() is actually in fs/sync.c so move the EXPORT* to
that file.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have had a report of bad memory allocation latency during DVD-RAM (UDF)
writing. This is causing the user's desktop session to become unusable.
Jan tracked the cause of this down to UDF inode reclaim blocking:
gnome-screens D ffff810006d1d598 0 20686 1
ffff810006d1d508 0000000000000082 ffff810037db6718 0000000000000800
ffff810006d1d488 ffffffff807e4280 ffffffff807e4280 ffff810006d1a580
ffff8100bccbc140 ffff810006d1a8c0 0000000006d1d4e8 ffff810006d1a8c0
Call Trace:
[<ffffffff804477f3>] io_schedule+0x63/0xa5
[<ffffffff802c2587>] sync_buffer+0x3b/0x3f
[<ffffffff80447d2a>] __wait_on_bit+0x47/0x79
[<ffffffff80447dc6>] out_of_line_wait_on_bit+0x6a/0x77
[<ffffffff802c24f6>] __wait_on_buffer+0x1f/0x21
[<ffffffff802c442a>] __bread+0x70/0x86
[<ffffffff88de9ec7>] :udf:udf_tread+0x38/0x3a
[<ffffffff88de0fcf>] :udf:udf_update_inode+0x4d/0x68c
[<ffffffff88de26e1>] :udf:udf_write_inode+0x1d/0x2b
[<ffffffff802bcf85>] __writeback_single_inode+0x1c0/0x394
[<ffffffff802bd205>] write_inode_now+0x7d/0xc4
[<ffffffff88de2e76>] :udf:udf_clear_inode+0x3d/0x53
[<ffffffff802b39ae>] clear_inode+0xc2/0x11b
[<ffffffff802b3ab1>] dispose_list+0x5b/0x102
[<ffffffff802b3d35>] shrink_icache_memory+0x1dd/0x213
[<ffffffff8027ede3>] shrink_slab+0xe3/0x158
[<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
[<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
[<ffffffff802951fa>] alloc_page_vma+0x176/0x189
[<ffffffff802822d8>] __do_fault+0x10c/0x417
[<ffffffff80284232>] handle_mm_fault+0x466/0x940
[<ffffffff8044b922>] do_page_fault+0x676/0xabf
This blocks with iprune_mutex held, which then blocks other reclaimers:
X D ffff81009d47c400 0 17285 14831
ffff8100844f3728 0000000000000086 0000000000000000 ffff81000000e288
ffff81000000da00 ffffffff807e4280 ffffffff807e4280 ffff81009d47c400
ffffffff805ff890 ffff81009d47c740 00000000844f3808 ffff81009d47c740
Call Trace:
[<ffffffff80447f8c>] __mutex_lock_slowpath+0x72/0xa9
[<ffffffff80447e1a>] mutex_lock+0x1e/0x22
[<ffffffff802b3ba1>] shrink_icache_memory+0x49/0x213
[<ffffffff8027ede3>] shrink_slab+0xe3/0x158
[<ffffffff8027fbab>] try_to_free_pages+0x177/0x232
[<ffffffff8027a578>] __alloc_pages+0x1fa/0x392
[<ffffffff8029507f>] alloc_pages_current+0xd1/0xd6
[<ffffffff80279ac0>] __get_free_pages+0xe/0x4d
[<ffffffff802ae1b7>] __pollwait+0x5e/0xdf
[<ffffffff8860f2b4>] :nvidia:nv_kern_poll+0x2e/0x73
[<ffffffff802ad949>] do_select+0x308/0x506
[<ffffffff802adced>] core_sys_select+0x1a6/0x254
[<ffffffff802ae0b7>] sys_select+0xb5/0x157
Now I think the main problem is having the filesystem block (and do IO) in
inode reclaim. The problem is that this doesn't get accounted well and
penalizes a random allocator with a big latency spike caused by work
generated from elsewhere.
I think the best idea would be to avoid this. By design if possible, or
by deferring the hard work to an asynchronous context. If the latter,
then the fs would probably want to throttle creation of new work with
queue size of the deferred work, but let's not get into those details.
Anyway, the other obvious thing we looked at is the iprune_mutex which is
causing the cascading blocking. We could turn this into an rwsem to
improve concurrency. It is unreasonable to totally ban all potentially
slow or blocking operations in inode reclaim, so I think this is a cheap
way to get a small improvement.
This doesn't solve the whole problem of course. The process doing inode
reclaim will still take the latency hit, and concurrent processes may end
up contending on filesystem locks. So fs developers should keep these
problems in mind.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make all seq_operations structs const, to help mitigate against
revectoring user-triggerable function pointers.
This is derived from the grsecurity patch, although generated from scratch
because it's simpler than extracting the changes from there.
Signed-off-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move various magic-number definitions into magic.h.
Signed-off-by: Nick Black <dank@qemfd.net>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
__estimate_accuracy() was prone to integer overflow, for example if *tv ==
{2147, 483648000} on a 32 bit computer (or even for delays as small as
{429, 500000000} if the task is niced).
Because the result was already forced between 0 and 100ms, the effect of
the overflow was not too problematic, but the use of the hrtimer range
feature was not optimal in overflow cases.
This patch ensures that there can not be an integer overflow in this
function.
Signed-off-by: Guillaume Knispel <gknispel@proformatique.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function uses signed integers for the unix_date and local variables -
if a negative number is supplied and the leap-year condition is not met,
month will be 0, leading to a read of day_n[-1]
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When calling vfs_unlink() on the lower dentry, d_delete() turns the
dentry into a negative dentry when the d_count is 1. This eventually
caused a NULL pointer deref when a read() or write() was done and the
negative dentry's d_inode was dereferenced in
ecryptfs_read_update_atime() or ecryptfs_getxattr().
Placing mutt's tmpdir in an eCryptfs mount is what initially triggered
the oops and I was able to reproduce it with the following sequence:
open("/tmp/upper/foo", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW, 0600) = 3
link("/tmp/upper/foo", "/tmp/upper/bar") = 0
unlink("/tmp/upper/foo") = 0
open("/tmp/upper/bar", O_RDWR|O_CREAT|O_NOFOLLOW, 0600) = 4
unlink("/tmp/upper/bar") = 0
write(4, "eCryptfs test\n"..., 14 <unfinished ...>
+++ killed by SIGKILL +++
https://bugs.launchpad.net/ecryptfs/+bug/387073
Reported-by: Loïc Minier <loic.minier@canonical.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Errors returned from vfs_read() and vfs_write() calls to the lower
filesystem were being masked as -EINVAL. This caused some confusion to
users who saw EINVAL instead of ENOSPC when the disk was full, for
instance.
Also, the actual bytes read or written were not accessible by callers to
ecryptfs_read_lower() and ecryptfs_write_lower(), which may be useful in
some cases. This patch updates the error handling logic where those
functions are called in order to accept positive return codes indicating
success.
Cc: Eric Sandeen <esandeen@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
When searching through the global authentication tokens for a given key
signature, verify that a matching key has not been revoked and has not
expired. This allows the `keyctl revoke` command to be properly used on
keys in use by eCryptfs.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Returns -ENOTSUPP when attempting to use filename encryption with
something other than a password authentication token, such as a private
token from openssl. Using filename encryption with a userspace eCryptfs
key module is a future goal. Until then, this patch handles the
situation a little better than simply using a BUG_ON().
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
If the lower inode is read-only, don't attempt to open the lower file
read/write and don't hand off the open request to the privileged
eCryptfs kthread for opening it read/write. Instead, only try an
unprivileged, read-only open of the file and give up if that fails.
This patch fixes an oops when eCryptfs is mounted on top of a read-only
mount.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Eric Sandeen <esandeen@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Returns an error when an unrecognized cipher code is present in a tag 3
packet or an ecryptfs_crypt_stat cannot be initialized. Also sets an
crypt_stat->tfm error pointer to NULL to ensure that it will not be
incorrectly freed in ecryptfs_destroy_crypt_stat().
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
So, I compiled a 2.6.31-rc5 kernel with ecryptfs and loaded its module.
When it came time to mount my filesystem, I got this in dmesg, and it
refused to mount:
[93577.776637] Unable to allocate crypto cipher with name [aes]; rc = [-2]
[93577.783280] Error attempting to initialize key TFM cipher with name = [aes]; rc = [-2]
[93577.791183] Error attempting to initialize cipher with name = [aes] and key size = [32]; rc = [-2]
[93577.800113] Error parsing options; rc = [-22]
I figured from the error message that I'd either forgotten to load "aes"
or that my key size was bogus. Neither one of those was the case. In
fact, I was missing the CRYPTO_ECB config option and the 'ecb' module.
Unfortunately, there's no trace of 'ecb' in that error message.
I've done two things to fix this. First, I've modified ecryptfs's
Kconfig entry to select CRYPTO_ECB and CRYPTO_CBC. I also took CRYPTO
out of the dependencies since the 'select' will take care of it for us.
I've also modified the error messages to print a string that should
contain both 'ecb' and 'aes' in my error case. That will give any
future users a chance of finding the right modules and Kconfig options.
I also wonder if we should:
select CRYPTO_AES if !EMBEDDED
since I think most ecryptfs users are using AES like me.
Cc: ecryptfs-devel@lists.launchpad.net
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Dustin Kirkland <kirkland@canonical.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
[tyhicks@linux.vnet.ibm.com: Removed extra newline, 80-char violation]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Lockdep reports the following valid-looking possible AB-BA deadlock with
global_auth_tok_list_mutex and keysig_list_mutex:
ecryptfs_new_file_context() ->
ecryptfs_copy_mount_wide_sigs_to_inode_sigs() ->
mutex_lock(&mount_crypt_stat->global_auth_tok_list_mutex);
-> ecryptfs_add_keysig() ->
mutex_lock(&crypt_stat->keysig_list_mutex);
vs
ecryptfs_generate_key_packet_set() ->
mutex_lock(&crypt_stat->keysig_list_mutex);
-> ecryptfs_find_global_auth_tok_for_sig() ->
mutex_lock(&mount_crypt_stat->global_auth_tok_list_mutex);
ie the two mutexes are taken in opposite orders in the two different
code paths. I'm not sure if this is a real bug where two threads could
actually hit the two paths in parallel and deadlock, but it at least
makes lockdep impossible to use with ecryptfs since this report triggers
every time and disables future lockdep reporting.
Since ecryptfs_add_keysig() is called only from the single callsite in
ecryptfs_copy_mount_wide_sigs_to_inode_sigs(), the simplest fix seems to
be to move the lock of keysig_list_mutex back up outside of the where
global_auth_tok_list_mutex is taken. This patch does that, and fixes
the lockdep report on my system (and ecryptfs still works OK).
The full output of lockdep fixed by this patch is:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.31-2-generic #14~rbd2
-------------------------------------------------------
gdm/2640 is trying to acquire lock:
(&mount_crypt_stat->global_auth_tok_list_mutex){+.+.+.}, at: [<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
but task is already holding lock:
(&crypt_stat->keysig_list_mutex){+.+.+.}, at: [<ffffffff81217728>] ecryptfs_generate_key_packet_set+0x58/0x2b0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&crypt_stat->keysig_list_mutex){+.+.+.}:
[<ffffffff8108c897>] check_prev_add+0x2a7/0x370
[<ffffffff8108cfc1>] validate_chain+0x661/0x750
[<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
[<ffffffff8108d585>] lock_acquire+0xa5/0x150
[<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0
[<ffffffff81552b56>] mutex_lock_nested+0x46/0x60
[<ffffffff8121526a>] ecryptfs_add_keysig+0x5a/0xb0
[<ffffffff81213299>] ecryptfs_copy_mount_wide_sigs_to_inode_sigs+0x59/0xb0
[<ffffffff81214b06>] ecryptfs_new_file_context+0xa6/0x1a0
[<ffffffff8120e42a>] ecryptfs_initialize_file+0x4a/0x140
[<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60
[<ffffffff8113a7d4>] vfs_create+0xb4/0xe0
[<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110
[<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0
[<ffffffff8112d8d9>] do_sys_open+0x69/0x140
[<ffffffff8112d9f0>] sys_open+0x20/0x30
[<ffffffff81013132>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
-> #0 (&mount_crypt_stat->global_auth_tok_list_mutex){+.+.+.}:
[<ffffffff8108c675>] check_prev_add+0x85/0x370
[<ffffffff8108cfc1>] validate_chain+0x661/0x750
[<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
[<ffffffff8108d585>] lock_acquire+0xa5/0x150
[<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0
[<ffffffff81552b56>] mutex_lock_nested+0x46/0x60
[<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
[<ffffffff812177d5>] ecryptfs_generate_key_packet_set+0x105/0x2b0
[<ffffffff81212f49>] ecryptfs_write_headers_virt+0xc9/0x120
[<ffffffff8121306d>] ecryptfs_write_metadata+0xcd/0x200
[<ffffffff8120e44b>] ecryptfs_initialize_file+0x6b/0x140
[<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60
[<ffffffff8113a7d4>] vfs_create+0xb4/0xe0
[<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110
[<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0
[<ffffffff8112d8d9>] do_sys_open+0x69/0x140
[<ffffffff8112d9f0>] sys_open+0x20/0x30
[<ffffffff81013132>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
other info that might help us debug this:
2 locks held by gdm/2640:
#0: (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8113cb8b>] do_filp_open+0x3cb/0xae0
#1: (&crypt_stat->keysig_list_mutex){+.+.+.}, at: [<ffffffff81217728>] ecryptfs_generate_key_packet_set+0x58/0x2b0
stack backtrace:
Pid: 2640, comm: gdm Tainted: G C 2.6.31-2-generic #14~rbd2
Call Trace:
[<ffffffff8108b988>] print_circular_bug_tail+0xa8/0xf0
[<ffffffff8108c675>] check_prev_add+0x85/0x370
[<ffffffff81094912>] ? __module_text_address+0x12/0x60
[<ffffffff8108cfc1>] validate_chain+0x661/0x750
[<ffffffff81017275>] ? print_context_stack+0x85/0x140
[<ffffffff81089c68>] ? find_usage_backwards+0x38/0x160
[<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
[<ffffffff8108d585>] lock_acquire+0xa5/0x150
[<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
[<ffffffff8108b0b0>] ? check_usage_backwards+0x0/0xb0
[<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0
[<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
[<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
[<ffffffff8108c02c>] ? mark_held_locks+0x6c/0xa0
[<ffffffff81125b0d>] ? kmem_cache_alloc+0xfd/0x1a0
[<ffffffff8108c34d>] ? trace_hardirqs_on_caller+0x14d/0x190
[<ffffffff81552b56>] mutex_lock_nested+0x46/0x60
[<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
[<ffffffff812177d5>] ecryptfs_generate_key_packet_set+0x105/0x2b0
[<ffffffff81212f49>] ecryptfs_write_headers_virt+0xc9/0x120
[<ffffffff8121306d>] ecryptfs_write_metadata+0xcd/0x200
[<ffffffff81210240>] ? ecryptfs_init_persistent_file+0x60/0xe0
[<ffffffff8120e44b>] ecryptfs_initialize_file+0x6b/0x140
[<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60
[<ffffffff8113a7d4>] vfs_create+0xb4/0xe0
[<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110
[<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0
[<ffffffff8129a93e>] ? _raw_spin_unlock+0x5e/0xb0
[<ffffffff8155410b>] ? _spin_unlock+0x2b/0x40
[<ffffffff81139e9b>] ? getname+0x3b/0x240
[<ffffffff81148a5a>] ? alloc_fd+0xfa/0x140
[<ffffffff8112d8d9>] do_sys_open+0x69/0x140
[<ffffffff81553b8f>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8112d9f0>] sys_open+0x20/0x30
[<ffffffff81013132>] system_call_fastpath+0x16/0x1b
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
In ecryptfs_destroy_inode(), inode_info->lower_file_mutex is locked,
and just after the mutex is unlocked, the code does:
kmem_cache_free(ecryptfs_inode_info_cache, inode_info);
This means that if another context could possibly try to take the same
mutex as ecryptfs_destroy_inode(), then it could end up getting the
mutex just before the data structure containing the mutex is freed.
So any such use would be an obvious use-after-free bug (catchable with
slab poisoning or mutex debugging), and therefore the locking in
ecryptfs_destroy_inode() is not needed and can be dropped.
Similarly, in ecryptfs_destroy_crypt_stat(), crypt_stat->keysig_list_mutex
is locked, and then the mutex is unlocked just before the code does:
memset(crypt_stat, 0, sizeof(struct ecryptfs_crypt_stat));
Therefore taking this mutex is similarly not necessary.
Removing this locking fixes false-positive lockdep reports such as the
following (and they are false-positives for exactly the same reason
that the locking is not needed):
=================================
[ INFO: inconsistent lock state ]
2.6.31-2-generic #14~rbd3
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/323 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&inode_info->lower_file_mutex){+.+.?.}, at: [<ffffffff81210d34>] ecryptfs_destroy_inode+0x34/0x100
{RECLAIM_FS-ON-W} state was registered at:
[<ffffffff8108c02c>] mark_held_locks+0x6c/0xa0
[<ffffffff8108c10f>] lockdep_trace_alloc+0xaf/0xe0
[<ffffffff81125a51>] kmem_cache_alloc+0x41/0x1a0
[<ffffffff8113117a>] get_empty_filp+0x7a/0x1a0
[<ffffffff8112dd46>] dentry_open+0x36/0xc0
[<ffffffff8121a36c>] ecryptfs_privileged_open+0x5c/0x2e0
[<ffffffff81210283>] ecryptfs_init_persistent_file+0xa3/0xe0
[<ffffffff8120e838>] ecryptfs_lookup_and_interpose_lower+0x278/0x380
[<ffffffff8120f97a>] ecryptfs_lookup+0x12a/0x250
[<ffffffff8113930a>] real_lookup+0xea/0x160
[<ffffffff8113afc8>] do_lookup+0xb8/0xf0
[<ffffffff8113b518>] __link_path_walk+0x518/0x870
[<ffffffff8113bd9c>] path_walk+0x5c/0xc0
[<ffffffff8113be5b>] do_path_lookup+0x5b/0xa0
[<ffffffff8113bfe7>] user_path_at+0x57/0xa0
[<ffffffff811340dc>] vfs_fstatat+0x3c/0x80
[<ffffffff8113424b>] vfs_stat+0x1b/0x20
[<ffffffff81134274>] sys_newstat+0x24/0x50
[<ffffffff81013132>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
irq event stamp: 7811
hardirqs last enabled at (7811): [<ffffffff810c037f>] call_rcu+0x5f/0x90
hardirqs last disabled at (7810): [<ffffffff810c0353>] call_rcu+0x33/0x90
softirqs last enabled at (3764): [<ffffffff810631da>] __do_softirq+0x14a/0x220
softirqs last disabled at (3751): [<ffffffff8101440c>] call_softirq+0x1c/0x30
other info that might help us debug this:
2 locks held by kswapd0/323:
#0: (shrinker_rwsem){++++..}, at: [<ffffffff810f67ed>] shrink_slab+0x3d/0x190
#1: (&type->s_umount_key#35){.+.+..}, at: [<ffffffff811429a1>] prune_dcache+0xd1/0x1b0
stack backtrace:
Pid: 323, comm: kswapd0 Tainted: G C 2.6.31-2-generic #14~rbd3
Call Trace:
[<ffffffff8108ad6c>] print_usage_bug+0x18c/0x1a0
[<ffffffff8108aff0>] ? check_usage_forwards+0x0/0xc0
[<ffffffff8108bac2>] mark_lock_irq+0xf2/0x280
[<ffffffff8108bd87>] mark_lock+0x137/0x1d0
[<ffffffff81164710>] ? fsnotify_clear_marks_by_inode+0x30/0xf0
[<ffffffff8108bee6>] mark_irqflags+0xc6/0x1a0
[<ffffffff8108d337>] __lock_acquire+0x287/0x430
[<ffffffff8108d585>] lock_acquire+0xa5/0x150
[<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100
[<ffffffff8108d2e7>] ? __lock_acquire+0x237/0x430
[<ffffffff815526ad>] __mutex_lock_common+0x4d/0x3d0
[<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100
[<ffffffff81164710>] ? fsnotify_clear_marks_by_inode+0x30/0xf0
[<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100
[<ffffffff8129a91e>] ? _raw_spin_unlock+0x5e/0xb0
[<ffffffff81552b36>] mutex_lock_nested+0x46/0x60
[<ffffffff81210d34>] ecryptfs_destroy_inode+0x34/0x100
[<ffffffff81145d27>] destroy_inode+0x87/0xd0
[<ffffffff81146b4c>] generic_delete_inode+0x12c/0x1a0
[<ffffffff81145832>] iput+0x62/0x70
[<ffffffff811423c8>] dentry_iput+0x98/0x110
[<ffffffff81142550>] d_kill+0x50/0x80
[<ffffffff81142623>] prune_one_dentry+0xa3/0xc0
[<ffffffff811428b1>] __shrink_dcache_sb+0x271/0x290
[<ffffffff811429d9>] prune_dcache+0x109/0x1b0
[<ffffffff81142abf>] shrink_dcache_memory+0x3f/0x50
[<ffffffff810f68dd>] shrink_slab+0x12d/0x190
[<ffffffff810f9377>] balance_pgdat+0x4d7/0x640
[<ffffffff8104c4c0>] ? finish_task_switch+0x40/0x150
[<ffffffff810f63c0>] ? isolate_pages_global+0x0/0x60
[<ffffffff810f95f7>] kswapd+0x117/0x170
[<ffffffff810777a0>] ? autoremove_wake_function+0x0/0x40
[<ffffffff810f94e0>] ? kswapd+0x0/0x170
[<ffffffff810773be>] kthread+0x9e/0xb0
[<ffffffff8101430a>] child_rip+0xa/0x20
[<ffffffff81013c90>] ? restore_args+0x0/0x30
[<ffffffff81077320>] ? kthread+0x0/0xb0
[<ffffffff81014300>] ? child_rip+0x0/0x20
Signed-off-by: Roland Dreier <roland@digitalvampire.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
In ocfs2_file_aio_write, we will prevent direct io if
we find that we are appending(changing i_size) and call
generic_file_aio_write_nolock. But actually O_DIRECT flag
is there and this function will call generic_file_direct_write
eventually which will update i_size and leave di->i_size
alone. The bug is
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1173.
So this patch let ocfs2_direct_IO returns 0 directly if we
are appending so that buffered write will be called and
di->i_size get updated successfully. And this is also
what we want in ocfs2_file_aio_write.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
when we check/modify lockres->purge, we should with the protection of lockres->spinlock.
in dlm_purge_lockres(), the checking/modifying is not with the protectin.
this patch fixes it.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This patch adds the missed mlog_exit() and mlog_exit_void() lines when routines
return.
Signed-off-by: Coly Li <coly.li@suse.de>
Acked-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In a clustered setup, we have to panic the box on journal abort. This is
because we don't have the facility to go hard readonly. With hard ro, another
node would detect node failure and initiate recovery.
Having said that, we shouldn't force panic if the volume is mounted locally.
This patch defers the handling to the mount option, errors.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The ioctl will take 3 parameters: old_path, new_path and
preserve and call vfs_reflink. It is useful when we backport
reflink features to old kernels.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
reflink has 2 options for the destination file:
1. snapshot: reflink will attempt to preserve ownership, permissions,
and all other security state in order to create a full snapshot.
2. new file: it will acquire the data extent sharing but will see the
file's security state and attributes initialized as a new file.
So add the option to ocfs2.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
reflink is a very complicated process, so it can't be integrated
into one transaction. So if the system panic in the operation, we
may leave a unfinished inode in the destication directory.
So we will try to create an inode in orphan_dir first, reflink it
to the src file and then move it to the destication file in the end.
In that way we won't be afraid of any corruption during the reflink.
This patch adds 2 functions for orphan_dir operation:
1. Create a new inode in orphand dir.
2. Move an inode to a target dir.
Note:
fsck.ocfs2 should work for us to remove the unfinished file in the
orphan_dir.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
In order to make the original function more suitable for reflink,
we modify the following inode operations. Both are tiny.
1. ocfs2_mknod_locked only use dentry for mlog, so move it to
the caller so that reflink can use it without dentry.
2. ocfs2_prepare_orphan_dir only want inode to get its ip_blkno.
So use ip_blkno instead.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
In ocfs2_extend_rotate_transaction, op_credits is the orignal
credits in the handle and we only want to extend the credits
for the rotation, but the old solution always double it. It
is harmless for some minor operations, but for actions like
reflink we may rotate tree many times and cause the credits
increase dramatically. So this patch try to only increase
the desired credits.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Actually the whole reflink will touch refcount tree 2 times:
1. It will add the clusters in the extent record to the tree if it
isn't refcounted before.
2. It will add 1 refcount to these clusters when it add these
extent records to the tree.
So actually we shouldn't do merge in the 1st operation since the 2nd
one will soon be called and we may have to split it again. Do a merge
first and split soon is a waste of time. So we only merge in the 2nd
round. This is done by adding a new internal __ocfs2_increase_refcount
and call it with "not-merge" for 1st refcount operation in reflink.
This also has a side-effect that we don't need to worry too much about
the metadata allocation in the 2nd round since it will only merge and
no split will happen for those records.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
The old xattr value remove is quite simple, it just erase the
tree and free the clusters. But as we have added refcount support,
The process is a little complicated.
We have to lock the refcount tree at the beginning, what's more,
we may split the refcount tree in some cases, so meta/credits are
needed.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
With reflink, there is a need that we create a new xattr indexed
block from the very beginning. So add a new parameter for
ocfs2_create_xattr_block.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Now with xattr refcount support, we need to check whether
we have xattr refcounted before we remove the refcount tree.
Now the mechanism is:
1) Check whether i_clusters == 0, if no, exit.
2) check whether we have i_xattr_loc in dinode. if yes, exit.
2) Check whether we have inline xattr stored outside, if yes, exit.
4) Remove the tree.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
In ocfs2, when xattr's value is larger than OCFS2_XATTR_INLINE_SIZE,
it will be kept outside of the blocks we store xattr entry. And they
are stored in a b-tree also. So this patch try to attach all these
clusters to refcount tree also.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Currently we have ocfs2_iterate_xattr_buckets which can receive
a para and a callback to iterate a series of bucket. It is good.
But actually the 2 callers ocfs2_xattr_tree_list_index_block and
ocfs2_delete_xattr_index_block are almost the same. The only
difference is that the latter need to handle the extent record
also. So add a new function named ocfs2_iterate_xattr_index_block.
It can be given func callback which are used for exten record.
So now we only have one iteration function for the xattr index
block. Ane what's more, it is useful for our future reflink
operations.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
In order to make 2 transcation(xattr and cow) independent with each other,
we CoW the whole xattr out in case we are setting them.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
We currently use pagecache to duplicate clusters in CoW,
but it isn't suitable for xattr case. So abstract it out
so that the caller can decide which method it use.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
With the new refcount tree, xattr value can also be refcounted
among multiple files. So return the appropriate extent flags
so that CoW can used it later.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
A reflink creates a snapshot of a file, that means the attributes
must be identical except for three exceptions - nlink, ino, and ctime.
As for time changes, Here is a brief description:
1. Source file:
1) atime: Ignore. Let the lazy atime code handle that.
2) mtime: don't touch.
3) ctime: If we change the tree (adding REFCOUNTED to at least one
extent), update it.
2. Destination file:
1) atime: ignore.
2) mtime: we want it to appear identical to the source.
3) ctime: update.
The idea here is that an ls -l will show the same time for the
src and target - it shows mtime. Backup software like rsync and tar
will treat the new file correctly too.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
2 major functions are added in this patch.
ocfs2_attach_refcount_tree will create a new refcount tree to the
old file if it doesn't have one and insert all the extent records
to the tree if they are not refcounted.
ocfs2_create_reflink_node will:
1. set the refcount tree to the new file.
2. call ocfs2_duplicate_extent_list which will iterate all the
extents for the old file, insert it to the new file and increase
the corresponding referennce count.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
When we truncate a file to a specific size which resides in a reflinked
cluster, we need to CoW it since ocfs2_zero_range_for_truncate will
zero the space after the size(just another type of write).
So we add a "max_cpos" in ocfs2_refcount_cow so that it will stop when
it hit the max cluster offset.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
When we use mmap, we CoW the refcountd clusters in
ocfs2_write_begin_nolock. While for normal file
io(including directio), we do CoW in
ocfs2_prepare_inode_for_write.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
During CoW, if the old extent record is refcounted, we allocate
som new clusters and do CoW. Actually we can have some improvement
here. If the old extent has refcount=1, that means now it is only
used by this file. So we don't need to allocate new clusters, just
remove the refcounted flag and it is OK. We also have to remove
it from the refcount tree while not deleting it.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
This patch try CoW support for a refcounted record.
the whole process will be:
1. Calculate how many clusters we need to CoW and where we start.
Extents that are not completely encompassed by the write will
be broken on 1MB boundaries.
2. Do CoW for the clusters with the help of page cache.
3. Change the b-tree structure with the new allocated clusters.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Add 'Decrement refcount for delete' in to the normal truncate
process. So for a refcounted extent record, call refcount rec
decrementation instead of cluster free.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Given a physical cpos and length, decrement the refcount
in the tree. If the refcount for any portion of the extent goes
to zero, that portion is queued for freeing.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Given a physical cpos and length, increment the refcount
in the tree. If the extent has not been seen before, a refcount
record is created for it. Refcount records may be merged or
split by this operation.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Now fs/ocfs2/alloc.c has more than 7000 lines. It contains our
basic b-tree operation. Although we have already make our b-tree
operation generic, the basic structrue ocfs2_path which is used
to iterate one b-tree branch is still static and limited to only
used in alloc.c. As refcount tree need them and I don't want to
add any more b-tree unrelated code to alloc.c, export them out.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Add refcount b-tree as a new extent tree so that it can
use the b-tree to store and maniuplate ocfs2_refcount_rec.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
ocfs2_mark_extent_written actually does the following things:
1. check the parameters.
2. initialize the left_path and split_rec.
3. call __ocfs2_mark_extent_written. it will do:
1) check the flags of unwritten
2) do the real split work.
The whole process is packed tightly somehow. So this patch
will abstract 2 different functions so that future b-tree
operation can work with it.
1. __ocfs2_split_extent will accept path and split_rec and do
the real split work.
2. ocfs2_change_extent_flag will accept a new flag and initialize
path and split_rec.
So now ocfs2_mark_extent_written will do:
1. check the parameters.
2. call ocfs2_change_extent_flag.
1) initalize the left_path and split_rec.
2) check whether the new flags conflict with the old one.
3) call __ocfs2_split_extent to do the split.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Add a new operation eo_ocfs2_extent_contig int the extent tree's
operations vector. So that with the new refcount tree, We want
this so that refcount trees can always return CONTIG_NONE and
prevent extent merging.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Implement locking around struct ocfs2_refcount_tree. This protects
all read/write operations on refcount trees. ocfs2_refcount_tree
has its own lock and its own caching_info, protecting buffers among
multiple nodes.
User must call ocfs2_lock_refcount_tree before his operation on
the tree and unlock it after that.
ocfs2_refcount_trees are referenced by the block number of the
refcount tree root block, So we create an rb-tree on the ocfs2_super
to look them up.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
refcount tree should use its own caching info so that when
we downconvert the refcount tree lock, we can drop all the
cached buffer head.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
In meta downconvert, we need to checkpoint the metadata in an inode.
For refcount tree, we also need it. So abstract the process out.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
We now do extra checks before a balance to make sure
there is room for the balance to take place. One of
the checks was testing to see if we were trying to
balance away the last block group of a given type.
If there is no space available for new chunks, we
should not try and balance away the last block group
of a give type. But, the code wasn't checking for
available chunk space, and so it was exiting too soon.
The fix here is to combine some of the checks and make
sure we try to allocate new chunks when we're balancing
the last block group.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After a balance it is briefly possible for the space info
field in the inode to be NULL. This adds some checks
to make sure things properly deal with the NULL value.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for-2.6.32' of git://linux-nfs.org/~bfields/linux: (68 commits)
nfsd4: nfsv4 clients should cross mountpoints
nfsd: revise 4.1 status documentation
sunrpc/cache: avoid variable over-loading in cache_defer_req
sunrpc/cache: use list_del_init for the list_head entries in cache_deferred_req
nfsd: return success for non-NFS4 nfs4_state_start
nfsd41: Refactor create_client()
nfsd41: modify nfsd4.1 backchannel to use new xprt class
nfsd41: Backchannel: Implement cb_recall over NFSv4.1
nfsd41: Backchannel: cb_sequence callback
nfsd41: Backchannel: Setup sequence information
nfsd41: Backchannel: Server backchannel RPC wait queue
nfsd41: Backchannel: Add sequence arguments to callback RPC arguments
nfsd41: Backchannel: callback infrastructure
nfsd4: use common rpc_cred for all callbacks
nfsd4: allow nfs4 state startup to fail
SUNRPC: Defer the auth_gss upcall when the RPC call is asynchronous
nfsd4: fix null dereference creating nfsv4 callback client
nfsd4: fix whitespace in NFSPROC4_CLNT_CB_NULL definition
nfsd41: sunrpc: add new xprt class for nfsv4.1 backchannel
sunrpc/cache: simplify cache_fresh_locked and cache_fresh_unlocked.
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
trivial: fix typo in aic7xxx comment
trivial: fix comment typo in drivers/ata/pata_hpt37x.c
trivial: typo in kernel-parameters.txt
trivial: fix typo in tracing documentation
trivial: add __init/__exit macros in drivers/gpio/bt8xxgpio.c
trivial: add __init macro/ fix of __exit macro location in ipmi_poweroff.c
trivial: remove unnecessary semicolons
trivial: Fix duplicated word "options" in comment
trivial: kbuild: remove extraneous blank line after declaration of usage()
trivial: improve help text for mm debug config options
trivial: doc: hpfall: accept disk device to unload as argument
trivial: doc: hpfall: reduce risk that hpfall can do harm
trivial: SubmittingPatches: Fix reference to renumbered step
trivial: fix typos "man[ae]g?ment" -> "management"
trivial: media/video/cx88: add __init/__exit macros to cx88 drivers
trivial: fix typo in CONFIG_DEBUG_FS in gcov doc
trivial: fix missing printk space in amd_k7_smp_check
trivial: fix typo s/ketymap/keymap/ in comment
trivial: fix typo "to to" in multiple files
trivial: fix typos in comments s/DGBU/DBGU/
...
Anyone who wants to do copy to/from user from a kernel thread, needs
use_mm (like what fs/aio has). Move that into mm/, to make reusing and
exporting easier down the line, and make aio use it. Next intended user,
besides aio, will be vhost-net.
Acked-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patchset adds a flag to mmap that allows the user to request that an
anonymous mapping be backed with huge pages. This mapping will borrow
functionality from the huge page shm code to create a file on the kernel
internal mount and use it to approximate an anonymous mapping. The
MAP_HUGETLB flag is a modifier to MAP_ANONYMOUS and will not work without
both flags being preset.
A new flag is necessary because there is no other way to hook into huge
pages without creating a file on a hugetlbfs mount which wouldn't be
MAP_ANONYMOUS.
To userspace, this mapping will behave just like an anonymous mapping
because the file is not accessible outside of the kernel.
This patchset is meant to simplify the programming model. Presently there
is a large chunk of boiler platecode, contained in libhugetlbfs, required
to create private, hugepage backed mappings. This patch set would allow
use of hugepages without linking to libhugetlbfs or having hugetblfs
mounted.
Unification of the VM code would provide these same benefits, but it has
been resisted each time that it has been suggested for several reasons: it
would break PAGE_SIZE assumptions across the kernel, it makes page-table
abstractions really expensive, and it does not provide any benefit on
architectures that do not support huge pages, incurring fast path
penalties without providing any benefit on these architectures.
This patch:
There are two means of creating mappings backed by huge pages:
1. mmap() a file created on hugetlbfs
2. Use shm which creates a file on an internal mount which essentially
maps it MAP_SHARED
The internal mount is only used for shared mappings but there is very
little that stops it being used for private mappings. This patch extends
hugetlbfs_file_setup() to deal with the creation of files that will be
mapped MAP_PRIVATE on the internal hugetlbfs mount. This extended API is
used in a subsequent patch to implement the MAP_HUGETLB mmap() flag.
Signed-off-by: Eric Munson <ebmunson@us.ibm.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when
CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make
more sense of it by removing CONFIG_TMPFS altogether, always enabling its
code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on
CONFIG_TMPFS off that we'd better leave that as is.
But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off:
make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL
shmem_acl.o being pointlessly built into the kernel when SHMEM is off.
And a selfish change, to prevent the world from being rebuilt when I
switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the
header files is mm.h shmem_lock() - give that a shmem.c stub instead.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In preparation for the next patch, add a simple get_dump_page(addr)
interface for the CONFIG_ELF_CORE dumpers to use, instead of calling
get_user_pages() directly. They're not interested in errors: they
just want to use holes as much as possible, to save space and make
sure that the data is aligned where the headers said it would be.
Oh, and don't use that horrid DUMP_SEEK(off) macro!
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrew Morton pointed out oom_adjust_write() has very strange EIO
and new line handling. this patch fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
oom-killer kills a process, not task. Then oom_score should be calculated
as per-process too. it makes consistency more and makes speed up
select_bad_process().
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, OOM logic callflow is here.
__out_of_memory()
select_bad_process() for each task
badness() calculate badness of one task
oom_kill_process() search child
oom_kill_task() kill target task and mm shared tasks with it
example, process-A have two thread, thread-A and thread-B and it have very
fat memory and each thread have following oom_adj and oom_score.
thread-A: oom_adj = OOM_DISABLE, oom_score = 0
thread-B: oom_adj = 0, oom_score = very-high
Then, select_bad_process() select thread-B, but oom_kill_task() refuse
kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory()
call select_bad_process() again. but select_bad_process() select the same
task. It mean kernel fall in livelock.
The fact is, select_bad_process() must select killable task. otherwise
OOM logic go into livelock.
And root cause is, oom_adj shouldn't be per-thread value. it should be
per-process value because OOM-killer kill a process, not thread. Thus
This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct signal_struct. it naturally prevent
select_bad_process() choose wrong task.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sizing of memory allocations shouldn't depend on the number of physical
pages found in a system, as that generally includes (perhaps a huge amount
of) non-RAM pages. The amount of what actually is usable as storage
should instead be used as a basis here.
Some of the calculations (i.e. those not intending to use high memory)
should likely even use (totalram_pages - totalhigh_pages).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Dave Airlie <airlied@linux.ie>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/kcore has its own routine to access vmallc area. It can be replaced
with vread(). And by this, /proc/kcore can do safe access to vmalloc
area.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Mike Smith <scgtrp@gmail.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The patch makes the clear_refs more versatile in adding the option to
select anonymous pages or file backed pages for clearing. This addition
has a measurable impact on user space application performance as it
decreases the number of pagewalks in scenarios where one is only
interested in a specific type of page (anonymous or file mapped).
The patch adds anonymous and file backed filters to the clear_refs interface.
echo 1 > /proc/PID/clear_refs resets the bits on all pages
echo 2 > /proc/PID/clear_refs resets the bits on anonymous pages only
echo 3 > /proc/PID/clear_refs resets the bits on file backed pages only
Any other value is ignored
Signed-off-by: Moussa A. Ba <moussa.a.ba@gmail.com>
Signed-off-by: Jared E. Hulbert <jaredeh@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KSM will need to identify its kernel merged pages unambiguously, and
/proc/kpageflags will probably like to do so too.
Since KSM will only be substituting anonymous pages, statistics are best
preserved by making a PageKsm page a special PageAnon page: one with no
anon_vma.
But KSM then needs its own page_add_ksm_rmap() - keep it in ksm.h near
PageKsm; and do_wp_page() must COW them, unlike singly mapped PageAnons.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Recently we encountered OOM problems due to memory use of the GEM cache.
Generally a large amuont of Shmem/Tmpfs pages tend to create a memory
shortage problem.
We often use the following calculation to determine the amount of shmem
pages:
shmem = NR_ACTIVE_ANON + NR_INACTIVE_ANON - NR_ANON_PAGES
however the expression does not consider isolated and mlocked pages.
This patch adds explicit accounting for pages used by shmem and tmpfs.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The amount of memory allocated to kernel stacks can become significant and
cause OOM conditions. However, we do not display the amount of memory
consumed by stacks.
Add code to display the amount of memory used for stacks in /proc/meminfo.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In theory it could happen that on one CPU we initialize a new inode but
clearing of I_NEW | I_LOCK gets reordered before some of the
initialization. Thus on another CPU we return not fully uptodate inode
from iget_locked().
This seems to fix a corruption issue on ext3 mounted over NFS.
[akpm@linux-foundation.org: add some commentary]
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As we get closer to proper -ENOSPC handling in btrfs, we need more accurate
space accounting for the space info's. Currently we exclude the free space for
the super mirrors, but the space they take up isn't accounted for in any of the
counters. This patch introduces bytes_super, which keeps track of the amount
of bytes used for a super mirror in the block group cache and space info. This
makes sure that our free space caclucations will be completely accurate.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There is a slight problem with the extent entry threshold calculation for the
free space cache. We only adjust the threshold down as we add bitmaps, but
never actually adjust the threshold up as we add bitmaps. This means we could
fragment the free space so badly that we end up using all bitmaps to describe
the free space, use all the free space which would result in the bitmaps being
freed, but then go to add free space again as we delete things and immediately
add bitmaps since the extent threshold would still be 0. Now as we free
bitmaps the extent threshold will be ratcheted up to allow more extent entries
to be added.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch removes a bunch of dead code from the snapshot removal stuff. It
was confusing me when doing the metadata ENOSPC stuff so I killed it.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When we first go to add free space, we allocate a new info and set the offset
and bytes to the space we are adding. This is fine, except we actually set the
size of a bitmap as we set the bits in it, so if we add space to a bitmap, we'd
end up counting the same space twice. This isn't a huge deal, it just makes
the allocator behave weirdly since it will think that a bitmap entry has more
space than it ends up actually having. I used a BUG_ON() to catch when this
problem happened, and with this patch I no longer get the BUG_ON().
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The box can get locked up in the allocator if we happen upon a block group
under these conditions:
1) During a commit, so caching threads cannot make progress
2) Our block group currently is in the middle of being cached
3) Our block group currently has plenty of free space in it
4) Our block group is so fragmented that it ends up having no free space chunks
larger than min_bytes calculated by btrfs_find_space_cluster.
What happens is we try and do btrfs_find_space_cluster, which fails because it
is unable to find enough free space chunks that are large than min_bytes and
are close enough together. Since the block group is not cached we do a
wait_block_group_cache_progress, which waits for the number of bytes we need,
except the block group already has _plenty_ of free space, its just severely
fragmented, so we loop and try again, ad infinitum. This patch keeps us from
waiting on the block group to finish caching if we failed to find a free space
cluster before. It also makes sure that we don't even try to find a free space
cluster if we are on our last loop in the allocator, since we will have tried
everything at this point at it is futile.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Currently, we can panic the box if the first block group we go to move is of a
type where there is no space left to move those extents. For example, if we
fill the disk up with data, and then we try to balance and we have no room to
move the data nor room to allocate new chunks, we will panic. Change this by
checking to see if we have room to move this chunk around, and if not, return
-ENOSPC and move on to the next chunk. This will make sure we remove block
groups that are moveable, like if we have alot of empty metadata block groups,
and then that way we make room to be able to balance our data chunks as well.
Tested this with an fs that would panic on btrfs-vol -b normally, but no longer
panics with this patch.
V1->V2:
-actually search for a free extent on the device to make sure we can allocate a
chunk if need be.
-fix btrfs_shrink_device to make sure we actually try to relocate all the
chunks, and then if we can't return -ENOSPC so if we are doing a btrfs-vol -r
we don't remove the device with data still on it.
-check to make sure the block group we are going to relocate isn't the last one
in that particular space
-fix a bug in btrfs_shrink_device where we would change the device's size and
not fix it if we fail to do our relocate
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Allow NFS v4 clients to seamlessly cross mount point without
have to set either the 'crossmnt' or the 'nohide' export
options.
Signed-Off-By: Steve Dickson <steved@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Fix an arithmetic error that was breaking extents cloned via the clone
ioctl starting in the second half of a file.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch adds snapshot/subvolume destroy ioctl. A subvolume that isn't being
used and doesn't contains links to other subvolumes can be destroyed.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
btrfs allows subvolumes and snapshots anywhere in the directory tree.
If we snapshot a subvolume that contains a link to other subvolume
called subvolA, subvolA can be accessed through both the original
subvolume and the snapshot. This is similar to creating hard link to
directory, and has the very similar problems.
The aim of this patch is enforcing there is only one access point to
each subvolume. Only the first directory entry (the one added when
the subvolume/snapshot was created) is treated as valid access point.
The first directory entry is distinguished by checking root forward
reference. If the corresponding root forward reference is missing,
we know the entry is not the first one.
This patch also adds snapshot/subvolume rename support, the code
allows rename subvolume link across subvolumes.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The new back reference format does not allow reusing objectid of
deleted snapshot/subvol. So we use ++highest_objectid to allocate
objectid for new snapshot/subvol.
Now we use ++highest_objectid to allocate objectid for both new inode
and new snapshot/subvolume, so this patch removes 'find hole' code in
btrfs_find_free_objectid.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This patch contains two changes to avoid unnecessary tree block reads during
snapshot dropping.
First, check tree block's reference count and flags before reading the tree
block. if reference count > 1 and there is no need to update backrefs, we can
avoid reading the tree block.
Second, save when snapshot was created in root_key.offset. we can compare block
pointer's generation with snapshot's creation generation during updating
backrefs. If a given block was created before snapshot was created, the
snapshot can't be the tree block's owner. So we can avoid reading the block.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'perfcounters-rename-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf: Tidy up after the big rename
perf: Do the big rename: Performance Counters -> Performance Events
perf_counter: Rename 'event' to event_id/hw_event
perf_counter: Rename list_entry -> group_entry, counter_list -> group_list
Manually resolved some fairly trivial conflicts with the tracing tree in
include/trace/ftrace.h and kernel/trace/trace_syscalls.c.
* 'writeback' of git://git.kernel.dk/linux-2.6-block:
nfs: initialize the backing_dev_info when creating the server
writeback: make balance_dirty_pages() gradually back more off
writeback: don't use schedule_timeout() without setting runstate
nfs: nfs_kill_super() should call bdi_unregister() after killing super
NFS may free the server structure without ever having used the
bdi, so we either need to flag the bdi as being uninitialized or
initialize it up front. This does the latter.
This fixes a crash with mounting more than one NFS file system,
should people ever need that kind of obscure NFS functionality.
Tested-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Otherwise we could be attempting to flush data for a writeback
thread and bdi that have already disappeared.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The Cpus_allowed fields in /proc/<pid>/status is currently only
shown in case of CONFIG_CPUSETS. However their contents are also
useful for the !CONFIG_CPUSETS case.
So change the current behaviour and always show these fields.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090921090627.GD4649@osiris.boeblingen.de.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We may end up doing DMA to/from these. Until the new MTD API fixes the
issues, this should stop things from falling over.
Original idea from Gilles Casse <list@gcasse.net>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
If we didn't check sb->s_dirt, it will update the FSINFO
unconditionally. It will reduce the filetime of flash base device.
So, this checks sb->s_dirt. sb->s_dirt is racy, however FSINFO is just
hint. So even if there is race, and we hit it, it would not become big
problem.
And this also is as workaround of suspend problem.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
The allocator has some nice knobs for sending hints about where
to try and allocate new blocks, but when we're doing file allocations
we're not sending any hint at all.
This commit adds a simple extent map search to see if we can
quickly and easily find a hint for the allocator.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs fills a delayed allocation, it tries to increase
the wbc nr_to_write to cover a big part of allocation. The
theory is that we're doing contiguous IO and writing a few
more blocks will save seeks overall at a very low cost.
The problem is that extent_write_cache_pages could ignore
the new higher nr_to_write if nr_to_write had already gone
down to zero. We fix that by rechecking the nr_to_write
for every page that is processed in the pagevec.
This updates the math around bumping the nr_to_write value
to make sure we don't leave a tiny amount of IO hanging
around for the very end of a new extent.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (64 commits)
ext4: Update documentation about quota mount options
ext4: replace MAX_DEFRAG_SIZE with EXT_MAX_BLOCK
ext4: Fix the alloc on close after a truncate hueristic
ext4: Add a tracepoint for ext4_alloc_da_blocks()
ext4: store EXT4_EXT_MIGRATE in i_state instead of i_flags
ext4: limit block allocations for indirect-block files to < 2^32
ext4: Fix different block exchange issue in EXT4_IOC_MOVE_EXT
ext4: Add null extent check to ext_get_path
ext4: Replace BUG_ON() with ext4_error() in move_extents.c
ext4: Replace get_ext_path macro with an inline funciton
ext4: Fix include/trace/events/ext4.h to work with Systemtap
ext4: Fix initalization of s_flex_groups
ext4: Always set dx_node's fake_dirent explicitly.
ext4: Fix async commit mode to be safe by using a barrier
ext4: Don't update superblock write time when filesystem is read-only
ext4: Clarify the locking details in mballoc
ext4: check for need init flag in ext4_mb_load_buddy
ext4: move ext4_mb_init_group() function earlier in the mballoc.c
ext4: Make non-journal fsync work properly
ext4: Assure that metadata blocks are written during fsync in no journal mode
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: add fusectl interface to max_background
fuse: limit user-specified values of max background requests
fuse: use drop_nlink() instead of direct nlink manipulation
fuse: document protocol version negotiation
fuse: make the number of max background requests and congestion threshold tunable
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
dlm: use kernel_sendpage
dlm: fix connection close handling
dlm: fix double-release of socket in error exit path
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
ext3: Flush disk caches on fsync when needed
ext3: Add locking to ext3_do_update_inode
ext3: Fix possible deadlock between ext3_truncate() and ext3_get_blocks()
jbd: Annotate transaction start also for journal_restart()
jbd: Journal block numbers can ever be only 32-bit use unsigned int for them
ext3: Update MAINTAINERS for ext3 and JBD
JBD: round commit timer up to avoid uncommitted transaction
This patch gets rid of two limitations of async block group caching.
The old code delays handling pinned extents when block group is in
caching. To allocate logged file extents, the old code need wait
until block group is fully cached. To get rid of the limitations,
This patch introduces a data structure to track the progress of
caching. Base on the caching progress, we know which extents should
be added to the free space cache when handling the pinned extents.
The logged file extents are also handled in a similar way.
This patch also changes how pinned extents are tracked. The old
code uses one tree to track pinned extents, and copy the pinned
extents tree at transaction commit time. This patch makes it use
two trees to track pinned extents. One tree for extents that are
pinned in the running transaction, one tree for extents that can
be unpinned. At transaction commit time, we swap the two trees.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There's no reason to redefine the maximum allowable offset
in an extent-based file just for defrag;
EXT_MAX_BLOCK already does this.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In an attempt to avoid doing an unneeded flush after opening a
(previously non-existent) file with O_CREAT|O_TRUNC, the code only
triggered the hueristic if ei->disksize was non-zero. Turns out that
the VFS doesn't call ->truncate() if the file doesn't exist, and
ei->disksize is always zero even if the file previously existed. So
remove the test, since it isn't necessary and in fact disabled the
hueristic.
Thanks to Clemens Eisserer that he was seeing problems with files
written using kwrite and eclipse after sudden crashes caused by a
buggy Intel video driver.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In 'dbg_check_space_info()' we want to dump current lprops statistics,
but actually dump old statistics. Fix this.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Alexander Beregalov reported the following warning:
=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.31-03149-gdcc030a #1
-------------------------------------------------------
udevadm/716 is trying to acquire lock:
(&mm->mmap_sem){++++++}, at: [<c107249a>] might_fault+0x4a/0xa0
but task is already holding lock:
(sysfs_mutex){+.+.+.}, at: [<c10cb9aa>] sysfs_readdir+0x5a/0x200
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (sysfs_mutex){+.+.+.}:
[...]
-> #2 (&bdev->bd_mutex){+.+.+.}:
[...]
-> #1 (&REISERFS_SB(s)->lock){+.+.+.}:
[...]
-> #0 (&mm->mmap_sem){++++++}:
[...]
On reiserfs mount path, we take the reiserfs lock and while
initializing the journal, we open the device, taking the
bdev->bd_mutex. Then rescan_partition() may signal the change
to sysfs.
We have then the following dependency:
reiserfs_lock -> bd_mutex -> sysfs_mutex
Later, while entering reiserfs_readpage() after a pagefault in an
mmaped reiserfs file, we are holding the mm->mmap_sem, and we are going
to take the reiserfs lock too.
We have then the following dependency:
mm->mmap_sem -> reiserfs_lock
which, expanded with the previous dependency gives us:
mm->mmap_sem -> reiserfs_lock -> bd_mutex -> sysfs_mutex
Now while entering the sysfs readdir path, we are holding the
sysfs_mutex. And when we copy a directory entry to the user buffer, we
might fault and then take the mm->mmap_sem lock. Which leads to the
circular locking dependency reported.
We can fix that by relaxing the reiserfs lock during the call to
journal_init_dev(), which is the place where we open the mounted
device.
This is fine to relax the lock here because we are in the begining of
the reiserfs mount path and there is nothing to protect at this time,
the journal is not intialized.
We just keep this lock around for paranoid reasons.
Reported-by: Alexander Beregalov <a.beregalov@gmail.com>
Tested-by: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Laurent Riffard <laurent.riffard@free.fr>
EXT4_EXT_MIGRATE is only intended to be used for an in-memory flag,
and the hex value assigned to it collides with FS_DIRECTIO_FL (which
is also stored in i_flags). There's no reason for the
EXT4_EXT_MIGRATE bit to be stored in i_flags, so we switch it to use
i_state instead.
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Today, the ext4 allocator will happily allocate blocks past
2^32 for indirect-block files, which results in the block
numbers getting truncated, and corruption ensues.
This patch limits such allocations to < 2^32, and adds
BUG_ONs if we do get blocks larger than that.
This should address RH Bug 519471, ext4 bitmap allocator
must limit blocks to < 2^32
* ext4_find_goal() is modified to choose a goal < UINT_MAX,
so that our starting point is in an acceptable range.
* ext4_xattr_block_set() is modified such that the goal block
is < UINT_MAX, as above.
* ext4_mb_regular_allocator() is modified so that the group
search does not continue into groups which are too high
* ext4_mb_use_preallocated() has a check that we don't use
preallocated space which is too far out
* ext4_alloc_blocks() and ext4_xattr_block_set() add some BUG_ONs
No attempt has been made to limit inode locations to < 2^32,
so we may wind up with blocks far from their inodes. Doing
this much already will lead to some odd ENOSPC issues when the
"lower 32" gets full, and further restricting inodes could
make that even weirder.
For high inodes, choosing a goal of the original, % UINT_MAX,
may be a bit odd, but then we're in an odd situation anyway,
and I don't know of a better heuristic.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If logical block offset of original file which is passed to
EXT4_IOC_MOVE_EXT is different from donor file's,
a calculation error occurs in ext4_calc_swap_extents(),
therefore wrong block is exchanged between original file and donor file.
As a result, we hit ext4_error() in check_block_validity().
To detect the logical offset difference in EXT4_IOC_MOVE_EXT,
add checks to mext_calc_swap_extents() and handle it as error,
since data exchange must be done between the same blocks in EXT4_IOC_MOVE_EXT.
Reported-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There is the possibility that path structure which is taken
by ext4_ext_find_extent() indicates null extents.
Because during data block exchanging in ext4_move_extents(),
constitution of an extent tree may be changed.
As a solution, the patch adds null extent check
to ext_get_path().
Reported-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Replace BUG_ON calls with a call to ext4_error()
to print an error message if EXT4_IOC_MOVE_EXT failed
with some kind of reasons. This will help to debug.
Ted pointed this out, thanks.
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Replace get_ext_path macro with an inline function,
since this macro looks like a function call but its arguments
get modified. Ted pointed this out, thanks.
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In case we fsync() a file and inode is not dirty, we don't force a transaction
to disk and hence don't flush disk caches. Thus file data could be just in disk
caches and not on persistent storage. Fix the problem by flushing disk caches
if we didn't force a transaction commit.
Signed-off-by: Jan Kara <jack@suse.cz>
I've been struggling with this off and on while I've been testing the
data=guarded work. The symptom is corrupted orphan lists and inodes
with the wrong i_size stored on disk. I was convinced the
data=guarded code was just missing a call to ext3_mark_inode_dirty, but
tracing showed the i_disksize I was sending to ext3_mark_inode_dirty
wasn't actually making it to the drive.
ext3_mark_inode_dirty can be called without locks held (atime updates
and a few others), so the data=guarded code uses locks while updating
the in-memory inode, and then calls ext3_mark_inode_dirty
without any locks held.
But, ext3_mark_inode_dirty has no internal locking to make sure that
only one CPU is updating the buffer head at a time. Generally this
works out ok because everyone that changes the inode then calls
ext3_mark_inode_dirty themselves. Even though it races, eventually
someone updates the buffer heads and things move on.
But there is still a risk of the wrong values getting in, and the
data=guarded code seems to hit the race very often.
Since everyone that changes the inode also logs it, it should be
possible to fix this with some memory barriers. I'll leave that as an
exercise to the reader and lock the buffer head instead.
It it probably a good idea to have a different patch series for lockless
bit flipping on the ext3 i_state field. ext3_do_update_inode &= clears
EXT3_STATE_NEW without any locks held.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
During truncate we are sometimes forced to start a new transaction as the
amount of blocks to be journaled is both quite large and hard to predict. So
far we restarted a transaction while holding truncate_mutex and that violates
lock ordering because truncate_mutex ranks below transaction start (and it
can lead to a real deadlock with ext3_get_blocks() allocating new blocks
from ext3_writepage()).
Luckily, the problem is easy to fix: We just drop the truncate_mutex before
restarting the transaction and acquire it afterwards. We are safe to do this as
by the time ext3_truncate() is called, all the page cache for the truncated
part of the file is dropped and so writepage() cannot come and allocate new
blocks in the part of the file we are truncating. The rest of writers is
stopped by us holding i_mutex.
Signed-off-by: Jan Kara <jack@suse.cz>
lockdep annotation for a transaction start has been at the end of
journal_start(). But a transaction is also started from journal_restart(). Move
the lockdep annotation to start_this_handle() which covers both cases.
Signed-off-by: Jan Kara <jack@suse.cz>
It does not make sense to store block number for journal as unsigned long
since they can be only 32-bit (because of on-disk format limitation). So
change in-memory structures and variables to use unsigned int instead.
Signed-off-by: Jan Kara <jack@suse.cz>
Fix jiffie rounding in jbd commit timer setup code. Rounding down could cause
the timer to be fired before the corresponding transaction has expired. That
transaction can stay not committed forever if no new transaction is created or
explicit sync/umount happens.
Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: Jan Kara <jack@suse.cz>
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6:
Driver Core: devtmpfs - kernel-maintained tmpfs-based /dev
debugfs: Modify default debugfs directory for debugging pktcdvd.
debugfs: Modified default dir of debugfs for debugging UHCI.
debugfs: Change debugfs directory of IWMC3200
debugfs: Change debuhgfs directory of trace-events-sample.h
debugfs: Fix mount directory of debugfs by default in events.txt
hpilo: add poll f_op
hpilo: add interrupt handler
hpilo: staging for interrupt handling
driver core: platform_device_add_data(): use kmemdup()
Driver core: Add support for compatibility classes
uio: add generic driver for PCI 2.3 devices
driver-core: move dma-coherent.c from kernel to driver/base
mem_class: fix bug
mem_class: use minor as index instead of searching the array
driver model: constify attribute groups
UIO: remove 'default n' from Kconfig
Driver core: Add accessor for device platform data
Driver core: move dev_get/set_drvdata to drivers/base/dd.c
Driver core: add new device to bus's list before probing
wb_clear_pending AFAIKS should not be called after the item has been
put on the list, except by the worker threads. It could lead to the
situation where the refcount is decremented below 0 and cause lots of
problems.
Presumably the !wb_has_dirty_io case is not a common one, so it can
be discovered when the thread wakes up to check?
Also add a comment in bdi_work_clear.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
By the time bdi_work_on_stack gets evaluated again in bdi_work_free, it
can already have been deallocated and used for something else in the
!on stack case, giving a false positive in this test and causing
corruption.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If you're going to do an atomic RMW on each list entry, there's not much
point in all the RCU complexities of the list walking. This is only going
to help the multi-thread case I guess, but it doesn't hurt to do now.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
bdi_start_writeback() is currently split into two paths, one for
WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
only WB_SYNC_NONE.
Push down the writeback_control allocation and only accept the
parameters that make sense for each function. This cleans up
the API considerably.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This gets rid of work == NULL in bdi_queue_work() and puts the
OOM handling where it belongs.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Now that bdi_writeback_all() no longer handles integrity writeback,
it doesn't have to block anymore. This means that we can switch
bdi_list reader side protection to RCU.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Data integrity writeback must use bdi_start_writeback() and ensure
that wbc->sb and wbc->bdi are set.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We do this automatically in get_sb_bdev() from the set_bdev_super()
callback. Filesystems that have their own private backing_dev_info
must assign that in ->fill_super().
Note that ->s_bdi assignment is required for proper writeback!
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
We need to be able to pass in range_cyclic as well, so instead
of growing yet another argument, split the arguments into a
struct wb_writeback_args structure that we can use internally.
Also makes it easier to just copy all members to an on-stack
struct, since we can't access work after clearing the pending
bit.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Since it's an opportunistic writeback and not a data integrity action,
don't punt to blocking writeback. Just wakeup the thread and it will
flush old data.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
It has been unused since it was introduced in:
commit 520808bf20e90fdbdb320264ba7dd5cf9d47dcac
Author: Andrew Morton <akpm@osdl.org>
Date: Fri May 21 00:46:17 2004 -0700
[PATCH] block device layer: separate backing_dev_info infrastructure
So lets just kill it.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Make the max_background and congestion_threshold parameters of a FUSE
mount tunable at runtime by adding the respective knobs to its directory
within the fusectl filesystem.
Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
An untrusted user could DoS the system if s/he were allowed to accumulate an
arbitrary number of pending background requests by setting the above limits
to extremely high values in INIT. This patch excludes this possibility by
imposing global upper limits on the possible values of per-mount "max
background requests" and "congestion threshold" parameters for unprivileged
FUSE filesystems.
These global limits are implemented as module parameters.
Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
drop_nlink() is the API function to decrease the link count of an inode.
However, at a place the control filesystem used the decrement operator
on i_nlink directly. Fix this.
Cc: Anand Avati <avati@gluster.com>
Signed-off-by: Csaba Henk <csaba@gluster.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Enable hardware memory error handling for NFS
Truncation of data pages at runtime should be safe in NFS,
even when it doesn't support migration so far.
Trond tells me migration is also queued up for 2.6.32.
Acked-by: Trond.Myklebust@netapp.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Enable removing of corrupted pages through truncation
for a bunch of file systems: ext*, xfs, gfs2, ocfs2, ntfs
These should cover most server needs.
I chose the set of migration aware file systems for this
for now, assuming they have been especially audited.
But in general it should be safe for all file systems
on the data area that support read/write and truncate.
Caveat: the hardware error handler does not take i_mutex
for now before calling the truncate function. Is that ok?
Cc: tytso@mit.edu
Cc: hch@infradead.org
Cc: mfasheh@suse.com
Cc: aia21@cantab.net
Cc: hugh.dickins@tiscali.co.uk
Cc: swhiteho@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Add the high level memory handler that poisons pages
that got corrupted by hardware (typically by a two bit flip in a DIMM
or a cache) on the Linux level. The goal is to prevent everyone
from accessing these pages in the future.
This done at the VM level by marking a page hwpoisoned
and doing the appropriate action based on the type of page
it is.
The code that does this is portable and lives in mm/memory-failure.c
To quote the overview comment:
High level machine check handler. Handles pages reported by the
hardware as being corrupted usually due to a 2bit ECC memory or cache
failure.
This focuses on pages detected as corrupted in the background.
When the current CPU tries to consume corruption the currently
running process can just be killed directly instead. This implies
that if the error cannot be handled for some reason it's safe to
just ignore it because no corruption has been consumed yet. Instead
when that happens another machine check will happen.
Handles page cache pages in various states. The tricky part
here is that we can access any page asynchronous to other VM
users, because memory failures could happen anytime and anywhere,
possibly violating some of their assumptions. This is why this code
has to be extremely careful. Generally it tries to use normal locking
rules, as in get the standard locks, even if that means the
error handling takes potentially a long time.
Some of the operations here are somewhat inefficient and have non
linear algorithmic complexity, because the data structures have not
been optimized for this case. This is in particular the case
for the mapping from a vma to a process. Since this case is expected
to be rare we hope we can get away with this.
There are in principle two strategies to kill processes on poison:
- just unmap the data and wait for an actual reference before
killing
- kill as soon as corruption is detected.
Both have advantages and disadvantages and should be used
in different situations. Right now both are implemented and can
be switched with a new sysctl vm.memory_failure_early_kill
The default is early kill.
The patch does some rmap data structure walking on its own to collect
processes to kill. This is unusual because normally all rmap data structure
knowledge is in rmap.c only. I put it here for now to keep
everything together and rmap knowledge has been seeping out anyways
Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
Nick Piggin (who did a lot of great work) and others.
Cc: npiggin@suse.de
Cc: riel@redhat.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Move common initialization of 'struct nfs4_client' inside create_client().
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[nfsd41: Remember the auth flavor to use for callbacks]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
This patch enables the use of the nfsv4.1 backchannel.
Signed-off-by: Alexandros Batsakis <batsakis@netapp.com>
[initialize rpc_create_args.bc_xprt too]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Implement the cb_sequence callback conforming to draft-ietf-nfsv4-minorversion1
Note: highest slot id and target highest slot id do not have to be 0
as was previously implemented. They can be greater than what the
nfs server sent if the client supports a larger slot table on the
backchannel. At this point we just ignore that.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[Rework the back channel xdr using the shared v4.0 and v4.1 framework.]
Signed-off-by: Andy Adamson <andros@netapp.com>
[fixed indentation]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: use nfsd4_cb_sequence for callback minorversion]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: fix verification of CB_SEQUENCE highest slot id[
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: Backchannel: Remove old backchannel serialization]
[nfsd41: Backchannel: First callback sequence ID should be 1]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: decode_cb_sequence does not need to actually decode ignored fields]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Follows the model used by the NFS client. Setup the RPC prepare and done
function pointers so that we can populate the sequence information if
minorversion == 1. rpc_run_task() is then invoked directly just like
existing NFS client operations do.
nfsd4_cb_prepare() determines if the sequence information needs to be setup.
If the slot is in use, it adds itself to the wait queue.
nfsd4_cb_done() wakes anyone sleeping on the callback channel wait queue
after our RPC reply has been received. It also sets the task message
result pointer to NULL to clearly indicate we're done using it.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[define and initialize cl_cb_seq_nr here]
[pulled out unused defintion of nfsd4_cb_done]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
RPC callback requests will wait on this wait queue if the backchannel
is out of slots.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Follow the model we use in the client. Make the sequence arguments
part of the regular RPC arguments. None of the callbacks that are
soon to be implemented expect results that need to be passed back
to the caller, so we don't define a separate RPC results structure.
For session validation, the cb_sequence decoding will use a pointer
to the sequence arguments that are part of the RPC argument.
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[define struct nfsd4_cb_sequence here]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Keep the xprt used for create_session in cl_cb_xprt.
Mark cl_callback.cb_minorversion = 1 and remember
the client provided cl_callback.cb_prog rpc program number.
Use it to probe the callback path.
Use the client's network address to initialize as the
callback's address as expected by the xprt creation
routines.
Define xdr sizes and code nfs4_cb_compound header to be able
to send a null callback rpc.
Signed-off-by: Andy Adamson<andros@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
[get callback minorversion from fore channel's]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: change bc_sock to bc_xprt]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[pulled definition for cl_cb_xprt]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: set up backchannel's cb_addr]
[moved rpc_create_args init to "nfsd: modify nfsd4.1 backchannel to use new xprt class"]
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Callbacks are always made using the machine's identity, so we can use a
single auth_generic credential shared among callbacks to all clients and
let the rpc code take care of the rest.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
On setting up the callback to the client, we attempt to use the same
authentication flavor the client did. We find an rpc cred to use by
calling rpcauth_lookup_credcache(), which assumes that the given
authentication flavor has a credentials cache. However, this is not
required to be true--in particular, auth_null does not use one.
Instead, we should call the auth's lookup_cred() method.
Without this, a client attempting to mount using nfsv4 and auth_null
triggers a null dereference.
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
It was possible for an async worker thread to be selected to
receive a new work item, but exit before the work item was
actually placed into that thread's work list.
This commit fixes the race by incrementing the num_pending
counter earlier, and making sure to check the number of pending
work items before a thread exits.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The exit-on-idle code for async worker threads was incorrectly
calling spin_lock_irq with interrupts already off.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
After a new worker thread starts, it is placed into the
list of idle threads. But, this may race with a
check for idle done by the worker thread itself, resulting
in a double list_add operation.
This fix adds a check to make sure the idle thread addition
is done properly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
It's possible that this struct will outlive the filp to which it is
attached. If it does and it needs to do some work on the inode, then
it'll need a reference.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
...rather than a write lock. It doesn't change the list so a read lock
should be sufficient.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
It's set on oplock break but nothing ever looks at it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
cifs_posix_open takes a "poplock" argument that's intended to be used in
the actual posix open call to set the "Flags" field. It ignores this
value however and declares an "oplock" parameter on the stack that it
passes uninitialized to the CIFSPOSIXOpen function. Not only does this
mean that the oplock request flags are bogus, but the result that's
expected to be in that variable is unchanged.
Fix this, and also clean up the type of the oplock parameter used. Since
it's expected to be __u32, we should use that everywhere and not
implicitly cast it from a signed type.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
commit ac68392460 ("[CIFS] Allow raw
ntlmssp code to be enabled with sec=ntlmssp") added a new bit to the
allowed security flags mask but seems to have inadvertently removed
Lanman security from the allowed flags. Add it back.
CC: Stable <stable@kernel.org>
Signed-off-by: Chuck Ebbert <cebbert@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
fix the following 'make includecheck' warning:
fs/xfs/linux-2.6/xfs_iops.c: xfs_acl.h is included more than once.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Alex Elder <aelder@sgi.com>
Let attribute group vectors be declared "const". We'd
like to let most attribute metadata live in read-only
sections... this is a start.
Signed-off-by: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Improve 'dbg_dump_lprop()' and print dark and dead space there,
decode flags, and journal heads.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The journal head names and numbers are part of the UBIFS format, so
they should be in the ubifs-media.h.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Using relative pathnames in #include statements interacts badly with
SystemTap, since the fs/ext4/*.h header files are not packaged up as
part of a distribution kernel's header files. Since systemtap doesn't
use TP_fast_assign(), we can use a blind structure definition and then
make sure the needed header files are defined before the ext4 source
files #include the trace/events/ext4.h header file.
https://bugzilla.redhat.com/show_bug.cgi?id=512478
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for-2.6.32' of git://git.kernel.dk/linux-2.6-block: (29 commits)
block: use blkdev_issue_discard in blk_ioctl_discard
Make DISCARD_BARRIER and DISCARD_NOBARRIER writes instead of reads
block: don't assume device has a request list backing in nr_requests store
block: Optimal I/O limit wrapper
cfq: choose a new next_req when a request is dispatched
Seperate read and write statistics of in_flight requests
aoe: end barrier bios with EOPNOTSUPP
block: trace bio queueing trial only when it occurs
block: enable rq CPU completion affinity by default
cfq: fix the log message after dispatched a request
block: use printk_once
cciss: memory leak in cciss_init_one()
splice: update mtime and atime on files
block: make blk_iopoll_prep_sched() follow normal 0/1 return convention
cfq-iosched: get rid of must_alloc flag
block: use interrupts disabled version of raise_softirq_irqoff()
block: fix comment in blk-iopoll.c
block: adjust default budget for blk-iopoll
block: fix long lines in block/blk-iopoll.c
block: add blk-iopoll, a NAPI like approach for block devices
...
* 'osync_cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
fsync: wait for data writeout completion before calling ->fsync
vfs: Remove generic_osync_inode() and sync_page_range{_nolock}()
fat: Opencode sync_page_range_nolock()
pohmelfs: Use new syncing helper
xfs: Convert sync_page_range() to simple filemap_write_and_wait_range()
ocfs2: Update syncing after splicing to match generic version
ntfs: Use new syncing helpers and update comments
ext4: Remove syncing logic from ext4_file_write
ext3: Remove syncing logic from ext3_file_write
ext2: Update comment about generic_osync_inode
vfs: Introduce new helpers for syncing after writing to O_SYNC file or IS_SYNC inode
vfs: Rename generic_file_aio_write_nolock
ocfs2: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
pohmelfs: Use __generic_file_aio_write instead of generic_file_aio_write_nolock
vfs: Remove syncing from generic_file_direct_write() and generic_file_buffered_write()
vfs: Export __generic_file_aio_write() and add some comments
vfs: Introduce filemap_fdatawait_range
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
GFS2: Whitespace fixes
GFS2: Remove unused sysfs file
GFS2: Be extra careful about deallocating inodes
GFS2: Remove no_formal_ino generating code
GFS2: Rename eattr.[ch] as xattr.[ch]
GFS2: Clean up of extended attribute support
GFS2: Add explanation of extended attr on-disk format
GFS2: Add "-o errors=panic|withdraw" mount options
GFS2: jumping to wrong label?
GFS2: free disk inode which is deleted by remote node -V2
GFS2: Add a document explaining GFS2's uevents
GFS2: Add sysfs link to device
GFS2: Replace assertion with proper error handling
GFS2: Improve error handling in inode allocation
GFS2: Add some more info to uevents
GFS2: Add online uevent to GFS2
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
udf: Fix possible corruption when close races with write
udf: Perform preallocation only for regular files
udf: Remove wrong assignment in udf_symlink
udf: Remove dead code
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: (21 commits)
fs/Kconfig: move nilfs2 outside misc filesystems
nilfs2: convert nilfs_bmap_lookup to an inline function
nilfs2: allow btree code to directly call dat operations
nilfs2: add update functions of virtual block address to dat
nilfs2: remove individual gfp constants for each metadata file
nilfs2: stop zero-fill of btree path just before free it
nilfs2: remove unused btree argument from btree functions
nilfs2: remove nilfs_dat_abort_start and nilfs_dat_abort_free
nilfs2: shorten freeze period due to GC in write operation v3
nilfs2: add more check routines in mount process
nilfs2: An unassigned variable is assigned to a never used structure member
nilfs2: use GFP_NOIO for bio_alloc instead of GFP_NOWAIT
nilfs2: stop using periodic write_super callback
nilfs2: clean up nilfs_write_super
nilfs2: fix disorder of nilfs_write_super in nilfs_sync_fs
nilfs2: remove redundant super block commit
nilfs2: implement nilfs_show_options to display mount options in /proc/mounts
nilfs2: always lookup disk block address before reading metadata block
nilfs2: use semaphore to protect pointer to a writable FS-instance
nilfs2: fix format string compile warning (ino_t)
...
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: consolidate reconnect logic in smb_init routines
cifs: Replace wrtPending with a real reference count
cifs: protect GlobalOplock_Q with its own spinlock
cifs: use tcon pointer in cifs_show_options
cifs: send IPv6 addr in upcall with colon delimiters
[CIFS] Fix checkpatch warnings
PATCH] cifs: fix broken mounts when a SSH tunnel is used (try #4)
[CIFS] Memory leak in ntlmv2 hash calculation
[CIFS] potential NULL dereference in parse_DFS_referrals()
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1623 commits)
netxen: update copyright
netxen: fix tx timeout recovery
netxen: fix file firmware leak
netxen: improve pci memory access
netxen: change firmware write size
tg3: Fix return ring size breakage
netxen: build fix for INET=n
cdc-phonet: autoconfigure Phonet address
Phonet: back-end for autoconfigured addresses
Phonet: fix netlink address dump error handling
ipv6: Add IFA_F_DADFAILED flag
net: Add DEVTYPE support for Ethernet based devices
mv643xx_eth.c: remove unused txq_set_wrr()
ucc_geth: Fix hangs after switching from full to half duplex
ucc_geth: Rearrange some code to avoid forward declarations
phy/marvell: Make non-aneg speed/duplex forcing work for 88E1111 PHYs
drivers/net/phy: introduce missing kfree
drivers/net/wan: introduce missing kfree
net: force bridge module(s) to be GPL
Subject: [PATCH] appletalk: Fix skb leak when ipddp interface is not loaded
...
Fixed up trivial conflicts:
- arch/x86/include/asm/socket.h
converted to <asm-generic/socket.h> in the x86 tree. The generic
header has the same new #define's, so that works out fine.
- drivers/net/tun.c
fix conflict between 89f56d1e9 ("tun: reuse struct sock fields") that
switched over to using 'tun->socket.sk' instead of the redundantly
available (and thus removed) 'tun->sk', and 2b980dbd ("lsm: Add hooks
to the TUN driver") which added a new 'tun->sk' use.
Noted in 'next' by Stephen Rothwell.
When we close a file, we remove preallocated blocks from it. But this
truncation was not protected by i_mutex and thus it could have raced with a
write through a different fd and cause crashes or even filesystem corruption.
Signed-off-by: Jan Kara <jack@suse.cz>
So far we preallocated blocks also for directories but that brings a
problem, when to get rid of preallocated blocks we don't need. So far
we removed them in udf_clear_inode() which has a disadvantage that
1) blocks are unavailable long after writing to a directory finished
and thus one can get out of space unnecessarily early
2) releasing blocks from udf_clear_inode is problematic because VFS
does not expect us to redirty inode there and it also slows down
memory reclaim.
So preallocate blocks only for regular files where we can drop preallocation
in udf_release_file.
Signed-off-by: Jan Kara <jack@suse.cz>
Recomputation of the pointer was wrong (it should have been just increment).
Luckily, we never use the computed value. Remove it.
Signed-off-by: Jan Kara <jack@suse.cz>
Currenly vfs_fsync(_range) first calls filemap_fdatawrite to write out
the data, the calls into ->fsync to write out the metadata and then finally
calls filemap_fdatawait to wait for the data I/O to complete. What sounds
like a clever micro-optimization actually is nast trap for many filesystems.
For many modern filesystems i_size or other inode information is only
updated on I/O completion and we need to wait for I/O to finish before
we can write out the metadata. For old fashionen filesystems that
instanciate blocks during the actual write and also update the metadata
at that point it opens up a large window were we could expose uninitialized
blocks after a crash. While a few filesystems that need it already wait
for the I/O to finish inside their ->fsync methods it is rather suboptimal
as it is done under the i_mutex and also always for the whole file instead
of just a part as we could do for O_SYNC handling.
Here is a small audit of all fsync instances in the tree:
- spufs_mfc_fsync:
- ps3flash_fsync:
- vol_cdev_fsync:
- printer_fsync:
- fb_deferred_io_fsync:
- bad_file_fsync:
- simple_sync_file:
don't care - filesystems/drivers do't use the page cache or are
purely in-memory.
- simple_fsync:
- file_fsync:
- affs_file_fsync:
- fat_file_fsync:
- jfs_fsync:
- ubifs_fsync:
- reiserfs_dir_fsync:
- reiserfs_sync_file:
never touch pagecache themselves. We need to wait before if we do
not want to expose stale data after an allocation.
- afs_fsync:
- fuse_fsync_common:
do the waiting writeback itself in awkward ways, would benefit from
proper semantics
- block_fsync:
Does a filemap_write_and_wait on the block device inode. Because we
now have f_mapping that is the same inode we call it on in vfs_fsync.
So just removing it and letting the VFS do the work in one go would
be an improvement.
- btrfs_sync_file:
- cifs_fsync:
- xfs_file_fsync:
need the wait first and currently do it themselves. would benefit from
doing it outside i_mutex.
- coda_fsync:
- ecryptfs_fsync:
- exofs_file_fsync:
- shm_fsync:
only passes the fsync through to the lower layer
- ext3_sync_file:
doesn't seem to care, comments are confusing.
- ext4_sync_file:
would need the wait to work correctly for delalloc mode with late
i_size updates. Otherwise the ext3 comment applies.
currently implemens it's own writeback and wait in an odd way,
could benefit from doing it properly.
- gfs2_fsync:
not needed for journaled data mode, but probably harmless there.
Currently writes back data asynchronously itself. Needs some
major audit.
- hostfs_fsync:
just calls fsync/datasync on the host FD. Without the wait before
data might not even be inflight yet if we're unlucky.
- hpfs_file_fsync:
- ncp_fsync:
no-ops. Dangerous before and after.
- jffs2_fsync:
just calls jffs2_flush_wbuf_gc, not sure how this relates to data.
- nfs_fsync_dir:
just increments stats, claims all directory operations are synchronous
- nfs_file_fsync:
only writes out data??? Looks very odd.
- nilfs_sync_file:
looks like it expects all data done, but not sure from the code
- ntfs_dir_fsync:
- ntfs_file_fsync:
appear to do their own data writeback. Very convoluted code.
- ocfs2_sync_file:
does it's own data writeback, but no wait. probably needs the wait.
- smb_fsync:
according to a comment expects all pages written already, probably needs
the wait before.
This patch only changes vfs_fsync_range, removal of the wait in the methods
that have it is left to the filesystem maintainers. Note that most
filesystems really do need an audit for their fsync methods given the
gems found in this very brief audit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
fat_cont_expand() is the only user of sync_page_range_nolock(). It's also the
only user of generic_osync_inode() which does not have a file open. So
opencode needed actions for FAT so that we can convert generic_osync_inode() to
a standard syncing path.
Update a comment about generic_osync_inode().
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
Christoph Hellwig says that it is enough for XFS to call
filemap_write_and_wait_range() instead of sync_page_range() because we do
all the metadata syncing when forcing the log.
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Update ocfs2 specific splicing code to use generic syncing helper. The sync now
does not happen under rw_lock because generic_write_sync() acquires i_mutex
which ranks above rw_lock. That should not matter because standard fsync path
does not hold it either.
Acked-by: Joel Becker <Joel.Becker@oracle.com>
Acked-by: Mark Fasheh <mfasheh@suse.com>
CC: ocfs2-devel@oss.oracle.com
Signed-off-by: Jan Kara <jack@suse.cz>
Use new syncing helpers in .write and .aio_write functions. Also
remove superfluous syncing in ntfs_file_buffered_write() and update
comments about generic_osync_inode().
CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
The syncing is now properly handled by generic_file_aio_write() so
no special ext4 code is needed.
CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>
Syncing is now properly done by generic_file_aio_write() so no special logic is
needed in ext3.
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Introduce new function for generic inode syncing (vfs_fsync_range) and use
it from fsync() path. Introduce also new helper for syncing after a sync
write (generic_write_sync) using the generic function.
Use these new helpers for syncing from generic VFS functions. This makes
O_SYNC writes to block devices acquire i_mutex for syncing. If we really
care about this, we can make block_fsync() drop the i_mutex and reacquire
it before it returns.
CC: Evgeniy Polyakov <zbr@ioremap.net>
CC: ocfs2-devel@oss.oracle.com
CC: Joel Becker <joel.becker@oracle.com>
CC: Felix Blyakher <felixb@sgi.com>
CC: xfs@oss.sgi.com
CC: Anton Altaparmakov <aia21@cantab.net>
CC: linux-ntfs-dev@lists.sourceforge.net
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
generic_file_aio_write_nolock() is now used only by block devices and raw
character device. Filesystems should use __generic_file_aio_write() in case
generic_file_aio_write() doesn't suit them. So rename the function to
blkdev_aio_write() and move it to fs/blockdev.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Use the new helper. We have to submit data pages ourselves in case of O_SYNC
write because __generic_file_aio_write does not do it for us. OCFS2 developpers
might think about moving the sync out of i_mutex which seems to be easily
possible but that's out of scope of this patch.
CC: ocfs2-devel@oss.oracle.com
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Some people asked me questions like the following:
On Wed, 15 Jul 2009 13:11:21 +0200, Leon Woestenberg wrote:
> just wondering, any reasons why NILFS2 is one of the miscellaneous
> filesystems and, for example, btrfs, is not in Kconfig?
Actually, nilfs is NOT a filesystem came from other operating systems,
but a filesystem created purely for Linux. Nor is it a flash
filesystem but that for generic block devices.
So, this moves nilfs outside the misc category as I responded in LKML
"Re: Why does NILFS2 hide under Miscellaneous filesystems?"
(Message-Id: <20090716.002526.93465395.ryusuke@osrg.net>).
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The nilfs_bmap_lookup() is now a wrapper function of
nilfs_bmap_lookup_at_level().
This moves the nilfs_bmap_lookup() to a header file converting it to
an inline function and gives an opportunity for optimization.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The current btree code is written so that btree functions call dat
operations via wrapper functions in bmap.c when they allocate, free,
or modify virtual block addresses.
This abstraction requires additional function calls and causes
frequent call of nilfs_bmap_get_dat() function since it is used in the
every wrapper function.
This removes the wrapper functions and makes them available from
btree.c and direct.c, which will increase the opportunity of
compiler optimization.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This is a preparation for the successive cleanup ("nilfs2: allow btree
to directly call dat operations").
This adds functions bundling a few operations to change an entry of
virtual block address on the dat file.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This gets rid of NILFS_CPFILE_GFP, NILFS_SUFILE_GFP, NILFS_DAT_GFP,
and NILFS_IFILE_GFP. All of these constants refer to NILFS_MDT_GFP,
and can be removed.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The btree path object is cleared just before it is freed.
This will remove the code doing the unnecessary clear operation.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Even though many btree functions take a btree object as their first
argument, most of them are not used in their functions.
This sticky use of the btree argument is hurting code readability and
giving the possibility of inefficient code generation.
So, this removes the unnecessary btree arguments.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This is a re-revised patch to shorten freeze period.
This version include a fix of the bug Konishi-san mentioned last time.
When GC is runnning, GC moves live block to difference segments.
Copying live blocks into memory is done in a transaction,
however it is not necessarily to be in the transaction.
This patch will get the nilfs_ioctl_move_blocks() out from
transaction lock and put it before the transaction.
I ran sysbench fileio test against nilfs partition.
I copied some DVD/CD images and created snapshot to create live blocks
before starting the benchmark.
Followings are summary of rc8 and rc8 w/ the patch of per-request
statistics, which is min/max and avg. I ran each test three times and
bellow is average of those numers.
According to this benchmark result, average time is slightly degrated.
However, worstcase (max) result is significantly improved.
This can address a few seconds write freeze.
- random write per-request performance of rc8
min 0.843ms
max 680.406ms
avg 3.050ms
- random write per-request performance of rc8 w/ this patch
min 0.843ms -> 100.00%
max 380.490ms -> 55.90%
avg 3.233ms -> 106.00%
- sequential write per-request performance of rc8
min 0.736ms
max 774.343ms
avg 2.883ms
- sequential write per-request performance of rc8 w/ this patch
min 0.720ms -> 97.80%
max 644.280ms-> 83.20%
avg 3.130ms -> 108.50%
-----8<-----8<-----nilfs_cleanerd.conf-----8<-----8<-----
protection_period 150
selection_policy timestamp # timestamp in ascend order
nsegments_per_clean 2
cleaning_interval 2
retry_interval 60
use_mmap
log_priority info
-----8<-----8<-----nilfs_cleanerd.conf-----8<-----8<-----
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs2: Add more safeguard routines and protections in mount process,
which also makes nilfs2 report consistency error messages when
checkpoint number is invalid.
Signed-off-by: Zhu Yanhai <zhu.yanhai@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
nilfs2: In procedure 'nilfs_get_sb()', when a nilfs filesysttem is
mounted for the first time, local variable 'nilfs->ns_last_cno' is
used before loading the latest checkpoint number from disk (in
'nilfs_fill_super'). 'nilfs->ns_last_cno' is assigned to 'sd.cno', but
'sd.cno' has never been used in the procedure.
Signed-off-by: Zhang Qiang <zhangqiang.buaa@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Alberto Bertogli advised me about bio_alloc() use in nilfs:
On Sat, 13 Jun 2009 22:52:40 -0300, Alberto Bertogli wrote:
> By the way, those bio_alloc()s are using GFP_NOWAIT but it looks
> like they could use at least GFP_NOIO or GFP_NOFS, since the caller
> can (and sometimes do) sleep. The only caller is nilfs_submit_bh(),
> which calls nilfs_submit_seg_bio() which can sleep calling
> wait_for_completion().
This takes in the comment and replaces the use of GFP_NOWAIT flag with
GFP_NOIO.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This removes nilfs_write_super and commit super block in nilfs
internal thread, instead of periodic write_super callback.
VFS layer calls ->write_super callback periodically. However,
it looks like that calling back is ommited when disk I/O is busy.
And when cleanerd (nilfs GC) is runnig, disk I/O tend to be busy thus
nilfs superblock is not synchronized as nilfs designed.
To avoid it, syncing superblock by nilfs thread instead of pdflush.
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Separate conditions that check if syncing super block and alternative
super block are required as inline functions to reuse the conditions.
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This fixes disorder of nilfs_write_super in nilfs_sync_fs. Commiting
super block must be the end of the function so that every changes are
reflected.
->sync_fs() is not called frequently so this makes nilfs_sync_fs call
nilfs_commit_super instead of nilfs_write_super.
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This removes redundant super block commit.
nilfs_write_super will call nilfs_commit_super to store super block
into block device. However, nilfs_put_super will call
nilfs_commit_super right after calling nilfs_write_super. So calling
nilfs_write_super in nilfs_put_super would be redundant.
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This is a patch to display mount options in procfs.
Mount options will show up in the /proc/mounts as other fs does.
...
/dev/sda6 /mnt nilfs2 ro,relatime,barrier=off,cp=3,order=strict 0 0
...
Signed-off-by: Jiro SEKIBA <jir@unicus.jp>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The current metadata file code skips disk address lookup for its data
block if the buffer has a mapped flag.
This has a potential risk to cause read request to be performed
against the stale block address that GC moved, and it may lead to meta
data corruption. The mapped flag is safe if the buffer has an
uptodate flag, otherwise it may prevent necessary update of disk
address in the next read.
This will avoid the potential problem by ensuring disk address lookup
before reading metadata block even for buffers with the mapped flag.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
will get rid of nilfs_get_writer() and nilfs_put_writer() pair used to
retain a writable FS-instance for a period.
The pair functions were making up some kind of recursive lock with a
mutex, but they became overkill since the commit
201913ed74. Furthermore, they caused
the following lockdep warning because the mutex can be released by a
task which didn't lock it:
=====================================
[ BUG: bad unlock balance detected! ]
-------------------------------------
kswapd0/422 is trying to release lock (&nilfs->ns_writer_mutex) at:
[<c1359ff5>] mutex_unlock+0x8/0xa
but there are no more locks to release!
other info that might help us debug this:
no locks held by kswapd0/422.
stack backtrace:
Pid: 422, comm: kswapd0 Not tainted 2.6.31-rc4-nilfs #51
Call Trace:
[<c1358f97>] ? printk+0xf/0x18
[<c104fea7>] print_unlock_inbalance_bug+0xcc/0xd7
[<c11578de>] ? prop_put_global+0x3/0x35
[<c1050195>] lock_release+0xed/0x1dc
[<c1359ff5>] ? mutex_unlock+0x8/0xa
[<c1359f83>] __mutex_unlock_slowpath+0xaf/0x119
[<c1359ff5>] mutex_unlock+0x8/0xa
[<d1284add>] nilfs_mdt_write_page+0xd8/0xe1 [nilfs2]
[<c1092653>] shrink_page_list+0x379/0x68d
[<c109171b>] ? isolate_pages_global+0xb4/0x18c
[<c1092bd2>] shrink_list+0x26b/0x54b
[<c10930be>] shrink_zone+0x20c/0x2a2
[<c10936b7>] kswapd+0x407/0x591
[<c1091667>] ? isolate_pages_global+0x0/0x18c
[<c1040603>] ? autoremove_wake_function+0x0/0x33
[<c10932b0>] ? kswapd+0x0/0x591
[<c104033b>] kthread+0x69/0x6e
[<c10402d2>] ? kthread+0x0/0x6e
[<c1003e33>] kernel_thread_helper+0x7/0x1a
This patch uses a reader/writer semaphore instead of the own lock and
kills this warning.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Unlike on most other architectures ino_t is an unsigned int on s390.
So add an explicit cast to avoid this compile warning:
fs/nilfs2/recovery.c: In function 'recover_dsync_blocks':
fs/nilfs2/recovery.c:555: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'ino_t'
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
The __nilfs_read_inode function is ignoring the error code returned
from nilfs_read_inode_common(), and wrongly delivers a success code
(zero) when it escapes from the function in erroneous cases.
This adds the missing error handling.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
blk_ioctl_discard duplicates large amounts of code from blkdev_issue_discard,
the only difference between the two is that blkdev_issue_discard needs to
send a barrier discard request and blk_ioctl_discard a non-barrier one,
and blk_ioctl_discard needs to wait on the request. To facilitates this
add a flags argument to blkdev_issue_discard to control both aspects of the
behaviour. This will be very useful later on for using the waiting
funcitonality for other callers.
Based on an earlier patch from Matthew Wilcox <matthew@wil.cx>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Currently, there is a single in_flight counter measuring the number of
requests in the request_queue. But some monitoring tools would like to
know how many read requests and write requests are in progress. Split the
current in_flight counter into two seperate counters for read and write.
This information is exported as a sysfs attribute, as changing the
currently available stat files would break the existing tools.
Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Until now, trying to unlock the reiserfs write lock whereas the current
task doesn't hold it lead to a simple warning.
We should actually warn and panic in this case to avoid the user datas
to reach an unstable state.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Laurent Riffard <laurent.riffard@free.fr>
reiserfs_commit_write() is always called with the write lock held.
Thus the current calls to reiserfs_write_lock() in this function are
acquiring the lock recursively.
We can safely drop them.
This also solves further assumptions for this lock to be really
released while calling reiserfs_write_unlock().
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Laurent Riffard <laurent.riffard@free.fr>
reiserfs_mkdir() acquires the reiserfs lock, assuming it has been called
from the dir inodes callbacks, without the lock held.
But it can also be called from other internal sites such as
reiserfs_xattr_init() which already holds the lock. This recursive
locking leads to further wrong assumptions. For example, later calls
to reiserfs_mutex_lock_safe() won't actually unlock the reiserfs lock
the time we acquire a given mutex, creating unexpected lock inversions.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Laurent Riffard <laurent.riffard@free.fr>
reiserfs_xattr_init is called with the reiserfs write lock held, but
if the ".reiserfs_priv" entry is not created, we take the superblock
root directory inode mutex until .reiserfs_priv is created.
This creates a lock dependency inversion against other sites such as
reiserfs_file_release() which takes an inode mutex and the reiserfs
lock after.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Laurent Riffard <laurent.riffard@free.fr>
When do_balance() balances the tree, a trick is performed to
provide the ability for other tree writers/readers to check whether
do_balance() is executing concurrently (requires CONFIG_REISERFS_CHECK).
This is done to protect concurrent accesses to the tree. The trick
is the following:
When do_balance is called, a unique global variable called cur_tb
takes a pointer to the current tree to be rebalanced.
Once do_balance finishes its work, cur_tb takes the NULL value.
Then, concurrent tree readers/writers just have to check the value
of cur_tb to ensure do_balance isn't executing concurrently.
If it is, then it proves that schedule() occured on do_balance(),
which then relaxed the bkl that protected the tree.
Now that the bkl has be turned into a mutex, this check is still
fine even though do_balance() becomes preemptible: the write lock
will not be automatically released on schedule(), so the tree is
still protected.
But this is only fine if we have a single reiserfs mountpoint.
Indeed, because the bkl is a global lock, it didn't allowed
concurrent executions between a tree reader/writer in a mount point
and a do_balance() on another tree from another mountpoint.
So assuming all these readers/writers weren't supposed to be
reentrant, the current check now sometimes detect false positives with
the current per-superblock mutex which allows this reentrancy.
This patch keeps the concurrent tree accesses check but moves it
per superblock, so that only trees from a same mount point are
checked to be not accessed concurrently.
[ Impact: fix spurious panic while running several reiserfs mount-points ]
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
While searching a pathname, an inode mutex can be acquired
in do_lookup() which calls reiserfs_lookup() which in turn
acquires the write lock.
On the other side reiserfs_fill_super() can acquire the write_lock
and then call reiserfs_lookup_privroot() which can acquire an
inode mutex (the root of the mount point).
So we theoretically risk an AB - BA lock inversion that could lead
to a deadlock.
As for other lock dependencies found since the bkl to mutex
conversion, the fix is to use reiserfs_mutex_lock_safe() which
drops the lock dependency to the write lock.
[ Impact: fix a possible deadlock with reiserfs ]
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
search_by_key() is the site which most requires the lock.
This is mostly because it is a very central function and also
because it releases/reaqcuires the write lock at least once each
time it is called.
Such release/reacquire creates a lot of contention in this place and
also opens more the window which let another thread changing the tree.
When it happens, the current path searching over the tree must be
retried from the beggining (the root) which is a wasteful and
time consuming recovery.
This patch factorizes two release/reacquire sequences:
- reading leaf nodes blocks
- reading current block
The latter immediately follows the former.
The whole sequence is safe as a single unlocked section because
we check just after if the tree has changed during these operations.
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
reiserfs_mutex_lock_safe() is a hack to avoid any dependency between
an internal reiserfs mutex and the write lock, it has been proposed
to follow the old bkl logic.
The code does the following:
while (!mutex_trylock(m)) {
reiserfs_write_unlock(s);
schedule();
reiserfs_write_lock(s);
}
It then imitate the implicit behaviour of the lock when it was
a Bkl and hadn't such dependency:
mutex_lock(m) {
if (fastpath)
let's go
else {
wait_for_mutex() {
schedule() {
unlock_kernel()
reacquire_lock_kernel()
}
}
}
}
The problem is that by using such explicit schedule(), we don't
benefit of the adaptive mutex spinning on owner.
The logic in use now is:
reiserfs_write_unlock(s);
mutex_lock(m); // -> possible adaptive spinning
reiserfs_write_lock(s);
[ Impact: restore the use of adaptive spinning mutexes in reiserfs ]
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
reiserfs_write_end() is a hot path in reiserfs.
We have two wasteful write lock lock/release inside that can be gathered
without changing the code logic.
This patch factorizes them out in a single protected section, reducing the
number of contentions inside.
[ Impact: reduce lock contention in a reiserfs hotpath ]
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
search_by_key() is a central function in reiserfs which searches
the patch in the fs tree from the root to a node given its key.
It is the function that is most requesting the write lock
because it's a path very often used.
Also we forget to release the lock while reading the next tree node,
making us holding the lock in a wasteful way.
Then we release the lock while reading the current node and its childs,
all-in-one. It should be safe because we have a reference to these
blocks and even if we read a block that will be concurrently changed,
we have an fs_changed check later that will make us retry the path from
the root.
[ Impact: release the write lock while unused in a hot path ]
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
The write lock can be acquired recursively in reiserfs_lookup(). But we may
want to *really* release the lock before possible rescheduling from a
reiserfs_lookup() callee.
Hence we want to only acquire the lock once (ie: not recursively).
[ Impact: prevent from possible false unreleased write lock on sleeping ]
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
reiserfs_get_block() is one of these sites where the write lock might
be acquired recursively.
It's a particular problem because this function is called very often.
It's a hot spot which needs to reschedule() periodically while converting
direct items to indirect ones because it can take some time.
Then if we are applying the write lock release/reacquire pattern on
schedule() here, it may not produce the desired effect since we may have
locked in more than one depth.
The solution is to use reiserfs_write_lock_once() which won't try
to reacquire the lock recursively. Then the lock will be *really*
released before schedule().
Also, we only release the lock if TIF_NEED_RESCHED is set to not
create wasteful numerous contentions.
[ Impact: fix a too long holded lock case in reiserfs_get_block() ]
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
ll_rw_block() might sleep, and the bkl was released at this point. Then
we can also relax the write lock at this point.
[ Impact: release the reiserfs write lock when it is not needed ]
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
reiserfs_read_bitmap_block() uses sb_bread() to read the bitmap block. This
helper might sleep.
Then, when the bkl was used, it was released at this point. We can then
relax the write lock too here.
[ Impact: release the reiserfs write lock when it is not needed ]
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
get_neighbors() is used to get the left and/or right blocks
against a given one in order to balance a tree.
sb_bread() is used to read the buffer of these neighors blocks and
while it waits for this operation, it might sleep.
The bkl was released at this point, and then we can also release
the write lock before calling sb_bread().
This is safe because if the filesystem is changed after this
lock release, the function returns REPEAT_SEARCH (aka SCHEDULE_OCCURRED
in the function header comments) in order to repeat the neighbhor
research.
[ Impact: release the reiserfs write lock when it is not needed ]
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
prepare_for_delete_or_cut() can process several types of items, including
indirect items, ie: items which contain no file data but pointers to
unformatted nodes scattering the datas of a file.
In this case it has to zero out these pointers to block numbers of
unformatted nodes and release the bitmap from these block numbers.
It can take some time, so a rescheduling() is performed between each
block processed. We can safely release the write lock while
rescheduling(), like the bkl did, because the code checks just after
if the item has moved after sleeping.
[ Impact: release the reiserfs write lock when it is not needed ]
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
When do_journal_end() copies data to the journal blocks buffers in memory,
it reschedules if needed between each block copied and dirtyfied.
We can also release the write lock at this rescheduling stage,
like did the bkl implicitly.
[ Impact: release the reiserfs write lock when it is not needed ]
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Impact: fix a deadlock
reiserfs_dirty_inode() is the super_operations::dirty_inode() callback
of reiserfs. It can be called from different contexts where the write
lock can be already held.
But this function also grab the write lock (possibly recursively).
Subsequent release of the lock before sleep will actually not release
the lock if the caller of mark_inode_dirty() (which in turn calls
reiserfs_dirty_inode()) already owns the lock.
A typical case:
reiserfs_write_end() {
acquire_write_lock()
mark_inode_dirty() {
reiserfs_dirty_inode() {
reacquire_write_lock() {
journal_begin() {
do_journal_begin_r() {
/*
* fail to release, still
* one depth of lock
*/
release_write_lock()
reiserfs_wait_on_write_block() {
wait_event()
The event is usually provided by something which needs the write lock but
it hasn't been released.
We use reiserfs_write_lock_once() here to ensure we only grab the
write lock in one level.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@texware.it>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
LKML-Reference: <1239680065-25013-4-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix a deadlock
reiserfs_truncate_file() can be called from multiple context where
the write lock can be already hold or not.
This function also acquire (possibly recursively) the write
lock. Subsequent releases before sleeping will not actually release
the lock because we may be in more than one lock depth degree.
A typical case is:
reiserfs_file_release {
acquire_the_lock()
reiserfs_truncate_file()
reacquire_the_lock()
journal_begin() {
do_journal_begin_r() {
reiserfs_wait_on_write_block() {
/*
* Not released because still one
* depth owned
*/
release_lock()
wait_for_event()
At this stage the event never happen because the one which provides
it needs the write lock.
We use reiserfs_write_lock_once() here to ensure that we don't acquire the
write lock recursively.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@texware.it>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
LKML-Reference: <1239680065-25013-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Sometimes we don't want to recursively hold the per superblock write
lock because we want to be sure it is actually released when we come
to sleep.
This patch introduces the necessary tools for that.
reiserfs_write_lock_once() does the same job than reiserfs_write_lock()
except that it won't try to acquire recursively the lock if the current
task already owns it. Also the lock_depth before the call of this function
is returned.
reiserfs_write_unlock_once() unlock only if reiserfs_write_lock_once()
returned a depth equal to -1, ie: only if it actually locked.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@texware.it>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Cc: Chris Mason <chris.mason@oracle.com>
LKML-Reference: <1239680065-25013-2-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix a deadlock
The j_flush_mutex is acquired safely in journal.c:
if we can't take it, we free the reiserfs per superblock lock
and wait a bit.
But we have a remaining place in kupdate_transactions() where
j_flush_mutex is still acquired traditionnaly. Thus the following
scenario (warned by lockdep) can happen:
A B
mutex_lock(&write_lock) mutex_lock(&write_lock)
mutex_lock(&j_flush_mutex) mutex_lock(&j_flush_mutex) //block
mutex_unlock(&write_lock)
sleep...
mutex_lock(&write_lock) //deadlock
Fix this by using reiserfs_mutex_lock_safe() in kupdate_transactions().
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@texware.it>
Cc: Jeff Mahoney <jeffm@suse.com>
LKML-Reference: <1239660635-12940-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch is an attempt to remove the Bkl based locking scheme from
reiserfs and is intended.
It is a bit inspired from an old attempt by Peter Zijlstra:
http://lkml.indiana.edu/hypermail/linux/kernel/0704.2/2174.html
The bkl is heavily used in this filesystem to prevent from
concurrent write accesses on the filesystem.
Reiserfs makes a deep use of the specific properties of the Bkl:
- It can be acqquired recursively by a same task
- It is released on the schedule() calls and reacquired when schedule() returns
The two properties above are a roadmap for the reiserfs write locking so it's
very hard to simply replace it with a common mutex.
- We need a recursive-able locking unless we want to restructure several blocks
of the code.
- We need to identify the sites where the bkl was implictly relaxed
(schedule, wait, sync, etc...) so that we can in turn release and
reacquire our new lock explicitly.
Such implicit releases of the lock are often required to let other
resources producer/consumer do their job or we can suffer unexpected
starvations or deadlocks.
So the new lock that replaces the bkl here is a per superblock mutex with a
specific property: it can be acquired recursively by a same task, like the
bkl.
For such purpose, we integrate a lock owner and a lock depth field on the
superblock information structure.
The first axis on this patch is to turn reiserfs_write_(un)lock() function
into a wrapper to manage this mutex. Also some explicit calls to
lock_kernel() have been converted to reiserfs_write_lock() helpers.
The second axis is to find the important blocking sites (schedule...(),
wait_on_buffer(), sync_dirty_buffer(), etc...) and then apply an explicit
release of the write lock on these locations before blocking. Then we can
safely wait for those who can give us resources or those who need some.
Typically this is a fight between the current writer, the reiserfs workqueue
(aka the async commiter) and the pdflush threads.
The third axis is a consequence of the second. The write lock is usually
on top of a lock dependency chain which can include the journal lock, the
flush lock or the commit lock. So it's dangerous to release and trying to
reacquire the write lock while we still hold other locks.
This is fine with the bkl:
T1 T2
lock_kernel()
mutex_lock(A)
unlock_kernel()
// do something
lock_kernel()
mutex_lock(A) -> already locked by T1
schedule() (and then unlock_kernel())
lock_kernel()
mutex_unlock(A)
....
This is not fine with a mutex:
T1 T2
mutex_lock(write)
mutex_lock(A)
mutex_unlock(write)
// do something
mutex_lock(write)
mutex_lock(A) -> already locked by T1
schedule()
mutex_lock(write) -> already locked by T2
deadlock
The solution in this patch is to provide a helper which releases the write
lock and sleep a bit if we can't lock a mutex that depend on it. It's another
simulation of the bkl behaviour.
The last axis is to locate the fs callbacks that are called with the bkl held,
according to Documentation/filesystem/Locking.
Those are:
- reiserfs_remount
- reiserfs_fill_super
- reiserfs_put_super
Reiserfs didn't need to explicitly lock because of the context of these callbacks.
But now we must take care of that with the new locking.
After this patch, reiserfs suffers from a slight performance regression (for now).
On UP, a high volume write with dd reports an average of 27 MB/s instead
of 30 MB/s without the patch applied.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Ingo Molnar <mingo@elte.hu>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Bron Gondwana <brong@fastmail.fm>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
LKML-Reference: <1239070789-13354-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* git://git.linux-nfs.org/projects/trondmy/nfs-2.6: (87 commits)
NFSv4: Disallow 'mount -t nfs4 -overs=2' and 'mount -t nfs4 -overs=3'
NFS: Allow the "nfs" file system type to support NFSv4
NFS: Move details of nfs4_get_sb() to a helper
NFS: Refactor NFSv4 text-based mount option validation
NFS: Mount option parser should detect missing "port="
NFS: out of date comment regarding O_EXCL above nfs3_proc_create()
NFS: Handle a zero-length auth flavor list
SUNRPC: Ensure that sunrpc gets initialised before nfs, lockd, etc...
nfs: fix compile error in rpc_pipefs.h
nfs: Remove reference to generic_osync_inode from a comment
SUNRPC: cache must take a reference to the cache detail's module on open()
NFS: Use the DNS resolver in the mount code.
NFS: Add a dns resolver for use with NFSv4 referrals and migration
SUNRPC: Fix a typo in cache_pipefs_files
nfs: nfs4xdr: optimize low level decoding
nfs: nfs4xdr: get rid of READ_BUF
nfs: nfs4xdr: simplify decode_exchange_id by reusing decode_opaque_inline
nfs: nfs4xdr: get rid of COPYMEM
nfs: nfs4xdr: introduce decode_sessionid helper
nfs: nfs4xdr: introduce decode_verifier helper
...
The s_flex_groups array should have been initialized using atomic_add
to sum up the free counts from the block groups that make up a
flex_bg. By using atomic_set, the value of the s_flex_groups array
was set to the values of the last block group in the flex_bg.
The impact of this bug is that the block and inode allocation
algorithms might not pick the best flex_bg for new allocation.
Thanks to Damien Guibouret for pointing out this problem!
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (64 commits)
sched: Fix sched::sched_stat_wait tracepoint field
sched: Disable NEW_FAIR_SLEEPERS for now
sched: Keep kthreads at default priority
sched: Re-tune the scheduler latency defaults to decrease worst-case latencies
sched: Turn off child_runs_first
sched: Ensure that a child can't gain time over it's parent after fork()
sched: enable SD_WAKE_IDLE
sched: Deal with low-load in wake_affine()
sched: Remove short cut from select_task_rq_fair()
sched: Turn on SD_BALANCE_NEWIDLE
sched: Clean up topology.h
sched: Fix dynamic power-balancing crash
sched: Remove reciprocal for cpu_power
sched: Try to deal with low capacity, fix update_sd_power_savings_stats()
sched: Try to deal with low capacity
sched: Scale down cpu_power due to RT tasks
sched: Implement dynamic cpu_power
sched: Add smt_gain
sched: Update the cpu_power sum during load-balance
sched: Add SD_PREFER_SIBLING
...
When btrfs_get_extent is reading inline file items for readpage,
it needs to copy the inline extent into the page. If the
inline extent doesn't cover all of the page, that means there
is a hole in the file, or that our file is smaller than one
page.
readpage does zeroing for the case where the file is smaller than one
page, but nobody is currently zeroing for the case where there is
a hole after the inline item.
This commit changes btrfs_get_extent to zero fill the page past
the end of the inline item.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This closes a whole where the page may be written before
the page_mkwrite caller has a chance to dirty it
(thanks to Nick Piggin)
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Data COW means that whenever we write to a file, we replace any old
extent pointers with new ones. There was a window where a readpage
might find the old extent pointers on disk and cache them in the
extent_map tree in ram in the middle of a given write replacing them.
Even though both the readpage and the write had their respective bytes
in the file locked, the extent readpage inserts may cover more bytes than
it had locked down.
This commit closes the race by keeping the new extent pinned in the extent
map tree until after the on-disk btree is properly setup with the new
extent pointers.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs writes go through delalloc to the data=ordered code. This
makes sure that all of the data is on disk before the metadata
that references it. The tracking means that we have to make sure
each page in an extent is fully written before we add that extent into
the on-disk btree.
This was done in the past by setting the EXTENT_ORDERED bit for the
range of an extent when it was added to the data=ordered code, and then
clearing the EXTENT_ORDERED bit in the extent state tree as each page
finished IO.
One of the reasons we had to do this was because sometimes pages are
magically dirtied without page_mkwrite being called. The EXTENT_ORDERED
bit is checked at writepage time, and if it isn't there, our page become
dirty without going through the proper path.
These bit operations make for a number of rbtree searches for each page,
and can cause considerable lock contention.
This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
As pages go into the ordered code, PagePrivate2 is set on each one.
This is a cheap operation because we already have all the pages locked
and ready to go.
As IO finishes, the PagePrivate2 bit is cleared and the ordered
accoutning is updated for each page.
At writepage time, if the PagePrivate2 bit is missing, we go into the
writepage fixup code to handle improperly dirtied pages.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes the btrfs code to find delalloc ranges in the extent state
tree to use the new state caching code from set/test bit. It reduces
one of the biggest causes of rbtree searches in the writeback path.
test_range_bit is also modified to take the cached state as a starting
point while searching.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
At writepage time, we have the page locked and we have the
extent_map entry for this extent pinned in the extent_map tree.
So, the page can't go away and its mapping can't change.
There is no need for the extra extent_state lock bits during writepage.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Many of the btrfs extent state tree users follow the same pattern.
They lock an extent range in the tree, do some operation and then
unlock.
This translates to at least 2 rbtree searches, and maybe more if they
are doing operations on the extent state tree. A locked extent
in the tree isn't going to be merged or changed, and so we can
safely return the extent state structure as a cached handle.
This changes set_extent_bit to give back a cached handle, and also
changes both set_extent_bit and clear_extent_bit to use the cached
handle if it is available.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Btrfs is currently mirroring some of the page state bits into
its extent state tree. The goal behind this was to use it in supporting
blocksizes other than the page size.
But, we don't currently support that, and we're using quite a lot of CPU
on the rb tree and its spin lock. This commit starts a series of
cleanups to reduce the amount of work done in the extent state tree as
part of each IO.
This commit:
* Adds the ability to lock an extent in the state tree and also set
other bits. The idea is to do locking and delalloc in one call
* Removes the EXTENT_WRITEBACK and EXTENT_DIRTY bits. Btrfs is using
a combination of the page bits and the ordered write code for this
instead.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
As the extent state tree is manipulated, there are call backs
that are used to take extra actions when different state bits are set
or cleared. One example of this is a counter for the total number
of delayed allocation bytes in a single inode and in the whole FS.
When new states are inserted, this callback is being done before we
properly setup the new state. This hasn't caused problems before
because the lock bit was always done first, and the existing call backs
don't care about the lock bit.
This patch makes sure the state is properly setup before using the
callback, which is important for later optimizations that do more work
without using the lock bit.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
There are two main users of the extent_map tree. The
first is regular file inodes, where it is evenly spread
between readers and writers.
The second is the chunk allocation tree, which maps blocks from
logical addresses to phyiscal ones, and it is 99.99% reads.
The mapping tree is a point of lock contention during heavy IO
workloads, so this commit switches things to a rw lock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs io submission thread tries to back off congested devices in
favor of rotating off to another disk.
But, it tries to make sure it submits at least some IO before rotating
on (the others may be congested too), and so it has a magic number of
requests it tries to write before it hops.
This makes the magic number smaller. Testing shows that we're spending
too much time on congested devices and leaving the other devices idle.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When btrfs fills a large delayed allocation extent, it is a good idea
to try and convince the write_cache_pages caller to go ahead and
write a good chunk of that extent. The extra IO is basically free
because we know it is contiguous.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This changes the btrfs worker threads to batch work items
into a local list. It allows us to pull work items in
large chunks and significantly reduces the number of times we
need to take the worker thread spinlock.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The btrfs worker thread spinlock was being used both for the
queueing of IO and for the processing of ordered events.
The ordered events never happen from end_io handlers, and so they
don't need to use the _irq version of spinlocks. This adds a
dedicated lock to the ordered lists so they don't have to run
with irqs off.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The Btrfs set_extent_bit call currently searches the rbtree
every time it needs to find more extent_state objects to fill
the requested operation.
This adds a simple test with rb_next to see if the next object
in the tree was adjacent to the one we just found. If so,
we skip the search and just use the next object.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The Btrfs worker threads don't currently die off after they have
been idle for a while, leading to a lot of threads sitting around
doing nothing for each mount.
Also, they are unable to start atomically (from end_io hanlders).
This commit reworks the worker threads so they can be started
from end_io handlers (just setting a flag that asks for a thread
to be added at a later date) and so they can exit if they
have been idle for a long time.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* 'writeback' of git://git.kernel.dk/linux-2.6-block:
writeback: check for registered bdi in flusher add and inode dirty
writeback: add name to backing_dev_info
writeback: add some debug inode list counters to bdi stats
writeback: get rid of pdflush completely
writeback: switch to per-bdi threads for flushing data
writeback: move dirty inodes from super_block to backing_dev_info
writeback: get rid of generic_sync_sb_inodes() export
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (57 commits)
binfmt_elf: fix PT_INTERP bss handling
TPM: Fixup boot probe timeout for tpm_tis driver
sysfs: Add labeling support for sysfs
LSM/SELinux: inode_{get,set,notify}secctx hooks to access LSM security context information.
VFS: Factor out part of vfs_setxattr so it can be called from the SELinux hook for inode_setsecctx.
KEYS: Add missing linux/tracehook.h #inclusions
KEYS: Fix default security_session_to_parent()
Security/SELinux: includecheck fix kernel/sysctl.c
KEYS: security_cred_alloc_blank() should return int under all circumstances
IMA: open new file for read
KEYS: Add a keyctl to install a process's session keyring on its parent [try #6]
KEYS: Extend TIF_NOTIFY_RESUME to (almost) all architectures [try #6]
KEYS: Do some whitespace cleanups [try #6]
KEYS: Make /proc/keys use keyid not numread as file position [try #6]
KEYS: Add garbage collection for dead, revoked and expired keys. [try #6]
KEYS: Flag dead keys to induce EKEYREVOKED [try #6]
KEYS: Allow keyctl_revoke() on keys that have SETATTR but not WRITE perm [try #6]
KEYS: Deal with dead-type keys appropriately [try #6]
CRED: Add some configurable debugging [try #6]
selinux: Support for the new TUN LSM hooks
...
Splice should update the modification and access times on regular
files just like read and write. Not updating mtime will confuse
backup tools, etc...
This patch only adds the time updates for regular files. For pipes
and other special files that splice touches the need for updating the
times is less clear. Let's discuss and fix that separately.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Get rid of any functions that test for these bits and make callers
use bio_rw_flagged() directly. Then it is at least directly apparent
what variable and flag they check.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Also a debugging aid. We want to catch dirty inodes being added to
backing devices that don't do writeback.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This enables us to track who does what and print info. Its main use
is catching dirty inodes on the default_backing_dev_info, so we can
fix that up.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This gets rid of pdflush for bdi writeout and kupdated style cleaning.
pdflush writeout suffers from lack of locality and also requires more
threads to handle the same workload, since it has to work in a
non-blocking fashion against each queue. This also introduces lumpy
behaviour and potential request starvation, since pdflush can be starved
for queue access if others are accessing it. A sample ffsb workload that
does random writes to files is about 8% faster here on a simple SATA drive
during the benchmark phase. File layout also seems a LOT more smooth in
vmstat:
r b swpd free buff cache si so bi bo in cs us sy id wa
0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45
where vanilla tends to fluctuate a lot in the creation phase:
r b swpd free buff cache si so bi bo in cs us sy id wa
1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54
A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
SSD based writeback test on XFS performs over 20% better as well, with
the throughput being very stable around 1GB/sec, where pdflush only
manages 750MB/sec and fluctuates wildly while doing so. Random buffered
writes to many files behave a lot better as well, as does random mmap'ed
writes.
A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This is a first step at introducing per-bdi flusher threads. We should
have no change in behaviour, although sb_has_dirty_inodes() is now
ridiculously expensive, as there's no easy way to answer that question.
Not a huge problem, since it'll be deleted in subsequent patches.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This adds two new exported functions:
- writeback_inodes_sb(), which only attempts to writeback dirty inodes on
this super_block, for WB_SYNC_NONE writeout.
- sync_inodes_sb(), which writes out all dirty inodes on this super_block
and also waits for the IO to complete.
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When ext4_dx_add_entry() has to split an index node, it has to ensure that
name_len of dx_node's fake_dirent is also zero, because otherwise e2fsck
won't recognise it as an intermediate htree node and consider the htree to
be corrupted.
Signed-off-by: Andreas Schlick <schlick@lavabit.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Previously the journal_async_commit mount option was equivalent to
using barrier=0 (and just as unsafe). This patch fixes it so that we
eliminate the barrier before the commit block (by not using ordered
mode), and explicitly issuing an empty barrier bio after writing the
commit block. Because of the journal checksum, it is safe to do this;
if the journal blocks are not all written before a power failure, the
checksum in the commit block will prevent the last transaction from
being replayed.
Using the fs_mark benchmark, using journal_async_commit shows a 50%
improvement:
FSUse% Count Size Files/sec App Overhead
8 1000 10240 30.5 28242
vs.
FSUse% Count Size Files/sec App Overhead
8 1000 10240 45.8 28620
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This avoids updating the superblock write time when we are mounting
the root file system read/only but we need to replay the journal; at
that point, for people who are east of GMT and who make their clock
tick in localtime for Windows bug-for-bug compatibility, and this will
cause e2fsck to complain and force a full file system check.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* topic/soundcore-preclaim:
sound: make OSS device number claiming optional and schedule its removal
sound: request char-major-* module aliases for missing OSS devices
chrdev: implement __[un]register_chrdev()
In fs/binfmt_elf.c, load_elf_interp() calls padzero() for .bss even if
the PT_LOAD has no PROT_WRITE and no .bss. This generates EFAULT.
Here is a small test case. (Yes, there are other, useful PT_INTERP
which have only .text and no .data/.bss.)
----- ptinterp.S
_start: .globl _start
nop
int3
-----
$ gcc -m32 -nostartfiles -nostdlib -o ptinterp ptinterp.S
$ gcc -m32 -Wl,--dynamic-linker=ptinterp -o hello hello.c
$ ./hello
Segmentation fault # during execve() itself
After applying the patch:
$ ./hello
Trace trap # user-mode execution after execve() finishes
If the ELF headers are actually self-inconsistent, then dying is fine.
But having no PROT_WRITE segment is perfectly normal and correct if
there is no segment with p_memsz > p_filesz (i.e. bss). John Reiser
suggested checking for PROT_WRITE in the bss logic. I think it makes
most sense to simply apply the bss logic only when there is bss.
This patch looks less trivial than it is due to some reindentation.
It just moves the "if (last_bss > elf_bss) {" test up to include the
partial-page bss logic as well as the more-pages bss logic.
Reported-by: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
This patch amends and nicifies commentaries in file.c, as well as
fixes some spelling problems.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
The 'ubifs_scan()' function returns -EUCLEAN if something is corrupted
and recovery is needed, otherwise it returns other error codes. However,
in few places UBIFS does not check the error codes and runs recovery.
This patch changes this behavior and makes UBIFS start recovery only
on -EUCLEAN errors.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Adrian Hunter <Adrian.Hunter@nokia.com>
At the moment UBIFS print large and scary error messages and
flash dumps in case of nearly any corruption, even if it is
a recoverable corruption. For example, if the master node is
corrupted, ubifs_scan() prints error dumps, then UBIFS recovers
just fine and goes on.
This patch makes UBIFS print scary error messages only in
real cases, which are not recoverable. It adds 'quiet' argument
to the 'ubifs_scan()' function, so the caller may ask 'ubi_scan()'
not to print error messages if the caller is able to do recovery.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Adrian Hunter <Adrian.Hunter@nokia.com>
Add one more check to UBIFS - a check that makes sure that there
are no data nodes beyond inode size. And few commantaries fixes
along the line.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Adrian Hunter <Adrian.Hunter@nokia.com>
We don't need to take the alloc_sem lock when we are adding new
groups, since mballoc won't see the new group added until we bump
sbi->s_groups_count.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
We should check for need init flag with the group's alloc_sem held, to
make sure while we are loading the buddy cache and holding a reference
to it, a file system resize can't add new blocks to same group.
The patch also drops the need init flag check in
ext4_mb_regular_allocator() because doing the check without holding
alloc_sem is racy.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
This moves the function around so that it can be called from
ext4_mb_load_buddy().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* lookup-permissions-cleanup:
jffs2/jfs/xfs: switch over to 'check_acl' rather than 'permission()'
ext[234]: move over to 'check_acl' permission model
shmfs: use 'check_acl' instead of 'permission'
Make 'check_acl()' a first-class filesystem op
Simplify exec_permission_lite(), part 3
Simplify exec_permission_lite() further
Simplify exec_permission_lite() logic
Do not call 'ima_path_check()' for each path component
In fs/binfmt_elf.c, load_elf_interp() calls padzero() for .bss even if
the PT_LOAD has no PROT_WRITE and no .bss. This generates EFAULT.
Here is a small test case. (Yes, there are other, useful PT_INTERP
which have only .text and no .data/.bss.)
----- ptinterp.S
_start: .globl _start
nop
int3
-----
$ gcc -m32 -nostartfiles -nostdlib -o ptinterp ptinterp.S
$ gcc -m32 -Wl,--dynamic-linker=ptinterp -o hello hello.c
$ ./hello
Segmentation fault # during execve() itself
After applying the patch:
$ ./hello
Trace trap # user-mode execution after execve() finishes
If the ELF headers are actually self-inconsistent, then dying is fine.
But having no PROT_WRITE segment is perfectly normal and correct if
there is no segment with p_memsz > p_filesz (i.e. bss). John Reiser
suggested checking for PROT_WRITE in the bss logic. I think it makes
most sense to simply apply the bss logic only when there is bss.
This patch looks less trivial than it is due to some reindentation.
It just moves the "if (last_bss > elf_bss) {" test up to include the
partial-page bss logic as well as the more-pages bss logic.
Reported-by: John Reiser <jreiser@bitwagon.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Teach ext4_write_inode() and ext4_do_update_inode() about non-journal
mode: If we're not using a journal, ext4_write_inode() now calls
ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc())
with a new "do_sync" parameter. If that parameter is nonzero _and_ we're
not using a journal, ext4_do_update_inode() calls sync_dirty_buffer()
instead of ext4_handle_dirty_metadata().
This problem was found in power-fail testing, checking the amount of
loss of files and blocks after a power failure when using fsync() and
when not using fsync(). It turned out that using fsync() was actually
worse than not doing so, possibly because it increased the likelihood
that the inodes would remain unflushed and would therefore be lost at
the power failure.
Signed-off-by: Frank Mayhar <fmayhar@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When there is no journal present, we must attach buffer heads
associated with extent tree and indirect blocks to the inode's
mapping->private_list via mark_buffer_dirty_inode() so that
ext4_sync_file() --- which is called to service fsync() and
fdatasync() system calls --- can write out the inode's metadata blocks
by calling sync_mapping_buffers().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When ext4 is using a journal, a metadata block which is deallocated
must be passed into the journal layer so it can be dropped from the
current transaction and/or revoked. This is done by calling the
functions ext4_journal_forget() and ext4_journal_revoke(), which call
jbd2_journal_forget(), and jbd2_journal_revoke(), respectively.
Since the jbd2_journal_forget() and jbd2_journal_revoke() call
bforget(), if ext4 is not using a journal, ext4_journal_forget() and
ext4_journal_revoke() must call bforget() to avoid a dirty metadata
block overwriting a block after it has been reallocated and reused for
another inode's data block.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch adds a setxattr handler to the file, directory, and symlink
inode_operations structures for sysfs. The patch uses hooks introduced in the
previous patch to handle the getting and setting of security information for
the sysfs inodes. As was suggested by Eric Biederman the struct iattr in the
sysfs_dirent structure has been replaced by a structure which contains the
iattr, secdata and secdata length to allow the changes to persist in the event
that the inode representing the sysfs_dirent is evicted. Because sysfs only
stores this information when a change is made all the optional data is moved
into one dynamically allocated field.
This patch addresses an issue where SELinux was denying virtd access to the PCI
configuration entries in sysfs. The lack of setxattr handlers for sysfs
required that a single label be assigned to all entries in sysfs. Granting virtd
access to every entry in sysfs is not an acceptable solution so fine grained
labeling of sysfs is required such that individual entries can be labeled
appropriately.
[sds: Fixed compile-time warnings, coding style, and setting of inode security init flags.]
Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov>
Signed-off-by: Stephen D. Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
This factors out the part of the vfs_setxattr function that performs the
setting of the xattr and its notification. This is needed so the SELinux
implementation of inode_setsecctx can handle the setting of the xattr while
maintaining the proper separation of layers.
Signed-off-by: David P. Quigley <dpquigl@tycho.nsa.gov>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
We added the ENOSPC handling patch in xfs_create just after it got mered
with xfs_mkdir. Change the log reservation to the variable for either
the create or mkdir value so it does the right thing if get here for creating
a directory.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
The /sys/fs/gfs2/<fsname>/lock_module/id file has been unused for
some time now, so we can remove it. We still accept the mount option
though, as userspace still sends that.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When mounting an "nfs" type file system, recognize "v4," "vers=4," or
"nfsvers=4" mount options, and convert the file system to "nfs4" under
the covers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[trondmy: fixed up binary mount code so it sets the 'version' field too]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Refactor nfs4_get_sb() to allow its guts to be invoked by
nfs_get_sb().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Refactor the part of nfs4_validate_mount_options() that
handles text-based options, so we can call it from the NFSv2/v3
option validation function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The meaning of not specifying the "port=" mount option is different
for "-t nfs" and "-t nfs4" mounts. The default port value for
NFSv2/v3 mounts is 0, but the default for NFSv4 mounts is 2049.
To support "-t nfs -o vers=4", the mount option parser must detect
when "port=" is missing so that the correct default port value can be
set depending on which NFS version is requested.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hi Trond,
Recently we were observing the behaviour difference between a 2.4.x and
2.6.x kernel with respect to O_EXCL. A comment from 2.4.x era, "For now,
we don't implement O_EXCL." seems inaccurate in TOT.
If so, here's a patch to remove the comment.
This patch is against:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
Signed-off-by: Harshula Jayasuriya <harshula@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This avoids an indirect call in the VFS for each path component lookup.
Well, at least as long as you own the directory in question, and the ACL
check is unnecessary.
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't implement per-filesystem 'extX_permission()' functions that have
to be called for every path component operation, and instead just expose
the actual ACL checking so that the VFS layer can now do it for us.
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is stage one in flattening out the callchains for the common
permission testing. Rather than have most filesystem implement their
own inode->i_op->permission function that just calls back down to the
VFS layers 'generic_permission()' with the per-filesystem ACL checking
function, the filesystem can just expose its 'check_acl' function
directly, and let the VFS layer do everything for it.
This is all just preparatory - no filesystem actually enables this yet.
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Don't call down to the generic inode_permission() function just to
call the inode-specific permission function - just do it directly.
The generic inode_permission() code does things like checking MAY_WRITE
and devcgroup_inode_permission(), neither of which are relevant for the
light pathname walk permission checks (we always do just MAY_EXEC, and
the inode is never a special device).
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function is only called for path components that are already known
to be directories (they have a '->lookup' method). So don't bother
doing that whole S_ISDIR() testing, the whole point of the 'lite()'
version is that we know that we are looking at a directory component,
and that we're only checking name lookup permission.
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of returning EAGAIN and having the caller do something
special for that case, just do the special case directly.
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Not only is that a supremely timing-critical path, but it's hopefully
some day going to be lockless for the common case, and ima can't do
that.
Plus the integrity code doesn't even care about non-regular files, so it
was always a total waste of time and effort.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a potential race in the inode deallocation code if two
nodes try to deallocate the same inode at the same time. Most of
the issue is solved by the iopen locking. There is still a small
window which is not covered by the iopen lock. This patches fixes
that and also makes the deallocation code more robust in the face of
any errors in the rgrp bitmaps, or erroneous iopen callbacks from
other nodes.
This does introduce one extra disk read, but that is generally not
an issue since its the same block that must be written to later
in the deallocation process. The total disk accesses therefore stay
the same,
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Drop the WARN_ON(1), as he stack trace is not appropriate, since it is
triggered by file system corruption, and it misleads users into
thinking there is a kernel bug. In addition, change the message
displayed by ext4_error() to make it clear that this is a file system
corruption problem.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In order to check whether the buffer_heads are mapped we need to hold
page lock. Otherwise a reclaim can cleanup the attached buffer_heads.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
- As ima_counts_put() may be called after the inode has been freed,
verify that the inode is not NULL, before dereferencing it.
- Maintain the IMA file counters in may_open() properly, decrementing
any counter increments on subsequent errors.
Reported-by: Ciprian Docan <docan@eden.rutgers.edu>
Reported-by: J.R. Okajima <hooanon05@yahoo.co.jp>
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Acked-by: Eric Paris <eparis@redhat.com
Signed-off-by: James Morris <jmorris@namei.org>
This function means moving extents every page, so change its name from
move_exgtent_par_page().
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.co.jp>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Return exchanged blocks count (moved_len) to user space,
if ext4_move_extents() failed on the way.
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The ext4_move_extents() functions checks with BUG_ON() whether the
exchanged blocks count accords with request blocks count. But, if the
target range (orig_start + len) includes sparse block(s), 'moved_len'
(exchanged blocks count) does not agree with 'len' (request blocks
count), since sparse block is not counted in 'moved_len'. This causes
us to hit the BUG_ON(), even though the function succeeded.
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The mext_check_arguments() function in move_extents.c has wrong
comparisons. orig_start which is passed from user-space is block
unit, but i_size of inode is byte unit, therefore the checks do not
work fine. This mis-check leads to the overflow of 'len' and then
hits BUG_ON() in ext4_move_extents(). The patch fixes this issue.
Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
Reviewed-by: Greg Freemyer <greg.freemyer@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We need to flush the write cache unconditionally in ->fsync, otherwise
writes into already allocated blocks can get lost. Writes into fully
allocated files are very common when using disk images for
virtualization, and without this fix can easily lose data after
an fdatasync, which is the typical implementation for a cache flush on
the virtual drive.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext2_rename(), dir_page is acquired through ext2_dotdot(). It is
then released through ext2_set_link() but only if old_dir != new_dir.
Failing that, the pkmap reference count is never decremented and the
page remains pinned forever. Repeat that a couple times with highmem
pages and all pkmap slots get exhausted, and every further kmap() calls
end up stalling on the pkmap_map_wait queue at which point the whole
system comes to a halt.
Signed-off-by: Nicolas Pitre <nico@marvell.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2: ocfs2_write_begin_nolock() should handle len=0
ocfs2: invalidate dentry if its dentry_lock isn't initialized.
Tom Horsley reports that his debugger hangs when it tries to read
/proc/pid_of_tracee/maps, this happens since
"mm_for_maps: take ->cred_guard_mutex to fix the race with exec"
04b836cbf19e885f8366bccb2e4b0474346c02d
commit in 2.6.31.
But the root of the problem lies in the fact that do_execve() path calls
tracehook_report_exec() which can stop if the tracer sets PT_TRACE_EXEC.
The tracee must not sleep in TASK_TRACED holding this mutex. Even if we
remove ->cred_guard_mutex from mm_for_maps() and proc_pid_attr_write(),
another task doing PTRACE_ATTACH should not hang until it is killed or the
tracee resumes.
With this patch do_execve() does not use ->cred_guard_mutex directly and
we do not hold it throughout, instead:
- introduce prepare_bprm_creds() helper, it locks the mutex
and calls prepare_exec_creds() to initialize bprm->cred.
- install_exec_creds() drops the mutex after commit_creds(),
and thus before tracehook_report_exec()->ptrace_stop().
or, if exec fails,
free_bprm() drops this mutex when bprm->cred != NULL which
indicates install_exec_creds() was not called.
Reported-by: Tom Horsley <tom.horsley@att.net>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's no real cost for the journal checksum feature, and we should
make sure it is enabled all the time.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
With this commit, extent tree operations are divorced from inodes and
rely on ocfs2_caching_info. Phew!
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We only allow unwritten extents on data, so the toplevel
ocfs2_mark_extent_written() can use an inode all it wants. But the
subfunction isn't even using the inode argument.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2_insert_extent() wants to insert a record into the extent map if
it's an inode data extent. But since many btrees can call that
function, let's make it an op on ocfs2_extent_tree. Other tree types
can leave it empty.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2_remove_extent() wants to truncate the extent map if it's
truncating an inode data extent. But since many btrees can call that
function, let's make it an op on ocfs2_extent_tree. Other tree types
can leave it empty.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2_grow_branch() not really using it other than to pass it to the
subfunctions ocfs2_shift_tree_depth(), ocfs2_find_branch_target(), and
ocfs2_add_branch(). The first two weren't it either, so they drop the
argument. ocfs2_add_branch() only passed it to
ocfs2_adjust_rightmost_branch(), which drops the inode argument and uses
the ocfs2_extent_tree as well.
ocfs2_append_rec_to_path() can be take an ocfs2_extent_tree instead of
the inode. The function ocfs2_adjust_rightmost_records() goes along for
the ride.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
It already gets ocfs2_extent_tree, so we can just use that. This chains
to the same modification for ocfs2_remove_rightmost_path() and
ocfs2_rotate_rightmost_leaf_left().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
It already has struct ocfs2_extent_tree, which has the caching info. So
we don't need to pass it struct inode.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
It already has struct ocfs2_extent_tree, which has the caching info. So
we don't need to pass it struct inode.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Get rid of the inode argument. Use extent_tree instead. This means a
few more functions have to pass an extent_tree around.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Pass the ocfs2_extent_list down through ocfs2_rotate_tree_right() and
get rid of struct inode in ocfs2_rotate_subtree_root_right().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Pass struct ocfs2_extent_tree into ocfs2_create_new_meta_bhs(). It no
longer needs struct inode or ocfs2_super.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2_find_path and ocfs2_find_leaf() walk our btrees, reading extent
blocks. They need struct ocfs2_caching_info for that, but not struct
inode.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
extent blocks belong to btrees on more than just inodes, so we want to
pass the ocfs2_caching_info structure directly to
ocfs2_read_extent_block(). A number of places in alloc.c can now drop
struct inode from their argument list.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
What do we cache? Metadata blocks. What are most of our non-inode metadata
blocks? Extent blocks for our btrees. struct ocfs2_extent_tree is the
main structure for managing those. So let's store the associated
ocfs2_caching_info there.
This means that ocfs2_et_root_journal_access() doesn't need struct inode
anymore, and any place that has an et can refer to et->et_ci instead of
INODE_CACHE(inode).
Signed-off-by: Joel Becker <joel.becker@oracle.com>
The next step in divorcing metadata I/O management from struct inode is
to pass struct ocfs2_caching_info to the journal functions. Thus the
journal locks a metadata cache with the cache io_lock function. It also
can compare ci_last_trans and ci_created_trans directly.
This is a large patch because of all the places we change
ocfs2_journal_access..(handle, inode, ...) to
ocfs2_journal_access..(handle, INODE_CACHE(inode), ...).
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Similar ip_last_trans, ip_created_trans tracks the creation of a journal
managed inode. This specifically tracks what transaction created the
inode. This is so the code can know if the inode has ever been written
to disk.
This behavior is desirable for any journal managed object. We move it
to struct ocfs2_caching_info as ci_created_trans so that any object
using ocfs2_caching_info can rely on this behavior.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We have the read side of metadata caching isolated to struct
ocfs2_caching_info, now we need the write side. This means the journal
functions. The journal only does a couple of things with struct inode.
This change moves the ip_last_trans field onto struct
ocfs2_caching_info as ci_last_trans. This field tells the journal
whether a pending journal flush is required.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We are really passing the inode into the ocfs2_read/write_blocks()
functions to get at the metadata cache. This commit passes the cache
directly into the metadata block functions, divorcing them from the
inode.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We don't really want to cart around too many new fields on the
ocfs2_caching_info structure. So let's wrap all our access of the
parent object in a set of operations. One pointer on caching_info, and
more flexibility to boot.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
We want to use the ocfs2_caching_info structure in places that are not
inodes. To do that, it can no longer rely on referencing the inode
directly.
This patch moves the flags to ocfs2_caching_info->ci_flags, stores
pointers to the parent's locks on the ocfs2_caching_info, and renames
the constants and flags to reflect its independant state.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Bug introduced by mainline commit e7432675f8
The bug causes ocfs2_write_begin_nolock() to oops when len=0.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Cc: stable@kernel.org
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Make the return from compose_entry_fh() zero or an error, even though
the returned error isn't used, just to make the meaning of the return
immediately obvious.
Move some repeated code out of main function into helper.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
A number of callers (nfsd4_encode_fattr(), at least) don't bother to
release the filehandle returned to fh_compose() if fh_compose() returns
an error. So, modify fh_compose() to release the filehandle before
returning an error.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Move the jffs2 garbage collecting thread to the new kthread API.
Signed-off-by: Gerard Lledo <gerard.lledo@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
nfsd4_path() allocates a temporary filehandle and then fails to free it
before the function exits, leaking reference counts to the dentry and
export that it refers to.
Also, nfsd4_lookupp() puts the result of exp_pseudoroot() in a temporary
filehandle which it releases on success of exp_pseudoroot() but not on
failure; fix exp_pseudoroot to ensure that on failure it releases the
filehandle before returning.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
There's a large cut and paste chunk of code in smb_init and
small_smb_init to handle reconnects. Break it out into a separate
function, clean it up and have both routines call it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
The function jffs2_nor_wbuf_flash_setup() doesn't allocate the verify buffer
if CONFIG_JFFS2_FS_WBUF_VERIFY is defined, so causing a kernel panic when
that macro is enabled and the verify function is called. Similarly the
jffs2_nor_wbuf_flash_cleanup() must free the buffer if
CONFIG_JFFS2_FS_WBUF_VERIFY is enabled.
The following patch fixes the problem.
The following patch applies to 2.6.30 kernel.
Signed-off-by: Massimo Cirillo <maxcir@gmail.com>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: stable@kernel.org
When creating a new file, ima_path_check() assumed the new file
was being opened for write. Call ima_path_check() with the
appropriate acc_mode so that the read/write counters are
incremented correctly.
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
If you enable group or project quotas on an XFS file system, then the
mount table presented through /proc/self/mounts erroneously shows
that both options are in effect for the file system. The root of
the problem is some bad logic in the xfs_showargs() function, which
is used to format the file system type-specific options in effect
for a file system.
The problem originated in this GIT commit:
Move platform specific mount option parse out of core XFS code
Date: 11/22/07
Author: Dave Chinner
SHA1 ID: a67d7c5f5d
For XFS quotas, project and group quota management are mutually
exclusive--only one can be in effect at a time. There are two
parts to managing quotas: aggregating usage information; and
enforcing limits. It is possible to have a quota in effect
(aggregating usage) but not enforced.
These features are recorded on an XFS mount point using these flags:
XFS_PQUOTA_ACCT - Project quotas are aggregated
XFS_GQUOTA_ACCT - Group quotas are aggregated
XFS_OQUOTA_ENFD - Project/group quotas are enforced
The code in error is in fs/xfs/linux-2.6/xfs_super.c:
if (mp->m_qflags & (XFS_PQUOTA_ACCT|XFS_OQUOTA_ENFD))
seq_puts(m, "," MNTOPT_PRJQUOTA);
else if (mp->m_qflags & XFS_PQUOTA_ACCT)
seq_puts(m, "," MNTOPT_PQUOTANOENF);
if (mp->m_qflags & (XFS_GQUOTA_ACCT|XFS_OQUOTA_ENFD))
seq_puts(m, "," MNTOPT_GRPQUOTA);
else if (mp->m_qflags & XFS_GQUOTA_ACCT)
seq_puts(m, "," MNTOPT_GQUOTANOENF);
The problem is that XFS_OQUOTA_ENFD will be set in mp->m_qflags
if either group or project quotas are enforced, and as a result
both MNTOPT_PRJQUOTA and MNTOPT_GRPQUOTA will be shown as mount
options.
Signed-off-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Add a config option (CONFIG_DEBUG_CREDENTIALS) to turn on some debug checking
for credential management. The additional code keeps track of the number of
pointers from task_structs to any given cred struct, and checks to see that
this number never exceeds the usage count of the cred struct (which includes
all references, not just those from task_structs).
Furthermore, if SELinux is enabled, the code also checks that the security
pointer in the cred struct is never seen to be invalid.
This attempts to catch the bug whereby inode_has_perm() faults in an nfsd
kernel thread on seeing cred->security be a NULL pointer (it appears that the
credential struct has been previously released):
http://www.kerneloops.org/oops.php?number=252883
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Merge reason: bump from rc5 to rc8, but also pick up TP_perf_assign()
API, a patch will be queued that depends on it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use NFSD_SLOT_CACHE_SIZE size buffers for sessions DRC instead of holding nfsd
pages in cache.
Connectathon testing has shown that 1024 bytes for encoded compound operation
responses past the sequence operation is sufficient, 512 bytes is a little too
small. Set NFSD_SLOT_CACHE_SIZE to 1024.
Allocate memory for the session DRC in the CREATE_SESSION operation
to guarantee that the memory resource is available for caching responses.
Allocate each slot individually in preparation for slot table size negotiation.
Remove struct nfsd4_cache_entry and helper functions for the old page-based
DRC.
The iov_len calculation in nfs4svc_encode_compoundres is now always
correct. Replay is now done in nfsd4_sequence under the state lock, so
the session ref count is only bumped on non-replay. Clean up the
nfs4svc_encode_compoundres session logic.
The nfsd4_compound_state statp pointer is also not used.
Remove nfsd4_set_statp().
Move useful nfsd4_cache_entry fields into nfsd4_slot.
Signed-off-by: Andy Adamson <andros@netapp.com
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
nfserr_resource is not a legal error for NFSv4.1. Replace it with
nfserr_serverfault for EXCHANGE_ID and CREATE_SESSION processing.
We will also need to map nfserr_resource to other errors in routines shared
by NFSv4.0 and NFSv4.1
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
This fixes a bug in the sequence operation reply.
The sequence operation returns the highest slotid it will accept in the future
in sr_highest_slotid, and the highest slotid it prefers the client to use.
Since we do not re-negotiate the session slot table yet, these should both
always be set to the session ca_maxrequests.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
By using the requested ca_maxresponsesize_cached * ca_maxresponses to bound
a forechannel drc request size, clients can tailor a session to usage.
For example, an I/O session (READ/WRITE only) can have a much smaller
ca_maxresponsesize_cached (for only WRITE compound responses) and a lot larger
ca_maxresponses to service a large in-flight data window.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
xfs_inobt_lookup is also used in xfs_itable.c, remove the STATIC modifier
from it's declaration to fix non-debug builds.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
The fact that the filesystem doesn't currently list any alternate
locations does _not_ imply that the fs_locations attribute should be
marked as "unsupported".
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Currently, cifs_close() tries to wait until all I/O is complete and then
frees the file private data. If I/O does not completely in a reasonable
amount of time it frees the structure anyway, leaving a potential use-
after-free situation.
This patch changes the wrtPending counter to a complete reference count and
lets the last user free the structure.
Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Right now, the GlobalOplock_Q is protected by the GlobalMid_Lock. That
lock is also used for completely unrelated purposes (mostly for managing
the global mid queue). Give the list its own dedicated spinlock
(cifs_oplock_lock) and rename the list to cifs_oplock_list to
eliminate the camel-case.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Minor nit: we already have a tcon pointer so we don't need to
dereference cifs_sb again.
Also initialize the vars in the declaration.
Reported-by: Peter Staubach <staubach@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Make it easier on the upcall program by adding ':' delimiters between
each group of hex digits.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Fix a small typo in the compat ioctl handler that cause the swapext
compat handler to never be called.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Fix a small typo in the compat ioctl handler that cause the swapext
compat handler to never be called.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Tested-by: Torsten Kaiser <just.for.lkml@googlemail.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_trans_iget is a wrapper for xfs_iget that adds the inode to the
transaction after it is read. Except when the inode already is in the
inode cache, in which case it returns the existing locked inode with
increment lock recursion counts.
Now, no one in the tree every decrements these lock recursion counts,
so any user of this gets a potential double unlock when both the original
owner of the inode and the xfs_trans_iget caller unlock it. When looking
back in a git bisect in the historic XFS tree there was only one place
that decremented these counts, xfs_trans_iput. Introduced in commit
ca25df7a840f426eb566d52667b6950b92bb84b5 by Adam Sweeney in 1993,
and removed in commit 19f899a3ab155ff6a49c0c79b06f2f61059afaf3 by
Steve Lord in 2003. And as long as it didn't slip through git bisects
cracks never actually used in that time frame.
A quick audit of the callers of xfs_trans_iget shows that no caller
really relies on this behaviour fortunately - xfs_ialloc allows this
inode from disk so it must not be there before, and all the RT allocator
routines only every add each RT bitmap inode once.
In addition to removing lots of code and reducing the size of the inode
item this patch also avoids the double inode cache lookup in each
create/mkdir/mknod transaction.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
The guarantees for O_SYNC are exactly the same as the ones we need to
make for an fsync call (and given that Linux O_SYNC is O_DSYNC the
equivalent is fdadatasync, but we treat both the same in XFS), except
with a range data writeout. Jan Kara has started unifying these two
path for filesystems using the generic helpers, and I've started to
look at XFS.
The actual transaction commited by xfs_fsync and xfs_write_sync_logforce
has a different transaction number, but actually is exactly the same.
We'll only use the fsync transaction going forward. One major difference
is that xfs_write_sync_logforce never issues a cache flush unless we
commit a transaction causing that as a side-effect, which is an obvious
bug in the O_SYNC handling. Second all the locking and i_update_size
vs i_update_core changes from 978b723712
never made it to xfs_write_sync_logforce, so we add them back.
To make xfs_fsync easily usable from the O_SYNC path, the filemap_fdatawait
call is moved up to xfs_file_fsync, so that we don't wait on the whole
file after we already waited for our portion in xfs_write.
We'll also use a plain call to filemap_write_and_wait_range instead
of the previous sync_page_rang which did it in two steps including
an half-hearted inode write out that doesn't help us.
Once we're done with this also remove the now useless i_update_size
tracking.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Don't search too far - abort if it is outside a certain radius and simply do
a linear search for the first free inode. In AGs with a million inodes this
can speed up allocation speed by 3-4x.
[hch: ported to the new xfs_ialloc.c world order]
Signed-off-by: Dave Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Currenly we have a xfs_inobt_lookup* variant for each comparism direction,
and all these get all three fields of the inobt records passed, while the
common case is just looking for the inode number and we have only marginally
more callers than xfs_inobt_lookup* variants.
So opencode a direct call to xfs_btree_lookup for the single case where we
need all fields, and replace xfs_inobt_lookup* with a xfs_inobt_looku that
just takes the inode number and the direction for all other callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Clarify the control flow in xfs_dialloc. Factor out a helper to go to the
next node from the current one and improve the control flow by expanding
composite if statements and using gotos.
The xfs_ialloc_next_rec helper is borrowed from Dave Chinners dynamic
allocation policy patches.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Factor out a common helper from repeated debug checks in xfs_dialloc and
xfs_difree.
[hch: split out from Dave's dynamic allocation policy patches]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Both callers of xfs_inobt_update have the record in form of a
xfs_inobt_rec_incore_t, so just pass a pointer to it instead of the
individual variables.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Most callers of xfs_inobt_get_rec need to fill a xfs_inobt_rec_incore_t, and
those who don't yet are fine with a xfs_inobt_rec_incore_t, instead of the
three individual variables, too. So just change xfs_inobt_get_rec to write
the output into a xfs_inobt_rec_incore_t directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Factor out code to initialize new inode clusters into a function of it's own.
This keeps xfs_ialloc_ag_alloc smaller and better structured and enables a
future inode cluster initialization transaction. Also initialize the agno
variable earlier in xfs_ialloc_ag_alloc to avoid repeated byte swaps.
[hch: The original patch is from Dave from his unpublished inode create
transaction patch series, with some modifcations by me to apply stand-alone]
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
One more try..
It seems there is a regression that got introduced while Jeff fixed
all the mount/umount races. While attempting to find whether a tcp
session is already existing, we were not checking whether the "port"
used are the same. When a second mount is attempted with a different
"port=" option, it is being ignored. Because of this the cifs mounts
that uses a SSH tunnel appears to be broken.
Steps to reproduce:
1. create 2 shares
# SSH Tunnel a SMB session
2. ssh -f -L 6111:127.0.0.1:445 root@localhost "sleep 86400"
3. ssh -f -L 6222:127.0.0.1:445 root@localhost "sleep 86400"
4. tcpdump -i lo 6111 &
5. mkdir -p /mnt/mnt1
6. mkdir -p /mnt/mnt2
7. mount.cifs //localhost/a /mnt/mnt1 -o username=guest,ip=127.0.0.1,port=6111
#(shows tcpdump activity on port 6111)
8. mount.cifs //localhost/b /mnt/mnt2 -o username=guest,ip=127.0.0.1,port=6222
#(shows tcpdump activity only on port 6111 and not on 6222
Fix by adding a check to compare the port _only_ if the user tries to
override the tcp port with "port=" option, before deciding that an
existing tcp session is found. Also, clean up a bit by replacing
if-else if by a switch statment while at it as suggested by Jeff.
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
in function calc_ntlmv2_hash memory is not released.
1. If in the line 333 we successfully allocate memory and assign it to
pctxt variable:
pctxt = kmalloc(sizeof(struct HMACMD5Context), GFP_KERNEL);
then we go to line 376 and exit wihout releasing memory pointed to by pctxt
variable.
Add a memory releasing for pctxt variable before exit from function
calc_ntlmv2_hash.
Signed-off-by: Alexander Strakh <strakh@ispras.ru>
Signed-off-by: Steve French <sfrench@us.ibm.com>
In the recent change by Al Viro that changes verious subsystems
to use "struct path" one case was missed in the autofs4 module
which causes mounts to no longer expire.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new tracepoint which shows the pages that will be written using
write_cache_pages() by ext4_da_writepages().
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
To solve a lock inversion problem, we implement part of the
range_cyclic algorithm in ext4_da_writepages(). (See commit 2acf2c26
for more details.)
As part of that change wbc->range_start was modified by ext4's
writepages function, which causes its callers to get confused since
they aren't expecting the filesystem to modify it. The simplest fix
is to save and restore wbc->range_start in ext4_da_writepages.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
bp was tested for NULL a few lines before, followed by a return, and there
is no intervening modification of its value.
A simplified version of the semantic match that finds this problem is as
follows: (http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@r exists@
local idexpression x;
expression E;
position p1,p2;
@@
if (x == NULL || ...) { ... when forall
return ...; }
... when != \(x=E\|x--\|x++\|--x\|++x\|x-=E\|x+=E\|x|=E\|x&=E\|&x\)
(
*x == NULL
|
*x != NULL
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Acked-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Commit a19d9f887d removed the
ino64 option but left the XFS_INO64_OFFSET define it used
in place - just remove it.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
CONFIG_XFS_DEBUG builds still need xfs_read_agf to be
non-static, oops.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
A lot more functions could be made static, but they need
forward declarations; this does some easy ones, and also
found a few unused functions in the process.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
memory allocation may fail, prevent a NULL dereference
Pointed out by Roel Kluin
CC: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
In ext4_link we need to check using EXT4_LINK_MAX, and not
EXT4_DIR_LINK_MAX(), since ext4_link() is creating hard links of
regular files, and not directories.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use EXT4_DIR_LINK_MAX so that rename() can move a directory into new
parent directory without running into the EXT4_LINK_MAX limit.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Seperating the addition and update of marks in inotify resulted in a
regression in that inotify never gets events. The inotify group mask is
always 0. This mask should be updated any time a new mark is added.
Signed-off-by: Eric Paris <eparis@redhat.com>
Compounds consisting of only a sequence operation don't need any
additional caching beyond the sequence information we store in the slot
entry. Fix nfsd4_is_solo_sequence to identify this case correctly.
The additional check for a failed sequence in nfsd4_store_cache_entry()
is redundant, since the nfsd4_is_solo_sequence call lower down catches
this case.
The final ce_cachethis set in nfsd4_sequence is also redundant.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
0db501bd06 introduced a regresion in that it now sends a nul
terminator but the length accounting when checking for space or
reporting to userspace did not take this into account. This corrects
all of the rounding logic.
Signed-off-by: Eric Paris <eparis@redhat.com>
The extents sanity-checking code depends on the ext4_ext_space_*()
functions returning the maximum alloable size for eh_max; however,
when the debugging #ifdef AGGRESSIVE_TEST is enabled to test the
extent tree handling code, this prevents a normally created ext4
filesystem from being mounted with the errors:
Aug 26 15:43:50 bsd086 kernel: [ 96.070277] EXT4-fs error (device sda8): ext4_ext_check_inode: bad header/extent in inode #8: too large eh_max - magic f30a, entries 1, max 4(3), depth 0(0)
Aug 26 15:43:50 bsd086 kernel: [ 96.070526] EXT4-fs (sda8): no journal found
Bug reported by Akira Fujita.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When an event has no pathname, there's no need to pad it with a null byte and
therefore generate an inotify_event sized block of zeros. This fixes a
regression introduced by commit 0db501bd06 where
my system wouldn't finish booting because some process was being confused by
this.
Signed-off-by: Brian Rogers <brian@xyzw.org>
Signed-off-by: Eric Paris <eparis@redhat.com>
In commit a5a0a63092, when
ocfs2_attch_dentry_lock fails, we call an extra iput and reset
dentry->d_fsdata to NULL. This resolve a bug, but it isn't
completed and the dentry is still there. When we want to use
it again, ocfs2_dentry_revalidate doesn't catch it and return
true. That make future ocfs2_dentry_lock panic out.
One bug is http://oss.oracle.com/bugzilla/show_bug.cgi?id=1162.
The resolution is to add a check for dentry->d_fsdata in
revalidate process and return false if dentry->d_fsdata is NULL,
so that a new ocfs2_lookup will be called again.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
RFC 3530 says "ACE4_IDENTIFIER_GROUP flag MUST be ignored on entries
with these special identifiers. When encoding entries with these
special identifiers, the ACE4_IDENTIFIER_GROUP flag SHOULD be set to
zero." It really shouldn't matter either way, but the point is that
this flag is used to distinguish named users from named groups (since
unix allows a group to have the same name as a user), so it doesn't
really make sense to use it on a special identifier such as this.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Until we work out the state locking so we can use a spin lock to protect
the cl_lru, we need to take the state_lock to renew the client.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: Do not renew state on error]
Signed-off-by: Ricardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
[nfsd41: Simplify exit code]
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* 'for-linus' of git://git.infradead.org/users/eparis/notify:
inotify: Ensure we alwasy write the terminating NULL.
inotify: fix locking around inotify watching in the idr
inotify: do not BUG on idr entries at inotify destruction
inotify: seperate new watch creation updating existing watches
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
9p: update documentation pointers
9p: remove unnecessary v9fses->options which duplicates the mount string
net/9p: insulate the client against an invalid error code sent by a 9p server
9p: Add missing cast for the error return value in v9fs_get_inode
9p: Remove redundant inode uid/gid assignment
9p: Fix possible regressions when ->get_sb fails.
9p: Fix v9fs show_options
9p: Fix possible memleak in v9fs_inode_from fid.
9p: minor comment fixes
9p: Fix possible inode leak in v9fs_get_inode.
9p: Check for error in return value of v9fs_fid_add
kAFS crashes when asked to read a symbolic link because page_getlink()
passes a NULL file pointer to read_mapping_page(), but afs_readpage()
expects a file pointer from which to extract a key.
Modify afs_readpage() to request the appropriate key from the calling
process's keyrings if a file struct is not supplied with one attached.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The inum structure used throughout GFS2 has two fields. One
no_addr is the disk block number of the inode in question and
is used everywhere as the inode number. The other, no_formal_ino,
is used only as the generation number for NFS.
Historically the no_formal_ino field was set using a complicated
system of one global and one per-node file containing inode numbers
in order to ensure that each no_formal_ino was unique. Also this
code made no provision for what would happen when eventually the
(64 bit) numbers ran out. Now I know that is pretty unlikely to
happen given the large space of numbers, but it is possible
nevertheless.
The only guarantee required for no_formal_ino is that, for any
single inode, the same number doesn't get reused too quickly.
We already have a generation number which is kept in the inode
and initialised from a counter in the resource group (almost
no overhead, since we have to touch the resource group anyway
in order to allocate an inode in the first place). Aside from
ensuring that we never use the value 0 in the no_formal_ino
field, we can use that counter directly.
As a result of that change, we lose about 200 lines of code and
also gain about 10 creates/sec on the postmark benchmark (on
my test machine).
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Before the rewrite copy_event_to_user always wrote a terqminating '\0'
byte to user space after the filename. Since the rewrite that
terminating byte was skipped if your filename is exactly a multiple of
event_size. Ouch!
So add one byte to name_size before we round up and use clear_user to
set userspace to zero like /dev/zero does instead of copying the
strange nul_inotify_event. I can't quite convince myself len_to_zero
will never exceed 16 and even if it doesn't clear_user should be more
efficient and a more accurate reflection of what the code is trying to
do.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
The are races around the idr storage of inotify watches. It's possible
that a watch could be found from sys_inotify_rm_watch() in the idr, but it
could be removed from the idr before that code does it's removal. Move the
locking and the refcnt'ing so that these have to happen atomically.
Signed-off-by: Eric Paris <eparis@redhat.com>
If an inotify watch is left in the idr when an fsnotify group is destroyed
this will lead to a BUG. This is not a dangerous situation and really
indicates a programming bug and leak of memory. This patch changes it to
use a WARN and a printk rather than killing people's boxes.
Signed-off-by: Eric Paris <eparis@redhat.com>
There is nothing known wrong with the inotify watch addition/modification
but this patch seperates the two code paths to make them each easy to
verify as correct.
Signed-off-by: Eric Paris <eparis@redhat.com>
Use the more conventional name for the extended attribute
support code. Update all the places which care.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This has been on my list for some time. We need to change the way
in which we handle extended attributes to allow faster file creation
times (by reducing the number of transactions required) and the
extended attribute code is the main obstacle to this.
In addition to that, the VFS provides a way to demultiplex the xattr
calls which we ought to be using, rather than rolling our own. This
patch changes the GFS2 code to use that VFS feature and as a result
the code shrinks by a couple of hundred lines or so, and becomes
easier to read.
I'm planning on doing further clean up work in this area, but this
patch is a good start. The cleaned up code also uses the more usual
"xattr" shorthand, I plan to eliminate the use of "eattr" eventually
and in the mean time it serves as a flag as to which bits of the code
have been updated.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
unsigned short is potentially too small to track blocks within
a group; today it is safe due to restrictions in e2fsprogs but
we have _lo / _hi bits for group blocks with the intent to go
up to 32 bits, so clean this up now.
There are many more places where we use unsigned/int/unsigned int
to contain a group block but this should at least fix all the
short types.
I added a few comments to the struct ext4_group_info definition
as well.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Precursor to changing some types; to keep things in sync, it
seems better to allocate/memset based on the size of the
variables we are using rather than on some disconnected
basic type like "unsigned short"
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
ext3: Improve error message that changing journaling mode on remount is not possible
ext3: Update Kconfig description of EXT3_DEFAULTS_TO_ORDERED
lock_kernel() in knfsd was replaced with a mutex. The later
commit 03cf6c9f49 ("knfsd:
add file to export stats about nfsd pools") did not follow
that change. This patch fixes the issue.
Also move the get and put of nfsd_serv to the open and close methods
(instead of start and stop methods) to allow atomic check and increment
of reference count in the open method (where we can still return an
error).
Signed-off-by: Ryusei Yamaguchi <mandel59@gmail.com>
Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Cc: Greg Banks <gnb@fmeh.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The group deny entries end up denying tcy even though tcy was just
allowed by the allow entry. This appears to be due to:
ace->access_mask = mask_from_posix(deny, flags);
instead of:
ace->access_mask = deny_mask_from_posix(deny, flags);
Denying a previously allowed bit has no effect, so this shouldn't affect
behavior, but it's ugly.
Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Commit 76db6d9500 (nfs41: add session setup
to the state manager) introduces an infinite loop possibility in the NFSv4
state manager. By first checking nfs4_has_session() before clearing the
NFS4CLNT_SESSION_SETUP flag, it allows for a situation where someone sets
that flag, but it never gets cleared, and so the state manager loops.
In fact commit c3fad1b1aa (nfs41: add session
reset to state manager) causes this to happen every time we get a network
partition error.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Daniel J Blueman <daniel.blueman@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
ocfs2/dlm: Wait on lockres instead of erroring cancel requests
ocfs2: Add missing lock name
ocfs2: Don't oops in ocfs2_kill_sb on a failed mount
ocfs2: release the buffer head in ocfs2_do_truncate.
ocfs2: Handle quota file corruption more gracefully
2.6.30's commit 8a0bdec194 removed
user_shm_lock() calls in hugetlb_file_setup() but left the
user_shm_unlock call in shm_destroy().
In detail:
Assume that can_do_hugetlb_shm() returns true and hence user_shm_lock()
is not called in hugetlb_file_setup(). However, user_shm_unlock() is
called in any case in shm_destroy() and in the following
atomic_dec_and_lock(&up->__count) in free_uid() is executed and if
up->__count gets zero, also cleanup_user_struct() is scheduled.
Note that sched_destroy_user() is empty if CONFIG_USER_SCHED is not set.
However, the ref counter up->__count gets unexpectedly non-positive and
the corresponding structs are freed even though there are live
references to them, resulting in a kernel oops after a lots of
shmget(SHM_HUGETLB)/shmctl(IPC_RMID) cycles and CONFIG_USER_SCHED set.
Hugh changed Stefan's suggested patch: can_do_hugetlb_shm() at the
time of shm_destroy() may give a different answer from at the time
of hugetlb_file_setup(). And fixed newseg()'s no_id error path,
which has missed user_shm_unlock() ever since it came in 2.6.9.
Reported-by: Stefan Huber <shuber2@gmail.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Tested-by: Stefan Huber <shuber2@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using kernel_sendpage() is cleaner and safer than following
sock->ops ourselves.
Signed-off-by: Paolo Bonzini <bonzini@gnu.org>
Signed-off-by: David Teigland <teigland@redhat.com>
Closing a connection to a node can create problems if there are
outstanding messages for that node. The problems include dlm_send
spinning attempting to reconnect, or BUG from tcp_connect_to_sock()
attempting to use a partially closed connection.
To cleanly close a connection, we now first attempt to send any pending
messages, cancel any remaining workqueue work, and flag the connection
as closed to avoid reconnect attempts.
Signed-off-by: Lars Marowsky-Bree <lmb@suse.de>
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch makes the error message about changing journaling mode on remount
more descriptive. Some people are going to hit this error now due to commit
bbae8bcc49 if they configure a kernel to default
to data=writeback mode. The problem happens if they have data=ordered set for
the root filesystem in /etc/fstab but not in the kernel command line (and they
don't use initrd). Their filesystem then gets mounted as data=writeback by
kernel but then their boot fails because init scripts won't be able to remount
the filesystem rw. Better error message will hopefully make it easier for them
to find the error in their setup and bother us less with error reports :).
Signed-off-by: Jan Kara <jack@suse.cz>
The old description for this configuration option was perhaps not
completely balanced in terms of describing the tradeoffs of using a
default of data=writeback vs. data=ordered. Despite the fact that old
description very strongly recomended disabling this feature, all of
the major distributions have elected to preserve the existing 'legacy'
default, which is a strong hint that it perhaps wasn't telling the
whole story.
This revised description has been vetted by a number of ext3
developers as being better at informing the user about the tradeoffs
of enabling or disabling this configuration feature.
Cc: linux-ext4@vger.kernel.org
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
This patch adds "-o errors=panic" and "-o errors=withdraw" to the
gfs2 mount options. The "errors=withdraw" option is today's
current behaviour, meaning to withdraw from the file system if a
non-serious gfs2 error occurs. The new "errors=panic" option
tells gfs2 to force a kernel panic if a non-serious gfs2 file
system error occurs. This may be useful, for example, where
fabric-level fencing is used that has no way to reboot (such as
fence_scsi).
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
vfs_read() offset is defined as loff_t, but kernel_read()
offset is only defined as unsigned long. Redefine
kernel_read() offset as loff_t.
Cc: stable@kernel.org
Signed-off-by: Mimi Zohar <zohar@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
Some releases of Linux rpc.mountd (nfs-utils 1.1.4 and later) return an
empty auth flavor list if no sec= was specified for the export. This is
notably broken server behavior.
The new auth flavor list checking added in a recent commit rejects this
case. The OpenSolaris client does too.
The broken mountd implementation is already widely deployed. To avoid
a behavioral regression, the kernel's mount client skips flavor checking
(ie reverts to the pre-2.6.32 behavior) if mountd returns an empty
flavor list.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch adds 'const' qualifier to UBIFS xattr inode and file
operations.
Pointed-out-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
In commit a8e7d49aa7 ("Fix race in
create_empty_buffers() vs __set_page_dirty_buffers()"), I removed a test
for a NULL page mapping unintentionally when some of the code inside
__set_page_dirty() was moved to the callers.
That removal generally didn't matter, since a filesystem would serialize
truncation (which clears the page mapping) against writing (which marks
the buffer dirty), so locking at a higher level (either per-page or an
inode at a time) should mean that the buffer page would be stable. And
indeed, nothing bad seemed to happen.
Except it turns out that apparently reiserfs does something odd when
under load and writing out the journal, and we have a number of bugzilla
entries that look similar:
http://bugzilla.kernel.org/show_bug.cgi?id=13556http://bugzilla.kernel.org/show_bug.cgi?id=13756http://bugzilla.kernel.org/show_bug.cgi?id=13876
and it looks like reiserfs depended on that check (the common theme
seems to be "data=journal", and a journal writeback during a truncate).
I suspect reiserfs should have some additional locking, but in the
meantime this should get us back to the pre-2.6.29 behavior.
Pattern-pointed-out-by: Roland Kletzing <devzero@web.de>
Cc: stable@kernel.org (2.6.29 and 2.6.30)
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a SETCLIENTID call comes in, one of the args given is the svc_rqst.
This struct contains an rq_addr field which holds the address that sent
the call. If this is an IPv6 address, then we can use the sin6_scope_id
field in this address to populate the sin6_scope_id field in the
callback address.
AFAICT, the rq_addr.sin6_scope_id is non-zero if and only if the client
mounted the server's link-local address.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The framework to add this is all in place. Now, add the code to allow
support for establishing a callback channel on an IPv6 socket.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
...rather than as a separate address and port fields. This will be
necessary for implementing callbacks over IPv6. Also, convert
gen_callback to use the standard rpcuaddr2sockaddr routine rather than
its own private one.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
It's currently a __be32, which isn't big enough to hold an IPv6 address.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
lockd needs these sort of routines, as does the NFSv4 callback code.
Move lockd's routines into common code and rename them so that they can
be used by others.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Acked-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Node may not be inserted over existing node. This causes inode tree
corruption and I was seeing crashes in inode_tree_del which I can not
reproduce after this patch.
The other way to fix this would be to tie inode lifetime in the rbtree
with inode while not in freeing state. I had a look at this but it is
not so trivial at this point. At least this patch gets things working again.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Chris Mason <chris.mason@oracle.com>
Acked-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
When suid is set and the non-owner user has write permission, any writing
into this file should be allowed and suid should be removed after that.
However, current kernel only allows writing without truncations, when we
do truncations on that file, we get EPERM. This is a bug.
Steps to reproduce this bug:
% ls -l rootdir/file1
-rwsrwsrwx 1 root root 3 Jun 25 15:42 rootdir/file1
% echo h > rootdir/file1
zsh: operation not permitted: rootdir/file1
% ls -l rootdir/file1
-rwsrwsrwx 1 root root 3 Jun 25 15:42 rootdir/file1
% echo h >> rootdir/file1
% ls -l rootdir/file1
-rwxrwxrwx 1 root root 5 Jun 25 16:34 rootdir/file1
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric Sandeen <esandeen@redhat.com>
Acked-by: Eric Paris <eparis@redhat.com>
Cc: Eugene Teo <eteo@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: James Morris <jmorris@namei.org>
In case a downconvert is queued, and a flock receives a signal,
BUG_ON(lockres->l_action != OCFS2_AST_INVALID) is triggered
because a lock cancel triggers a dlmunlock while an AST is
scheduled.
To avoid this, allow a LKM_CANCEL to pass through, and let it
wait on __dlm_wait_on_lockres().
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.de>
Acked-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
There is missing name for NFSSync cluster lock. This makes lockdep unhappy
because we end up passing NULL to lockdep when initializing lock key. Fix it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
generic_file_direct_write() no longer calls generic_osync_inode() so remove the
comment.
CC: linux-nfs@vger.kernel.org
CC: Neil Brown <neilb@suse.de>
CC: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In the referral code, use it to look up the new server's ip address if the
fs_locations attribute contains a hostname.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFSv4 and NFSv4.1 protocols both allow for the redirection of a client
from one server to another in order to support filesystem migration and
replication. For full protocol support, we need to add the ability to
convert a DNS host name into an IP address that we can feed to the RPC
client.
We'll reuse the sunrpc cache, now that it has been converted to work with
rpc_pipefs.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: fix oopses with doubly mounted snapshots
nilfs2: missing a read lock for segment writer in nilfs_attach_checkpoint()
The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
the mm_struct. It was a very good first step for sanitize OOM.
However Paul Menage reported the commit makes regression to his job
scheduler. Current OOM logic can kill OOM_DISABLED process.
Why? His program has the code of similar to the following.
...
set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
...
if (vfork() == 0) {
set_oom_adj(0); /* Invoked child can be killed */
execve("foo-bar-cmd");
}
....
vfork() parent and child are shared the same mm_struct. then above
set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
change oom_adj for vfork() parent. Then, vfork() parent (job scheduler)
lost OOM immune and it was killed.
Actually, fork-setting-exec idiom is very frequently used in userland program.
We must not break this assumption.
Then, this patch revert commit 2ff05b2b and related commit.
Reverted commit list
---------------------
- commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct)
- commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
- commit 8123681022 (oom: only oom kill exiting tasks with attached memory)
- commit 933b787b57 (mm: copy over oom_adj value at fork time)
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_sb_pseudo sets s_maxbytes to ~0ULL which becomes negative when cast
to a signed value. Fix it to use MAX_LFS_FILESIZE which casts properly
to a positive signed value.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Steve French <smfrench@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Robert Love <rlove@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The last correction to the tcp_connect_to_sock error exit path,
commit a89d63a159, can free an already
freed socket, due to collision with a previous (incomplete) attempt
to fix the same issue, commit 311f6fc77c.
Signed-off-by: Casey Dahlin <cdahlin@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
will fix kernel oopses like the following:
# mount -t nilfs2 -r -o cp=20 /dev/sdb1 /test1
# mount -t nilfs2 -r -o cp=20 /dev/sdb1 /test2
# umount /test1
# umount /test2
BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1069
in_atomic(): 0, irqs_disabled(): 1, pid: 3886, name: umount.nilfs2
1 lock held by umount.nilfs2/3886:
#0: (&type->s_umount_key#31){+.+...}, at: [<c10b398a>] deactivate_super+0x52/0x6c
irq event stamp: 1219
hardirqs last enabled at (1219): [<c135c774>] __mutex_unlock_slowpath+0xf8/0x119
hardirqs last disabled at (1218): [<c135c6d5>] __mutex_unlock_slowpath+0x59/0x119
softirqs last enabled at (1214): [<c1033316>] __do_softirq+0x1a5/0x1ad
softirqs last disabled at (1205): [<c1033354>] do_softirq+0x36/0x5a
Pid: 3886, comm: umount.nilfs2 Not tainted 2.6.31-rc6 #55
Call Trace:
[<c1023549>] __might_sleep+0x107/0x10e
[<c13603c0>] do_page_fault+0x246/0x397
[<c136017a>] ? do_page_fault+0x0/0x397
[<c135e753>] error_code+0x6b/0x70
[<c136017a>] ? do_page_fault+0x0/0x397
[<c104f805>] ? __lock_acquire+0x91/0x12fd
[<c1050a62>] ? __lock_acquire+0x12ee/0x12fd
[<c1050a62>] ? __lock_acquire+0x12ee/0x12fd
[<c1050b2b>] lock_acquire+0xba/0xdd
[<d0d17d3f>] ? nilfs_detach_segment_constructor+0x2f/0x2fa [nilfs2]
[<c135d4fe>] down_write+0x2a/0x46
[<d0d17d3f>] ? nilfs_detach_segment_constructor+0x2f/0x2fa [nilfs2]
[<d0d17d3f>] nilfs_detach_segment_constructor+0x2f/0x2fa [nilfs2]
[<c104ea2c>] ? mark_held_locks+0x43/0x5b
[<c104ecb1>] ? trace_hardirqs_on_caller+0x10b/0x133
[<c104ece4>] ? trace_hardirqs_on+0xb/0xd
[<d0d09ac1>] nilfs_put_super+0x2f/0xca [nilfs2]
[<c10b3352>] generic_shutdown_super+0x49/0xb8
[<c10b33de>] kill_block_super+0x1d/0x31
[<c10e6599>] ? vfs_quota_off+0x0/0x12
[<c10b398f>] deactivate_super+0x57/0x6c
[<c10c4bc3>] mntput_no_expire+0x8c/0xb4
[<c10c5094>] sys_umount+0x27f/0x2a4
[<c10c50c6>] sys_oldumount+0xd/0xf
[<c10031a4>] sysenter_do_call+0x12/0x38
...
This turns out to be a bug brought by an -rc1 patch ("nilfs2: simplify
remaining sget() use").
In the patch, a new "put resource" function, nilfs_put_sbinfo()
was introduced to delay freeing nilfs_sb_info struct.
But the nilfs_put_sbinfo() mistakenly used atomic_dec_and_test()
function to check the reference count, and it caused the nilfs_sb_info
was freed when user mounted a snapshot twice.
This bug also suggests there was unseen memory leak in usual mount
/umount operations for nilfs.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
this patch is for the same problem that Benjamin Marzinski fixes at commit
b94a170e96
quotation of the original problem:
---cut here---
When a file is deleted from a gfs2 filesystem on one node, a dcache
entry for it may still exist on other nodes in the cluster. If this
happens, gfs2 will be unable to free this file on disk. Because of this,
it's possible to have a gfs2 filesystem with no files on it and no free
space. With this patch, when a node receives a callback notifying it
that the file is being deleted on another node, it schedules a new
workqueue thread to remove the file's dcache entry.
---end cut---
after applying Benjamin's patch, I think there is still a case in which the disk
inode remains even when "no space" is hit. the case is that when running
d_prune_aliases() against the inode, there are one or more dentries(aliases)
which have reference count number > 0. in this case the dentries won't be pruned.
and even later, the reference count becomes to 0, the dentries can still be
cached in memory. unfortunately, no callback come again, things come back to
the state before the callback runs. thus the on disk inode remains there until
in memoryinode is removed for some other reason(shrinking inode cache or unmount
the volume..).
this patch is to remove those dentries when their reference count becomes to 0 and
the inode is deleted by remote node. for implementation, gfs2_dentry_delete() is
added as dentry_operations.d_delete. the function returns true when the inode is
deleted by remote node. in dput(), gfs2_dentry_delete() is called and since it
returns true, the dentry is unhashed from dcache and then removed. when all dentries
are removed, the in memory inode get removed so that the on disk inode is freed.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
'ns_cno' of structure 'the_nilfs' must be protected from segment
writer, in other words, the caller of nilfs_get_checkpoint should hold
read lock for nilfs->ns_segctor_sem. This patch adds the lock/unlock
operations in nilfs_attach_checkpoint() when calling
nilfs_cpfile_get_checkpoint().
Signed-off-by: Zhang Qiang <zhangqiang.buaa@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
A user reported that although his root ext4 filesystem was mounting
fine, other filesystems would not mount, with the:
"Filesystem with huge files cannot be mounted RDWR without CONFIG_LBDAF"
error on his 32-bit box built without CONFIG_LBDAF. This is because
the test at mount time for this situation was not being re-checked
on remount, and the normal boot process makes an ro->rw transition,
so this was being missed.
Refactor to make a common helper function to test the filesystem
features against the type of mount request (RO vs. RW) so that we
stay consistent.
Addresses Red-Hat-Bugzilla: #517650
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
While reading through some of the mballoc code it seems that a couple
spots in the size normalization function could be streamlined.
The test for non-overlapping PAs can be or'd for the start & end
conditions, and the tests for adjacent PAs can be else-if'd -
it's essentially independently testing:
if (A + B <= C)
...
if (A > C)
...
These cannot both be true so it seems like the else-if might
be slightly more efficient and/or informative.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_mb_update_group_info is only called in one place, and it's
extremely simple. There's no reason to have it in a separate function
in a separate file as far as I can tell, it just obfuscates what's
really going on.
Perhaps it was intended to keep the grp->bb_* manipulation local to
mballoc.c but we're already accessing other grp-> fields in balloc.c
directly so this seems ok.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4 will happily mount a > 16T filesystem on a 32-bit box, but
this is not safe; writes to the block device will wrap past 16T
and the page cache can't index past 16T (232 index * 4k pages).
Adding another test to the existing "too many sectors" test
should do the trick.
Add a comment, a relevant return value, and fix the reference
to the CONFIG_LBD(AF) option as well.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
During truncate we are sometimes forced to start a new transaction as
the amount of blocks to be journaled is both quite large and hard to
predict. So far we restarted a transaction while holding i_data_sem
and that violates lock ordering because i_data_sem ranks below a
transaction start (and it can lead to a real deadlock with
ext4_get_blocks() mapping blocks in some page while having a
transaction open).
We fix the problem by dropping the i_data_sem before restarting the
transaction and acquire it afterwards. It's slightly subtle that this
works:
1) By the time ext4_truncate() is called, all the page cache for the
truncated part of the file is dropped so get_block() should not be
called on it (we only have to invalidate extent cache after we
reacquire i_data_sem because some extent from not-truncated part could
extend also into the part we are going to truncate).
2) Writes, migrate or defrag hold i_mutex so they are stopped for all
the time of the truncate.
This bug has been found and analyzed by Theodore Tso <tytso@mit.edu>.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
lockdep annotation for a transaction start has been at the end of
jbd2_journal_start(). But a transaction is also started from
jbd2_journal_restart(). Move the lockdep annotation to start_this_handle()
which covers both cases.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_ext_show_leaf() will display the leaf extents when extent
debugging is enabled.
Printing out the unwritten bit is useful for debugging unwritten
extent, allow us to see the unwritten extents vs written extents,
after the unwritten extents are splitted or converted.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
When EXT_DEBUG is enabled I received the following compile warning on
PPC64:
CC [M] fs/ext4/inode.o
CC [M] fs/ext4/extents.o
fs/ext4/extents.c: In function ‘ext4_ext_rm_leaf’:
fs/ext4/extents.c:2097: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘ext4_lblk_t’
fs/ext4/extents.c: In function ‘ext4_ext_get_blocks’:
fs/ext4/extents.c:2789: warning: format ‘%u’ expects type ‘unsigned int’, but argument 4 has type ‘long unsigned int’
fs/ext4/extents.c:2852: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 3 has type ‘ext4_lblk_t’
fs/ext4/extents.c:2953: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 4 has type ‘unsigned int’
CC [M] fs/ext4/migrate.o
The patch fixes compile warning.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Index: linux-2.6.31-rc4/fs/ext4/extents.c
===================================================================
Currently the group preallocation code tries to find a large (512)
free block from which to do per-cpu group allocation for small files.
The problem with this scheme is that it leaves the filesystem horribly
fragmented. In the worst case, if the filesystem is unmounted and
remounted (after a system shutdown, for example) we forget the fact
that wee were using a particular (now-partially filled) 512 block
extent. So the next time we try to allocate space for a small file,
we will find *another* completely free 512 block chunk to allocate
small files. Given that there are 32,768 blocks in a block group,
after 64 iterations of "mount, write one 4k file in a directory,
unmount", the block group will have 64 files, each separated by 511
blocks, and the block group will no longer have any free 512
completely free chunks of blocks for group preallocation space.
So if we try to allocate blocks for a file that has been closed, such
that we know the final size of the file, and the filesystem is not
busy, avoid using group preallocation.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The mount options string is saved in sb->s_options. This patch removes
the redundant duplicating of the mount options. Also, since we are not
displaying anything special in show options, we replace v9fs_show_options
with generic_show_options for now.
Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Cast the error return value (ENOMEM) in v9fs_get_inode() to its
correct type using ERR_PTR.
Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
If we fail to mount the filesystem, we have to be careful not to dereference
uninitialized structures in ocfs2_kill_sb.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Remove a redundant update of inode's i_uid and i_gid
after v9fs_get_inode() since the latter already sets up
a new inode and sets the proper uid and gid values.
Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
->get_sb can fail causing some badness. this patch fixes
* clear sb->fs_s_info in kill_sb.
* deactivate_locked_super() calls kill_sb (v9fs_kill_super) which closes the
destroys the client, clunks all its fids and closes the v9fs session.
Attempting to do it twice will cause an oops.
Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Add the delimiter ',' before the options when they are passed
and check if no option parameters are passed to prevent displaying
NULL in /proc/mounts.
Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Add missing p9stat_free in v9fs_inode_from_fid to avoid
any possible leaks.
Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Fix the comments -- mostly the improper and/or missing descriptions
of function parameters.
Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Add a missing iput when cleaning up if v9fs_get_inode
fails after returning a valid inode.
Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Check if v9fs_fid_add was successful or not based on its
return value.
Signed-off-by: Abhishek Kulkarni <adkulkar@umail.iu.edu>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
The inotify_add_watch man page specifies that inotify_add_watch() will
return a non-negative integer. However, historically the inotify
watches started at 1, not at 0.
Turns out that the inotifywait program provided by the inotify-tools
package doesn't properly handle a 0 watch descriptor. In 7e790dd5 we
changed from starting at 1 to starting at 0. This patch starts at 1,
just like in previous kernels, but also just like in previous kernels
it's possible for it to wrap back to 0. This preserves the kernel
functionality exactly like it was before the patch (neither method broke
the spec)
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In f44aebcc the tail drop logic of events with no file backing
(q_overflow and in_ignored) was reversed so IN_IGNORED events would
never be tail dropped. This now means that Q_OVERFLOW events are NOT
tail dropped. The fix is to not tail drop IN_IGNORED, but to tail drop
Q_OVERFLOW.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
inotify decides if private data it passed to get added to an event was
used by checking list_empty(). But it's possible that the event may
have been dequeued and the private event removed so it would look empty.
The fix is to use the return code from fsnotify_add_notify_event rather
than looking at the list.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In ocfs2_do_truncate, we forget to release last_eb_bh which
will cause memleak. So call brelse in the end.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2_read_virt_blocks() does BUG when we try to read a block from a file
beyond its end. Since this can happen due to filesystem corruption, it
is not really an appropriate answer. Make ocfs2_read_quota_block() check
the condition and handle it by calling ocfs2_error() and returning EIO.
[ Modified to print ip_blkno in the error - Joel ]
Reported-by: Tristan Ye <tristan.ye@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This adds a link from the per-gfs2 sb sysfs directory to
the block device upon which the filesystem is mounted. The
link is called "device", strangely enough :-)
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
A little while back, block allocation was given some improved
error handling which meant that -EIO was returned in the case
of there being a problem in the resource group data. In addition
a message is printed explaning what went wrong and how to fix it.
This extends that error handling so that it also covers inode
allocation too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
With each uevent, we now always include the journal ID. We
can't call it JID since that is already in use by some of
the individual events relating to recovery, so we use
JOURNALID instead. We don't send the JOURNALID for spectator
mounts, since there isn't one.
Also the ADD event now has both RDONLY and SPECTATOR information
to match that of the ONLINE event.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We already have an offline uevent (used when a withdraw occurs)
but no online uevent. This adds an online uevent so that userspace
will be able to detect a successful mount by means other than
not receiving a remove event after the add & recovery (change)
uevents.
It has also been added to the remount path as well - we can't use
a change uevent there as older GFS2 userspace acts on change uevents
according to the state that it thinks the fs is in, so we can't
easily add any new ones.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The locking in xfs_iget_cache_hit currently has numerous problems:
- we clear the reclaim tag without i_flags_lock which protects
modifications to it
- we call inode_init_always which can sleep with pag_ici_lock
held (this is oss.sgi.com BZ #819)
- we acquire and drop i_flags_lock a lot and thus provide no
consistency between the various flags we set/clear under it
This patch fixes all that with a major revamp of the locking in
the function. The new version acquires i_flags_lock early and
only drops it once we need to call into inode_init_always or before
calling xfs_ilock.
This patch fixes a bug seen in the wild where we race modifying the
reclaim tag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
The triggered field of struct poll_wqueues introduced in commit
5f820f648c ("poll: allow f_op->poll to
sleep").
It was first set to 1 in pollwake() (now __pollwake() ), tested and
later set to 0 in poll_schedule_timeout(), but not initialized before.
As a result when the process needs to sleep, triggered was likely to be
non-zero even if pollwake() is not called before the first
poll_schedule_timeout(), meaning schedule_hrtimeout_range() would not be
called and an extra loop calling all ->poll() would be done.
This patch initialize triggered to 0 in poll_initwait() so the ->poll()
are not called twice before the process goes to sleep when it needs to.
Signed-off-by: Guillaume Knispel <gknispel@proformatique.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do not increment decoding ptr if not needed.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Part fo the nfs4xdr cleanup. READ_BUF will go away.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
do not increment encoding ptr if not needed.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In order to open code and expose the result pointer assignment.
Alternatively, we can open code the call to xdr_reserve_space
and do the BUG_ON an the error case at the call site.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is already done by xdr_reserve_space and since encode_compound_hdr
is adding a byte count to "12" which is already word aligned, the xdr
level rounding will work just as well.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Although this file is only ever written and not read by
userspace, it seems that the utils are opening this
file O_RDWR, so we need to allow that.
Also fixes the whitespace which seemed to be broken.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: David Teigland <teigland@redhat.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (22 commits)
ocfs2: Fix possible deadlock when extending quota file
ocfs2: keep index within status_map[]
ocfs2: Initialize the cluster we're writing to in a non-sparse extend
ocfs2: Remove redundant BUG_ON in __dlm_queue_ast()
ocfs2/quota: Release lock for error in ocfs2_quota_write.
ocfs2: Define credit counts for quota operations
ocfs2: Remove syncjiff field from quota info
ocfs2: Fix initialization of blockcheck stats
ocfs2: Zero out padding of on disk dquot structure
ocfs2: Initialize blocks allocated to local quota file
ocfs2: Mark buffer uptodate before calling ocfs2_journal_access_dq()
ocfs2: Make global quota files blocksize aligned
ocfs2: Use ocfs2_rec_clusters in ocfs2_adjust_adjacent_records.
ocfs2: Fix deadlock on umount
ocfs2: Add extra credits and access the modified bh in update_edge_lengths.
ocfs2: Fail ocfs2_get_block() immediately when a block needs allocation
ocfs2: Fix error return in ocfs2_write_cluster()
ocfs2: Fix compilation warning for fs/ocfs2/xattr.c
ocfs2: Initialize count in aio_write before generic_write_checks
ocfs2: log the actual return value of ocfs2_file_aio_write()
...
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: fix spin_is_locked assert on uni-processor builds
xfs: check for dinode realtime flag corruption
use XFS_CORRUPTION_ERROR in xfs_btree_check_sblock
xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_get
xfs: switch to NOFS allocation under i_lock in xfs_readlink_bmap
xfs: switch to NOFS allocation under i_lock in xfs_attr_rmtval_set
xfs: switch to NOFS allocation under i_lock in xfs_buf_associate_memory
xfs: switch to NOFS allocation under i_lock in xfs_dir_cilookup_result
xfs: switch to NOFS allocation under i_lock in xfs_da_buf_make
xfs: switch to NOFS allocation under i_lock in xfs_da_state_alloc
xfs: switch to NOFS allocation under i_lock in xfs_getbmap
xfs: avoid memory allocation under m_peraglock in growfs code
We can't call nfs_readdata_release()/nfs_writedata_release() without
first initialising and referencing args.context. Doing so inside
nfs_direct_read_schedule_segment()/nfs_direct_write_schedule_segment()
causes an Oops.
We should rather be calling nfs_readdata_free()/nfs_writedata_free() in
those cases.
Looking at the O_DIRECT code, the "struct nfs_direct_req" is already
referencing the nfs_open_context for us. Since the readdata and writedata
structures carry a reference to that, we can simplify things by getting rid
of the extra nfs_open_context references, so that we can replace all
instances of nfs_readdata_release()/nfs_writedata_release().
Reported-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Without SMP or preemption spin_is_locked always returns false,
so we can't do an assert with it. Instead use assert_spin_locked,
which does the right thing on all builds.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reported-by: Johannes Engel <jcnengel@googlemail.com>
Tested-by: Johannes Engel <jcnengel@googlemail.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Ramon tested XFS with a modified version of fsfuzzer and hit a NULL
pointer dereference in __xfs_get_blocks due to the RT device target
pointer being NULL.
To fix this reject inode with the realtime bit set on a a filesystem
without an RT subvolume during inode read.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Reported-by: Ramon de Carvalho Valle <ramon@risesecurity.org>
Tested-by: Ramon de Carvalho Valle <ramon@risesecurity.org>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
In Red Hat Bug 512552
- Can't write to XFS mount during raid5 resync
a user ran into corruption while resyncing a raid, and we failed
a consistency test, but didn't get much more info; it'd be nice
to call XFS_CORRUPTION_ERROR here so we can see the buffer
contents.
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_attr_rmtval_get is always called with i_lock held, but i_lock is taken
in reclaim context so all allocations under it must avoid recursions into
the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_readlink_bmap is called with i_lock held, but i_lock is taken in
reclaim context so all allocations under it must avoid recursions into
the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_attr_rmtval_set is always called with i_lock held, and i_lock is taken
in reclaim context so all allocations under it must avoid recursions into
the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_buf_associate_memory is used for setting up the spare buffer for the
log wrap case in xlog_sync which can happen under i_lock when called from
xfs_fsync. The i_lock mutex is taken in reclaim context so all allocations
under it must avoid recursions into the filesystem. There are a couple
more uses of xfs_buf_associate_memory in the log recovery code that are
also affected by this, but I'd rather keep the code simple than passing on
a gfp_mask argument. Longer term we should just stop requiring the memoery
allocation in xlog_sync by some smaller rework of the buffer layer.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_dir_cilookup_result is always called with i_lock held, but i_lock is taken
in reclaim context so all allocations under it must avoid recursions into the
filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
i_lock is taken in the reclaim context so all allocations under it
must avoid recursions into the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_da_state_alloc is always called with i_lock held, but i_lock is taken in
reclaim context so all allocations under it must avoid recursions into the
filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
xfs_getbmap allocates memory with i_lock held, but i_lock is taken in
reclaim context so all allocations under it must avoid recursions into
the filesystem.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Allocate the memory for the larger m_perag array before taking the
per-AG lock as the per-AG lock can be taken under the i_lock which
can be taken from reclaim context.
Reported by the new reclaim context tracing in lockdep.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
In OCFS2, allocator locks rank above transaction start. Thus we
cannot extend quota file from inside a transaction less we could
deadlock.
We solve the problem by starting transaction not already in
ocfs2_acquire_dquot() but only in ocfs2_local_read_dquot() and
ocfs2_global_read_dquot() and we allocate blocks to quota files before starting
the transaction. In case we crash, quota files will just have a few blocks
more but that's no problem since we just use them next time we extend the
quota file.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Subject: [PATCH] nfs: remove superfluous BUG_ON()s
Remove duplicated BUG_ON()s from nfs[4]_create_server()
(we make the same checks earlier in both functions).
This takes care of the following entries from Dan's list:
fs/nfs/client.c +1078 nfs_create_server(47) warning: variable derefenced before check 'server->nfs_client'
fs/nfs/client.c +1079 nfs_create_server(48) warning: variable derefenced before check 'server->nfs_client->rpc_ops'
fs/nfs/client.c +1363 nfs4_create_server(43) warning: variable derefenced before check 'server->nfs_client'
fs/nfs/client.c +1364 nfs4_create_server(44) warning: variable derefenced before check 'server->nfs_
Reported-by: Dan Carpenter <error27@gmail.com>
Cc: corbet@lwn.net
Cc: eteo@redhat.com
Cc: Julia Lawall <julia@diku.dk>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hi.
I have a proposal for possibly resolving this issue.
I believe that this situation occurs due to the way that the
Linux NFS client handles writes which modify partial pages.
The Linux NFS client handles partial page modifications by
allocating a page from the page cache, copying the data from
the user level into the page, and then keeping track of the
offset and length of the modified portions of the page. The
page is not marked as up to date because there are portions
of the page which do not contain valid file contents.
When a read call comes in for a portion of the page, the
contents of the page must be read in the from the server.
However, since the page may already contain some modified
data, that modified data must be written to the server
before the file contents can be read back in the from server.
And, since the writing and reading can not be done atomically,
the data must be written and committed to stable storage on
the server for safety purposes. This means either a
FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT.
This has been discussed at length previously.
This algorithm could be described as modify-write-read. It
is most efficient when the application only updates pages
and does not read them.
My proposed solution is to add a heuristic to decide whether
to do this modify-write-read algorithm or switch to a read-
modify-write algorithm when initially allocating the page
in the write system call path. The heuristic uses the modes
that the file was opened with, the offset in the page to
read from, and the size of the region to read.
If the file was opened for reading in addition to writing
and the page would not be filled completely with data from
the user level, then read in the old contents of the page
and mark it as Uptodate before copying in the new data. If
the page would be completely filled with data from the user
level, then there would be no reason to read in the old
contents because they would just be copied over.
This would optimize for applications which randomly access
and update portions of files. The linkage editor for the
C compiler is an example of such a thing.
I tested the attached patch by using rpmbuild to build the
current Fedora rawhide kernel. The kernel without the
patch generated about 269,500 WRITE requests. The modified
kernel containing the patch generated about 261,000 WRITE
requests. Thus, about 8,500 fewer WRITE requests were
generated. I suspect that many of these additional
WRITE requests were probably FILE_SYNC requests to WRITE
a single page, but I didn't test this theory.
The difference between this patch and the previous one was
to remove the unneeded PageDirty() test. I then retested to
ensure that the resulting system continued to behave as
desired.
Thanx...
ps
Signed-off-by: Peter Staubach <staubach@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[un]register_chrdev() assume minor range 0-255. This patch adds __
prefixed versions which take @minorbase and @count explicitly.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
The problem is minor, but without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.
Now we do not need to re-check task->mm after ptrace_may_access(), it
can't be changed to the new mm under us.
Strictly speaking, this also fixes another very minor problem. Unless
security check fails or the task exits mm_for_maps() should never
return NULL, the caller should get either old or new ->mm.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
mm_for_maps() takes ->mmap_sem after security checks, this looks
strange and obfuscates the locking rules. Move this lock to its
single caller, m_start().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
It would be nice to kill __ptrace_may_access(). It requires task_lock(),
but this lock is only needed to read mm->flags in the middle.
Convert mm_for_maps() to use ptrace_may_access(), this also simplifies
the code a little bit.
Also, we do not need to take ->mmap_sem in advance. In fact I think
mm_for_maps() should not play with ->mmap_sem at all, the caller should
take this lock.
With or without this patch, without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
The problem is minor, but without ->cred_guard_mutex held we can race
with exec() and get the new ->mm but check old creds.
Now we do not need to re-check task->mm after ptrace_may_access(), it
can't be changed to the new mm under us.
Strictly speaking, this also fixes another very minor problem. Unless
security check fails or the task exits mm_for_maps() should never
return NULL, the caller should get either old or new ->mm.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
mm_for_maps() takes ->mmap_sem after security checks, this looks
strange and obfuscates the locking rules. Move this lock to its
single caller, m_start().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: James Morris <jmorris@namei.org>
The logic around sbi->s_mb_last_group and sbi->s_mb_last_start was all
screwed up. These fields were getting unconditionally all the time,
set even when stream allocation had not taken place, and if they were
being used when the file was smaller than s_mb_stream_request, which
is when the allocation should _not_ be doing stream allocation.
Fix this by determining whether or not we stream allocation should
take place once, in ext4_mb_group_or_file(), and setting a flag which
gets used in ext4_mb_regular_allocator() and ext4_mb_use_best_found().
This simplifies the code and assures that we are consistently using
(or not using) the stream allocation logic.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
move_extent_par_page calls a_ops->write_begin() to increase journal
handler's reference count. However, if either mext_replace_branches()
or ext4_get_block fails, the increased reference count isn't
decreased. This will cause a later attempt to umount of the fs to hang
forever. The patch addresses the issue by calling ext4_journal_stop()
if page is not NULL (which means a_ops->write_end() isn't invoked).
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
fix jiffie rounding in jbd commit timer setup code. Rounding down
could cause the timer to be fired before the corresponding transaction
has expired. That transaction can stay not committed forever if no
new transaction is created or expicit sync/umount happens.
Signed-off-by: Alex Zhuravlev (Tomas) <alex.zhuravlev@sun.com>
Signed-off-by: Andreas Dilger <adilger@sun.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
For events that are rare, such as referral DNS lookups, it makes limited
sense to have a daemon constantly listening for upcalls on a channel. An
alternative in those cases might simply be to run the app that fills the
cache using call_usermodehelper_exec() and friends.
The following patch allows the cache_detail to specify alternative upcall
mechanisms for these particular cases.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In write_failover_ip(), replace the sscanf() with a call to the common
sunrpc.ko presentation address parser.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Use shared rpc_set_port() function instead of nlm_clear_port().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: Use the common routine now provided in sunrpc.ko for parsing mount
addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Introduce a set of functions in the kernel's RPC implementation for
converting between a socket address and either a standard
presentation address string or an RPC universal address.
The universal address functions will be used to encode and decode
RPCB_FOO and NFSv4 SETCLIENTID arguments. The other functions are
part of a previous promise to deliver shared functions that can be
used by upper-layer protocols to display and manipulate IP
addresses.
The kernel's current address printf formatters were designed
specifically for kernel to user-space APIs that require a particular
string format for socket addresses, thus are somewhat limited for the
purposes of sunrpc.ko. The formatter for IPv6 addresses, %pI6, does
not support short-handing or scope IDs. Also, these printf formatters
are unique per address family, so a separate formatter string is
required for printing AF_INET and AF_INET6 addresses.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit a14017db added support in the kernel's NFS mount client to
decode the authentication flavor list returned by mountd.
The NFS client can now use this list to determine whether the
authentication flavor requested by the user is actually supported
by the server.
Note we don't actually negotiate the security flavor if none was
specified by the user. Instead, we try to use AUTH_SYS, and fail if
the server does not support it. This prevents us from negotiating
an inappropriate security flavor (some servers list AUTH_NULL first).
If the server does not support AUTH_SYS, the user must provide an
appropriate security flavor by specifying the "sec=" mount option.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Previous logic in the NFS mount parsing code path assumed
auth_flavor_len was set to zero for simple authentication flavors
(like AUTH_UNIX), and 1 for compound flavors (like AUTH_GSS).
At some earlier point (maybe even before the option parsers were
merged?) specific checks for auth_flavor_len being zero were removed
from the functions that validate the mount option that sets the mount
point's authentication flavor.
Since we are populating an array for authentication flavors, the
auth_flavor_len should always be set to the number of flavors. Let's
eliminate some cleverness here, and prepare for new logic that needs
to know the number of flavors in the auth_flavors[] array.
(auth_flavors[] is an array because at some point we want to allow a
list of acceptable authentication flavors to be specified via the sec=
mount option. For now it remains a single element array).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
After certain failure modes of an NFS mount, an NFS client should send
a MOUNTPROC_UMNT request to remove the just-added mount entry from the
server's mount table. While no-one should rely on the accuracy of the
server's mount table, sending a UMNT is simply being a good internet
neighbor.
Since NFS mount processing is handled in the kernel now, we will need
a function in the kernel's mountd client that can post a MOUNTRPC_UMNT
request, in order to handle these failure modes.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The new minorversion= mount option (commit 3fd5be9e) was merged at
the same time as the recent sloppy parser fixes (commit a5a16bae),
so minorversion= still uses the old value parsing logic.
If the minorversion= option specifies a bogus value, it should fail
with "bad value" not "bad option."
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tighten up the validity checking in param_set_port: check for NULL pointers.
Ensure that the option shows up on 'modinfo' output.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the NFSv4 server doesn't support a POSIX attribute, the generic NFS code
needs to know that, so that it don't keep trying to poll for it.
However, by the same count, if the NFSv4 server does support that
attribute, then we should ensure that the inode metadata is appropriately
labelled as being untrusted. For instance, if we don't know the correct
value of the file's uid, we should certainly not be caching ACLs or ACCESS
results.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If the server is broken, then retrying forever won't fix it. We
should just give up after a while, and return an error to the user.
We set the number of retries to 10 for now...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Ensure that index i remains within array mnt_errtbl[] and mnt3_errtbl[].
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Do not exceed array status_map[]
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In a non-sparse extend, we correctly allocate (and zero) the clusters between
the old_i_size and pos, but we don't zero the portions of the cluster we're
writing to outside of pos<->len.
It handles clustersize > pagesize and blocksize < pagesize.
[Cleaned up by Joel Becker.]
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
invalidate_inode_pages2_range may return -EBUSY occasionally
which results Oops. This patch fixes the issue by moving
invalidate_inode_pages2_range into a loop and keeping calling
it until the return value is not -EBUSY.
The EBUSY return is temporary, and can happen when the btrfs release page
function is unable to release a page because the EXTENT_LOCK
bit is set.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
find_zlib_workspace returns an ERR_PTR value in an error case instead of NULL.
A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@match exists@
expression x, E;
statement S1, S2;
@@
x = find_zlib_workspace(...)
... when != x = E
(
* if (x == NULL || ...) S1 else S2
|
* if (x == NULL && ...) S1 else S2
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
This takes care of the following entry from Dan's list:
fs/btrfs/inode.c +4788 btrfs_rename(36) warning: variable derefenced before check 'old_inode'
Reported-by: Dan Carpenter <error27@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Eugene Teo <eteo@redhat.com>
Cc: Julia Lawall <julia@diku.dk>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
* git://git.infradead.org/mtd-2.6:
jffs2: Fix return value from jffs2_do_readpage_nolock()
mtd: mtdblock: introduce mtdblks_lock
mtd: remove 'SBC8240 Wind River' Device Driver Code
mtd: OneNAND: OMAP2/3: free GPMC CS on module removal
mtd: OneNAND: fix incorrect bufferram offset
mtd: blkdevs: do not forget to get MTD devices
mtd: fix the conversion from dev to mtd_info
mtd: let include/linux/mtd/partitions.h stand on its own
The new credentials code broke load_flat_shared_library() as it now uses
an uninitialized cred pointer.
Reported-by: Bernd Schmidt <bernds_cb1@t-online.de>
Tested-by: Bernd Schmidt <bernds_cb1@t-online.de>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I suspect that mnt_want_write_file() may have wrong assumption. I think
mnt_want_write_file() is assuming it increments ->mnt_writers if
(file->f_mode & FMODE_WRITE). But, if it's special_file(), it is false?
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Acked-by: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The FIEMAP_IOC_FIEMAP mapping ioctl was missing a 32-bit compat handler,
which means that 32-bit suerspace on 64-bit kernels cannot use this ioctl
command.
The structure is nicely aligned, padded, and sized, so it is just this
simple.
Tested w/ 32-bit ioctl tester (from Josef) on a 64-bit kernel on ext4.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Mark Lord <lkml@rtr.ca>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Josef Bacik <josef@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When freeing an inode that lost race getting added to the inode cache we
must not call into ->destroy_inode, because that would delete the inode
that won the race from the inode cache radix tree.
This patch uses splits a new xfs_inode_free helper out of xfs_ireclaim
and uses that plus __destroy_inode to make sure we really only free
the memory allocted for the inode that lost the race, and not mess with
the inode cache state.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Reported-by: Alex Samad <alex@samad.com.au>
Reported-by: Andrew Randrianasulu <randrik@mail.ru>
Reported-by: Stephane <sharnois@max-t.com>
Reported-by: Tommy <tommy@news-service.com>
Reported-by: Miah Gregory <mace@darksilence.net>
Reported-by: Gabriel Barazer <gabriel@oxeva.fr>
Reported-by: Leandro Lucarella <llucax@gmail.com>
Reported-by: Daniel Burr <dburr@fami.com.au>
Reported-by: Nickolay <newmail@spaces.ru>
Reported-by: Michael Guntsche <mike@it-loops.com>
Reported-by: Dan Carley <dan.carley+linuxkern-bugs@gmail.com>
Reported-by: Michael Ole Olsen <gnu@gmx.net>
Reported-by: Michael Weissenbacher <mw@dermichi.com>
Reported-by: Martin Spott <Martin.Spott@mgras.net>
Reported-by: Christian Kujau <lists@nerdbynature.de>
Tested-by: Michael Guntsche <mike@it-loops.com>
Tested-by: Dan Carley <dan.carley+linuxkern-bugs@gmail.com>
Tested-by: Christian Kujau <lists@nerdbynature.de>
When we want to tear down an inode that lost the add to the cache race
in XFS we must not call into ->destroy_inode because that would delete
the inode that won the race from the inode cache radix tree.
This patch provides the __destroy_inode helper needed to fix this,
the actual fix will be in th next patch. As XFS was the only reason
destroy_inode was exported we shift the export to the new __destroy_inode.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
Currently inode_init_always calls into ->destroy_inode if the additional
initialization fails. That's not only counter-intuitive because
inode_init_always did not allocate the inode structure, but in case of
XFS it's actively harmful as ->destroy_inode might delete the inode from
a radix-tree that has never been added. This in turn might end up
deleting the inode for the same inum that has been instanciated by
another process and cause lots of cause subtile problems.
Also in the case of re-initializing a reclaimable inode in XFS it would
free an inode we still want to keep alive.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Eric Sandeen <sandeen@sandeen.net>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2:
nilfs2: fix missing unlock in error path of nilfs_mdt_write_page
nilfs2: fix oops due to inconsistent state in page with discrete b-tree nodes
Check whether index is within bounds before testing the element.
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Since forceuid is the default, we now need to show when it's disabled.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This adds a missing unlock of nilfs->ns_writer_mutex in
nilfs_mdt_write_page() function.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
This patch fixes the regression reported here:
http://bugzilla.kernel.org/show_bug.cgi?id=13861
commit 4ae1507f6d changed the default
behavior when the uid= or gid= option was specified for a mount. The
existing behavior was to always clobber the ownership information
provided by the server when these options were specified. The above
commit changed this behavior so that these options simply provided
defaults when the server did not provide this information (unless
"forceuid" or "forcegid" were specified)
This patch reverts this change so that the default behavior is restored.
It also adds "noforceuid" and "noforcegid" options to make it so that
ownership information from the server is preserved, even when the mount
has uid= or gid= options specified.
It also adds a couple of printk notices that pop up when forceuid or
forcegid options are specified without a uid= or gid= option.
Reported-by: Tom Chiverton <bugzilla.kernel.org@falkensweb.com>
Reviewed-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Andrea Gelmini gave me a report that a kernel oops hit on a nilfs
filesystem with a 1KB block size when doing rsync.
This turned out to be caused by an inconsistency of dirty state
between a page and its buffers storing b-tree node blocks.
If the page had multiple buffers split over multiple logs, and if the
logs were written at a time, a dirty flag remained in the page even
every dirty flag in the buffers was cleared.
This will fix the failure by dropping the dirty flag properly for
pages with the discrete multiple b-tree nodes.
Reported-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Andrea Gelmini <andrea.gelmini@gmail.com>
Cc: stable@kernel.org
Because, with "shortname=lower", copying one FAT filesystem tree to
another FAT filesystem tree using Linux results in semantically
different filesystems. (E.g.: Filenames which were once "all
uppercase" are now "all lowercase").
So, this changes the default of "shortname=lower" to "shortname=mixed".
Signed-off-by: Paul Wise <pabs3@bonedaddy.net>
[change fat_show_options()]
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
With utf8 option, vfat allowed the duplicated filenames.
Normal nls returns -EINVAL for invalid char. But utf8s_to_utf16s()
skipped the invalid char historically.
So, this changes the utf8s_to_utf16s() directly to return -EINVAL for
invalid char, because vfat is only user of it.
mkdir /mnt/fatfs
FILENAME=`echo -ne "invalidutf8char_\\0341_endofchar"`
echo "Using filename: $FILENAME"
dd if=/dev/zero of=fatfs bs=512 count=128
mkdosfs -F 32 fatfs
mount -o loop,utf8 fatfs /mnt/fatfs
touch "/mnt/fatfs/$FILENAME"
umount /mnt/fatfs
mount -o loop,utf8 fatfs /mnt/fatfs
touch "/mnt/fatfs/$FILENAME"
ls -l /mnt/fatfs
umount /mnt/fatfs
---- And the output is:
Using filename: invalidutf8char_\0341_endofchar
128+0 records in
128+0 records out
65536 bytes (66 kB) copied, 0.000388118 s, 169 MB/s
mkdosfs 2.11 (12 Mar 2005)
total 0
-rwxr-xr-x 1 root root 0 Jun 28 19:46 invalidutf8char__endofchar
-rwxr-xr-x 1 root root 0 Jun 28 19:46 invalidutf8char__endofchar
Tested-by: Marton Balint <cus@fazekas.hu>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
The async caching thread can end up looping forever if a given
search puts it at the last key in a leaf. It will end up calling
btrfs_next_leaf and then checking if it needs to politely drop
the read semaphore.
Most of the time this looping isn't noticed because it is able to
make progress the next time around. But, during log replay,
we wait on the async caching thread to finish, and the async thread
is waiting on the commit, and no progress is really made.
The fix used here is to copy the key out of the next leaf,
that way our search lands there properly.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Yan Zheng hit a problem where we tried to remove some free space but failed
because we couldn't find the free space entry. This is because the free space
was held within a bitmap that had a starting offset well before the actual
offset of the free space, and there were free space extents that were in the
same range as that offset, so tree_search_offset returned with NULL because we
couldn't find a free space extent that had that offset. This is fixed by
making sure that if we fail to find the entry, we re-search again with
bitmap_only set to 1 and do an offset_to_bitmap so we can get the appropriate
bitmap. A similar problem happens in btrfs_alloc_from_bitmap for the
clustering code, but that is not as bad since we will just go and redo our
cluster allocation.
Also this adds some debugging checks to make sure that the free space we are
trying to remove from the bitmap is in fact there. This can probably go away
after a while, but since this code is only used by the tree-logging stuff it
would be nice to run with it for a while to make sure there are no problems.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
VM calculation for nr_to_write seems off. Bump it way
up, this gets simple streaming writes zippy again.
To be reviewed again after Jens' writeback changes.
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Cc: Chris Mason <chris.mason@oracle.com>
Reviewed-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
commit 6321e3ed2a caused
the full bmv_count's worth of getbmapx structures to get
allocated; telling it to do MAXEXTNUM was a bit insane,
resulting in ENOMEM every time.
Chop it down to something reasonable, the number of slots
in the caller's input buffer. If this is too large the
caller may get ENOMEM but the reason should not be a
mystery, and they can try again with something smaller.
We add 1 to the value because in the normal getbmap
world, bmv_count includes the header and xfs_getbmap does:
nex = bmv->bmv_count - 1;
if (nex <= 0)
return XFS_ERROR(EINVAL);
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Reviewed-by: Olaf Weber <olaf@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Felix Blyakher <felixb@sgi.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: be more polite in the async caching threads
Btrfs: preserve commit_root for async caching
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6:
udf: Fix loading of VAT inode when drive wrongly reports number of recorded blocks
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes:
GFS2: remove dcache entries for remote deleted inodes
GFS2: Fix incorrent statfs consistency check
GFS2: Don't put unlikely reclaim candidates on the reclaim list.
GFS2: Don't try and dealloc own inode
GFS2: Fix panic in glock memory shrinker
GFS2: keep statfs info in sync on grows
GFS2: Shrink the shrinker