It looks the error paths in jffs2_block_check_erase() have wrong return
values. A block that failed to be erased never gets marked as bad.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Show peer group ID of nearest dominating group that has intersection
with the mount's namespace.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[mszeredi@suse.cz] rewrite and split big patch into managable chunks
/proc/mounts in its current form lacks important information:
- propagation state
- root of mount for bind mounts
- the st_dev value used within the filesystem
- identifier for each mount and it's parent
It also suffers from the following problems:
- not easily extendable
- ambiguity of mountpoints within a chrooted environment
- doesn't distinguish between filesystem dependent and independent options
- doesn't distinguish between per mount and per super block options
This patch introduces /proc/<pid>/mountinfo which attempts to address
all these deficiencies.
Code shared between /proc/<pid>/mounts and /proc/<pid>/mountinfo is
extracted into separate functions.
Thanks to Al Viro for the help in getting the design right.
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Allow /proc/<pid>/mountinfo to use the root of <pid> to calculate
mountpoints.
- move definition of 'struct proc_mounts' to <linux/mnt_namespace.h>
- add the process's namespace and root to this structure
- pass a pointer to 'struct proc_mounts' into seq_operations
In addition the following cleanups are made:
- use a common open function for /proc/<pid>/{mounts,mountstat}
- surround namespace.c part of these proc files with #ifdef CONFIG_PROC_FS
- make the seq_operations structures const
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a unique ID to each peer group using the IDR infrastructure. The
identifiers are reused after the peer group dissolves.
The IDR structures are protected by holding namepspace_sem for write
while allocating or deallocating IDs.
IDs are allocated when a previously unshared vfsmount becomes the
first member of a peer group. When a new member is added to an
existing group, the ID is copied from one of the old members.
IDs are freed when the last member of a peer group is unshared.
Setting the MNT_SHARED flag on members of a subtree is done as a
separate step, after all the IDs have been allocated. This way an
allocation failure can be cleaned up easilty, without affecting the
propagation state.
Based on design sketch by Al Viro.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a unique ID to each vfsmount using the IDR infrastructure. The
identifiers are reused after the vfsmount is freed.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a new function:
seq_file_root()
This is similar to seq_path(), but calculates the path relative to the
given root, instead of current->fs->root. If the path was unreachable
from root, then modify the root parameter to reflect this.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[mszeredi@suse.cz] split big patch into managable chunks
Add the following functions:
dentry_path()
seq_dentry()
These are similar to d_path() and seq_path(). But instead of
calculating the path within a mount namespace, they calculate the path
from the root of the filesystem to a given dentry, ignoring mounts
completely.
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
[PATCH] get rid of __exit_files(), __exit_fs() and __put_fs_struct()
[PATCH] proc_readfd_common() race fix
[PATCH] double-free of inode on alloc_file() failure exit in create_write_pipe()
[PATCH] teach seq_file to discard entries
[PATCH] umount_tree() will unhash everything itself
[PATCH] get rid of more nameidata passing in namespace.c
[PATCH] switch a bunch of LSM hooks from nameidata to path
[PATCH] lock exclusively in collect_mounts() and drop_collected_mounts()
[PATCH] move a bunch of declarations to fs/internal.h
Haven't had any complaints about it recently, despite having the test
code enabled to verify that the calculated length is correct.
Kill it off, just by #undef TEST_TOTLEN for now; removing it for real
can come a little later.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
The problem fixed in commit 014b164e13
(space leak with in-band cleanmarkers) would have been caught a lot
quicker if our paranoid debugging mode had included adding up the size
counts from all the eraseblocks and comparing the totals with the counts
in the superblock. Add that.
Make jffs2_mark_erased_block() file the newly-erased block on the
free_list before calling the debug function, to make it happy.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Since we drop the rcu_read_lock inside the loop, we can't assume
that files->fdt will remain unchanged (and not freed) between
iterations.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We were accounting for the cleanmarker by calling jffs2_link_node_ref()
(without locking!), which adjusted both superblock and per-eraseblock
accounting, subtracting the size of the cleanmarker from {jeb,c}->free_size
and adding it to {jeb,c}->used_size.
But only _then_ were we adding the size of the newly-erased block back
to the superblock counts, and we were adding each of jeb->{free,used}_size
to the corresponding superblock counts. Thus, the size of the cleanmarker
was effectively subtracted from the superblock's free_size _twice_.
Fix this, by always adding a full eraseblock size to c->free_size when
we've erased a block. And call jffs2_link_node_ref() under the proper
lock, while we're at it.
Thanks to Alexander Yurchenko and/or Damir Shayhutdinov for (almost)
pinpointing the problem.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
dlm: linux/{dlm,dlm_device}.h: cleanup for userspace
dlm: common max length definitions
dlm: move plock code from gfs2
dlm: recover nodes that are removed and re-added
dlm: save master info after failed no-queue request
dlm: make dlm_print_rsb() static
dlm: match signedness between dlm_config_info and cluster_set
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-udf-2.6: (41 commits)
udf: use crc_itu_t from lib instead of udf_crc
udf: Fix compilation warnings when UDF debug is on
udf: Fix bug in VAT mapping code
udf: Add read-only support for 2.50 UDF media
udf: Fix handling of multisession media
udf: Mount filesystem read-only if it has pseudooverwrite partition
udf: Handle VAT packed inside inode properly
udf: Allow loading of VAT inode
udf: Fix detection of VAT version
udf: Silence warning about accesses beyond end of device
udf: Improve anchor block detection
udf: Cleanup anchor block detection.
udf: Move processing of virtual partitions
udf: Move filling of partition descriptor info into a separate function
udf: Improve error recovery on mount
udf: Cleanup volume descriptor sequence processing
udf: fix anchor point detection
udf: Remove declarations of arrays of size UDF_NAME_LEN (256 bytes)
udf: Remove checking of existence of filename in udf_add_entry()
udf: Mark udf_process_sequence() as noinline
...
We have a problem in scsi_transport_spi in that we need to customise
not only the visibility of the attributes, but also their mode. Fix
this by making the is_visible() callback return a mode, with 0
indicating is not visible.
Also add a sysfs_update_group() API to allow us to change either the
visibility or mode of the files at any time on the fly.
Acked-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Add the write verification buffer to the dataflash. The mtd_dataflash has
the CONFIG_DATAFLASH_WRITE_VERIFY so is better a change to Kconfig.
Signed-off-by: Michael Trimarchi <trimarchimichael@yahoo.it>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
fs/jffs2/gc.c:1147:29: warning: symbol 'jeb' shadows an earlier one
fs/jffs2/gc.c:1084:89: originally declared here
fs/jffs2/gc.c:1197:29: warning: symbol 'jeb' shadows an earlier one
fs/jffs2/gc.c:1084:89: originally declared here
Rename the unused 'jeb' argument to avoid this. We could potentially
remove the argument, but GCC should be doing that anyway.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
fs/jffs2/write.c:585:28: warning: symbol 'fd' shadows an earlier one
fs/jffs2/write.c:536:27: originally declared here
No need to redeclare fd, use the original one, after this point,
fd is always reassigned before it used again.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
fs/jffs2/nodemgmt.c:60:8: warning: symbol 'ret' shadows an earlier one
fs/jffs2/nodemgmt.c:45:6: originally declared here
(reported by Harvey Harrison)
Just remove the offending declaration of 'int ret' and use the earlier one.
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
fs/jffs2/ioctl.c:14:5: warning: symbol 'jffs2_ioctl' was not declared.
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Further reduction of stack footprint (sys_pivot_root());
lose useless BKL in there, while we are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Taking namespace_sem shared there isn't worth the trouble, especially with
vfsmount ID allocation about to be added. That way we know that umount_tree(),
copy_tree() and clone_mnt() are _always_ serialized by namespace_sem.
umount_tree() still needs vfsmount_lock (it manipulates hash chains, among
other things), but that's a separate story.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/juhl/trivial: (24 commits)
DOC: A couple corrections and clarifications in USB doc.
Generate a slightly more informative error msg for bad HZ
fix typo "is" -> "if" in Makefile
ext*: spelling fix prefered -> preferred
DOCUMENTATION: Use newer DEFINE_SPINLOCK macro in docs.
KEYS: Fix the comment to match the file name in rxrpc-type.h.
RAID: remove trailing space from printk line
DMA engine: typo fixes
Remove unused MAX_NODES_SHIFT
MAINTAINERS: Clarify access to OCFS2 development mailing list.
V4L: Storage class should be before const qualifier (sn9c102)
V4L: Storage class should be before const qualifier
sonypi: Storage class should be before const qualifier
intel_menlow: Storage class should be before const qualifier
DVB: Storage class should be before const qualifier
arm: Storage class should be before const qualifier
ALSA: Storage class should be before const qualifier
acpi: Storage class should be before const qualifier
firmware_sample_driver.c: fix coding style
MAINTAINERS: Add ati_remote2 driver
...
Fixed up trivial conflicts in firmware_sample_driver.c
* 'for-2.6.26' of git://git.kernel.dk/linux-2.6-block:
block: fix blk_register_queue() return value
block: fix memory hotplug and bouncing in block layer
block: replace remaining __FUNCTION__ occurrences
Kconfig: clean up block/Kconfig help descriptions
cciss: fix warning oops on rmmod of driver
cciss: Fix race between disk-adding code and interrupt handler
block: move the padding adjustment to blk_rq_map_sg
block: add bio_copy_user_iov support to blk_rq_map_user_iov
block: convert bio_copy_user to bio_copy_user_iov
loop: manage partitions in disk image
cdrom: use kmalloced buffers instead of buffers on stack
cdrom: make unregister_cdrom() return void
cdrom: use list_head for cdrom_device_info list
cdrom: protect cdrom_device_info list by mutex
cdrom: cleanup hardcoded error-code
cdrom: remove ifdef CONFIG_SYSCTL
* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-2.6: (36 commits)
SCSI: convert struct class_device to struct device
DRM: remove unused dev_class
IB: rename "dev" to "srp_dev" in srp_host structure
IB: convert struct class_device to struct device
memstick: convert struct class_device to struct device
driver core: replace remaining __FUNCTION__ occurrences
sysfs: refill attribute buffer when reading from offset 0
PM: Remove destroy_suspended_device()
Firmware: add iSCSI iBFT Support
PM: Remove legacy PM (fix)
Kobject: Replace list_for_each() with list_for_each_entry().
SYSFS: Explicitly include required header file slab.h.
Driver core: make device_is_registered() work for class devices
PM: Convert wakeup flag accessors to inline functions
PM: Make wakeup flags available whenever CONFIG_PM is set
PM: Fix misuse of wakeup flag accessors in serial core
Driver core: Call device_pm_add() after bus_add_device() in device_add()
PM: Handle device registrations during suspend/resume
block: send disk "change" event for rescan_partitions()
sysdev: detect multiple driver registrations
...
Fixed trivial conflict in include/linux/memory.h due to semaphore header
file change (made irrelevant by the change to mutex).
These are small cleanups all over the tree.
Trivial style and comment changes to
fs/select.c, kernel/signal.c, kernel/stop_machine.c & mm/pdflush.c
Signed-off-by: Pavel Machek <pavel@suse.cz>
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Add central definitions for max lockspace name length and max resource
name length. The lack of central definitions has resulted in scattered
private definitions which we can now clean up, including an unused one
in dlm_device.h.
Signed-off-by: David Teigland <teigland@redhat.com>
Move the code that handles cluster posix locks from gfs2 into the dlm
so that it can be used by both gfs2 and ocfs2.
Signed-off-by: David Teigland <teigland@redhat.com>
If a node is removed from a lockspace, and then added back before the
dlm is notified of the removal, the dlm will not detect the removal
and won't clear the old state from the node. This is fixed by using a
list of added nodes so the membership recovery can detect when a newly
added node is already in the member list.
Signed-off-by: David Teigland <teigland@redhat.com>
When a NOQUEUE request fails, the rsb res_master field is unnecessarily
reset to -1, instead of leaving the valid master setting in place. We
want to save the looked-up master values while the rsb is on the "toss
list" so that another lookup can be avoided if the rsb is soon reused.
The fix is to simply leave res_master value alone.
Signed-off-by: David Teigland <teigland@redhat.com>
cluster_set is only called from the macro CLUSTER_ATTR which defines read/write
access functions. Make the signedness match to avoid sparse warnings every time
CLUSTER_ATTR is used (lines 149-159) all of the form:
fs/dlm/config.c:149:1: warning: incorrect type in argument 3 (different signedness)
fs/dlm/config.c:149:1: expected unsigned int *info_field
fs/dlm/config.c:149:1: got int extern [toplevel] *<noident>
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch enables bio_copy_user to take struct sg_iovec (renamed
bio_copy_user_iov). bio_copy_user uses bio_copy_user_iov internally as
bio_map_user uses bio_map_user_iov.
The major changes are:
- adds sg_iovec array to struct bio_map_data
- adds __bio_copy_iov that copy data between bio and
sg_iovec. bio_copy_user_iov and bio_uncopy_user use it.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Tejun Heo <htejun@gmail.com>
Cc: Mike Christie <michaelc@cs.wisc.edu>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Requiring userspace to close and re-open sysfs attributes has been the
policy since before 2.6.12. It allows userspace to get a consistent
snapshot of kernel state and consume it with incremental reads and seeks.
Now, if the file position is zero the kernel assumes userspace wants to see
the new value. The application for this change is to allow a userspace
RAID metadata handler to check the state of an array without causing any
memory allocations. Thus not causing writeback to a raid array that might
be blocked waiting for userspace to take action.
Cc: Neil Brown <neilb@suse.de>
Acked-by: Tejun Heo <htejun@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
After an experimental deletion of the unnecessary inclusion of
<linux/slab.h> from the header file <linux/percpu.h>, the following
files under fs/sysfs were exposed as needing to explicitly include
<linux/slab.h>.
Signed-off-by: Robert P. J. Day <rpjday@crashcourse.ca>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Userspace likes to get notified that the disk may have changed, when
rescan_partitions() is called after partitioning or media change. It will
make it possible to update the state of the disk with the "change" event,
before the following partition "add" events are handled.
Cc: David Zeuthen <david@fubar.dk>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
There is possible NULL pointer dereference if kstr[n]dup failed.
So fix them for safety.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We need to try to ensure that we always use the same credentials whenever
we re-establish the clientid on the server. If not, the server won't
recognise that we're the same client, and so may not allow us to recover
state.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
With the recent change to generic creds, we can no longer use
cred->cr_ops->cr_name to distinguish between RPCSEC_GSS principals and
AUTH_SYS/AUTH_NULL identities. Replace it with the rpc_authops->au_name
instead...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Benny points out that zero-padding of multiword bitfields is necessary,
and that delimiting each word is nice to avoid endianess confusion.
bhalevy: without zero padding output can be ambiguous. Also,
since the printed array of two 32-bit unsigned integers is not a
64-bit number, delimiting the output with a semicolon makes more sense.
Signed-off-by: Fred Isaman <iisaman@citi.umich.edu>
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
All use sites for nfs{,4}_stat_to_errno negate their return value.
It's more efficient to return a negative error from the stat_to_errno convertors
rather than negating its return value everywhere. This also produces slightly
smaller code.
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now that we've added the 'generic' credentials (that are independent of the
rpc_client) to the nfs_open_context, we can use those in the NLM client to
ensure that the lock/unlock requests are authenticated to whoever
originally opened the file.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Both NLM and NFSv4 should be able to clean up adequately in the case where
the user interrupts the RPC call...
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We shouldn't remove the lock from the list of blocked locks until the
CANCEL call has completed since we may be racing with a GRANTED callback.
Also ensure that we send an UNLOCK if the CANCEL request failed. Normally
that should only happen if the process gets hit with a fatal signal.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, it returns success as long as the RPC call was sent. We'd like
to know if the CANCEL operation succeeded on the server.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Peter Staubach comments:
> In the course of investigating testing failures in the locking phase of
> the Connectathon testsuite, I discovered a couple of things. One was
> that one of the tests in the locking tests was racy when it didn't seem
> to need to be and two, that the NFS client asynchronously releases locks
> when a process is exiting.
...
> The Single UNIX Specification Version 3 specifies that: "All locks
> associated with a file for a given process shall be removed when a file
> descriptor for that file is closed by that process or the process holding
> that file descriptor terminates.".
>
> This does not specify whether those locks must be released prior to the
> completion of the exit processing for the process or not. However,
> general assumptions seem to be that those locks will be released. This
> leads to more deterministic behavior under normal circumstances.
The following patch converts the NFSv2/v3 locking code to use the same
mechanism as NFSv4 for sending asynchronous RPC calls and then waiting for
them to complete. This ensures that the UNLOCK and CANCEL RPC calls will
complete even if the user interrupts the call, yet satisfies the
above request for synchronous behaviour on process exit.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When we replace the existing synchronous RPC calls with asynchronous calls,
the reference count will be needed in order to allow us to examine the
result of the RPC call.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Also fix up nlmclnt_lock() so that it doesn't pass modified versions of
fl->fl_flags to nlmclnt_cancel() and other helpers.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It is quite possible that the OPEN, CLOSE, LOCK, LOCKU,... compounds fail
before the actual stateful operation has been executed (for instance in the
PUTFH call). There is no way to tell from the overall status result which
operations were executed from the COMPOUND.
The fix is to move incrementing of the sequence id into the XDR layer,
so that we do it as we process the results from the stateful operation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There should be no need to invalidate a perfectly good state owner just
because of a stale filehandle. Doing so can cause the state recovery code
to break, since nfs4_get_renew_cred() and nfs4_get_setclientid_cred() rely
on finding active state owners.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
In the case of readpage() we need to ensure that the pages get unlocked,
and that the error is flagged.
In the case of O_DIRECT, we need to ensure that the pages are all released.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
It is possible for nfs_wb_page() to sometimes exit with 0 return value, yet
the page is left in a dirty state.
For instance in the case where the server rebooted, and the COMMIT request
failed, then all the previously "clean" pages which were cached by the
server, but were not guaranteed to have been writted out to disk,
have to be redirtied and resent to the server.
The fix is to have nfs_wb_page_priority() check that the page is clean
before it exits...
This fixes a condition that triggers the BUG_ON(PagePrivate(page)) in
nfs_create_request() when we're in the nfs_readpage() path.
Also eliminate a redundant BUG_ON(!PageLocked(page)) while we're at it. It
turns out that clear_page_dirty_for_io() has the exact same test.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
There have been a few oopses caused by 'struct file's with NULL f_vfsmnts.
There was also a set of potentially missed mnt_want_write()s from
dentry_open() calls.
This patch provides a very simple debugging framework to catch these kinds of
bugs. It will WARN_ON() them, but should stop us from having any oopses or
mnt_writer count imbalances.
I'm quite convinced that this is a good thing because it found bugs in the
stuff I was working on as soon as I wrote it.
[hch: made it conditional on a debug option.
But it's still a little bit too ugly]
[hch: merged forced remount r/o fix from Dave and akpm's fix for the fix]
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Originally from: Herbert Poetzl <herbert@13thfloor.at>
This is the core of the read-only bind mount patch set.
Note that this does _not_ add a "ro" option directly to the bind mount
operation. If you require such a mount, you must first do the bind, then
follow it up with a 'mount -o remount,ro' operation:
If you wish to have a r/o bind mount of /foo on bar:
mount --bind /foo /bar
mount -o remount,ro /bar
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time. Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances. Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely. This worst-case scenario causes a 3%
degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory. In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.
(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable. So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)
[AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If we depend on the inodes for writeability, we will not catch the r/o mounts
when implemented.
This patches uses __mnt_want_write(). It does not guarantee that the mount
will stay writeable after the check. But, this is OK for one of the checks
because it is just for a printk().
The other two are probably unnecessary and duplicate existing checks in the
VFS. This won't make them better checks than before, but it will make them
detect r/o mounts.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Elevate the write count during the xfs m/ctime updates.
XFS has to do it's own timestamp updates due to an unfortunate VFS
design limitation, so it will have to track writers by itself aswell.
[hch: split out from the touch_atime patch as it's not related to it at all]
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It is OK to let access() go without using a mnt_want/drop_write() pair because
it doesn't actually do writes to the filesystem, and it is inherently racy
anyway. This is a rare case when it is OK to use __mnt_is_readonly()
directly.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
chown/chmod,etc... don't call permission in the same way that the normal
"open for write" calls do. They still write to the filesystem, so bump the
write count during these operations.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This is the first really tricky patch in the series. It elevates the writer
count on a mount each time a non-special file is opened for write.
We used to do this in may_open(), but Miklos pointed out that __dentry_open()
is used as well to create filps. This will cover even those cases, while a
call in may_open() would not have.
There is also an elevated count around the vfs_create() call in open_namei().
See the comments for more details, but we need this to fix a 'create, remount,
fail r/w open()' race.
Some filesystems forego the use of normal vfs calls to create
struct files. Make sure that these users elevate the mnt
writer count because they will get __fput(), and we need
to make sure they're balanced.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Some ioctl()s can cause writes to the filesystem. Take these, and make them
use mnt_want/drop_write() instead.
[AV: updated]
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now includes fix for oops seen by akpm.
"never let a libc developer write your kernel code" - hch
"nor, apparently, a kernel developer" - akpm
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove handling of NULL mnt while we are at it - that can't happen these days.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This basically audits the callers of xattr_permission(), which calls
permission() and can perform writes to the filesystem.
[AV: add missing parts - removexattr() and nfsd posix acls, plug for a leak
spotted by Miklos]
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This also uses the little helper in the NFS code to make an if() a little bit
less ugly. We introduced the helper at the beginning of the series.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().
So that we don't have to make three mnt_want/drop_write()
calls inside of the switch statement, we move some of its
logic outside of the switch and into a helper function
suggested by Christoph.
This also encapsulates a fix for mknod(S_IFREG) that Miklos
found.
[AV: merged mkdir handling, added missing nfsd pieces]
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Elevate the write count during the vfs_rmdir() and vfs_unlink().
[AV: merged rmdir and unlink parts, added missing pieces in nfsd]
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The emergency remount code forcibly removes FMODE_WRITE from
filps. The r/o bind mount code notices that this was done
without a proper mnt_drop_write() and properly gives a
warning.
This patch does a mnt_drop_write() to keep everything
balanced.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If someone decides to demote a file from r/w to just
r/o, they can use this same code as __fput().
NFS does just that, and will use this in the next
patch.
AV: drop write access in __fput() only after we evict from file list.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J Bruce Fields" <bfields@fieldses.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch adds two function mnt_want_write() and mnt_drop_write(). These are
used like a lock pair around and fs operations that might cause a write to the
filesystem.
Before these can become useful, we must first cover each place in the VFS
where writes are performed with a want/drop pair. When that is complete, we
can actually introduce code that will safely check the counts before allowing
r/w<->r/o transitions to occur.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
open_namei() will, in the future, need to take mount write counts
over its creation and truncation (via may_open()) operations. It
needs to keep these write counts until any potential filp that is
created gets __fput()'d.
This gets complicated in the error handling and becomes very murky
as to how far open_namei() actually got, and whether or not that
mount write count was taken. That makes it a bad interface.
All that the current do_filp_open() really does is allocate the
nameidata on the stack, then call open_namei().
So, this merges those two functions and moves filp_open() over
to namei.c so it can be close to its buddy: do_filp_open(). It
also gets a kerneldoc comment in the process.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
My end goal here is to make sure all users of may_open()
return filps. This will ensure that we properly release
mount write counts which were taken for the filp in
may_open().
This patch moves the sys_open flags to namei flags
calculation into fs/namei.c. We'll shortly be moving
the nameidata_to_filp() calls into namei.c, and this
gets the sys_open flags to a place where we can get
at them when we need them.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.26: (1090 commits)
[NET]: Fix and allocate less memory for ->priv'less netdevices
[IPV6]: Fix dangling references on error in fib6_add().
[NETLABEL]: Fix NULL deref in netlbl_unlabel_staticlist_gen() if ifindex not found
[PKT_SCHED]: Fix datalen check in tcf_simp_init().
[INET]: Uninline the __inet_inherit_port call.
[INET]: Drop the inet_inherit_port() call.
SCTP: Initialize partial_bytes_acked to 0, when all of the data is acked.
[netdrvr] forcedeth: internal simplifications; changelog removal
phylib: factor out get_phy_id from within get_phy_device
PHY: add BCM5464 support to broadcom PHY driver
cxgb3: Fix __must_check warning with dev_dbg.
tc35815: Statistics cleanup
natsemi: fix MMIO for PPC 44x platforms
[TIPC]: Cleanup of TIPC reference table code
[TIPC]: Optimized initialization of TIPC reference table
[TIPC]: Remove inlining of reference table locking routines
e1000: convert uint16_t style integers to u16
ixgb: convert uint16_t style integers to u16
sb1000.c: make const arrays static
sb1000.c: stop inlining largish static functions
...
When a share was in DFS and the server was Unix/Linux, we were sending paths of the form
\\server\share/dir/file
rather than
//server/share/dir/file
There was some discussion between me and jra over whether we should use
/server/share/dir/file
as MS sometimes says - but the documentation for this claims it should be
doubleslash for this type of UNC-like path format and that works, so leaving
it as doubleslash but converting the \ to / in the the //server/share portion.
This gets Samba to now correctly return STATUS_PATH_NOT_COVERED when it is
supposed to (Windows already did since the direction of the slash was not an issue
for them). Still need another minor change to fully enable DFS (need to finish
some chages to SMBGetDFSRefer
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (64 commits)
ocfs2/net: Add debug interface to o2net
ocfs2: Only build ocfs2/dlm with the o2cb stack module
ocfs2/cluster: Get rid of arguments to the timeout routines
ocfs2: Put tree in MAINTAINERS
ocfs2: Use BUG_ON
ocfs2: Convert ocfs2 over to unlocked_ioctl
ocfs2: Improve rename locking
fs/ocfs2/aops.c: test for IS_ERR rather than 0
ocfs2: Add inode stealing for ocfs2_reserve_new_inode
ocfs2: Add ac_alloc_slot in ocfs2_alloc_context
ocfs2: Add a new parameter for ocfs2_reserve_suballoc_bits
ocfs2: Enable cross extent block merge.
ocfs2: Add support for cross extent block
ocfs2: Move /sys/o2cb to /sys/fs/o2cb
sysfs: Allow removal of symlinks in the sysfs root
ocfs2: Reconnect after idle time out.
ocfs2/dlm: Cleanup lockres print
ocfs2/dlm: Fix lockname in lockres print function
ocfs2/dlm: Move dlm_print_one_mle() from dlmmaster.c to dlmdebug.c
ocfs2/dlm: Dumps the purgelist into a debugfs file
...
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (49 commits)
[GFS2] fix assertion in log_refund()
[GFS2] fix GFP_KERNEL misuses
[GFS2] test for IS_ERR rather than 0
[GFS2] Invalidate cache at correct point
[GFS2] fs/gfs2/recovery.c: suppress warnings
[GFS2] Faster gfs2_bitfit algorithm
[GFS2] Streamline quota lock/check for no-quota case
[GFS2] Remove drop of module ref where not needed
[GFS2] gfs2_adjust_quota has broken unstuffing code
[GFS2] possible null pointer dereference fixup
[GFS2] Need to ensure that sector_t is 64bits for GFS2
[GFS2] re-support special inode
[GFS2] remove gfs2_dev_iops
[GFS2] fix file_system_type leak on gfs2meta mount
[GFS2] Allow bmap to allocate extents
[GFS2] Fix a page lock / glock deadlock
[GFS2] proper extern for gfs2/locking/dlm/mount.c:gdlm_ops
[GFS2] gfs2/ops_file.c should #include "ops_inode.h"
[GFS2] be*_add_cpu conversion
[GFS2] Fix bug where we called drop_bh incorrectly
...
New WAFS filer uses ioctls which are shown to be available
on a share by querying this info level
Acked-by: Sam Liddicott <sam@liddicott.com>
Signed-off-by: Stevef French <sfrench@us.ibm.com>
This patch exposes o2net information via debugfs. The information includes
the list of sockets (sock_containers) as well as the list of outstanding
messages (send_tracking). Useful for o2dlm debugging.
(This patch is derived from an earlier one written by Zach Brown that
exposed the same information via /proc.)
[Mark: checkpatch fixes]
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
fs/ocfs2/dlm/ocfs2_dlm.ko and fs/ocfs2/dlm/ocfs2_dlmfs.ko get built if
CONFIG_FS_OCFS2 is specified. This isn't quite how it should happen any more
- the "o2cb" dlm modules should only be built if CONFIG_FS_OCFS2_O2CB is
set, so update the dlm Makefile accordingly.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
We keep seeing bug reports related to NULL pointer derefs in
o2net_set_nn_state(). When I originally wrote up the configurable timeout
patch, I had tried to plan for multiple clusters. This was silly.
The timeout routines all use o2nm_single_cluster so there's no point in
passing an argument at all. This patch removes the arguments and kills those
bugs dead.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
if (...) BUG(); should be replaced with BUG_ON(...) when the test has no
side-effects to allow a definition of BUG_ON that drops the code completely.
The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@ disable unlikely @ expression E,f; @@
(
if (<... f(...) ...>) { BUG(); }
|
- if (unlikely(E)) { BUG(); }
+ BUG_ON(E);
)
@@ expression E,f; @@
(
if (<... f(...) ...>) { BUG(); }
|
- if (E) { BUG(); }
+ BUG_ON(E);
)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
As far as I can see there is nothing in ocfs2_ioctl that requires the BKL,
so use unlocked_ioctl
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2_rename() was being too aggressive with the rename lock - we only need
it for certain forms of directory rename.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The function ocfs2_start_trans always returns either a valid pointer or a
value made with ERR_PTR, so its result should be tested with IS_ERR, not
with a test for 0.
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Inode allocation is modified to look in other nodes allocators during
extreme out of space situations. We retry our own slot when space is freed
back to the global bitmap, or whenever we've allocated more than 1024 inodes
from another slot.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In inode stealing, we no longer restrict the allocation to
happen in the local node. So it is neccessary for us to add
a new member in ocfs2_alloc_context to indicate which slot
we are using for allocation. We also modify the process of
local alloc so that this member can be used there also.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In some cases(Inode stealing from other nodes), we may not want
ocfs2_reserve_suballoc_bits to allocate new groups from the
global_bitmap since it may already be full. So add a new parameter
for this.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2_figure_merge_contig_type, we judge whether there exists
a cross extent block merge and enable it by setting CONTIG_LEFT
and CONTIG_RIGHT accordingly.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
In ocfs2_merge_rec_left, when we find the merge extent is "CONTIG_RIGHT"
with the first extent record of the next extent block, we will merge it to
the next extent block and change all the related extent blocks accordingly.
In ocfs2_merge_rec_right, when we find the merge extent is "CONTIG_LEFT"
with the last extent record of the previous extent block, we will merge
it to the prevoius extent block and change all the related extent blocks
accordingly.
As for CONTIG_LEFTRIGHT, we will handle CONTIG_RIGHT first so that when
the index is zero, the merge process will be more efficient and easier.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
/sys/fs is where we really want file system specific sysfs objects.
Ocfs2-tools has been updated to look in /sys/fs/o2cb. We can maintain
backwards compatibility with old ocfs2-tools by using a sysfs symlink. After
some time (2 years), the symlink can be safely removed. This patch also adds
documentation to make it easier for people to figure out what /sys/fs/o2cb
is used for.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Allow callers of sysfs_remove_link() to pass a NULL kobj, in which case
sysfs_root will be used as the parent directory. This allows us to tear down
top level symlinks created via sysfs_create_link(), which already has
similar handling of a NULL parent object.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Acked-by: Greg Kroah-Hartman <gregkh@suse.de>
Currently, o2net connects to a node on hb_up and disconnects on
hb_down and net timeout.
It disconnects on net timeout is ok, but it should attempt to
reconnect back. This is because sometimes nodes get overloaded
enough that the network connection breaks but the disk hb does not.
And if we get into that situation, we either fence (unnecessarily)
or wait for its disk hb to die (and sometimes hang in the process).
So in this updated scheme, when the network disconnects, we keep
attempting to reconnect till we succeed or we get a disk hb down
event.
If the other node is really dead, then we will eventually get a
node down event. If not, we should be able to connect again and
continue.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
A previous patch added KERN_NOTICE to printks printing the lockres that
cluttered the output. This patch removes the log level. For people concerned
with syslog clutter, please note we now use this facility to print lockres
only during an error.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
__dlm_print_one_lock_resource was printing lockname incorrectly.
Also, we now use printk directly instead of mlog as the latter prints
the line context which is not useful for this print.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch helps in consolidating debugging related functions in dlmdebug.c.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch dumps all the lockres' on the purgelist it can fit in one page
into a debugfs file. Useful for debugging.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch dumps all mles it can fit in one page into a debugfs file.
Useful for debugging.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch moves some mle related definitions from dlmmaster.c
to dlmcommon.h. Future patches need these definitions to dump mle
debugging information.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.beckeroracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch dumps all the lockres' alongwith all the locks into
a debugfs file. Useful for debugging.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch dumps the dlm state (dlm_ctxt) into a debugfs file.
Useful for debugging.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch creates the debugfs directories that will hold the
files to be used to dump the dlm state.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch links all the lockres' to a tracking list in dlm_ctxt.
We will use this in an upcoming patch that will walk the entire
list and to dump the lockres states to a debugfs file.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch makes the o2dlm allocate memory for lockres, lockname and lock
structures from slabcaches rather than kmalloc. This allows us to not only
make these allocs more efficient but also allows us to track the memory being
consumed by these structures.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This patch renames dlm_mle_slabcache to prevent namespace clashes with fs/dlm.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 now supports plug-ins for the classic O2CB stack as well as
userspace cluster stacks in conjunction with fs/dlm. This allows zero,
one, or both of the plug-ins to be selected in Kconfig. For local mounts
(non-clustered), neither plug-in is needed. Both plugins can be loaded
at one time, the runtime will select the one needed for the cluster
systme in use.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add ocfs2_stack_user.ko to the Makefile so that it builds.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The masklog code is in the o2cb stack, but ocfs2_lockid.h now needs to
be included by the user stack. The BUG() in ocfs2_lock_type_string()
does not need masklog support, so change it to a regular BUG_ON().
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add code to use fs/dlm.
[ Modified to be part of the stack_user module -- Joel ]
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The "SETV" message sets the filesystem locking protocol version as
negotiated by the client. The client negotiates based on the maximum
version advertised in /sys/fs/ocfs2/max_locking_protocol.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This is the second part of the ocfs2_control handshake. After
negotiating the ocfs2_control protocol, the daemon tells the filesystem
what the local node id is via the SETN message.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When the control daemon sees a node go down, it sends a DOWN message
through the ocfs2_control device.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When a control daemon opens the ocfs2_control device, it must perform a
handshake to tell the filesystem it is something capable of monitoring
cluster status. Only after the handshake is complete will the filesystem
allow mounts.
This is the first part of the handshake. The daemon reads all supported
ocfs2_control protocols, then writes in the protocol it will use.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2_control misc device is how a userspace control daemon (controld)
talks to the filesystem. Introduce the bare-bones filesystem ops.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Add a skeleton for the stack_user module. It's just the barebones module
code.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Userspace can now query and specify the cluster stack in use via the
/sys/fs/ocfs2/cluster_stack file. By default, it is 'o2cb', which is
the classic stack. Thus, old tools that do not know how to modify this
file will work just fine. The stack cannot be modified if there is a
live filesystem.
ocfs2_cluster_connect() now takes the expected cluster stack as an
argument. This way, the filesystem and the stack glue ensure they are
speaking to the same backend.
If the stack is 'o2cb', the o2cb stack plugin is used. For any other
value, the fsdlm stack plugin is selected.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The filesystem gains the USERSPACE_STACK incomat bit and the
s_cluster_info field on the superblock. When a userspace stack is in
use, the name of the stack is stored on-disk for mount-time
verification.
The "cluster_stack" option is added to mount(2) processing. The mount
process needs to pass the matching stack name. If the passed name and
the on-disk name do not match, the mount is failed.
When using the classic o2cb stack, the incompat bit is *not* set and no
mount option is used other than the usual heartbeat=local. Thus, the
filesystem is compatible with older tools.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Introduce a set of sysfs files that describe the current stack glue
state. The files live under /sys/fs/ocfs2. The locking_protocol file
displays the version of ocfs2's locking code. The
loaded_cluster_plugins file displays all of the currently loaded stack
plugins. When filesystems are mounted, the active_cluster_plugin file
will display the plugin in use.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We define the ocfs2_stack_plugin structure to represent a stack driver.
The o2cb stack code is split into stack_o2cb.c. This becomes the
ocfs2_stack_o2cb.ko module.
The stackglue generic functions are similarly split into the
ocfs2_stackglue.ko module. This module now provides an interface to
register drivers. The ocfs2_stack_o2cb driver registers itself. As
part of this interface, ocfs2_stackglue can load drivers on demand.
This is accomplished in ocfs2_cluster_connect().
ocfs2_cluster_disconnect() is now notified when a _hangup() is pending.
If a hangup is pending, it will not release the driver module and will
let _hangup() do that.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Define the ocfs2_stack_operations structure. Build o2cb_stack_ops from
all of the o2cb-specific stack functions. Change the generic stack glue
functions to call the stack_ops instead of the o2cb functions directly.
The o2cb functions are moved to stack_o2cb.c. The headers are cleaned up
to where only needed headers are included.
In this code, stackglue.c and stack_o2cb.c refer to some shared
extern variables. When they become modules, that will change.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Split off the o2cb-specific funtionality from the generic stack glue
calls. This is a precurser to wrapping the o2cb functionality in an
operations vector.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The stack glue initialization function needs a better name so that it can be
used cleanly when stackglue becomes a module.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
dlmglue.c was still referencing a raw o2dlm lksb in one instance. Let's
create a generic ocfs2_dlm_dump_lksb() function. This allows underlying
DLMs to print whatever they want about their lock.
We then move the o2dlm dump into stackglue.c where it belongs.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
When using fsdlm, -EAGAIN is returned in the async callback for NOQUEUE
requests. Fix up dlmglue to expect this.
Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
o2dlm has the non-standard behavior of providing a cancel callback
(unlock_ast) even when the cancel has failed (the locking operation
succeeded without canceling). This is called CANCELGRANT after the
status code sent to the callback. fs/dlm does not provide this
callback, so dlmglue must be changed to live without it.
o2dlm_unlock_ast_wrapper() in stackglue now ignores CANCELGRANT calls.
Because dlmglue no longer sees CANCELGRANT, ocfs2_unlock_ast() no longer
needs to check for it. ocfs2_locking_ast() must catch that a cancel was
tried and clear the cancel state.
Making these changes opens up a locking race. dlmglue uses the the
OCFS2_LOCK_BUSY flag to ensure only one thread is calling the dlm at any
one time. But dlmglue must unlock the lockres before calling into the
dlm. In the small window of time between unlocking the lockres and
calling the dlm, the downconvert thread can try to cancel the lock. The
downconvert thread is checking the OCFS2_LOCK_BUSY flag - it doesn't
know that ocfs2_dlm_lock() has not yet been called.
Because ocfs2_dlm_lock() has not yet been called, the cancel operation
will just be a no-op. There's nothing to cancel. With CANCELGRANT,
dlmglue uses the CANCELGRANT callback to clear up the cancel state.
When it comes around again, it will retry the cancel. Eventually, the
first thread will have called into ocfs2_dlm_lock(), and either the
lock or the cancel will succeed. The downconvert thread can then do its
downconvert.
Without CANCELGRANT, there is nothing to clean up the cancellation
state. The downconvert thread does not know to retry its operations.
More importantly, the original lock may be blocking on the other node
that is trying to cancel us. With neither able to make progress, the
ast is never called and the cancellation state is never cleaned up that
way. dlmglue is deadlocked.
The OCFS2_LOCK_PENDING flag is introduced to remedy this window. It is
set at the same time OCFS2_LOCK_BUSY is. Thus, the downconvert thread
can check whether the lock is cancelable. If not, it just loops around
to try again. Once ocfs2_dlm_lock() is called, the thread then clears
OCFS2_LOCK_PENDING and wakes the downconvert thread. Now, if the
downconvert thread finds the lock BUSY, it can safely try to cancel it.
Whether the cancel works or not, the state will be properly set and the
lock processing can continue.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
It doesn't make sense to query for a node number before connecting to the
cluster stack. This should be safe to do because node_num is only just
printed,
and we're actually only moving the setting of node num a small amount
further in the mount process.
[ Disconnect when node query fails -- Joel ]
Reviewed-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The last bit of classic stack used directly in ocfs2 code is o2hb.
Specifically, the check for heartbeat during mount and the call to
ocfs2_hb_ctl during unmount.
We create an extra API, ocfs2_cluster_hangup(), to encapsulate the call
to ocfs2_hb_ctl. Other stacks will just leave hangup() empty.
The check for heartbeat is moved into ocfs2_cluster_connect(). It will
be matched by a similar check for other stacks.
With this change, only stackglue.c includes cluster/ headers.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
ocfs2 asks the cluster stack for the local node's node number for two
reasons; to fill the slot map and to print it. While the slot map isn't
necessary for userspace cluster stacks, the printing is very nice for
debugging. Thus we add ocfs2_cluster_this_node() as a generic API to get
this value. It is anticipated that the slot map will not be used under a
userspace cluster stack, so validity checks of the node num only need to
exist in the slot map code. Otherwise, it just gets used and printed as an
opaque value.
[ Fixed up some "int" versus "unsigned int" issues and made osb->node_num
truly opaque. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This step introduces a cluster stack agnostic API for initializing and
exiting. fs/ocfs2/dlmglue.c no longer uses o2cb/o2dlm knowledge to
connect to the stack. It is all handled in stackglue.c.
heartbeat.c no longer needs to know how it gets called.
ocfs2_do_node_down() is now a clean recovery trigger.
The big gotcha is the ordering of initializations and de-initializations done
underneath ocfs2_cluster_connect(). ocfs2_dlm_init() used to do all
o2dlm initialization in one block. Thus, the o2dlm functionality of
ocfs2_cluster_connect() is very straightforward. ocfs2_dlm_shutdown(),
however, did a few things between de-registration of the eviction
callback and actually shutting down the domain. Now de-registration and
shutdown of the domain are wrapped within the single
ocfs2_cluster_disconnect() call. I've checked the code paths to make
sure we can safely tear down things in ocfs2_dlm_shutdown() before
calling ocfs2_cluster_disconnect(). The filesystem has already set
itself to ignore the callback.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Wrap the lock status block (lksb) in a union. Later we will add a union
element for the fs/dlm lksb. Create accessors for the status and lvb
fields.
Other than a debugging function, dlmglue.c does not directly reference
the o2dlm locking path anymore.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Change the ocfs2_dlm_lock/unlock() functions to return -errno values.
This is the first step towards elminiating dlm_status in
fs/ocfs2/dlmglue.c. The change also passes -errno values to
->unlock_ast().
[ Fix a return code in dlmglue.c and change the error translation table into
an array of ints. --Mark ]
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The ocfs2 generic code should use the values in <linux/dlmconstants.h>.
stackglue.c will convert them to o2dlm values.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
This is the first in a series of patches to isolate ocfs2 from the
underlying cluster stack. Here we wrap the dlm locking functions with
ocfs2-specific calls. Because ocfs2 always uses the same dlm lock status
callbacks, we can eliminate the callbacks from the filesystem visible
functions.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The old slot map had a few limitations:
- It was limited to one block, so the maximum slot count was 255.
- Each slot was signed 16bits, limiting node numbers to INT16_MAX.
- An empty slot was marked by the magic 0xFFFF (-1).
The new slot map format provides 32bit node numbers (UINT32_MAX), a
separate space to mark a slot in use, and extra room to grow. The slot
map is now bounded by i_size, not a block.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The slot map file is merely an array of __le16. Wrap it in a structure for
cleaner reference.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The in-memory slot map uses the same magic as the on-disk one. There is
a special value to mark a slot as invalid. It relies on the size of
certain types and so on.
Write a new in-memory map that keeps validity as a separate field. Outside
of the I/O functions, OCFS2_INVALID_SLOT now means what it is supposed to.
It also is no longer tied to the type size.
This also means that only the I/O functions refer to 16bit quantities.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The slot map code assumed a slot_map file has one block allocated.
This changes the code to I/O as many blocks as will cover max_slots.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
The old recovery map was a bitmap of node numbers. This was sufficient
for the maximum node number of 254. Going forward, we want node numbers
to be UINT32. Thus, we need a new recovery map.
Note that we can't keep track of slots here. We must write down the
node number to recovery *before* we get the locks needed to convert a
node number into a slot number.
The recovery map is now an array of unsigned ints, max_slots in size.
It moves to journal.c with the rest of recovery.
Because it needs to be initialized, we move all of recovery initialization
into a new function, ocfs2_recovery_init(). This actually cleans up
ocfs2_initialize_super() a little as well. Following on, recovery cleaup
becomes part of ocfs2_recovery_exit().
A number of node map functions are rendered obsolete and are removed.
Finally, waiting on recovery is wrapped in a function rather than naked
checks on the recovery_event. This is a cleanup from Mark.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Just use osb_lock around the ocfs2_slot_info data. This allows us to
take the ocfs2_slot_info structure private in slot_info.c. All access
is now via accessors.
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
journal.c and dlmglue.c would refresh the slot map by hand. Instead, have
the update and clear functions do the work inside slot_map.c. The eventual
result is to make ocfs2_slot_info defined privately in slot_map.c
Signed-off-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6: (87 commits)
[XFS] Fix merge failure
[XFS] The forward declarations for the xfs_ioctl() helpers and the
[XFS] Update XFS documentation for noikeep/ikeep.
[XFS] Update XFS Documentation for ikeep and ihashsize
[XFS] Remove unused HAVE_SPLICE macro.
[XFS] Remove CONFIG_XFS_SECURITY.
[XFS] xfs_bmap_compute_maxlevels should be based on di_forkoff
[XFS] Always use di_forkoff when checking for attr space.
[XFS] Ensure the inode is joined in xfs_itruncate_finish
[XFS] Remove periodic logging of in-core superblock counters.
[XFS] fix logic error in xfs_alloc_ag_vextent_near()
[XFS] Don't error out on good I/Os.
[XFS] Catch log unmount failures.
[XFS] Sanitise xfs_log_force error checking.
[XFS] Check for errors when changing buffer pointers.
[XFS] Don't allow silent errors in xfs_inactive().
[XFS] Catch errors from xfs_imap().
[XFS] xfs_bulkstat_one_dinode() never returns an error.
[XFS] xfs_iflush_fork() never returns an error.
[XFS] Catch unwritten extent conversion errors.
...
associated comment about gcc behavior really aren't needed; all of these
functions are marked STATIC which includes noinline, and the stack usage
won't be a problem.
This effectively just removes the forward declarations and moves
xfs_ioctl() back to the end of the file.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30534a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
HAVE_SPLICE was part of the infrastructure for building 2.4 and 2.6
kernels out of the same tree. Now we don't build 2.4 kernels this
SGI-PV: 971046
SGI-Modid: xfs-linux-melb:xfs-kern:30878a
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
There is no point to the CONFIG_XFS_SECURITY option; it disables the
ability to set security attributes at runtime, but it does not actually
slim down or remove any code for runtime. Just remove it and always allow
security attributes to be set.
SGI-PV: 980310
SGI-Modid: xfs-linux-melb:xfs-kern:30877a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Fix up xfs_bmap_compute_maxlevels() to account for the case when we go
from using attr2 to using attr1. In that case attr1 will no longer
necessarily be at m_attr_offset>>3, but could be at a different value for
di_forkoff. Therefore, we return the worst case scenario using MINDBTPTRS
and MINABTPTRS, as this function is used for determining the maximum log
space.
SGI-PV: 979606
SGI-Modid: xfs-linux-melb:xfs-kern:30862a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
In the case where we mount a filesystem which was previously using the
attr2 format as attr1, returning the default mp->m_attroffset instead of
the per-inode di_forkoff for inline attribute fit calculations, may result
in corruption, if for example, the data fork is already taking more space
than the default fork offset and we try to add an extended attribute. Fix
tested by xfstests/186.
SGI-PV: 979606
SGI-Modid: xfs-linux-melb:xfs-kern:30861a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
On success, we still need to join the inode to the current transaction in
xfs_itruncate_finish(). Fixes regression from error handling changes.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30845a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfssyncd triggers the logging of superblock counters every 30s if the
filesystem is made with lazy-count=1. This will prevent disks from idling
and spinning down as there will be a log write every 30s. With the way
counter recovery works for lazy-count=1, this code is unnecessary and
provides no real benefit, so just remove it.
SGI-PV: 980145
SGI-Modid: xfs-linux-melb:xfs-kern:30840a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Fix a logic error in xfs_alloc_ag_vextent_near(). This is a regression
introduced by the error handling changes.
SGI-PV: 890084
SGI-Modid: xfs-linux-melb:xfs-kern:30838a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Barry Naujok <bnaujok@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfsbdstrat() made all I/Os error out, good or bad. Fix it.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30836a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Unmounting the log can fail. unlikely, but it can. Catch all the error
conditions an make sure it's propagated upwards.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30833a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_log_force() is declared to return an error, but we almost never check
it. We don't need to check it in most cases; if there's a log I/O error
then we'll be shutting down the filesystem anyway and that means we'll
catch the error somewhere else.
However, on certain calls we should be returning an error - sync
transactions, fsync, sync writes, etc. so this isn't a pure black and
white distinction. Hence make xfs_log_force() a void function that issues
a warning to the syslog on error, and call _xfs_log_force() in all the
places where we actually care about the error status returned.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30832a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_buf_associate_memory() can fail, but the return is never checked.
Propagate the error through XFS_BUF_SET_PTR() so that failures are
detected.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30831a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_inactive() fails to report errors when committing the inactive
transaction. Hence we can get silent failures either finishing off the
truncation or committing the transaction. Even if we get errors, we need
to continue, so simply warn loudly to the system if we get errors here.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30830a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Catch errors from xfs_imap() in log recovery when we might be trying to
map an invalid inode number due to a corrupted log.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30829a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_iflush_fork() never returns an error. Mark it void and clean up the
code calling it that checks for errors.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30827a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
On unwritten I/O completion, we fail to propagate an error when converting
the extent to a written extent. This means that the I/O silently fails.
propagate the error onto the ioend so that the inode is marked with an
error appropriately.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30826a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_bdwrite() cannot return an error; it only queues buffers to the
delayed write list and as such never encounters anything that can fail.
Mark it void.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30825a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_bawrite() can return immediate error status on async writes. Unlike
xfsbdstrat() we don't ever check the error on the buffer after the call,
so we currently do not catch errors at all here. Ensure we catch and
propagate or warn to the syslog about up-front async write errors.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30824a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfsbdstrat() is declared to return an error. That is never checked because
the error is propagated by the xfs_buf_t that is passed through the
function.
Mark xfsbdstrat() as returning void and comment the prototype on the
methods needed for error checking.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30823a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_bmap_last_offset() can fail and return an error.
xfs_iomap_write_allocate() fails to detect and propagate the error.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30802a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_free_extent() can fail, but log recovery never bothers to check if it
successfully free the extent it was supposed to. This could lead to silent
corruption during log recovery. Abort log recovery if we fail to free an
extent.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30801a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
block_truncate_page() can return errors that we currently ignore and
silently discard. We should not ever get errors reported here - an error
indicates a bug somewhere else. Hence catch the error and issue a stack
dump to the syslog because we cannot propagate the error any further up
the call chain.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30800a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_alloc_compute_aligned() returns a value based on a comparison of the
computed extent length and the minimum length allowed. This is only used
by some callers - the other four return parameters are used more often.
Hence move the comparison to the code that actually needs to do it and
make xfs_alloc_compute_aligned() a void function.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30797a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_alloc_search_busy() returns an index into the busy array if the extent
was found in the array. This is never checked, and the
xfs_alloc_search_busy() does a log force to prevent reuse of the extent
before the free transaction hits the disk. Hence the return value is
useless. Declare the function void and remove the slot number from the
tracing as well.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30796a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_trans_commit() can return errors when there are problems in the
transaction subsystem. They are indicative that the entire transaction may
be incomplete, and hence the error should be propagated as there is a good
possibility that there is something fatally wrong in the filesystem. Catch
and propagate or warn about commit errors in the places where they are
currently ignored.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30795a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_trans_reserve() reports errors that should not be ignored. For
example, a shutdown filesystem will report errors through
xfs_trans_reserve() to prevent further changes from being attempted on a
damaged filesystem. Catch and propagate all error conditions from
xfs_trans_reserve().
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30794a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Removing an ACL can return an error. Propagate it.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30793a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Propagate the error status from xfs_acl_setmode() so that callers know if
the ACl was set correctly or not.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30792a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Truncating the quota files can silently fail. Ensure that truncation
errors are propagated to the callers.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30791a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When turning off quota, we need to write various transactions to the log
to ensure that they are cleanly removed in the case of a crash. We need to
check that the transactions hit the disk correctly. If we fail to write
the final quota off transaction, we are corrupt in memory and so the only
option is to shut the filesystem down at this point.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30790a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Warn to the syslog if we fail to reset the quota flags in the superblock
when a quota check fails.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30789a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_qm_mount_quotas() returns an error status that is ignored. If we fail
to mount quotas, we continue with quota's turned off, which is all handled
inside xfs_qm_mount_quotas(). Mark it as void to indicate that errors need
not be returned to the callers.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30788a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_qm_dqflush() can fail, but the return is not checked anywhere. Hence
we never know if we've failed to flush a dquot to disk. Propagate the
error and warn to the syslog if a flush ever fails.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30787a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_qm_dqflush_all() can return flush errors. Ensure they are propagated
into the quotacheck code to determine if the quotacheck succeeded or not.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30786a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
xfs_reserve_blocks() can fail in interesting ways. In neither case is it a
fatal error, but the result can lead to sub-optimal behaviour. Warn to the
syslog if the call fails but otherwise continue.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30784a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Makes it simpler to annotate function prototypes with __must_check via sed
scripts.
SGI-PV: 980084
SGI-Modid: xfs-linux-melb:xfs-kern:30781a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
This target component validation is not POSIX conformant and it is not
done by any other Linux filesystem so remove it from XFS.
SGI-PV: 980080
SGI-Modid: xfs-linux-melb:xfs-kern:30776a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove the remaining uses of __inline in the XFS code base.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30774a
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Recent changes to xlog_state_release_iclog() placed the grant_lock inside
the icloglock. forced unmount of the log does this the opposite way
around, but does not depend on the order for correct working. Fix the
inversion by changing the order locks are gained in
xfs_log_force_umount().
SGI-PV: 979661
SGI-Modid: xfs-linux-melb:xfs-kern:30773a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
To reduce contention on the log in large CPU count, separate out different
parts of the xlog_t structure onto different cachelines. Move each lock
onto a different cacheline along with all the members that are
accessed/modified while that lock is held.
Also, move the debugging code into debug code.
SGI-PV: 978729
SGI-Modid: xfs-linux-melb:xfs-kern:30772a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The ticket allocator is just a simple slab implementation internal to the
log. It requires the icloglock to be held when manipulating it and this
contributes to contention on that lock.
Just kill the entire allocator and use a memory zone instead. While there,
allow us to gracefully fail allocation with ENOMEM.
SGI-PV: 978729
SGI-Modid: xfs-linux-melb:xfs-kern:30771a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Rather than use the icloglock for protecting the iclog completion callback
chain, use a new per-iclog lock so that walking the callback chain doesn't
require holding a global lock.
This reduces contention on the icloglock during transaction commit and log
I/O completion by reducing the number of times we need to hold the global
icloglock during these operations.
SGI-PV: 978729
SGI-Modid: xfs-linux-melb:xfs-kern:30770a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
While investigating the extent corruption bug I ran into this bug in debug
only code. xfs_bmap_check_leaf_extents() loops through the leaf blocks of
the extent btree checking that every extent is entirely before the next
extent. It also compares the last extent in the previous block to the
first extent in the current block when the previous block has been
released and potentially unmapped. So take a copy of the last extent
instead of a pointer. Also move the last extent check out of the loop
because we only need to do it once.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30718a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Most VN_RELE calls either directly contain a XFS_ITOV or have the
corresponding xfs_inode already in scope. Use the IRELE helper instead of
VN_RELE to clarify the code. With a little more work we can kill VN_RELE
altogether and define IRELE in terms of iput directly.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30710a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The three subcases of xfs_ioc_xattr don't share any semantics and almost
no code, so split it into three separate helpers.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30709a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
- rename rootvp to root for clarify
- remove useless vn_to_inode call
- check is_bad_inode before calling d_alloc_root
- use iput instead of VN_RELE in the error case
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30708a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When writing into preallocated regions there is a case where XFS can oops
or hang doing the unwritten extent conversion on I/O completion. It turns
out that the problem is related to the btree cursor being invalid.
When we do an insert into the tree, we may need to split blocks in the
tree. When we only split at the leaf level (i.e. level 0), everything
works just fine. However, if we have a multi-level split in the btreee,
the cursor passed to the insert function is no longer valid once the
insert is complete.
The leaf level split is handled correctly because all the operations at
level 0 are done using the original cursor, hence it is updated correctly.
However, when we need to update the next level up the tree, we don't use
that cursor - we use a cloned cursor that points to the index in the next
level up where we need to do the insert.
Hence if we need to split a second level, the changes to the tree are
reflected in the cloned cursor and not the original cursor. This
clone-and-move-up-a-level-on-split behaviour recurses all the way to the
top of the tree.
The complexity here is that these cloned cursors do not point to the
original index that was inserted - they point to the newly allocated block
(the right block) and the original cursor pointer to that level may still
point to the left block. Hence, without deep examination of the cloned
cursor and buffers, we cannot update the original cursor with the new path
from the cloned cursor.
In these cases the original cursor could be pointing to the wrong block(s)
and hence a subsequent modification to the tree using that cursor will
lead to corruption of the tree.
The crash case occurs when the tree changes height - we insert a new level
in the tree, and the cursor does not have a buffer in it's path for that
level. Hence any attempt to walk back up the cursor to the root block will
result in a null pointer dereference.
To make matters even more complex, the BMAP BT is rooted in an inode, so
we can have a change of height in the btree *without a root split*. That
is, if the root block in the inode is full when we split a leaf node, we
cannot fit the pointer to the new block in the root, so we allocate a new
block, migrate all the ptrs out of the inode into the new block and point
the inode root block at the newly allocated block. This changes the height
of the tree without a root split having occurred and hence invalidates the
path in the original cursor.
The patch below prevents xfs_bmbt_insert() from returning with an invalid
cursor by detecting the cases that invalidate the original cursor and
refresh it by do a lookup into the btree for the original index we were
inserting at.
Note that the INOBT, AGFBNO and AGFCNT btree implementations also have
this bug, but the cursor is currently always destroyed or revalidated
after an insert for those trees. Hence this patch only address the problem
in the BMBT code.
SGI-PV: 979339
SGI-Modid: xfs-linux-melb:xfs-kern:30701a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
At ENOSPC, we can get a filesystem shutdown due to a cancelling a dirty
transaction in xfs_mkdir or xfs_create. This is due to the initial
allocation attempt not taking into account inode alignment and hence we
can prepare the AGF freelist for allocation when it's not actually
possible to do an allocation. This results in inode allocation returning
ENOSPC with a dirty transaction, and hence we shut down the filesystem.
Because the first allocation is an exact allocation attempt, we must tell
the allocator that the alignment does not affect the allocation attempt.
i.e. we will accept any extent alignment as long as the extent starts at
the block we want. Unfortunately, this means that if the longest free
extent is less than the length + alignment necessary for fallback
allocation attempts but is long enough to attempt a non-aligned
allocation, we will modify the free list.
If we then have the exact allocation fail, all other allocation attempts
will also fail due to the alignment constraint being taken into account.
Hence the initial attempt needs to set the "alignment slop" field so that
alignment, while not required, must be taken into account when determining
if there is enough space left in the AG to do the allocation.
That means if the exact allocation fails, we will not dirty the freelist
if there is not enough space available fo a subsequent allocation to
succeed. Hence we get an ENOSPC error back to userspace without shutting
down the filesystem.
SGI-PV: 978886
SGI-Modid: xfs-linux-melb:xfs-kern:30699a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Replace the xfs_ail_entry_t with a struct list_head and clean the
surrounding code up. Also fixes a livelock in xfs_trans_first_push_ail()
by terminating the loop at the head of the list correctly.
SGI-PV: 978682
SGI-Modid: xfs-linux-melb:xfs-kern:30636a
Signed-off-by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When xfs_mountfs is called by xfs_mount xfs_readsb was called 35 lines
above unconditionally, so there is no need to try to read the superblock
if it's not present. If any other port doesn't have the superblock read at
this point it should just call it directly from it's xfs_mount equivalent.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30603a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
It's completely unused so we might aswell kill it. Note that there is
another t_sema in struct xlog_ticket, which is used and actually an sv_t
despite the name. That one is left untouched by this patch.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30591a
Signed-off-by: Niv Sardi <xaiki@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Now that the ktrace_enter() code is using atomics, the non-power-of-2
buffer sizes - which require modulus operations to get the index - are
showing up as using substantial CPU in the profiles.
Force the buffer sizes to be rounded up to the nearest power of two and
use masking rather than modulus operations to convert the index counter to
the buffer index. This reduces ktrace_enter overhead to 8% of a CPU time,
and again almost halves the trace intensive test runtime.
SGI-PV: 977546
SGI-Modid: xfs-linux-melb:xfs-kern:30538a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
ktrace_enter() is consuming vast amounts of CPU time due to the use of a
single global lock for protecting buffer index increments. Change it to
use per-buffer atomic counters - this reduces ktrace_enter() overhead
during a trace intensive test on a 4p machine from 58% of all CPU time to
12% and halves test runtime.
SGI-PV: 977546
SGI-Modid: xfs-linux-melb:xfs-kern:30537a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
XFS changes the c/mtime of an inode when truncating it to the same size.
The c/mtime is only supposed to change if the size is changed. Not to be
confused with ftruncate, where the c/mtime is supposed to be changed even
if the size is not changed.
The Linux VFS encodes this semantic difference in the flags it sends down
to ->setattr, which XFS currently ignores. We need to make XFS pay
attention to the VFS flags and hence Do The Right Thing.
SGI-PV: 977547
SGI-Modid: xfs-linux-melb:xfs-kern:30536a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
As Dave pointed out after the export ops changes we now always encode the
parent into the filehandle for regular files, but it's not actually needed
when the filesystem is export with no_subtree_check. This one-liner fixes
xfs_fs_encode_fh to skip encoding the parent unless nessecary.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30535a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
We can just use xfs_ilock/xfs_iunlock instead and get rid of the ugly
bhv_vrwlock_t.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30533a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Instead of of xfs_get_dir_entry use a macro to get the xfs_inode from the
dentry in the callers and grab the reference manually.
Only grab the reference once as it's fine to keep it over the dmapi calls.
(And even that reference is actually superflous in Linux but I'll leave
that for another patch)
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30531a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Cleanup the unneeded intermediate vnode step in the flushing helpers and
go directly from the xfs_inode to the struct address_space.
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30530a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
- use proper goto based unwinding instead of the current mess of
multiple conditionals
- rename ip to inode because that's the normal convention for Linux
inodes while ip is the convention for xfs_inodes
- remove unlikely checks for the default_acl - branches marked unlikely
might lead to extreme branch bredictor slowdons if taken and for some
workloads a default acl is quite common
- properly indent the switch statements
- remove xfs_has_fs_struct as nfsd has a fs_struct in any semi-recent
kernel
SGI-PV: 976035
SGI-Modid: xfs-linux-melb:xfs-kern:30529a
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Now that we update the log tail LSN less frequently on transaction
completion, we pass the contention straight to the global log state lock
(l_iclog_lock) during transaction completion.
We currently have to take this lock to decrement the iclog reference
count. there is a reference count on each iclog, so we need to take he
global lock for all refcount changes.
When large numbers of processes are all doing small trnasctions, the iclog
reference counts will be quite high, and the state change that absolutely
requires the l_iclog_lock is the except rather than the norm.
Change the reference counting on the iclogs to use atomic_inc/dec so that
we can use atomic_dec_and_lock during transaction completion and avoid the
need for grabbing the l_iclog_lock for every reference count decrement
except the one that matters - the last.
SGI-PV: 975671
SGI-Modid: xfs-linux-melb:xfs-kern:30505a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When hundreds of processors attempt to commit transactions at the same
time, they can contend on the AIL lock when updating the tail LSN held in
the in-core log structure.
At the moment, the tail LSN is only needed when actually writing out an
iclog, so it really does not need to be updated on every single
transaction completion - only those that result in switching iclogs and
flushing them to disk.
The result is that we reduce the number of times we need to grab the AIL
lock and the log grant lock by up to two orders of magnitude on large
processor count machines. The problem has previously been hidden by AIL
lock contention walking the AIL list which was recently solved and
uncovered this issue.
SGI-PV: 975671
SGI-Modid: xfs-linux-melb:xfs-kern:30504a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove open coded checks for the whether the inode is clean and replace
them with an inlined function.
SGI-PV: 977461
SGI-Modid: xfs-linux-melb:xfs-kern:30503a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove the xfs_icluster structure and replace with a radix tree lookup.
We don't need to keep a list of inodes in each cluster around anymore as
we can look them up quickly when we need to. The only time we need to do
this now is during inode writeback.
Factor the inode cluster writeback code out of xfs_iflush and convert it
to use radix_tree_gang_lookup() instead of walking a list of inodes built
when we first read in the inodes.
This remove 3 pointers from each xfs_inode structure and the xfs_icluster
structure per inode cluster. Hence we reduce the cache footprint of the
xfs_inodes by between 5-10% depending on cluster sparseness.
To be truly efficient we need a radix_tree_gang_lookup_range() call to
stop searching once we are past the end of the cluster instead of trying
to find a full cluster's worth of inodes.
Before (ia64):
$ cat /sys/slab/xfs_inode/object_size 536
After:
$ cat /sys/slab/xfs_inode/object_size 512
SGI-PV: 977460
SGI-Modid: xfs-linux-melb:xfs-kern:30502a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
When pdflush is writing back inodes, it can get stuck on inode cluster
buffers that are currently under I/O. This occurs when we write data to
multiple inodes in the same inode cluster at the same time.
Effectively, delayed allocation marks the inode dirty during the data
writeback. Hence if the inode cluster was flushed during the writeback of
the first inode, the writeback of the second inode will block waiting for
the inode cluster write to complete before writing it again for the newly
dirtied inode.
Basically, we want to avoid this from happening so we don't block pdflush
and slow down all of writeback. Hence we introduce a non-blocking async
inode flush flag that pdflush uses. If this flag is set, we use
non-blocking operations (e.g. try locks) whereever we can to avoid
blocking or extra I/O being issued.
SGI-PV: 970925
SGI-Modid: xfs-linux-melb:xfs-kern:30501a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
The only difference between the functions is one passes an inode for the
lookup, the other passes an inode number. However, they don't do the same
validity checking or set all the same state on the buffer that is returned
yet they should.
Factor the functions into a common implementation.
SGI-PV: 970925
SGI-Modid: xfs-linux-melb:xfs-kern:30500a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Remove the xfs_refcache, it was only needed while we were still
building for 2.4 kernels.
SGI-PV: 971186
SGI-Modid: xfs-linux-melb:xfs-kern:30472a
Signed-off-by: Donald Douwsma <donaldd@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
On a forced shutdown, xfs_finish_reclaim() will skip flushing the inode.
If the inode flush lock is not already held and there is an outstanding
xfs_iflush_done() then we might free the inode prematurely. By acquiring
and releasing the flush lock we will synchronise with xfs_iflush_done().
SGI-PV: 909874
SGI-Modid: xfs-linux-melb:xfs-kern:30468a
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Signed-off-by: David Chinner <dgc@sgi.com>
__FUNCTION__ is gcc-specific, use __func__
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
__FUNCTION__ is gcc-specific, use __func__
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
jbd2 debugfs and stats entries should only be created if cache initialisation
is successful. At the moment they are being created unconditionally which
will leave them dangling if cache (and hence module) initialisation fails.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If an error occurs during jbd2 cache initialisation it is possible for the
journal_head_cache to be NULL when jbd2_journal_destroy_journal_head_cache is
called. Replace the J_ASSERT with an if block to handle the situation
correctly.
Note that even with this fix things will break badly if jbd2 is statically
compiled in and cache initialisation fails.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The revocation table initialisation/destruction code is repeated for each of
the two revocation tables stored in the journal. Refactoring the duplicated
code into functions is tidier, simplifies the logic in initialisation in
particular, and slightly reduces the code size.
There should not be any functional change.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Make revocation cache destruction safe to call if initialisation fails
partially or entirely. This allows it to be used to cleanup in the case of
initialisation failure, simplifying the code slightly.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
This patch makes the needlessly global ext4_xattr_list() static.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The function prototype of ext4_new_blocks_old() is defined in ext4_fs.h,
so we don't need the extra function prototype in mballoc.c
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use ext4_get_group_desc() in ext4_get_inode_block() instead of open
coding the functionality.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: adilger@clusterfs.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mingming Cao <cmm@us.ibm.com>
Use ext4_group_first_block_no() and assign the return values to
ext4_fsblk_t variables.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: adilger@clusterfs.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mingming Cao <cmm@us.ibm.com>
Convert byte order of constant instead of variable which can be done at
compile time (vs run time).
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The list_for_each_entry_rcu() primitive should be used instead of
list_for_each_rcu(), as the former is easier to use and provides
better type safety.
http://groups.google.com/group/linux.kernel/browse_thread/thread/45749c83451cebeb/0633a65759ce7713?lnk=raot
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Roel Kluin <12o3l@tiscali.nl>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
mballoc.c is a whole lot of static functions, which gcc seems to
really like to inline.
With the changes below, on x86, I can at least get from:
432 ext4_mb_new_blocks
240 ext4_mb_free_blocks
208 ext4_mb_discard_group_preallocations
188 ext4_mb_seq_groups_show
164 ext4_mb_init_cache
152 ext4_mb_release_inode_pa
136 ext4_mb_seq_history_show
...
to
220 ext4_mb_free_blocks
188 ext4_mb_seq_groups_show
176 ext4_mb_regular_allocator
164 ext4_mb_init_cache
156 ext4_mb_new_blocks
152 ext4_mb_release_inode_pa
136 ext4_mb_seq_history_show
124 ext4_mb_release_group_pa
...
which still has some big functions in there, but not 432 bytes!
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
I checked ext4_ioctl and it looked largely safe to not be used
without BKL. So convert it over to unlocked_ioctl.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When we convert an uninitialized extent to an initialized extent
we need to make sure we return the number of blocks in the
extent from the file system block corresponding to logical
file block. Otherwise we cache wrong extent details and this
results in file system corruption.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_ext_get_blocks() returns the number of blocks allocated with buffer
head unmapped for a read from prealloc space. This is needed so that
delayed allocation doesn't do block reservation for prealloc space
since the blocks are already reserved on disk. Mark the buffer head
unwritten. Some code paths try to read the block if the buffer_head is
not new and no uptodate. Marking the buffer head unwritten avoids this
reading.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_ext_get_blocks() returns number of blocks allocated with buffer
heads unmapped for a read from prealloc space. This is needed so that
delayed allocation doesn't do block reservation for prealloc space since
the blocks are already resevred on disk. Fix ext4_ext_get_blocks to not
return greater than max_blocks, since some of the code paths cannot
handle such a return value.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4_fallocate needs to update file size in each transaction. Otherwise
if we crash the file size won't be seen. We were also not marking
the inode dirty after updating file size before. Also when we try to
retry allocation due to ENOSPC, make sure we reset the variable ret so
that we actually do a retry.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Fail migrate if we allocated new blocks via mmap write.
If we write to holes in the file via mmap, we end up allocating
new blocks. This block allocation happens without taking inode->i_mutex.
Since migrate is protected by i_mutex and migrate expects that no
new blocks get allocated during migrate, fail migrate if new blocks
get allocated.
We can't take inode->i_mutex in the mmap write path because that
would result in a locking order violation between i_mutex and mmap_sem.
Also adding a separate rw_sempahore for protection is really high overhead
for a rare operation such as migrate.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If the preallocated area is small zero out the full extent
instead of splitting them. This should avoid the "write
every alternate block" problem that could grow the number
of extents dramatically.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch handles possible ENOSPC errors when writing to an
uninitialized extent in case the filesystem is full.
A write to a prealloc area causes the split of an unititalized extent
into initialized and uninitialized extents. If we don't have
space to add new extent information, instead of returning error,
convert the existing uninitialized extent to initialized one. We
need to zero out the blocks corresponding to the entire extent to
prevent uninitialized data reaching userspace.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch enables extent-formatted normal symlinks. Using extents
format allows a symlink to refer to a block number larger than 2^32
on large filesystems. We still don't enable extent format for fast
symlinks, which are contained in the inode itself.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Put the old extent details back if we fail to split the
uninitialized extent.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The "resize" option won't be noticed as it comes after the NULL option,
so if you try to mount (or in this case remount) with that option it
won't be recognized.
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently fdatasync is identical to fsync in ext3.
I think fdatasync should skip journal flush in data=ordered and
data=writeback mode when it overwrites to already-instantiated blocks on
HDD. When I_DIRTY_DATASYNC flag is not set, fdatasync should skip journal
writeout because this indicates only atime or/and mtime updates.
Following patch is the same approach of ext2's fsync code(ext2_sync_file).
I did a performance test using the sysbench.
#sysbench --num-threads=128 --max-requests=50000 --test=fileio --file-total-size=128G
--file-test-mode=rndwr --file-fsync-mode=fdatasync run
The result on ext3 was:
-2.6.24
Operations performed: 0 Read, 50080 Write, 59600 Other = 109680 Total
Read 0b Written 782.5Mb Total transferred 782.5Mb (12.116Mb/sec)
775.45 Requests/sec executed
Test execution summary:
total time: 64.5814s
total number of events: 50080
total time taken by event execution: 3713.9836
per-request statistics:
min: 0.0000s
avg: 0.0742s
max: 0.9375s
approx. 95 percentile: 0.2901s
Threads fairness:
events (avg/stddev): 391.2500/23.26
execution time (avg/stddev): 29.0155/1.99
-2.6.24-patched
Operations performed: 0 Read, 50009 Write, 61596 Other = 111605 Total
Read 0b Written 781.39Mb Total transferred 781.39Mb (16.419Mb/sec)
1050.83 Requests/sec executed
Test execution summary:
total time: 47.5900s
total number of events: 50009
total time taken by event execution: 2934.5768
per-request statistics:
min: 0.0000s
avg: 0.0587s
max: 0.8938s
approx. 95 percentile: 0.1993s
Threads fairness:
events (avg/stddev): 390.6953/22.64
execution time (avg/stddev): 22.9264/1.17
Filesystem I/O throughput was improved.
Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Acked-by: Jan Kara <jack@suse.cz>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch fix a panic while running fsfuzzer.
We are improperly checking the return of ext4_orphan_get.
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
There are several cases where the running transaction can get buffers
added to its BJ_Metadata list which it never dirtied, which makes its
t_nr_buffers counter end up larger than its t_outstanding_credits
counter.
This will cause issues when starting new transactions as while we are
logging buffers we decrement t_outstanding_buffers, so when
t_outstanding_buffers goes negative, we will report that we need less
space in the journal than we actually need, so transactions will be
started even though there may not be enough room for them. In the worst
case scenario (which admittedly is almost impossible to reproduce) this
will result in the journal running out of space.
The fix is to only refile buffers from the committing transaction to the
running transactions BJ_Modified list when b_modified is set on that
journal, which is the only way to be sure if the running transaction has
modified that buffer.
This patch also fixes an accounting error in journal_forget, it is
possible that we can call journal_forget on a buffer without having
modified it, only gotten write access to it, so instead of freeing a
credit, we only do so if the buffer was modified. The assert will help
catch if this problem occurs. Without these two patches I could hit
this assert within minutes of running postmark, with them this issue no
longer arises.
Cc: <linux-ext4@vger.kernel.org>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently at the start of a journal commit we loop through all of the buffers
on the committing transaction and clear the b_modified flag (the flag that is
set when a transaction modifies the buffer) under the j_list_lock.
The problem is that everywhere else this flag is modified only under the jbd2
lock buffer flag, so it will race with a running transaction who could
potentially set it, and have it unset by the committing transaction.
This is also a big waste, you can have several thousands of buffers that you
are clearing the modified flag on when you may not need to. This patch
removes this code and instead clears the b_modified flag upon entering
do_get_write_access/journal_get_create_access, so if that transaction does
indeed use the buffer then it will be accounted for properly, and if it does
not then we know we didn't use it.
That will be important for the next patch in this series. Tested thoroughly
by myself using postmark/iozone/bonnie++.
Cc: <linux-ext4@vger.kernel.org>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
As pointed out by Sergey Vlasov, UDF implements its own version of
the CRC ITU-T V.41. Convert it to use the one in the library.
Signed-off-by: Bob Copeland <me@bobcopeland.com>
Cc: Sergey Vlasov <vsu@altlinux.ru>
Signed-off-by: Jan Kara <jack@suse.cz>
Fix two compilation warnings (and actual bugs in message formatting)
when UDF debugging is turned on.
Signed-off-by: Sebastian Manciulea <manciuleas@yahoo.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Fix mapping of blocks using VAT when it is stored in an inode.
UDF_I(inode)->i_data already points to the beginning of VAT header so there's
no need to add udf_ext0_offset(inode).
Signed-off-by: Sebastian Manciulea <manciuleas@yahoo.com>
Signed-off-by: Jan Kara <jack@suse.cz>
This patch implements parsing of metadata partitions and reading of Metadata
File thus allowing to read UDF 2.50 media. Error resilience is implemented
through accessing the Metadata Mirror File in case the data the Metadata File
cannot be read. The patch is based on the original patch by Sebastian Manciulea
<manciuleas@yahoo.com> and Mircea Fedoreanu <mirceaf_spl@yahoo.com>.
Signed-off-by: Sebastian Manciulea <manciuleas@yahoo.com>
Signed-off-by: Mircea Fedoreanu <mirceaf_spl@yahoo.com>
Signed-off-by: Jan Kara <jack@suse.cz>
According to OSTA UDF specification, only anchor blocks and primary volume
descriptors are placed on media relative to the last session. All other block
numbers are absolute (in the partition or the whole media). This seems to be
confirmed by multisession media created by other systems.
Signed-off-by: Sebastian Manciulea <manciuleas@yahoo.com>
Signed-off-by: Jan Kara <jack@suse.cz>
As we don't properly support writing to pseudooverwrite partition (we should
add entries to VAT and relocate blocks instead of just writing them), mount
filesystems with such partition as read-only.
Signed-off-by: Jan Kara <jack@suse.cz>
We didn't handle VAT packed inside the inode - we tried to call udf_block_map()
on such file which lead to strange results at best. Add proper handling of
packed VAT as we do it with other packed files.
Signed-off-by: Jan Kara <jack@suse.cz>
UDF media with VAT could have never worked because udf_fill_inode() didn't
know about case FILE_TYPE_VAT20. Fix this.
Signed-off-by: Jan Kara <jack@suse.cz>
We incorrectly (way to strictly) checked version of VAT on loading and thus
refuse to mount correct media. There are just two format versions - below 2.0
and above 2.0 and we understand both. So update the version check accordingly.
Signed-off-by: Jan Kara <jack@suse.cz>
Add <last block>+1 and <last block>-1 to a list of blocks which can be the
real last recorded block on a UDF media. Sebastian Manciulea
<manciuleas@yahoo.com> claims this helps some drive + media combinations
he is able to test.
Signed-off-by: Jan Kara <jack@suse.cz>
UDF anchor block detection is complicated by several things - there are several
places where the anchor point can be, some of them relative to the last
recorded block which some devices report wrongly. Moreover some devices on some
media seem to have 7 spare blocks sectors for every 32 blocks (at least as far
as I understand the old code) so we have to count also with that possibility.
This patch splits anchor block detection into several functions so that it is
clearer what we actually try to do. We fix several bugs of the type "for such
and such media, we fail to check block blah" as a result of the cleanup.
Signed-off-by: Jan Kara <jack@suse.cz>
This patch move processing of UDF virtual partitions close to the place
where other partition types are processed. As a result we now also
properly fill in partition access type.
Signed-off-by: Jan Kara <jack@suse.cz>
Report error when we fail to allocate memory for a bitmap and properly
release allocated memory and inodes for all the partitions in case of
mount failure and umount.
Signed-off-by: Jan Kara <jack@suse.cz>
Cleanup processing of volume descriptor sequence so that it is more readable,
make code handle errors (e.g. media problems) better.
Signed-off-by: Jan Kara <jack@suse.cz>
According to ECMA 167 rev. 3 (see 3/8.4.2.1), Anchor Volume Descriptor
Pointer should be recorded at two or more anchor points located at sectors
256, N, N - 256, where N - is a largest logical sector number at volume
space.
So we should always try to detect N on UDF volume before trying to find
Anchor Volume Descriptor (i.e. calling to udf_find_anchor()).
That said, all this patch does is updates the s_last_block even if the
udf_vrs() returns positive value.
Originally written and tested by Yuri Per, ported on latest mainline by me.
Signed-off-by: Yuri Per <Yuri.Per@acronis.com>
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Max Lyadvinsky <Max.Lyadvinsky@acronis.com>
Cc: Vladimir Simonov <Vladimir.Simonov@acronis.com>
Cc: Andrew Neporada <Andrew.Neporada@acronis.com>
Cc: Kirill Korotaev <dev@openvz.org>
Signed-off-by: Jan Kara <jack@suse.cz>
There are several places in UDF where we declared temporary arrays of
UDF_NAME_LEN bytes on stack. This is not nice to stack usage so this patch
changes those places to use kmalloc() instead. Also clean up bail-out paths
in those functions when we are changing them.
Signed-off-by: Jan Kara <jack@suse.cz>
We don't have to check whether a directory entry already exists in a directory
when creating a new one since we've already checked that earlier by lookup and
we are holding directory i_mutex all the time.
Signed-off-by: Jan Kara <jack@suse.cz>
reorganize few code blocks in super.c which
were needlessly indented (and hard to read):
so change from:
rettype fun()
{
init;
if (sth) {
long block of code;
}
}
to:
rettype fun()
{
init;
if (!sth)
return;
long block of code;
}
or
from:
rettype fun2()
{
init;
while (sth) {
init2();
if (sth2) {
long block of code;
}
}
}
to:
rettype fun2()
{
init;
while (sth) {
init2();
if (!sth2)
continue;
long block of code;
}
}
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
remove now unneeded kernel_timestamp type with conversion functions
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
* kernel_timestamp type was almost unused - only callers of udf_stamp_to_time
and udf_time_to_stamp used it, so let these functions handle endianness
internally and don't clutter code with conversions
* rename udf_stamp_to_time to udf_disk_stamp_to_time
and udf_time_to_stamp to udf_time_to_disk_stamp
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
block cannot be less than 0, because it's sector_t,
so remove unneeded checks
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
- translate udf_file_entry_alloc_offset macro into function
- translate udf_ext0_offset macro into function
- add comment about crypticly named fields in struct udf_inode_info
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
- move all brelse(ibh) after main if, because it's called
on every path except one where ibh is null
- move variables to the most inner blocks
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
replace all:
little_endian_variable = cpu_to_leX(leX_to_cpu(little_endian_variable) +
expression_in_cpu_byteorder);
with:
leX_add_cpu(&little_endian_variable, expression_in_cpu_byteorder);
sparse didn't generate any new warning with this patch
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
- remove one indentation level by little code reorganization
- convert "if (smth) BUG();" to "BUG_ON(smth);"
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
- constify internal crc table
- mark udf_crc "in" parameter as const
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
- fix error handling - always zero output variable
- don't zero explicitely fields zeroed by memset
- mark "in" paramater as const
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
udf_build_ustr was broken:
- size == 1:
dest->u_len = ptr[1 - 1], but at ptr[0] there's cmpID,
so we created string with wrong length
it should not happen, so we BUG() it
- size > 1 and size < UDF_NAME_LEN:
we set u_len correctly, but memcpy copied one needless byte
- size == UDF_NAME_LEN - 1:
memcpy overwrited u_len - with correct value, but...
- size >= UDF_NAME_LEN:
we copied UDF_NAME_LEN - 1 bytes, but dest->u_name is array
of UDF_NAME_LEN - 2 bytes, so we were overwriting u_len with
character from input string
nobody noticed because all callers set size
to acceptable values (constants within range)
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
- fix error handling - always zero output variable
- don't zero explicitely fields zeroed by memset
- mark "in" paramater as const
- remove outdated comment
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
The kernel.h macro DIV_ROUND_UP performs the computation (((n) + (d) - 1) /
(d)) but is perhaps more readable.
An extract of the semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)
// <smpl>
@haskernel@
@@
#include <linux/kernel.h>
@depends on haskernel@
expression n,d;
@@
(
- (n + d - 1) / d
+ DIV_ROUND_UP(n,d)
|
- (n + (d - 1)) / d
+ DIV_ROUND_UP(n,d)
)
@depends on haskernel@
expression n,d;
@@
- DIV_ROUND_UP((n),d)
+ DIV_ROUND_UP(n,d)
@depends on haskernel@
expression n,d;
@@
- DIV_ROUND_UP(n,(d))
+ DIV_ROUND_UP(n,d)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Jan Kara <jack@suse.cz>
There's really no reason to keep udf headers in include/linux as they're
not used by anything but fs/udf/.
This patch merges most of include/linux/udf_fs_i.h into fs/udf/udf_i.h,
include/linux/udf_fs_sb.h into fs/udf/udf_sb.h and
include/linux/udf_fs.h into fs/udf/udfdecl.h.
The only thing remaining in include/linux/ is a stub of udf_fs_i.h
defining the four user-visible udf ioctls. It's also moved from
unifdef-y to headers-y because it can be included unconditionally now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
There's not need to document vfs method invocation rules, we have
Documentation/filesystems/vfs.txt and Documentation/filesystems/Locking
for that. Also a lot of these comments where either plain wrong or
horrible out of date.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
This helper has been quite useless since sb_min_blocksize was introduced
and is misnamed while we're at it. Just opencode the few lines in the
caller instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Describe debug parameters with their names (and not their values).
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch makes the needlessly global cifs_dfs_automount_list static.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
mb_cache_entry_alloc() was allocating cache entries with GFP_KERNEL. But
filesystems are calling this function while holding xattr_sem so possible
recursion into the fs violates locking ordering of xattr_sem and transaction
start / i_mutex for ext2-4. Change mb_cache_entry_alloc() so that filesystems
can specify desired gfp mask and use GFP_NOFS from all of them.
Signed-off-by: Jan Kara <jack@suse.cz>
Reported-by: Dave Jones <davej@redhat.com>
Cc: <linux-ext4@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a tcon is being freed in call tconInfoFree, clean up any entries that may
exist in global oplock queue as the tcon structure hanging off of those entries
will be invalid and can cause oops while accesing any elements in the
tcon structure.
Signed-off-by: Shirish Pargaonkar <shirishp@us.ibm.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
This fixes a regression introduced in commit
205c109a7a when switching to
write_begin/write_end operations in JFFS2.
The page offset is miscalculated, leading to corruption of the fragment
lists and subsequently to memory corruption and panics.
[ Side note: the bug is a fairly direct result of the naming. Nick was
likely misled by the use of "offs", since we tend to use the notion of
"offset" not as an absolute position, but as an offset _within_ a page
or allocation.
Alternatively, a "pgoff_t" is a page index, but not a byte offset -
our VM naming can be a bit confusing.
So in this case, a VM person would likely have called this a "pos",
not an "offs", or perhaps talked about byte offsets rather than page
offsets (since it's counted in bytes, not pages). - Linus ]
Signed-off-by: Alexey Korolev <akorolev@infradead.org>
Signed-off-by: Vasiliy Leonenko <vasiliy.leonenko@mail.ru>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miklos Szeredi found the bug:
"Basically what happens is that on the server nlm_fopen() calls
nfsd_open() which returns -EACCES, to which nlm_fopen() returns
NLM_LCK_DENIED.
"On the client this will turn into a -EAGAIN (nlm_stat_to_errno()),
which in will cause fcntl_setlk() to retry forever."
So, for example, opening a file on an nfs filesystem, changing
permissions to forbid further access, then trying to lock the file,
could result in an infinite loop.
And Trond Myklebust identified the culprit, from Marc Eshel and I:
7723ec9777 "locks: factor out
generic/filesystem switch from setlock code"
That commit claimed to just be reshuffling code, but actually introduced
a behavioral change by calling the lock method repeatedly as long as it
returned -EAGAIN.
We assumed this would be safe, since we assumed a lock of type SETLKW
would only return with either success or an error other than -EAGAIN.
However, nfs does can in fact return -EAGAIN in this situation, and
independently of whether that behavior is correct or not, we don't
actually need this change, and it seems far safer not to depend on such
assumptions about the filesystem's ->lock method.
Therefore, revert the problematic part of the original commit. This
leaves vfs_lock_file() and its other callers unchanged, while returning
fcntl_setlk and fcntl_setlk64 to their former behavior.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Tested-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Marc Eshel <eshel@almaden.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'docs' of git://git.lwn.net/linux-2.6:
Add additional examples in Documentation/spinlocks.txt
Move sched-rt-group.txt to scheduler/
Documentation: move rpc-cache.txt to filesystems/
Documentation: move nfsroot.txt to filesystems/
Spell out behavior of atomic_dec_and_lock() in kerneldoc
Fix a typo in highres.txt
Fixes to the seq_file document
Fill out information on patch tags in SubmittingPatches
Add the seq_file documentation
Documentation/ is a little large, and filesystems/ seems an obvious
place for this file.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Michael Kerrisk found out that signalfd was not reporting back user data
pushed using sigqueue:
http://groups.google.com/group/linux.kernel/msg/9397cab8551e3123
The following patch makes signalfd report back the ssi_ptr and ssi_int members
of the signalfd_siginfo structure.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Acked-by: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jeff Roberson discovered a race when using kaio eventfd based notifications.
When it occurs it can lead tomissed wakeups and hung userspace.
This patch fixes the race by moving the notification inside the spinlocked
section of kaio. The operation is safe since eventfd spinlock and kaio one
are unrelated.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Jeff Roberson <jroberson@chesapeake.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use asmlinkage_protect in sys_io_getevents, because GCC for i386 with
CONFIG_FRAME_POINTER=n can decide to clobber an argument word on the
stack, i.e. the user struct pt_regs. Here the problem is not a tail
call, but just the compiler's use of the stack when it inlines and
optimizes the body of the called function. This seems to avoid it.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The prevent_tail_call() macro works around the problem of the compiler
clobbering argument words on the stack, which for asmlinkage functions
is the caller's (user's) struct pt_regs. The tail/sibling-call
optimization is not the only way that the compiler can decide to use
stack argument words as scratch space, which we have to prevent.
Other optimizations can do it too.
Until we have new compiler support to make "asmlinkage" binding on the
compiler's own use of the stack argument frame, we have work around all
the manifestations of this issue that crop up.
More cases seem to be prevented by also keeping the incoming argument
variables live at the end of the function. This makes their original
stack slots attractive places to leave those variables, so the compiler
tends not clobber them for something else. It's still no guarantee, but
it handles some observed cases that prevent_tail_call() did not.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://oss.sgi.com:8090/xfs/xfs-2.6:
[XFS] Ensure "both" features2 slots are consistent
[XFS] Fix superblock features2 field alignment problem
[XFS] remove shouting-indirection macros from xfs_sb.h
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
cfq-iosched: do not leak ioc_data across iosched switches
splice: fix infinite loop in generic_file_splice_read()
Some time ago while attempting to handle invalid link counts, I botched
the unlink of links itself, so this patch fixes this now correctly, so
that only the link count of nodes that don't point to links is ignored.
Thanks to Vlado Plaga <rechner@vlado-do.de> to notify me of this
problem.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There are several places where GFP_KERNEL allocations happen under a glock,
which will result in hangs if we're under memory pressure and go to re-enter the
fs in order to flush stuff out. This patch changes the culprits to GFS_NOFS to
keep this problem from happening. Thank you,
Signed-off-by: Josef Bacik <jbacik@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since older kernels may look in the sb_bad_features2 slot for flags,
rather than zeroing it out on fixup, we should make it equal to the
sb_features2 value.
Also, if the ATTR2 flag was not found prior to features2 fixup, it was not
set in the mount flags, so re-check after the fixup so that the current
session will use the feature.
Also fix up the comments to reflect these changes.
SGI-PV: 980085
SGI-Modid: xfs-linux-melb:xfs-kern:30778a
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>
Due to the xfs_dsb_t structure not being 64 bit aligned, the last field of
the on-disk superblock can vary in location This causes problems when the
filesystem gets moved to a different platform, or there is a 32 bit
userspace and 64 bit kernel.
This patch detects the defect at mount time, logs a warning such as:
XFS: correcting sb_features alignment problem
in dmesg and corrects the problem so that everything is OK. it also
blacklists the bad field in the superblock so it does not get used for
something else later on.
SGI-PV: 977636
SGI-Modid: xfs-linux-melb:xfs-kern:30539a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Eric Sandeen <sandeen@sandeen.net>
Signed-off-by: Lachlan McIlroy <lachlan@sgi.com>