There are three cases found that in error cases, journal transactions are not
committed nor aborted. We should take care of these case by committing the
transactions. Otherwise, there would left a journal handle which will lead to
, in same process context, the comming ocfs2_start_trans() gets wrong credits.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
When we deleting a direntry from a directory, if it's the first in a block we
invalid it by setting inode to 0; otherwise, we merge the deleted one to the
prior and contiguous direntry. And we don't truncate directories.
There is a problem for the later case since inode is not set to 0.
This problem happens when the caller passes a file position as parameter to
ocfs2_dir_foreach_blk(). If the position happens to point to a stale(not
the first, deleted in betweens of ocfs2_dir_foreach_blk()s) direntry, we are
not able to recognize its staleness. So that we treat it as a live one wrongly.
The fix is to set inode to 0 in both cases indicating the direntry is stale.
This won't introduce additional IOs.
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Memory allocated using kmem_cache_zalloc should be freed using
kmem_cache_free, not kfree.
The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression x,e,e1,e2;
@@
x = kmem_cache_zalloc(e1,e2)
... when != x = e
?-kfree(x)
+kmem_cache_free(e1,x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
When someone writes to an inode, readers accessing the same inode via
ocfs2_readpage() just busyloop trying to get ip_alloc_sem because
do_generic_file_read() looks up the page again and retries ->readpage()
when previous attempt failed with AOP_TRUNCATED_PAGE. When there are enough
readers, they can occupy all CPUs and in non-preempt kernel the system is
deadlocked because writer holding ip_alloc_sem is never run to release the
semaphore. Fix the problem by making reader block on ip_alloc_sem to break
the busy loop.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
Fix a corruption that can happen when we have (two or more) outstanding
aio's to an overlapping unaligned region. Ext4
(e9e3bcecf4) and xfs recently had to fix
similar issues.
In our case what happens is that we can have an outstanding aio on a region
and if a write comes in with some bytes overlapping the original aio we may
decide to read that region into a page before continuing (typically because
of buffered-io fallback). Since we have no ordering guarantees with the
aio, we can read stale or bad data into the page and then write it back out.
If the i/o is page and block aligned, then we avoid this issue as there
won't be any need to read data from disk.
I took the same approach as Eric in the ext4 patch and introduced some
serialization of unaligned async direct i/o. I don't expect this to have an
effect on the most common cases of AIO. Unaligned aio will be slower
though, but that's far more acceptable than data corruption.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <jlbec@evilplan.org>
ocfs2 implements its own llseek() to provide the SEEK_HOLE/SEEK_DATA
functionality.
SEEK_HOLE sets the file pointer to the start of either a hole or an unwritten
(preallocated) extent, that is greater than or equal to the supplied offset.
SEEK_DATA sets the file pointer to the start of an allocated extent (not
unwritten) that is greater than or equal to the supplied offset.
If the supplied offset is on a desired region, then the file pointer is set
to it. Offsets greater than or equal to the file size return -ENXIO.
Unwritten (preallocated) extents are considered holes because the file system
treats reads to such regions in the same way as it does to holes.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
This patch address two shortcomings in ocfs2_page_mkwrite():
1. Makes the function return better VM_FAULT_* errors.
2. It handles a error that is triggered when a page is dropped from the mapping
due to memory pressure. This patch locks the page to prevent that.
[Patch was cleaned up by Sunil Mushran.]
Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
The cluster up check only checks to see if the node is heartbeating or not.
If yes it continues assuming that the node is connected to all the nodes. But
if that is not the case, the cluster join aborts with a stack of errors that
are not easy to comprehend.
This patch adds the network connect check upfront and prints the nodes that
the node is not yet connected to, before aborting.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Patch adds function o2net_fill_node_map() to return the bitmap of nodes that
it is connected to. This bitmap is also accessible by the user via the debugfs
file, /sys/kernel/debug/o2net/connected_nodes.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
The o2hb debugfs file, elapsed_time_in_ms, should return values only after the
timer is armed atleast once.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
In dlmlock_remote(), we wait for the resource to stop being active before
setting the inprogress flag. Active includes recovery, migration, etc.
The problem here is that if the resource was being recovered or migrated, the
new owner could very well be that node itself (and thus not a remote node).
This problem was observed in Oracle bug#12583620. The error messages observed
were as follows:
dlm_send_remote_lock_request:337 ERROR: Error -40 (ELOOP) when sending message 503 (key 0xd6d8c7) to node 2
dlmlock_remote:271 ERROR: dlm status = DLM_BADARGS
dlmlock:751 ERROR: dlm status = DLM_BADARGS
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
The inflight reference count, in the lock resource, is taken to pin the resource
in memory. We take it when a new resource is created and release it after a
lock is attached to it. We do this to prevent the resource from getting purged
prematurely.
Earlier this reference count was being taken for locally mastered resources
only. This patch extends the same functionality for remotely mastered ones.
We are doing this because the same premature purging could occur for remotely
mastered resources if the remote node were to die before completion of the
create lock.
Fix for Oracle bug#12405575.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
Currently if the heartbeat device is hard-ro, the o2hb thread keeps chugging
along and dumping errors along the way. The user needs to manually stop the
heartbeat.
The patch addresses this shortcoming by adding a limit to the number of times
the hb thread will iterate in an unsteady state. If the hb thread does not
ready steady state in that many interation, the start is aborted.
Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
unlinkat - Remove a directory entry
size[4] Tunlinkat tag[2] dirfid[4] name[s] flag[4]
size[4] Runlinkat tag[2]
older Tremove have the below request format
size[4] Tremove tag[2] fid[4]
The remove message is used to remove a directory entry either file or directory
The remove opreation is actually a directory opertation and should ideally have
dirfid, if not we cannot represent the fid on server with anything other than
name. We will have to derive the directory name from fid in the Tremove request.
NOTE: The operation doesn't clunk the unlink fid.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
renameat - change name of file or directory
size[4] Trenameat tag[2] olddirfid[4] oldname[s] newdirfid[4] newname[s]
size[4] Rrenameat tag[2]
older Trename have the below request format
size[4] Trename tag[2] fid[4] newdirfid[4] name[s]
The rename message is used to change the name of a file, possibly moving it
to a new directory. The rename opreation is actually a directory opertation
and should ideally have olddirfid, if not we cannot represent the fid on server
with anything other than name. We will have to derive the old directory name
from fid in the Trename request.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This make sure we don't end up reusing the unlinked inode object.
The ideal way is to use inode i_generation. But i_generation is
not available in userspace always.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Without this fix, if any invalid mount options/args are passed while mouting
the 9p fs, no error (-EINVAL) is returned and default arg value is assigned.
This fix returns -EINVAL when an invalid arguement is found while parsing
mount options.
Signed-off-by: Prem Karat <prem.karat@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
This make sure we don't use wrong inode from the inode hash. The inode number
of the file deleted is reused by the next file system object created
and if we only use inode number for inode hash lookup we could end up
with wrong struct inode.
Also compare inode generation number. Not all Linux file system provide
st_gen in userspace. So it could be 0;
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
Now that VFS does the right thing remove the work around.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (107 commits)
vfs: use ERR_CAST for err-ptr tossing in lookup_instantiate_filp
isofs: Remove global fs lock
jffs2: fix IN_DELETE_SELF on overwriting rename() killing a directory
fix IN_DELETE_SELF on overwriting rename() on ramfs et.al.
mm/truncate.c: fix build for CONFIG_BLOCK not enabled
fs:update the NOTE of the file_operations structure
Remove dead code in dget_parent()
AFS: Fix silly characters in a comment
switch d_add_ci() to d_splice_alias() in "found negative" case as well
simplify gfs2_lookup()
jfs_lookup(): don't bother with . or ..
get rid of useless dget_parent() in btrfs rename() and link()
get rid of useless dget_parent() in fs/btrfs/ioctl.c
fs: push i_mutex and filemap_write_and_wait down into ->fsync() handlers
drivers: fix up various ->llseek() implementations
fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek
Ext4: handle SEEK_HOLE/SEEK_DATA generically
Btrfs: implement our own ->llseek
fs: add SEEK_HOLE and SEEK_DATA flags
reiserfs: make reiserfs default to barrier=flush
...
Fix up trivial conflicts in fs/xfs/linux-2.6/xfs_super.c due to the new
shrinker callout for the inode cache, that clashed with the xfs code to
start the periodic workers later.
sbi->s_mutex isn't needed for isofs at all so we can just remove it. Generally,
since isofs is always mounted read-only, filesystem structure cannot change
under us. So buffer_head contents stays constant after it's filled in. That
leaves us with possible changes of global data structures. Superblock changes
only during filesystem mount (even remount does not change it), inodes are only
filled in during reading from disk. So there are no changes of these structures
to bother about.
Arguments why sbi->s_mutex can be removed at each place:
isofs_readdir: Accesses sb, inode, filp, local variables => s_mutex not needed
isofs_lookup: Protected by directory's i_mutex. Accesses sb, inode, dentry,
local variables => s_mutex not needed
rock_ridge_symlink_readpage: Protected by page lock. Accesses sb, inode,
local variables => s_mutex not needed.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We don't generate IN_DELETE_SELF on victim of overwriting rename() if
it happens to be a directory. Trivially fixed by doing to ->i_nlink
what we do ->pino_nlink a couple of lines later in jffs2_rename().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
On ramfs and other simple_rename() users IN_DELETE_SELF is not generated
for victim of overwriting rename() if it's is a directory. Works on
most of the local filesystems and really trivial to fix...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (39 commits)
ptrace: do_wait(traced_leader_killed_by_mt_exec) can block forever
ptrace: fix ptrace_signal() && STOP_DEQUEUED interaction
connector: add an event for monitoring process tracers
ptrace: dont send SIGSTOP on auto-attach if PT_SEIZED
ptrace: mv send-SIGSTOP from do_fork() to ptrace_init_task()
ptrace_init_task: initialize child->jobctl explicitly
has_stopped_jobs: s/task_is_stopped/SIGNAL_STOP_STOPPED/
ptrace: make former thread ID available via PTRACE_GETEVENTMSG after PTRACE_EVENT_EXEC stop
ptrace: wait_consider_task: s/same_thread_group/ptrace_reparented/
ptrace: kill real_parent_is_ptracer() in in favor of ptrace_reparented()
ptrace: ptrace_reparented() should check same_thread_group()
redefine thread_group_leader() as exit_signal >= 0
do not change dead_task->exit_signal
kill task_detached()
reparent_leader: check EXIT_DEAD instead of task_detached()
make do_notify_parent() __must_check, update the callers
__ptrace_detach: avoid task_detached(), check do_notify_parent()
kill tracehook_notify_death()
make do_notify_parent() return bool
ptrace: s/tracehook_tracer_task()/ptrace_parent()/
...
* 'for-linus' of git://oss.sgi.com/xfs/xfs: (49 commits)
xfs: add size update tracepoint to IO completion
xfs: convert AIL cursors to use struct list_head
xfs: remove confusing ail cursor wrapper
xfs: use a cursor for bulk AIL insertion
xfs: failure mapping nfs fh to inode should return ESTALE
xfs: Remove the second parameter to xfs_sb_count()
xfs: remove the dead XFS_DABUF_DEBUG code
xfs: remove leftovers of the old btree tracing code
xfs: remove the dead QUOTADEBUG code
xfs: remove the unused xfs_buf_delwri_sort function
xfs: remove wrappers around b_iodone
xfs: remove wrappers around b_fspriv
xfs: add a proper transaction pointer to struct xfs_buf
xfs: factor out xfs_da_grow_inode_int
xfs: factor out xfs_dir2_leaf_find_stale
xfs: cleanup struct xfs_dir2_free
xfs: reshuffle dir2 headers
xfs: start periodic workers later
Revert "xfs: fix filesystsem freeze race in xfs_trans_alloc"
xfs: remove variables that serve no purpose in xfs_alloc_ag_vextent_exact()
...
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm:
dlm: don't limit active work items
dlm: use workqueue for callbacks
dlm: remove deadlock debug print
dlm: improve rsb searches
dlm: keep lkbs in idr
dlm: fix kmalloc args
dlm: don't do pointless NULL check, use kzalloc and fix order of arguments
dlm: dump address of unknown node
dlm: use vmalloc for hash tables
dlm: show addresses in configfs
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus:
hfsplus: ensure bio requests are not smaller than the hardware sectors
hfsplus: Add additional range check to handle on-disk corruptions
hfsplus: Add error propagation for hfsplus_ext_write_extent_locked
hfsplus: add error checking for hfs_find_init()
hfsplus: lift the 2TB size limit
hfsplus: fix overflow in hfsplus_read_wrapper
hfsplus: fix overflow in hfsplus_get_block
hfsplus: assignments inside `if' condition clean-up
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
GFS2: combine duplicated block freeing routines
GFS2: Add S_NOSEC support
GFS2: Automatically adjust glock min hold time
GFS2: Cache dir hash table in a contiguous buffer
* 'linux-next' of git://git.infradead.org/ubifs-2.6: (32 commits)
MAINTAINERS: change e-mail of Adrian Hunter
UBIFS: fix master node recovery
UBIFS: improve power cut emulation testing
UBIFS: rename recovery testing variables
UBIFS: remove custom list of superblocks
UBIFS: stop re-defining UBI operations
UBIFS: switch to I/O helpers
UBIFS: switch to ubifs_leb_write
UBIFS: switch to ubifs_leb_read
UBIFS: introduce more I/O helpers
UBIFS: always print stacktrace when switching to R/O mode
UBIFS: remove unused and unneeded debugging function
UBIFS: add global debugfs knobs
UBIFS: introduce debugfs helpers
UBIFS: re-arrange debugging code a bit
UBIFS: be more informative in failure mode
UBIFS: switch self-check knobs to debugfs
UBIFS: lessen amount of debugging check types
UBIFS: introduce helper functions for debugging checks and tests
UBIFS: amend debugging inode size check function prototype
...
Currently all bio requests are 512 bytes, which may fail for media
whose physical sector size is larger than this. Ensure these
requests are not smaller than the block device logical block size.
BugLink: http://bugs.launchpad.net/bugs/734883
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
'recoff' is read from disk and used for an argument to memcpy, so if
the value read from disk is larger than the page size, it result to
"general protection fault". This patch add additional range check for
the value, so that disk fuzz won't cause such fault.
Signed-off-by: Naohiro Aota <naota@elisp.net>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Test-case:
void *tfunc(void *arg)
{
execvp("true", NULL);
return NULL;
}
int main(void)
{
int pid;
if (fork()) {
pthread_t t;
kill(getpid(), SIGSTOP);
pthread_create(&t, NULL, tfunc, NULL);
for (;;)
pause();
}
pid = getppid();
assert(ptrace(PTRACE_ATTACH, pid, 0,0) == 0);
while (wait(NULL) > 0)
ptrace(PTRACE_CONT, pid, 0,0);
return 0;
}
It is racy, exit_notify() does __wake_up_parent() too. But in the
likely case it triggers the problem: de_thread() does release_task()
and the old leader goes away without the notification, the tracer
sleeps in do_wait() without children/tracees.
Change de_thread() to do __wake_up_parent(traced_leader->parent).
Since it is already EXIT_DEAD we can do this without ptrace_unlink(),
EXIT_DEAD threads do not exist from do_wait's pov.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
It seems to hurt performance in real life. Yes, the inode will be used
later, but the conditional doesn't seem to predict all that well
(negative dentries are not uncommon) and it looks like the cost of
prefetching is simply higher than depending on the cache doing the right
thing.
As usual.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The compiler, at least for ix86 and m68k, validly warns that the
comparison:
next <= (loff_t)-1
is always true (and it's always true also for x86-64 and probably all
other arches - as long as pgoff_t isn't wider than loff_t). The
intention appears to be to avoid wrapping of "next", so rather than
eliminating the pointless comparison, fix the loop to indeed get exited
when "next" would otherwise wrap.
On m68k the following warning is observed:
fs/fscache/page.c: In function '__fscache_uncache_all_inode_pages':
fs/fscache/page.c:979: warning: comparison is always false due to limited range of data type
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reported-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Suresh Jayaraman <sjayaraman@suse.de>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix silly characters in a comment in AFS code (some weird characters replaced
the word 'flag' some point way back).
Reported-by: viro@ZenIV.linux.org.uk
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>