We should drop the ->s_umount mutex if an error occurs after the
sget()/grab_super() call. This was introduced when adding support
for multiple instances of devpts and noticed during a code review/reorg.
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The last user of do_pipe is in arch/alpha/, after replacing it with
do_pipe_flags, the do_pipe can be totally dropped.
Signed-off-by: Cheng Renquan <crquan@gmail.com>
Acked-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Copy symlink data into the union member it is accessed through. Although
this shouldn't make a difference to behaviour it makes the code easier
to follow and grep through. It may also prevent problems if the
struct/union definitions change in the future.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Ensure fast symlink targets are NUL-terminated, even if corrupted
on-disk.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ufs2 fast symlinks can be twice as long as ufs ones, however the code
was using the ufs size in various places. Fix that so ufs2 symlinks over
60 characters aren't truncated.
Note that we copy the entire area instead of using the maxsymlinklen field
from the superblock. This way we will be more robust against corruption (of
the superblock).
While we are at it, use memcpy instead of open-coding it with for loops.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The maximum fast symlink size is set in the superblock of certain types
of UFS filesystem. Before using it we need to check that it isn't longer
than the available space we have in the inode.
Signed-off-by: Duane Griffin <duaneg@dghda.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a switch for the various i_mode fmt cases, and remove the comment
about writeability of devices nodes - that part is handled in
inode_permission and comment on (briefly) there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make sure that comments describe what's going on and not how, and always
use __d_instantiate instead of two separate branches, one with
d_instantiate and one with __d_instantiate.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Due to a different size of ino_t ustat needs a compat handler, but
currently only x86 and mips provide one. Add a generic compat_sys_ustat
and switch all architectures over to it. Instead of doing various
user copy hacks compat_sys_ustat just reimplements sys_ustat as
it's trivial. This was suggested by Arnd Bergmann.
Found by Eric Sandeen when running xfstests/017 on ppc64, which causes
stack smashing warnings on RHEL/Fedora due to the too large amount of
data writen by the syscall.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In two error cases affs_remove_link doesn't call affs_unlock_dir to
release the i_hash_lock semaphore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'bkl-removal' of git://git.lwn.net/linux-2.6:
Rationalize fasync return values
Move FASYNC bit handling to f_op->fasync()
Use f_lock to protect f_flags
Rename struct file->f_ep_lock
* 'for-linus' of git://git390.marist.edu/pub/scm/linux-2.6: (81 commits)
[S390] remove duplicated #includes
[S390] cpumask: use mm_cpumask() wrapper
[S390] cpumask: Use accessors code.
[S390] cpumask: prepare for iterators to only go to nr_cpu_ids/nr_cpumask_bits.
[S390] cpumask: remove cpu_coregroup_map
[S390] fix clock comparator save area usage
[S390] Add hwcap flag for the etf3 enhancement facility
[S390] Ensure that ipl panic notifier is called late.
[S390] fix dfp elf hwcap/facility bit detection
[S390] smp: perform initial cpu reset before starting a cpu
[S390] smp: fix memory leak on __cpu_up
[S390] ipl: Improve checking logic and remove switch defaults.
[S390] s390dbf: Remove needless check for NULL pointer.
[S390] s390dbf: Remove redundant initilizations.
[S390] use kzfree()
[S390] BUG to BUG_ON changes
[S390] zfcpdump: Prevent zcore from beeing built as a kernel module.
[S390] Use csum_partial in checksum.h
[S390] cleanup lowcore.h
[S390] eliminate ipl_device from lowcore
...
* 'for-2.6.30' of git://git.kernel.dk/linux-2.6-block:
Get rid of pdflush_operation() in emergency sync and remount
btrfs: get rid of current_is_pdflush() in btrfs_btree_balance_dirty
Move the default_backing_dev_info out of readahead.c and into backing-dev.c
block: Repeated lines in switching-sched.txt
bsg: Remove bogus check against request_queue->max_sectors
block: WARN in __blk_put_request() for potential bio leak
loop: fix circular locking in loop_clr_fd()
loop: support barrier writes
bsg: add support for tail queuing
cpqarray: enable bus mastering
block: genhd.h cleanup patch
block: add private bio_set for bio integrity allocations
block: genhd.h comment needs updating
block: get rid of unused blkdev_free_rq() define
block: remove various blk_queue_*() setting functions in blk_init_queue_node()
cciss: add BUILD_BUG_ON() for catching bad CommandList_struct alignment
block: don't create bio_vec slabs of less than the inline number
block: cleanup bio_alloc_bioset()
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1750 commits)
ixgbe: Allow Priority Flow Control settings to survive a device reset
net: core: remove unneeded include in net/core/utils.c.
e1000e: update version number
e1000e: fix close interrupt race
e1000e: fix loss of multicast packets
e1000e: commonize tx cleanup routine to match e1000 & igb
netfilter: fix nf_logger name in ebt_ulog.
netfilter: fix warning in ebt_ulog init function.
netfilter: fix warning about invalid const usage
e1000: fix close race with interrupt
e1000: cleanup clean_tx_irq routine so that it completely cleans ring
e1000: fix tx hang detect logic and address dma mapping issues
bridge: bad error handling when adding invalid ether address
bonding: select current active slave when enslaving device for mode tlb and alb
gianfar: reallocate skb when headroom is not enough for fcb
Bump release date to 25Mar2009 and version to 0.22
r6040: Fix second PHY address
qeth: fix wait_event_timeout handling
qeth: check for completion of a running recovery
qeth: unregister MAC addresses during recovery.
...
Manually fixed up conflicts in:
drivers/infiniband/hw/cxgb3/cxio_hal.h
drivers/infiniband/hw/nes/nes_nic.c
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
[CIFS] Fix memory overwrite when saving nativeFileSystem field during mount
[CIFS] Rename compose_mount_options to cifs_compose_mount_options.
[CIFS] work around bug in Samba server handling for posix open
[CIFS] Use posix open on file open when server supports it
cifs: fix buffer format byte on NT Rename/hardlink
[CIFS] Add definitions for remoteably fsctl calls
[CIFS] add extra null attr check
[CIFS] fix build error
[CIFS] reopen file via newer posix open protocol operation if available
[CIFS] Add new nostrictsync cifs mount option to avoid slow SMB flush
[CIFS] DFS no longer experimental
[CIFS] Send SMB flush in cifs_fsync
We don't have to start a transaction in writepage() when all the blocks
are a properly allocated. Even in ordered mode either the data has been
written via write() and they are thus already added to transaction's list
or the data was written via mmap and then it's random in which transaction
they get written anyway.
This should help VM to pageout dirty memory without blocking on transaction
commits.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw:
GFS2: Fix freeze issue
Fix a minor bug in the previous patch
GFS2: Clean up of glops.c
GFS2: Fix locking bug in failed shared to exclusive conversion
GFS2: Pagecache usage optimization on GFS2
GFS2: fix sparse warning: Should it be static?
GFS2: fix sparse warnings: constant is so big it is ...
GFS2: Support quota/noquota mount arguments
GFS2: Fix alignment issue and tidy gfs2_bitfit
GFS2: Add a "demote a glock" interface to sysfs
GFS2: Expose UUID via sysfs/uevent
GFS2: Support generation of discard requests
GFS2: Fix deadlock on journal flush
GFS2: Fix error path ref counting for root inode
GFS2: Remove unused field from glock
GFS2: Merge lock_dlm module into GFS2
GFS2: Remove "double" locking in quota
GFS2: change gfs2_quota_scan into a shrinker
GFS2: Bring back lvb-related stuff to lock_nolock to support quotas
GFS2: Fix remount argument parsing
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (71 commits)
SELinux: inode_doinit_with_dentry drop no dentry printk
SELinux: new permission between tty audit and audit socket
SELinux: open perm for sock files
smack: fixes for unlabeled host support
keys: make procfiles per-user-namespace
keys: skip keys from another user namespace
keys: consider user namespace in key_permission
keys: distinguish per-uid keys in different namespaces
integrity: ima iint radix_tree_lookup locking fix
TOMOYO: Do not call tomoyo_realpath_init unless registered.
integrity: ima scatterlist bug fix
smack: fix lots of kernel-doc notation
TOMOYO: Don't create securityfs entries unless registered.
TOMOYO: Fix exception policy read failure.
SELinux: convert the avc cache hash list to an hlist
SELinux: code readability with avc_cache
SELinux: remove unused av.decided field
SELinux: more careful use of avd in avc_has_perm_noaudit
SELinux: remove the unused ae.used
SELinux: check seqno when updating an avc_node
...
Change the default behaviour of the kernel to use relatime for all
filesystems. This can be overridden with the "strictatime" mount
option.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add support for explicitly requesting full atime updates. This makes it
possible for kernels to default to relatime but still allow userspace to
override it.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow atime to be updated once per day even with relatime. This lets
utilities like tmpreaper (which delete files based on last access time)
continue working, making relatime a plausible default for distributions.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Valerie Aurora Henson <vaurora@redhat.com>
Acked-by: Alan Cox <alan@redhat.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The dasd device driver will now support ECKD devices with more then
65520 cylinders.
In the traditional ECKD adressing scheme each track is addressed
by a 16-bit cylinder and 16-bit head number. The new addressing
scheme makes use of the fact that the actual number of heads is
never larger then 15, so 12 bits of the head number can be redefined
to be part of the cylinder address.
Signed-off-by: Stefan Weinhuber <wein@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Opencode a cheasy approach with kevent. The idea here is that we'll
add some generic delayed work infrastructure, which probably wont be
based on pdflush (or maybe it will, in which case we can just add it
back).
This is in preparation for getting rid of pdflush completely.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
ext2_quota_read() doesn't initialize tmp_bh.b_size before calling
ext2_get_block() where we access it. Since it is a local variable it
might contain some garbage. Make sure it is filled with reasonable
value before passing.
Signed-off-by: Manish Katiyar <mkatiyar@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Use lowercase names of quota functions instead of old uppercase ones.
CC: bfields@fieldses.org
CC: neilb@suse.de
Signed-off-by: Jan Kara <jack@suse.cz>
Use lowercase names of quota functions instead of old uppercase ones.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Dave Kleikamp <shaggy@austin.ibm.com>
Use lowercase names of quota functions instead of old uppercase ones.
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Mingming Cao <cmm@us.ibm.com>
CC: linux-ext4@vger.kernel.org
Use lowercase names of quota functions instead of old uppercase ones.
Signed-off-by: Jan Kara <jack@suse.cz>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
Remove bogus typedef which is just a definition of char *.
Remove unnecessary type casts.
Substitute freedqbuf() with kfree.
Signed-off-by: Jan Kara <jack@suse.cz>
Andrew Morton has suggested that three global quota locks can end up in the
same cacheline which can result in bad cacheline ping-pong on SMP machines.
Make locks cacheline aligned so that we avoid this problem (thanks goes to
Andrew for the idea).
Signed-off-by: Jan Kara <jack@suse.cz>
CC: Andrew Morton <akpm@linux-foundation.org>
Uses quota reservation/claim/release to handle quota properly for delayed
allocation in the three steps: 1) quotas are reserved when data being copied
to cache when block allocation is defered 2) when new blocks are allocated.
reserved quotas are converted to the real allocated quota, 2) over-booked
quotas for metadata blocks are released back.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
According to checkpatch: EXPORT_SYMBOL(foo); should immediately follow its
function/variable
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reserved quota will be claimed at the block allocation time. Over-booked
quota could be returned back with the release callback function.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Delayed allocation defers the block allocation at the dirty pages
flush-out time, doing quota charge/check at that time is too late.
But we can't charge the quota blocks until blocks are really allocated,
otherwise users could get overcharged after reboot from system crash.
This patch adds quota reservation for delayed allocation. Quota blocks
are reserved in memory, inode and quota won't gets dirtied until later
block allocation time.
Signed-off-by: Mingming Cao <cmm@us.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Commit 86c9508eb1c0ce5aa07b5cf1d36b60c54efc3d7a
"sysfs: don't block indefinitely for unmapped files" in linux-next
crashes the PowerMac G5 when X starts up. It's caught out by the way
powerpc's pci_mmap of legacy_mem uses shmem_zero_setup(), substituting
a new vma->vm_file whose private_data no longer points to the bin_buffer
(substitution done because some versions of X crash if that mmap fails).
The fix to this is straightforward: the original vm_file is fput() in
that case, so this mmap won't block sysfs at all, so just don't switch
over to bin_vm_ops if vm_file has changed.
But more fixes made before realizing that was the problem:-
It should not be an error if bin_page_mkwrite() finds no underlying
page_mkwrite().
Check that a file already mmap'ed has the same underlying vm_ops
_before_ pointing vma->vm_ops at bin_vm_ops.
If the file being mmap'ed is a shmem/tmpfs file, don't fail the mmap
on CONFIG_NUMA=y, just because that has a set_policy and get_policy:
provide bin_set_policy, bin_get_policy and bin_migrate.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Eric Biederman <ebiederm@aristanetworks.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The only way for a sysfs attribute to remove itself (without
deadlock) is to use the sysfs_schedule_callback() interface.
Vegard Nossum discovered that a poorly written sysfs ->store
callback can repeatedly schedule remove callbacks on the same
device over and over, e.g.
$ while true ; do echo 1 > /sys/devices/.../remove ; done
If the 'remove' attribute uses the sysfs_schedule_callback API
and also does not protect itself from concurrent accesses, its
callback handler will be called multiple times, and will
eventually attempt to perform operations on a freed kobject,
leading to many problems.
Instead of requiring all callers of sysfs_schedule_callback to
implement their own synchronization, provide the protection in
the infrastructure.
Now, sysfs_schedule_callback will only allow one scheduled
callback per kobject. On subsequent calls with the same kobject,
return -EAGAIN.
This is a short term fix. The long term fix is to allow sysfs
attributes to remove themselves directly, without any of this
callback hokey pokey.
[cornelia.huck@de.ibm.com: s390 ccwgroup bits]
Reported-by: vegard.nossum@gmail.com
Signed-off-by: Alex Chiang <achiang@hp.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch implements uevent suppress in kobject and removes it
from struct device, based on the following ideas:
1,Uevent sending should be one attribute of kobject, so suppressing it
in kobject layer is more natural than in device layer. By this way,
we can do it for other objects embedded with kobject.
2,It may save several bytes for each instance of struct device.(On my
omap3(32bit ARM) based box, can save 8bytes per device object)
This patch also introduces dev_set|get_uevent_suppress() helpers to
set and query uevent_suppress attribute in case to help kobject
as private part of struct device in future.
[This version is against the latest driver-core patch set of Greg,please
ignore the last version.]
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Modify sysfs bin files so that we can remove the bin file while they are
still mapped. When the kobject is removed we unmap the bin file and
arrange for future accesses to the mapping to receive SIGBUS.
Implementing this prevents a nasty DOS when pci devices are hot plugged
and unplugged. Where if any of their resources were mmaped the kernel
could not free up their pci resources or release their pci data
structures.
[akpm@linux-foundation.org: remove unused var]
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The sysfs_dirent serves as both an inode and a directory entry
for sysfs. To prevent the sysfs inode numbers from being freed
prematurely hold a reference to sysfs_dirent from the sysfs inode.
[akpm@linux-foundation.org: add comment]
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs: sysfs_add_one WARNs with full path to duplicate filename
As a debugging aid, it can be useful to know the full path to a
duplicate file being created in sysfs.
We now will display warnings such as:
sysfs: cannot create duplicate filename '/foo'
when attempting to create multiple files named 'foo' in the sysfs
root, or:
sysfs: cannot create duplicate filename '/bus/pci/slots/5/foo'
when attempting to create multiple files named 'foo' under a
given directory in sysfs.
The path displayed is always a relative path to sysfs_root. The
leading '/' in the path name refers to the sysfs_root mount
point, and should not be confused with the "real" '/'.
Thanks to Alex Williamson for essentially writing sysfs_pathname.
Cc: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
sysfs_get_inode ultimately calls sysfs_count_nlink when the a
directory inode is fectched. sysfs_count_nlink needs to be
called under the sysfs_mutex to guard against the unlikely
but possible scenario that the root directory is changing
as we are counting the number entries in it, and just in
general to be consistent.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
SYSFS_MAGIC has been added into magic.h, so only use that definition
in magic.h to avoid potential consistency problem.
Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
The integrity bio allocation needs its own bio_set to avoid violating
the mempool allocation rules and risking deadlocks.
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
If we don't have CONFIG_BLK_DEV_INTEGRITY set, then we don't have
any external dependencies on the bio_vec slabs. So don't create
the ones that we will inline anyway.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
this warning (which got fixed by commit b2bf968):
fs/bio.c: In function ‘bio_alloc_bioset’:
fs/bio.c:305: warning: ‘p’ may be used uninitialized in this function
Triggered because the code flow in bio_alloc_bioset() is correct
but a bit complex for the compiler to see through.
Streamline it a bit - this also makes the code a tiny bit more compact:
text data bss dec hex filename
7540 256 40 7836 1e9c bio.o.before
7539 256 40 7835 1e9b bio.o.after
Also remove an older compiler-warnings annotation from this function,
it's not needed.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
This removes some old code that was causing issues during
filesystem freeze.
Reported-by: Andrew Price <andy@andrewprice.me.uk>
Tested-by: Andrew Price <andy@andrewprice.me.uk>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The logic requires that we mark the glock dirty in page_mkwrite
otherwise we might not flush correctly in the case that no
allocation was required in the process of dirying the page.
Also we need to set the shared write flag early for the same
reason.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This cleans up a number of bits of code mostly based in glops.c.
A couple of simple functions have been merged into the callers
to make it more obvious what is going on, the mysterious raising
of i_writecount around the truncate_inode_pages() call has been
removed. The meta_go_* operations have been renamed rgrp_go_*
since that is the only lock type that they are used with.
The unused argument of gfs2_read_sb has been removed. Also
a bug has been fixed where a check for the rindex inode was
in the wrong callback. More comments are added, and the
debugging code is improved too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
After calling out to the dlm, GFS2 sets the new state of a glock to
gl_target in gdlm_ast(). However, gl_target is not always the lock
state that was requested. If a conversion from shared to exclusive
fails, finish_xmote() will call do_xmote() with LM_ST_UNLOCKED, instead
of gl->gl_target, so that it can reacquire the lock in exlusive the next
time around. In this case, setting the lock to gl_target in gdlm_ast()
will make GFS2 think that it has the glock in exclusive mode, when
really, it doesn't have the glock locked at all. This patch adds a new
field to the gfs2_glock structure, gl_req, to track the mode that was
requested.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
I introduced "is_partially_uptodate" aops for GFS2.
A page can have multiple buffers and even if a page is not uptodate, some buffers
can be uptodate on pagesize != blocksize environment.
This aops checks that all buffers which correspond to a part of a file
that we want to read are uptodate. If so, we do not have to issue actual
read IO to HDD even if a page is not uptodate because the portion we
want to read are uptodate.
"block_is_partially_uptodate" function is already used by ext2/3/4.
With the following patch random read/write mixed workloads or random read after
random write workloads can be optimized and we can get performance improvement.
I did a performance test using the sysbench.
#sysbench --num-threads=16 --max-requests=200000 --test=fileio --file-num=1
--file-block-size=8K --file-total-size=2G --file-test-mode=rndrw --file-fsync-freq=0
--file-rw-ratio=1 run
-2.6.29-rc6
Test execution summary:
total time: 202.6389s
total number of events: 200000
total time taken by event execution: 2580.0480
per-request statistics:
min: 0.0000s
avg: 0.0129s
max: 49.5852s
approx. 95 percentile: 0.0462s
-2.6.29-rc6-patched
Test execution summary:
total time: 177.8639s
total number of events: 200000
total time taken by event execution: 2419.0199
per-request statistics:
min: 0.0000s
avg: 0.0121s
max: 52.4306s
approx. 95 percentile: 0.0444s
arch: ia64
pagesize: 16k
blocksize: 4k
Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Impact: Make symbol static.
Fix this sparse warning:
fs/gfs2/rgrp.c:188:5: warning: symbol 'gfs2_bitfit' was not declared. Should it be static?
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Fix this sparse warnings:
fs/gfs2/rgrp.c:156:23: warning: constant 0xffffffffffffffff is so big it is unsigned long long
fs/gfs2/rgrp.c:157:23: warning: constant 0xaaaaaaaaaaaaaaaa is so big it is unsigned long long
fs/gfs2/rgrp.c:158:23: warning: constant 0x5555555555555555 is so big it is long long
fs/gfs2/rgrp.c:194:20: warning: constant 0x5555555555555555 is so big it is long long
fs/gfs2/rgrp.c:204:44: warning: constant 0x5555555555555555 is so big it is long long
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds support for "quota" and "noquota" mount options in addition to the
existing "quota=on/off/account" so that we are compatible with the names by
which these options are more generally known.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
An alignment issue with the existing bitfit algorithm was reported
on IA64. This patch attempts to fix that, and also to tidy up the
code a bit. There is now more documentation about how this works
and it has survived a number of different tests.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This adds a sysfs file called demote_rq to GFS2's
per filesystem directory. Its possible to use this
file to demote arbitrary glocks in exactly the same
way as if a request had come in from a remote node.
This is intended for testing issues relating to caching
of data under glocks. Despite that, the interface is
generic enough to send requests to any type of glock,
but be careful as its not always safe to send an
arbitrary message to an arbitrary glock. For that reason
and to prevent DoS, this interface is restricted to root
only.
The messages look like this:
<type>:<glocknumber> <mode>
Example:
echo -n "2:13324 EX" >/sys/fs/gfs2/unity:myfs/demote_rq
Which means "please demote inode glock (type 2) number 13324 so that
I can get an EX (exclusive) lock". The lock modes are those which
would normally be sent by a remote node in its callback so if you
want to unlock a glock, you use EX, to demote to shared, use SH or PR
(depending on whether you like GFS2 or DLM lock modes better!).
If the glock doesn't exist, you'll get -ENOENT returned. If the
arguments don't make sense, you'll get -EINVAL returned.
The plan is that this interface will be used in combination with
the blktrace patch which I recently posted for comments although
it is, of course, still useful in its own right.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since we have a UUID, we ought to expose it to the user via sysfs
and uevents. We already have the fs name in both of these places
(a combination of the lock proto and lock table name) so if we add
the UUID as well, we have a full set.
For older filesystems (i.e. those created before mkfs.gfs2 was writing
UUIDs by default) the sysfs file will appear zero length, and no UUID
env var will be added to the uevents.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch allows GFS2 to generate discard requests for blocks which are
no longer useful to the filesystem (i.e. those which have been freed as
the result of an unlink operation). The requests are generated at the
time which those blocks become available for reuse in the filesystem.
In order to use this new feature, you have to specify the "discard"
mount option. The code coalesces adjacent blocks into a single extent
when generating the discard requests, thus generating the minimum
number.
If an error occurs when the request has been sent to the block device,
then it will print a message and turn off the requests for that
filesystem. If the problem is temporary, then you can use remount to
turn the option back on again. There is also a nodiscard mount option
so that you can use remount to turn discard requests off, if required.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This patch fixes a deadlock when the journal is flushed and there
are dirty inodes other than the one which caused the journal flush.
Originally the journal flushing code was trying to obtain the
transaction glock while running the flush code for an inode glock.
We no longer require the transaction glock at this point in time
since we know that any attempt to get the transaction glock from
another node will result in a journal flush. So if we are flushing
the journal, we can be sure that the transaction lock is still
cached from when the transaction was started.
By inlining a version of gfs2_trans_begin() (minus the bit which
gets the transaction glock) we can avoid the deadlock problems
caused if there is a demote request queued up on the transaction
glock.
In addition I've also moved the umount rwsem so that it covers
the glock workqueue, since it all demotions are done by this
workqueue now. That fixes a bug on umount which I came across
while fixing the original problem.
Reported-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We were keeping hold of an extra ref to the root inode in one
of the error paths, that resulted in a hang.
Reported-by: Nate Straz <nstraz@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Tested-by: Robert Peterson <rpeterso@redhat.com>
The time stamp field is unused in the glock now that we are
using a shrinker, so that we can remove it and save sizeof(unsigned long)
bytes in each glock.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
This is the big patch that I've been working on for some time
now. There are many reasons for wanting to make this change
such as:
o Reducing overhead by eliminating duplicated fields between structures
o Simplifcation of the code (reduces the code size by a fair bit)
o The locking interface is now the DLM interface itself as proposed
some time ago.
o Fewer lookups of glocks when processing replies from the DLM
o Fewer memory allocations/deallocations for each glock
o Scope to do further optimisations in the future (but this patch is
more than big enough for now!)
Please note that (a) this patch relates to the lock_dlm module and
not the DLM itself, that is still a separate module; and (b) that
we retain the ability to build GFS2 as a standalone single node
filesystem with out requiring the DLM.
This patch needs a lot of testing, hence my keeping it I restarted
my -git tree after the last merge window. That way, this has the maximum
exposure before its merged. This is (modulo a few minor bug fixes) the
same patch that I've been posting on and off the the last three months
and its passed a number of different tests so far.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
We only really need a single spin lock for the quota data, so
lets just use the lru lock for now.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: Abhijith Das <adas@redhat.com>
Deallocation of gfs2_quota_data objects now happens on-demand through a
shrinker instead of routinely deallocating through the quotad daemon.
Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The quota code uses lvbs and this is currently not implemented in
lock_nolock, thereby causing panics when quota is enabled with
lock_nolock. This patch adds the relevant bits.
Signed-off-by: Abhijith Das <adas@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
The following patch fixes an issue relating to remount and argument
parsing. After this fix is applied, remount becomes atomic in that
it either succeeds changing the mount to the new state, or it fails
and leaves it in the old state. Previously it was possible for the
parsing of options to fail part way though and for the fs to be left
in a state where some of the new arguments had been applied, but some
had not.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Update all previous incarnations of my email address to the correct one.
Signed-off-by: Gertjan van Wingerde <gwingerde@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If ecryptfs_encrypted_view or ecryptfs_xattr_metadata were being
specified as mount options, a NULL pointer dereference of crypt_stat
was possible during lookup.
This patch moves the crypt_stat assignment into
ecryptfs_lookup_and_interpose_lower(), ensuring that crypt_stat
will not be NULL before we attempt to dereference it.
Thanks to Dan Carpenter and his static analysis tool, smatch, for
finding this bug.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Acked-by: Dustin Kirkland <kirkland@canonical.com>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When allocating the memory used to store the eCryptfs header contents, a
single, zeroed page was being allocated with get_zeroed_page().
However, the size of an eCryptfs header is either PAGE_CACHE_SIZE or
ECRYPTFS_MINIMUM_HEADER_EXTENT_SIZE (8192), whichever is larger, and is
stored in the file's private_data->crypt_stat->num_header_bytes_at_front
field.
ecryptfs_write_metadata_to_contents() was using
num_header_bytes_at_front to decide how many bytes should be written to
the lower filesystem for the file header. Unfortunately, at least 8K
was being written from the page, despite the chance of the single,
zeroed page being smaller than 8K. This resulted in random areas of
kernel memory being written between the 0x1000 and 0x1FFF bytes offsets
in the eCryptfs file headers if PAGE_SIZE was 4K.
This patch allocates a variable number of pages, calculated with
num_header_bytes_at_front, and passes the number of allocated pages
along to ecryptfs_write_metadata_to_contents().
Thanks to Florian Streibelt for reporting the data leak and working with
me to find the problem. 2.6.28 is the only kernel release with this
vulnerability. Corresponds to CVE-2009-0787
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Acked-by: Dustin Kirkland <kirkland@canonical.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Eugene Teo <eugeneteo@kernel.sg>
Cc: Greg KH <greg@kroah.com>
Cc: dann frazier <dannf@dannf.org>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Florian Streibelt <florian@f-streibelt.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The libaio test harness turned up a problem whereby lookup_ioctx on a
bogus io context was returning the 1 valid io context from the list
(harness/cases/3.p).
Because of that, an extra put_iocontext was done, and when the process
exited, it hit a BUG_ON in the put_iocontext macro called from exit_aio
(since we expect a users count of 1 and instead get 0).
The problem was introduced by "aio: make the lookup_ioctx() lockless"
(commit abf137dd77).
Thanks to Zach for pointing out that hlist_for_each_entry_rcu will not
return with a NULL tpos at the end of the loop, even if the entry was
not found.
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove a source of fput() call from inside IRQ context. Myself, like Eric,
wasn't able to reproduce an fput() call from IRQ context, but Jeff said he was
able to, with the attached test program. Independently from this, the bug is
conceptually there, so we might be better off fixing it. This patch adds an
optimization similar to the one we already do on ->ki_filp, on ->ki_eventfd.
Playing with ->f_count directly is not pretty in general, but the alternative
here would be to add a brand new delayed fput() infrastructure, that I'm not
sure is worth it.
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable:
Btrfs: Clear space_info full when adding new devices
Btrfs: Fix locking around adding new space_info
Nick Piggin noticed this (very unlikely) race between setting a page
dirty and creating the buffers for it - we need to hold the mapping
private_lock until we've set the page dirty bit in order to make sure
that create_empty_buffers() might not build up a set of buffers without
the dirty bits set when the page is dirty.
I doubt anybody has ever hit this race (and it didn't solve the issue
Nick was looking at), but as Nick says: "Still, it does appear to solve
a real race, which we should close."
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CIFS can allocate a few bytes to little for the nativeFileSystem field
during tree connect response processing during mount. This can result
in a "Redzone overwritten" message to be logged.
Signed-off-by: Sridhar Vinay <vinaysridhar@in.ibm.com>
Acked-by: Shirish Pargaonkar <shirishp@us.ibm.com>
CC: Stable <stable@kernel.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: fix bb_prealloc_list corruption due to wrong group locking
ext4: fix bogus BUG_ONs in in mballoc code
ext4: Print the find_group_flex() warning only once
ext4: fix header check in ext4_ext_search_right() for deep extent trees.
Although this operation is unsupported by our implementation
we still need to provide an encode routine for it to
merely encode its (error) status back in the compound reply.
Thanks for Bill Baker at sun.com for testing with the Sun
OpenSolaris' client, finding, and reporting this bug at
Connectathon 2009.
This bug was introduced in 2.6.27
Signed-off-by: Benny Halevy <bhalevy@panasas.com>
Cc: stable@kernel.org
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Commit ee6f779b9e ("filp->f_pos not
correctly updated in proc_task_readdir") changed the proc code to use
filp->f_pos directly, rather than through a temporary variable. In the
process, that caused the operations to be done on the full 64 bits, even
though the offset is never that big.
That's all fine and dandy per se, but for some unfathomable reason gcc
generates absolutely horrid code when using 64-bit values in switch()
statements. To the point of actually calling out to gcc helper
functions like __cmpdi2 rather than just doing the trivial comparisons
directly the way gcc does for normal compares. At which point we get
link failures, because we really don't want to support that kind of
crazy code.
Fix this by just casting the f_pos value to "unsigned long", which
is plenty big enough for /proc, and avoids the gcc code generation issue.
Reported-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Zhang Le <r0bertz@gentoo.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is for Red Hat bug 490026: EXT4 panic, list corruption in
ext4_mb_new_inode_pa
ext4_lock_group(sb, group) is supposed to protect this list for
each group, and a common code flow to remove an album is like
this:
ext4_get_group_no_and_offset(sb, pa->pa_pstart, &grp, NULL);
ext4_lock_group(sb, grp);
list_del(&pa->pa_group_list);
ext4_unlock_group(sb, grp);
so it's critical that we get the right group number back for
this prealloc context, to lock the right group (the one
associated with this pa) and prevent concurrent list manipulation.
however, ext4_mb_put_pa() passes in (pa->pa_pstart - 1) with a
comment, "-1 is to protect from crossing allocation group".
This makes sense for the group_pa, where pa_pstart is advanced
by the length which has been used (in ext4_mb_release_context()),
and when the entire length has been used, pa_pstart has been
advanced to the first block of the next group.
However, for inode_pa, pa_pstart is never advanced; it's just
set once to the first block in the group and not moved after
that. So in this case, if we subtract one in ext4_mb_put_pa(),
we are actually locking the *previous* group, and opening the
race with the other threads which do not subtract off the extra
block.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
filp->f_pos only get updated at the end of the function. Thus d_off of those
dirents who are in the middle will be 0, and this will cause a problem in
glibc's readdir implementation, specifically endless loop. Because when overflow
occurs, f_pos will be set to next dirent to read, however it will be 0, unless
the next one is the last one. So it will start over again and again.
There is a sample program in man 2 gendents. This is the output of the program
running on a multithread program's task dir before this patch is applied:
$ ./a.out /proc/3807/task
--------------- nread=128 ---------------
i-node# file type d_reclen d_off d_name
506442 directory 16 1 .
506441 directory 16 0 ..
506443 directory 16 0 3807
506444 directory 16 0 3809
506445 directory 16 0 3812
506446 directory 16 0 3861
506447 directory 16 0 3862
506448 directory 16 8 3863
This is the output after this patch is applied
$ ./a.out /proc/3807/task
--------------- nread=128 ---------------
i-node# file type d_reclen d_off d_name
506442 directory 16 1 .
506441 directory 16 2 ..
506443 directory 16 3 3807
506444 directory 16 4 3809
506445 directory 16 5 3812
506446 directory 16 6 3861
506447 directory 16 7 3862
506448 directory 16 8 3863
Signed-off-by: Zhang Le <r0bertz@gentoo.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Most fasync implementations do something like:
return fasync_helper(...);
But fasync_helper() will return a positive value at times - a feature used
in at least one place. Thus, a number of other drivers do:
err = fasync_helper(...);
if (err < 0)
return err;
return 0;
In the interests of consistency and more concise code, it makes sense to
map positive return values onto zero where ->fasync() is called.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Removing the BKL from FASYNC handling ran into the challenge of keeping the
setting of the FASYNC bit in filp->f_flags atomic with regard to calls to
the underlying fasync() function. Andi Kleen suggested moving the handling
of that bit into fasync(); this patch does exactly that. As a result, we
have a couple of internal API changes: fasync() must now manage the FASYNC
bit, and it will be called without the BKL held.
As it happens, every fasync() implementation in the kernel with one
exception calls fasync_helper(). So, if we make fasync_helper() set the
FASYNC bit, we can avoid making any changes to the other fasync()
functions - as long as those functions, themselves, have proper locking.
Most fasync() implementations do nothing but call fasync_helper() - which
has its own lock - so they are easily verified as correct. The BKL had
already been pushed down into the rest.
The networking code has its own version of fasync_helper(), so that code
has been augmented with explicit FASYNC bit handling.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: David Miller <davem@davemloft.net>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Traditionally, changes to struct file->f_flags have been done under BKL
protection, or with no protection at all. This patch causes all f_flags
changes after file open/creation time to be done under protection of
f_lock. This allows the removal of some BKL usage and fixes a number of
longstanding (if microscopic) races.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This lock moves out of the CONFIG_EPOLL ifdef and becomes f_lock. For now,
epoll remains the only user, but a future patch will use it to protect
f_flags as well.
Cc: Davide Libenzi <davidel@xmailserver.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
If bio_integrity_clone() fails, bio_clone() returns NULL without freeing
the newly allocated bio.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Stricter gfp_mask might be required for clone allocation.
For example, request-based dm may clone bio in interrupt context
so it has to use GFP_ATOMIC.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Alasdair G Kergon <agk@redhat.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: Fix the fix to Bugzilla #11061, when IPv6 isn't defined...
SUNRPC: xprt_connect() don't abort the task if the transport isn't bound
SUNRPC: Fix an Oops due to socket not set up yet...
Bug 11061, NFS mounts dropped
NFS: Handle -ESTALE error in access()
NLM: Fix GRANT callback address comparison when IPv6 is enabled
NLM: Shrink the IPv4-only version of nlm_cmp_addr()
NFSv3: Fix posix ACL code
NFS: Fix misparsing of nfsv4 fs_locations attribute (take 2)
SUNRPC: Tighten up the task locking rules in __rpc_execute()
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2:
ocfs2: Use xs->bucket to set xattr value outside
ocfs2: Fix a bug found by sparse check.
ocfs2: tweak to get the maximum inline data size with xattr
ocfs2: reserve xattr block for new directory with inline data
eCryptfs has file encryption keys (FEK), file encryption key encryption
keys (FEKEK), and filename encryption keys (FNEK). The per-file FEK is
encrypted with one or more FEKEKs and stored in the header of the
encrypted file. I noticed that the FEK is also being encrypted by the
FNEK. This is a problem if a user wants to use a different FNEK than
their FEKEK, as their file contents will still be accessible with the
FNEK.
This is a minimalistic patch which prevents the FNEKs signatures from
being copied to the inode signatures list. Ultimately, it keeps the FEK
from being encrypted with a FNEK.
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Acked-by: Dustin Kirkland <kirkland@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a ramfs nommu mapping is expanded, contiguous pages are allocated
and added to the pagecache. The caller's reference is then passed on
by moving whole pagevecs to the file lru list.
If the page cache adding fails, make sure that the error path also
moves the pagevec contents which might still contain up to PAGEVEC_SIZE
successfully added pages, of which we would leak references otherwise.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Enrik Berkhan <Enrik.Berkhan@ge.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pages attached to a ramfs inode's pagecache by truncation from nothing
- as done by SYSV SHM for example - may get discarded under memory
pressure.
The problem is that the pages are not marked dirty. Anything that creates
data in an MMU-based ramfs will cause the pages holding that data will
cause the set_page_dirty() aop to be called.
For the NOMMU-based mmap, set_page_dirty() may be called by write(), but
it won't be called by page-writing faults on writable mmaps, and it isn't
called by ramfs_nommu_expand_for_mapping() when a file is being truncated
from nothing to allocate a contiguous run.
The solution is to mark the pages dirty at the point of allocation by the
truncation code.
Signed-off-by: Enrik Berkhan <Enrik.Berkhan@ge.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thiemo Nagel reported that:
# dd if=/dev/zero of=image.ext4 bs=1M count=2
# mkfs.ext4 -v -F -b 1024 -m 0 -g 512 -G 4 -I 128 -N 1 \
-O large_file,dir_index,flex_bg,extent,sparse_super image.ext4
# mount -o loop image.ext4 mnt/
# dd if=/dev/zero of=mnt/file
oopsed, with a BUG_ON in ext4_mb_normalize_request because
size == EXT4_BLOCKS_PER_GROUP
It appears to me (esp. after talking to Andreas) that the BUG_ON
is bogus; a request of exactly EXT4_BLOCKS_PER_GROUP should
be allowed, though larger sizes do indicate a problem.
Fix that an another (apparently rare) codepath with a similar check.
Reported-by: Thiemo Nagel <thiemo.nagel@ph.tum.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
A long time ago, xs->base is allocated a 4K size and all the contents
in the bucket are copied to the it. Now we use ocfs2_xattr_bucket to
abstract xattr bucket and xs->base is initialized to the start of the
bu_bhs[0]. So xs->base + offset will overflow when the value root is
stored outside the first block.
Then why we can survive the xattr test by now? It is because we always
read the bucket contiguously now and kernel mm allocate continguous
memory for us. We are lucky, but we should fix it. So just get the
right value root as other callers do.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
We need to use le32_to_cpu to test rec->e_cpos in
ocfs2_dinode_insert_check.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Replace max_inline_data with max_inline_data_with_xattr
to ensure it correct when xattr inlined.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
If this is a new directory with inline data, we choose to
reserve the entire inline area for directory contents and
force an external xattr block.
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Acked-by: Joel Becker <joel.becker@oracle.com>
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
There was a report of a data corruption
http://lkml.org/lkml/2008/11/14/121. There is a script included to
reproduce the problem.
During testing, I encountered a number of strange things with ext3, so I
tried ext2 to attempt to reduce complexity of the problem. I found that
fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be
cleared, even though instrumentation showed that unlock_new_inode had
already been called for that inode. This points to memory scribble, or
synchronisation problme.
i_state of I_NEW inodes is not protected by inode_lock because other
processes are not supposed to touch them until I_LOCK (and I_NEW) is
cleared. Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify
i_state revealed that generic_sync_sb_inodes is picking up new inodes from
the inode lists and passing them to __writeback_single_inode without
waiting for I_NEW. Subsequently modifying i_state causes corruption. In
my case it would look like this:
CPU0 CPU1
unlock_new_inode() __sync_single_inode()
reg <- inode->i_state
reg -> reg & ~(I_LOCK|I_NEW) reg <- inode->i_state
reg -> inode->i_state reg -> reg | I_SYNC
reg -> inode->i_state
Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again.
Fix for this is rather than wait for I_NEW inodes, just skip over them:
inodes concurrently being created are not subject to data integrity
operations, and should not significantly contribute to dirty memory
either.
After this change, I'm unable to reproduce any of the added warnings or
hangs after ~1hour of running. Previously, the new warnings would start
immediately and hang would happen in under 5 minutes.
I'm also testing on ext3 now, and so far no problems there either. I
don't know whether this fixes the problem reported above, but it fixes a
real problem for me.
Cc: "Jorge Boncompte [DTI2]" <jorge@dti2.net>
Reported-by: Adrian Hunter <ext-adrian.hunter@nokia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: <stable@kernel.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In sget(), destroy_super(s) is called with s->s_umount held, which makes
lockdep unhappy.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the second fasync_helper() fails, pipe_rdwr_fasync() returns the error
but leaves the file on ->fasync_readers.
This was always wrong, but since 233e70f422
"saner FASYNC handling on file close" we have the new problem. Because in
this case setfl() doesn't set FASYNC bit, __fput() will not do
->fasync(0), and we leak fasync_struct with ->fa_file pointing to the
freed file.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stephen Rothwell reports:
Today's linux-next build (powerpc ppc64_defconfig) failed like this:
fs/built-in.o: In function `.nfs_get_client':
client.c:(.text+0x115010): undefined reference to `.__ipv6_addr_type'
Fix by moving the IPV6 specific parts of commit
d7371c41b0 ("Bug 11061, NFS mounts dropped")
into the '#ifdef IPV6..." section.
Also fix up a couple of formatting issues.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This is a short-term warning, and even printk_ratelimit() can result
in too much noise in system logs. So only print it once as a warning.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The corrupted filesystem patch added a check against zlib trying to
output too much data in the presence of data corruption. This check
triggered if zlib_inflate asked to be called again (Z_OK) with
avail_out == 0 and no more output buffers available. This check proves
to be rather dumb, as it incorrectly catches the case where zlib has
generated all the output, but there are still input bytes to be processed.
This patch does a number of things. It removes the original check and
replaces it with code to not move to the next output buffer if there
are no more output buffers available, relying on zlib to error if it
wants an extra output buffer in the case of data corruption. It
also replaces the Z_NO_FLUSH flag with the more correct Z_SYNC_FLUSH
flag, and makes the error messages more understandable to
non-technical users.
Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Reported-by: Stefan Lippers-Hollmann <s.L-H@gmx.de>
Samba server (version 3.3.1 and earlier, and 3.2.8 and earlier) incorrectly
required the O_CREAT flag on posix open (even when a file was not being
created). This disables posix open (create is still ok) after the first
attempt returns EINVAL (and logs an error, once, recommending that they
update their server).
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
Discovered at Connnectathon 2009...
The buffer format byte and the pad are transposed in NT_RENAME calls
(which are used to set hardlinks). Most servers seem to ignore this
fact, but NetApp filers throw back an error due to this problem. This
patch fixes it.
CC: Stable <stable@kernel.org>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
There are about 60 fsctl calls which Windows claims would be able
to be sent remotely and handled by the server. This adds the #defines
for them. A few of them look immediately useful, but need to also
add the structure definitions for them so they can be sent as SMBs.
Signed-off-by: Steve French <sfrench@us.ibm.com>
Although attr == NULL can not happen, this makes cifs_set_file_info safer
in the future since it may not be obvious that the caller can not set
attr to NULL.
Signed-off-by: Steve French <sfrench@us.ibm.com>
If the network connection crashes, and we have to reopen files, preferentially
use the newer cifs posix open protocol operation if the server supports it.
Signed-off-by: Steve French <sfrench@us.ibm.com>
If this mount option is set, when an application does an
fsync call then the cifs client does not send an SMB Flush
to the server (to force the server to write all dirty data
for this file immediately to disk), although cifs still sends
all dirty (cached) file data to the server and waits for the
server to respond to the write write. Since SMB Flush can be
very slow, and some servers may be reliable enough (to risk
delaying slightly flushing the data to disk on the server),
turning on this option may be useful to improve performance for
applications that fsync too much, at a small risk of server
crash. If this mount option is not set, by default cifs will
send an SMB flush request (and wait for a response) on every
fsync call.
Signed-off-by: Steve French <sfrench@us.ibm.com>
In contrast to the now-obsolete smbfs, cifs does not send SMB_COM_FLUSH
in response to an explicit fsync(2) to guarantee that all volatile data
is written to stable storage on the server side, provided the server
honors the request (which, to my knowledge, is true for Windows and
Samba with 'strict sync' enabled).
This patch modifies the cifs_fsync implementation to restore the
fsync-behavior of smbfs by triggering SMB_COM_FLUSH after sending
outstanding data on the client side to the server.
Signed-off-by: Horst Reiterer <horst.reiterer@gmail.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
* 'for-linus' of git://oss.sgi.com/xfs/xfs:
xfs: only issues a cache flush on unmount if barriers are enabled
xfs: prevent lockdep false positive in xfs_iget_cache_miss
xfs: prevent kernel crash due to corrupted inode log format
On swapon() path, it has already i_mutex. So, this uses i_alloc_sem
instead of it.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Laurent GUERBY <laurent@guerby.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Using offsetof() to calculate name length does not work because
it does not produce consistent results with with structure packing.
This caused memcpy to corrupt memory by copying 4 extra bytes off
the end of the buffer on 64 bit kernels with 32 bit userspace
(the only case where this 32/64 compat code is used).
The fix is to calculate name length directly from the start instead
of trying to derive it later using count and offsetof.
Signed-off-by: David Teigland <teigland@redhat.com>
Return immediately from dlm_unlock(CANCEL) if the lock is
granted and not being converted; there's nothing to cancel.
Signed-off-by: David Teigland <teigland@redhat.com>
When a conversion completes successfully and finds that a cancel
of the convert is still in progress (which is now a moot point),
preemptively clear the state associated with outstanding cancel.
That state could cause a subsequent conversion to be ignored.
Also, improve the consistency and content of error and debug
messages in this area.
Signed-off-by: David Teigland <teigland@redhat.com>
Integer nodeids can be too large for the idr code; use a hash
table instead.
Signed-off-by: Christine Caulfield <ccaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Fix kpf_copy_bit(src,dst) to be kpf_copy_bit(dst,src) to match the
actual call patterns, e.g. kpf_copy_bit(kflags, KPF_LOCKED, PG_locked).
This misplacement of src/dst only affected reporting of PG_writeback,
PG_reclaim and PG_buddy. For others kflags==uflags so not affected.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Addresses: http://bugzilla.kernel.org/show_bug.cgi?id=11061
sockaddr structures can't be reliably compared using memcmp() because
there are padding bytes in the structure which can't be guaranteed to
be the same even when the sockaddr structures refer to the same
socket. Instead compare all the relevant fields. In the case of IPv6
sin6_flowinfo is not compared because it only affects QoS and
sin6_scope_id is only compared if the address is "link local" because
"link local" addresses need only be unique to a specific link.
Signed-off-by: Ian Dall <ian@beware.dropbear.id.au>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Hi Trond,
I have been looking at a bugreport where trying to open applications on KDE
on a NFS mounted home fails temporarily. There have been multiple reports on
different kernel versions pointing to this common issue:
http://bugzilla.kernel.org/show_bug.cgi?id=12557https://bugs.launchpad.net/ubuntu/+source/linux/+bug/269954http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=508866.html
This issue can be reproducible consistently by doing this on a NFS mounted
home (KDE):
1. Open 2 xterm sessions
2. From one of the xterm session, do "ssh -X <remote host>"
3. "stat ~/.Xauthority" on the remote SSH session
4. Close the two xterm sessions
5. On the server do a "stat ~/.Xauthority"
6. Now on the client, try to open xterm
This will fail.
Even if the filehandle had become stale, the NFS client should invalidate
the cache/inode and should repeat LOOKUP. Looking at the packet capture when
the failure occurs shows that there were two subsequent ACCESS() calls with
the same filehandle and both fails with -ESTALE error.
I have tested the fix below. Now the client issue a LOOKUP after the
ACCESS() call fails with -ESTALE. If all this makes sense to you, can you
consider this for inclusion?
Thanks,
If the server returns an -ESTALE error due to stale filehandle in response to
an ACCESS() call, we need to invalidate the cache and inode so that LOOKUP()
can be retried. Without this change, the nfs client retries ACCESS() with the
same filehandle, fails again and could lead to temporary failure of
applications running on nfs mounted home.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The NFS mount command may pass an AF_INET server address to lockd. If
lockd happens to be using a PF_INET6 listener, the nlm_cmp_addr() in
nlmclnt_grant() will fail to match requests from that host because they
will all have a mapped IPv4 AF_INET6 address.
Adopt the same solution used in nfs_sockaddr_match_ipaddr() for NFSv4
callbacks: if either address is AF_INET, map it to an AF_INET6 address
before doing the comparison.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix a memory leak due to allocation in the XDR layer. In cases where the
RPC call needs to be retransmitted, we end up allocating new pages without
clearing the old ones. Fix this by moving the allocation into
nfs3_proc_setacls().
Also fix an issue discovered by Kevin Rudd, whereby the amount of memory
reserved for the acls in the xdr_buf->head was miscalculated, and causing
corruption.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The changeset ea31a4437c (nfs: Fix
misparsing of nfsv4 fs_locations attribute) causes the mountpath that is
calculated at the beginning of try_location() to be clobbered when we
later strncpy a non-nul terminated hostname using an incorrect buffer
length.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>