This ASSERT is testing an if_flags flag value against
a di_aformat enum value. di_aformat is never assigned
XFS_IFINLINE.
This happens to work for now, because XFS_IFINLINE has
the same value as XFS_DINODE_FMT_LOCAL, and that's tested
just before we call this function.
However, I think the intention is to assert that we have
read in the data, i.e. XFS_IFINLINE on if_flags, before
we use if_data. This is done in other places through the
code as well.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
In optimising the CIL operations, some of the IOP_* macros for
calling log item operations were removed. Remove the rest of them as
Christoph requested.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Geoffrey Wehrman <gwehrman@sgi.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
We've been seeing occasional problems with log space leaks and
transaction underruns such as this for some time:
XFS (dm-0): xlog_write: reservation summary:
trans type = FSYNC_TS (36)
unit res = 2740 bytes
current res = -4 bytes
total reg = 0 bytes (o/flow = 0 bytes)
ophdrs = 0 (ophdr space = 0 bytes)
ophdr + reg = 0 bytes
num regions = 0
Turns out that xfstests generic/311 is reliably reproducing this
problem with the test it runs at sequence 16 of it execution. It is
a 100% reliable reproducer with the mkfs configuration of "-b
size=1024 -m crc=1" on a 10GB scratch device.
The problem? Inode forks in btree format are logged in memory
format, not disk format (i.e. bmbt format, not bmdr format). That
means there is a btree block header being logged, when such a
structure is never written to the inode fork in bmdr format. The
bmdr header in the inode is only 4 bytes, while the bmbt header is
24 bytes for v4 filesystems and 72 bytes for v5 filesystems.
We currently reserve the inode size plus the rounded up overhead of
a logging a buffer, which is 128 bytes. That means the reservation
for a 512 byte inode is 640 bytes. What we can actually log is:
inode core, data and attr fork = 512 bytes
inode log format + log op header = 56 + 12 = 68 bytes
data fork bmbt hdr = 24/72 bytes
attr fork bmbt hdr = 24/72 bytes
So, for a v2 inodes we can log at least 628 bytes, but if we split that
inode over the end of the log across log buffers, we need to also
another log op header, which takes us to 640 bytes. If there's
another reservation taken out of this that I haven't taken into
account (perhaps multiple iclog splits?) or I haven't corectly
calculated the bmbt format space used (entirely possible), then
we will overun it.
For v3 inodes the maximum is actually 724 bytes, and even a
single maximally sized btree format fork can blow it (652 bytes).
And that's exactly what is happening with the FSYNC_TS transaction
in the above output - it's consumed 644 bytes of space after the CIL
context took the space reserved for it (2100 bytes).
This problem has always been present in the XFS code - the btree
format inode forks have always been logged in this manner. Hence
there has always been the possibility of an overrun with such a
transaction. The CRC code has just exposed it frequently enough to
be able to debug and understand the root cause....
So, let's fix all the inode log space reservations.
[ I'm so glad we spent the effort to clean up the transaction
reservation code. This is an easy fix now. ]
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The call to xfs_inobt_get_rec() in xfs_dialloc_ag() passes 'j' as
the output status variable. The immediately following
XFS_WANT_CORRUPTED_GOTO() checks the value of 'i,' which is from
the previous lookup call and has already been checked. Fix the
corruption check to use 'j.'
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
CRC enabled filesystems fail log recovery with 100% reliability on
xfstests xfs/085 with the following failure:
XFS (vdb): Mounting Filesystem
XFS (vdb): Starting recovery (logdev: internal)
XFS (vdb): Corruption detected. Unmount and run xfs_repair
XFS (vdb): bad inode magic/vsn daddr 144 #0 (magic=0)
XFS: Assertion failed: 0, file: fs/xfs/xfs_inode_buf.c, line: 95
The problem is that the inode buffer has not been recovered before
the readahead on the inode buffer is issued. The checkpoint being
recovered actually allocates the inode chunk we are doing readahead
from, so what comes from disk during readahead is essentially
random and the verifier barfs on it.
This inode buffer readahead problem affects non-crc filesystems,
too, but xfstests does not trigger it at all on such
configurations....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Log recovery has some strict ordering requirements which unordered
or reordered metadata writeback can defeat. This can occur when an
item is logged in a transaction, written back to disk, and then
logged in a new transaction before the tail of the log is moved past
the original modification.
The result of this is that when we read an object off disk for
recovery purposes, the buffer that we read may not contain the
object type that recovery is expecting and hence at the end of the
checkpoint being recovered we have an invalid object in memory.
This isn't usually a problem, as recovery will then replay all the
other checkpoints and that brings the object back to a valid and
correct state, but the issue is that while the object is in the
invalid state it can be flushed to disk. This results in the object
verifier failing and triggering a corruption shutdown of log
recover. This is correct behaviour for the verifiers - the problem
is that we are not detecting that the object we've read off disk is
newer than the transaction we are replaying.
All metadata in v5 filesystems has the LSN of it's last modification
stamped in it. This enabled log recover to read that field and
determine the age of the object on disk correctly. If the LSN of the
object on disk is older than the transaction being replayed, then we
replay the modification. If the LSN of the object matches or is more
recent than the transaction's LSN, then we should avoid overwriting
the object as that is what leads to the transient corrupt state.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When testing LSN ordering code for v5 superblocks, it was discovered
that the the LSN embedded in the generic btree blocks was
occasionally uninitialised. These values didn't get written to disk
by metadata writeback - they got written by previous transactions in
log recovery.
The issue is here that the when the block is first allocated and
initialised, the LSN field was not initialised - it gets overwritten
before IO is issued on the buffer - but the value that is logged by
transactions that modify the header before it is written to disk
(and initialised) contain garbage. Hence the first recovery of the
buffer will stamp garbage into the LSN field, and that can cause
subsequent transactions to not replay correctly.
The fix is simply to initialise the bb_lsn field to zero when we
initialise the block for the first time.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Sseveral sparse warnings were caused by missing rcu_dereference() annotations
for dereferencing mm->ioctx_table. Thankfully, none of those were actual bugs
as the deref was protected by a spin lock in all instances.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
The calculation doesn't take into account the size of the dir v3
header, so overestimates the hash entries in a node. This causes
directory buffer overruns when splitting and merging nodes.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The commit 36bc08cc01 ("fs/aio: Add support to aio ring pages migration")
added some debugging code that is not required and resulted in a build error
when 98474236f7 ("vfs: make the dentry cache use the lockref infrastructure")
was added to the tree. The code is not required, so just delete it.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
The clnt->cl_principal is being used exclusively to store the service
target name for RPCSEC_GSS/krb5 callbacks. Replace it with something that
is stored only in the RPCSEC_GSS-specific code.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We don't want to pass the context argument to trace_nfs_atomic_open_exit()
after it has been released.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
xfstests xfs/087 fails 100% reliably with this assert:
XFS (vdb): Mounting Filesystem
XFS (vdb): Starting recovery (logdev: internal)
XFS: Assertion failed: bp->b_flags & XBF_STALE, file: fs/xfs/xfs_buf.c, line: 548
while trying to read a dquot buffer in xlog_recover_dquot_ra_pass2().
The issue is that the buffer length to read that is passed to
xfs_buf_readahead is in units of filesystem blocks, not disk blocks.
(i.e. FSB, not daddr). Fix it but putting the correct conversion in
place.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When doing readhaead in log recovery, we check to see if buffers are
cancelled before doing readahead. If we find a cancelled buffer,
however, we always decrement the reference count we have on it, and
that means that readahead is causing a double decrement of the
cancelled buffer reference count.
This results in log recovery *replaying cancelled buffers* as the
actual recovery pass does not find the cancelled buffer entry in the
commit phase of the second pass across a transaction. On debug
kernels, this results in an ASSERT failure like so:
XFS: Assertion failed: !(flags & XFS_BLF_CANCEL), file: fs/xfs/xfs_log_recover.c, line: 1815
xfstests generic/311 reproduces this ASSERT failure with 100%
reproducability.
Fix it by making readahead only peek at the buffer cancelled state
rather than the full accounting that xlog_check_buffer_cancelled()
does.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Don't allow mounting sysfs unless the caller has CAP_SYS_ADMIN rights
over the net namespace. The principle here is if you create or have
capabilities over it you can mount it, otherwise you get to live with
what other people have mounted.
Instead of testing this with a straight forward ns_capable call,
perform this check the long and torturous way with kobject helpers,
this keeps direct knowledge of namespaces out of sysfs, and preserves
the existing sysfs abstractions.
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Merge fixes from Andrew Morton:
"Five fixes.
err, make that six. let me try again"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
fs/ocfs2/super.c: Use bigger nodestr to accomodate 32-bit node numbers
memcg: check that kmem_cache has memcg_params before accessing it
drivers/base/memory.c: fix show_mem_removable() to handle missing sections
IPC: bugfix for msgrcv with msgtyp < 0
Omnikey Cardman 4000: pull in ioctl.h in user header
timer_list: correct the iterator for timer_list
While using pacemaker/corosync, the node numbers are generated using IP
address as opposed to serial node number generation. This may not fit
in a 8-byte string. Use a bigger string to print the complete node
number.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This just replaces the dentry count/lock combination with the lockref
structure that contains both a count and a spinlock, and does the
mechanical conversion to use the lockref infrastructure.
There are no semantic changes here, it's purely syntactic. The
reference lockref implementation uses the spinlock exactly the same way
that the old dcache code did, and the bulk of this patch is just
expanding the internal "d_count" use in the dcache code to use
"d_lockref.count" instead.
This is purely preparation for the real change to make the reference
count updates be lockless during the 3.12 merge window.
[ As with the previous commit, this is a rewritten version of a concept
originally from Waiman, so credit goes to him, blame for any errors
goes to me.
Waiman's patch had some semantic differences for taking advantage of
the lockless update in dget_parent(), while this patch is
intentionally a pure search-and-replace change with no semantic
changes. - Linus ]
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We read the size of the name from the disk, but a larger name than
expected would cause memory corruption.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext4 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT4-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
If the group descriptor fails validation, mark the whole blockgroup
corrupt so that the inode/block allocators skip this group. The
previous approach takes the risk of writing to a damaged group
descriptor; hopefully it was never the case that the [ib]bitmap fields
pointed to another valid block and got dirtied, since the memset would
fill the page with 1s.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If we detect either a discrepancy between the inode bitmap and the
inode counts or the inode bitmap fails to pass validation checks, mark
the block group corrupt and refuse to allocate or deallocate inodes
from the group.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When we notice a block-bitmap corruption (because of device failure or
something else), we should mark this group as corrupt and prevent
further block allocations/deallocations from it. Currently, we end up
generating one error message for every block in the bitmap. This
potentially could make the system unstable as noticed in some
bugs. With this patch, the error will be printed only the first time
and mark the entire block group as corrupted. This prevents future
access allocations/deallocations from it.
Also tested by corrupting the block
bitmap and forcefully introducing the mb_free_blocks error:
(1) create a largefile (2Gb)
$ dd if=/dev/zero of=largefile oflag=direct bs=10485760 count=200
(2) umount filesystem. use dumpe2fs to see which block-bitmaps
are in use by largefile and note their block numbers
(3) use dd to zero-out the used block bitmaps
$ dd if=/dev/zero of=/dev/hdc4 bs=4096 seek=14 count=8 oflag=direct
(4) mount the FS and delete the largefile.
(5) recreate the largefile. verify that the new largefile does not
get any blocks from the groups marked as bad.
Without the patch, we will see mb_free_blocks error for each bit in
each zero'ed out bitmap at (4). With the patch, we only see the error
once per blockgroup:
[ 309.706803] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 15: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.720824] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 14: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.732858] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[ 309.748321] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 13: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.760331] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[ 309.769695] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 12: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.781721] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[ 309.798166] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 11: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
[ 309.810184] EXT4-fs error (device sdb4) in ext4_free_blocks:4802: IO failure
[ 309.819532] EXT4-fs error (device sdb4): ext4_mb_generate_buddy:735: group 10: 32768 clusters in bitmap, 0 in gd. blk grp corrupted.
Google-Bug-Id: 7258357
[darrick.wong@oracle.com]
Further modifications (by Darrick) to make more obvious that this corruption
bit applies to blocks only. Set the corruption flag if the block group bitmap
verification fails.
Original-author: Aditya Kali <adityakali@google.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The block_group parameter to ext4_validate_block_bitmap is both used
as a ext4_group_t inside the function and the same type is passed in
by all callers. We might as well use the typedef consistently instead
of open-coding the 'unsigned int'.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The block bitmap verification code assumes that calling ext4_error()
either panics the system or makes the fs readonly. However, this is
not always true: when 'errors=continue' is specified, an error is
printed but we don't return any indication of error to the caller,
which is (probably) the block allocator, which pretends that the crud
we read in off the disk is a usable bitmap. Yuck.
A block bitmap that fails the check should at least return no bitmap
to the caller. The block allocator should be told to go look in a
different group, but that's a separate issue.
The easiest way to reproduce this is to modify bg_block_bitmap (on a
^flex_bg fs) to point to a block outside the block group; or you can
create a metadata_csum filesystem and zero out the block bitmaps.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In the jbd2 checksumming code, explicitly declare separate variables with
endianness information so that we don't get confused and screw things up again.
Also fixes sparse warnings.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
After applied the commit (4a092d73), we have reduced the number of
source files that need to #include ext4_extents.h. But we can do
better.
This commit defines ext4_zeroout_es() in extents.c and move
EXT_MAX_BLOCKS into ext4.h in order not to include ext4_extents.h in
indirect.c and ioctl.c. Meanwhile we just need to include this file in
extent_status.c when ES_AGGRESSIVE_TEST is defined. Otherwise, this
commit removes a duplicated declaration in trace/events/ext4.h.
After applied this patch, we just need to include ext4_extents.h file
in {super,migrate,move_extents,extents}.c, and it is easy for us to
define a new extent disk layout.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use wait_for_stable_page() instead of wait_on_page_writeback()
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
If ext_debugging is enabled and path[depth].p_ext is NULL, len
and lblock are printed non initialized
Signed-off-by: Andi Shyti <andi@etezian.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This reverts commit bb2314b479.
It wasn't necessarily wrong per se, but we're still busily discussing
the exact details of this all, so I'm going to revert it for now.
It's true that you can already do flink() through /proc and that flink()
isn't new. But as Brad Spengler points out, some secure environments do
not mount proc, and flink adds a new interface that can avoid path
lookup of the source for those kinds of environments.
We may re-do this (and even mark it for stable backporting back in 3.11
and possibly earlier) once the whole discussion about the interface is done.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Following we will begin to add memcg dirty page accounting around
__set_page_dirty_{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to
avoid exporting those details to filesystems.
Since vfs set_page_dirty() should be called under page lock, here we don't need elaborate
codes to handle racy anymore, and two WARN_ON() are added to detect such exceptions.
Thanks very much for Sage and Yan Zheng's coaching!
I tested it in a two server's ceph environment that one is client and the other is
mds/osd/mon, and run the following fsx test from xfstests:
./fsx 1MB -N 50000 -p 10000 -l 1048576
./fsx 10MB -N 50000 -p 10000 -l 10485760
./fsx 100MB -N 50000 -p 10000 -l 104857600
The fsx does lots of mmap-read/mmap-write/truncate operations and the tests completed
successfully without triggering any of WARN_ON.
Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The writepages function was recently merged between writeback
and ordered mode. This completes the change by doing the same
with writepage. The remaining differences in writepage were
left over from some earlier time and not actually doing anything
useful.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
For sync_read/write, it may do multi stripe operations.If one of those
met erro, we return the former successed size rather than a error value.
There is a exception for write-operation met -EOLDSNAPC.If this occur,we
retry the whole write again.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
cephfs . show_layout
>layyout.data_pool: 0
>layout.object_size: 4194304
>layout.stripe_unit: 4194304
>layout.stripe_count: 1
TestA:
>dd if=/dev/urandom of=test bs=1M count=2 oflag=direct
>dd if=/dev/urandom of=test bs=1M count=2 seek=4 oflag=direct
>dd if=test of=/dev/null bs=6M count=1 iflag=direct
The messages from func striped_read are:
ceph: file.c:350 : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT
ceph: file.c:350 : striped_read 2097152~4194304 (read 2097152) got 0 HITSTRIPE SHORT
ceph: file.c:381 : zero tail 4194304
ceph: file.c:390 : striped_read returns 6291456
The hole of file is from 2M--4M.But actualy it zero the last 4M include
the last 2M area which isn't a hole.
Using this patch, the messages are:
ceph: file.c:350 : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT
ceph: file.c:358 : zero gap 2097152 to 4194304
ceph: file.c:350 : striped_read 4194304~2097152 (read 4194304) got 2097152
ceph: file.c:384 : striped_read returns 6291456
TestB:
>echo majianpeng > test
>dd if=test of=/dev/null bs=2M count=1 iflag=direct
The messages are:
ceph: file.c:350 : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT
ceph: file.c:350 : striped_read 11~6291445 (read 11) got 0 HITSTRIPE SHORT
ceph: file.c:390 : striped_read returns 11
For this case,it did once more striped_read.It's no meaningless.
Using this patch, the message are:
ceph: file.c:350 : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT
ceph: file.c:384 : striped_read returns 11
Big thanks to Yan Zheng for the patch.
Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
The f2fs_set_link updates its parent inode number, so we should sync this to
the inode block.
Otherwise, the data can be lost after sudden-power-off.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
client reporting a readdir loop.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.21 (GNU/Linux)
iQIcBAABAgAGBQJSG9IHAAoJEDaohF61QIxkptgP/jTEooTZ2uMmIouqj6rhrc81
avrqtB26Ww74XBzuyiTuSBLUUJXGaLveC/i9rR2XOyA7cXJUbrqvqpZ2OExOURsf
gHeLX20RKwKgaRQ6R9Xcri6wjWql4YHL/z3tI/DGpDkMfA0siocbHW+GZSfmMZ8c
EOcAECY+tqrAgFL1oESTinglGNZ2V1f/IyKB8DiUDgDuMkV6Zw3083Ph7xB/bw9T
Jffg6nvkST2mzZNTKgU8W3extW/X9GxxleYLED2jwEYioOFVAlvSKZTFD65NVUya
1l3YH68jROnDSoHmgOup0C6i51/e7QFuBNdyR/8UK2OyOXkJ1wneXutUIiysWWty
dY9+kxSSdpNuZvflGrZlP7yoEjGgRO/893owUcusiSgLTpQpNs2y7OVjCBwsHGa7
AIWyu+FyOnNnmO6oiYkmNQlE3bAoz3z0CO7IuP5lm5HRgMIxp+k2k8yOHon/PcuF
juQsbMydGcNVAjJlQuxCDh1uijOGLbDol/NekpUnyL02oy294raiWdy4IUaqWIHG
Y5i1yeVwFbr7QsI8RCbEliCRwP5YMWSp4irEUozJTDplAn4AnJ5AJdYq9KM5JHMm
qBHXl/asWbxGQZhmig2xzojZ+HVPFVWilpBz7OvsMXJaulrRqxV3DgAudIlZPP4B
mB8DPzctwmgzmlgCqzRW
=+aYs
-----END PGP SIGNATURE-----
Merge tag 'jfs-3.11-rc8' of git://github.com/kleikamp/linux-shaggy
Pull jfs fix from Dave Kleikamp:
"One JFS patch to fix an incompatibility with NFSv4 resulting in the
nfs client reporting a readdir loop"
* tag 'jfs-3.11-rc8' of git://github.com/kleikamp/linux-shaggy:
jfs: fix readdir cookie incompatibility with NFSv4
Rely on the fact that another flavor of the filesystem is already
mounted and do not rely on state in the user namespace.
Verify that the mounted filesystem is not covered in any significant
way. I would love to verify that the previously mounted filesystem
has no mounts on top but there are at least the directories
/proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
for other filesystems to mount on top of.
Refactor the test into a function named fs_fully_visible and call that
function from the mount routines of proc and sysfs. This makes this
test local to the filesystems involved and the results current of when
the mounts take place, removing a weird threading of the user
namespace, the mount namespace and the filesystems themselves.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Don't copy bind mounts of /proc/<pid>/ns/mnt between namespaces.
These files hold references to a mount namespace and copying them
between namespaces could result in a reference counting loop.
The current mnt_ns_loop test prevents loops on the assumption that
mounts don't cross between namespaces. Unfortunately unsharing a
mount namespace and shared substrees can both cause mounts to
propogate between mount namespaces.
Add two flags CL_COPY_UNBINDABLE and CL_COPY_MNT_NS_FILE are added to
control this behavior, and CL_COPY_ALL is redefined as both of them.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Don't allow mounting the proc filesystem unless the caller has
CAP_SYS_ADMIN rights over the pid namespace. The principle here is if
you create or have capabilities over it you can mount it, otherwise
you get to live with what other people have mounted.
Andy pointed out that this is needed to prevent users in a user
namespace from remounting proc and specifying different hidepid and gid
options on already existing proc mounts.
Cc: stable@vger.kernel.org
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
The "di_size" variable comes from the disk and it's a signed 64 bit.
We check the upper limit but we should check for negative numbers as
well.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
TO: Dave Chinner <david@fromorbit.com>
CC: Ben Myers <bpm@sgi.com>
CC: linux-kernel@vger.kernel.org
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
During trinity fuzzing in a kvmtool guest, I stumbled across the
following:
Unable to handle kernel NULL pointer dereference at virtual address 00000004
PC is at v9fs_file_do_lock+0xc8/0x1a0
LR is at v9fs_file_do_lock+0x48/0x1a0
[<c01e2ed0>] (v9fs_file_do_lock+0xc8/0x1a0) from [<c0119154>] (locks_remove_flock+0x8c/0x124)
[<c0119154>] (locks_remove_flock+0x8c/0x124) from [<c00d9bf0>] (__fput+0x58/0x1e4)
[<c00d9bf0>] (__fput+0x58/0x1e4) from [<c0044340>] (task_work_run+0xac/0xe8)
[<c0044340>] (task_work_run+0xac/0xe8) from [<c002e36c>] (do_exit+0x6bc/0x8d8)
[<c002e36c>] (do_exit+0x6bc/0x8d8) from [<c002e674>] (do_group_exit+0x3c/0xb0)
[<c002e674>] (do_group_exit+0x3c/0xb0) from [<c002e6f8>] (__wake_up_parent+0x0/0x18)
I believe this is due to an attempt to access utsname()->nodename, after
exit_task_namespaces() has been called, leaving current->nsproxy->uts_ns
as NULL and causing the above dereference.
A similar issue was fixed for lockd in 9a1b6bf818 ("LOCKD: Don't call
utsname()->nodename from nlmclnt_setlockargs"), so this patch attempts
something similar for 9pfs.
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
0. modified inode structure
--------------------------------------
metadata (e.g., i_mtime, i_ctime, etc)
--------------------------------------
direct pointers [0 ~ 873]
inline xattrs (200 bytes by default)
indirect pointers [0 ~ 4]
--------------------------------------
node footer
--------------------------------------
1. setxattr flow
- read_all_xattrs copies all the xattrs from inline and xattr node block.
- handle xattr entries
- write_all_xattrs copies modified xattrs into inline and xattr node block.
2. getxattr flow
- read_all_xattrs copies all the xattrs from inline and xattr node block.
- check target entries
3. Usage
# mount -t f2fs -o inline_xattr $DEV $MNT
Once mounted with the inline_xattr option, f2fs marks all the newly created
files to reserve an amount of inline xattr space explicitly inside the inode
block. Without the mount option, f2fs will not touch any existing files and
newly created files as well.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The __find_xattr is to search the wanted xattr entry starting from the
base_addr.
If not found, the returned entry is the last empty xattr entry that can be
allocated newly.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch enables the number of direct pointers inside on-disk inode block to
be changed dynamically according to the size of inline xattr space.
The number of direct pointers, ADDRS_PER_INODE, can be changed only if the file
has inline xattr flag.
The number of direct pointers that will be used by inline xattrs is defined as
F2FS_INLINE_XATTR_ADDRS.
Current patch assigns F2FS_INLINE_XATTR_ADDRS to 0 temporarily.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds basic inode flags for inline xattrs, F2FS_INLINE_XATTR,
and add a mount option, inline_xattr, which is enabled when xattr is set.
If the mount option is enabled, all the files are marked with the inline_xattrs
flag.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Fix to return -ENOMEM in the kset create and add error handling
case instead of 0, as done elsewhere in this function.
Introduced by commit b59d0bae6c.
(f2fs: add sysfs support for controlling the gc_thread)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: Namjae Jeon <namjae.jeon@samsung.com>
[Jaegeuk Kim: merge the patch with previous modification]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Pull vfs fixes from Al Viro:
"Assorted fixes from the last week or so"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
VFS: collect_mounts() should return an ERR_PTR
bfs: iget_locked() doesn't return an ERR_PTR
efs: iget_locked() doesn't return an ERR_PTR()
proc: kill the extra proc_readfd_common()->dir_emit_dots()
cope with potentially long ->d_dname() output for shmem/hugetlb
This is a set of small bug fixes for lpfc and zfcp and a fix for a fairly
nasty bug in sg where a process which cancels I/O completes in a kernel thread
which would then try to write back to the now gone userspace and end up
writing to a random kernel address instead.
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQEcBAABAgAGBQJSGIaHAAoJEDeqqVYsXL0MbPUH/3UXceHlgYRrwYZ0C10Ao5XB
WA8RWsDsX9UJxG68zEd8ED1aRHhmkfm4pEdMQ8DHW7+B7mvNhpb6mF0wxvmS5aIj
OVI0G+3KmghA3aDWbTtg8ED0wJ4q3ftcyzl4Fhpat+yA4g/BW7iJNDCv17nvZ90f
hNmdGm23wuYCid7JWNDO79spSp0q6wPJhG6ynJYOtzX1GvpEliZiGB0IOR3K44nW
cF6+Uigs3+6RGXX9UHOMrk9Ug3YFHfok224vvydcbRXVh8uneB8RQ6tziJVFI8tE
WPgAv2oDzly8l+ku71CqrjzG7fSwCr9Urlog9cEugE1iUmFCIQm6xJcSSnGJqaY=
=jKT6
-----END PGP SIGNATURE-----
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"This is a set of small bug fixes for lpfc and zfcp and a fix for a
fairly nasty bug in sg where a process which cancels I/O completes in
a kernel thread which would then try to write back to the now gone
userspace and end up writing to a random kernel address instead"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
[SCSI] zfcp: remove access control tables interface (keep sysfs files)
[SCSI] zfcp: fix schedule-inside-lock in scsi_device list loops
[SCSI] zfcp: fix lock imbalance by reworking request queue locking
[SCSI] sg: Fix user memory corruption when SG_IO is interrupted by a signal
[SCSI] lpfc: Don't force CONFIG_GENERIC_CSUM on
This should actually be returning an ERR_PTR on error instead of NULL.
That was how it was designed and all the callers expect it.
[AV: actually, that's what "VFS: Make clone_mnt()/copy_tree()/collect_mounts()
return errors" missed - originally collect_mounts() was expected to return
NULL on failure]
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
iget_locked() returns a NULL on error, it doesn't return an ERR_PTR.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The iget_locked() function returns NULL on error and never an ERR_PTR.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
proc_readfd_common() does dir_emit_dots() twice in a row,
we need to do this only once.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
dynamic_dname() is both too much and too little for those - the
output may be well in excess of 64 bytes dynamic_dname() assumes
to be enough (thanks to ashmem feeding really long names to
shmem_file_setup()) and vsnprintf() is an overkill for those
guys.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
It can take a long time to run log recovery operation because it is
single threaded and is bound by read latency. We can find that it took
most of the time to wait for the read IO to occur, so if one object
readahead is introduced to log recovery, it will obviously reduce the
log recovery time.
Log recovery time stat:
w/o this patch w/ this patch
real: 0m15.023s 0m7.802s
user: 0m0.001s 0m0.001s
sys: 0m0.246s 0m0.107s
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
At xfs_ail_min(), we do check if the AIL list is empty or not before
returning the first item in it with list_empty() and list_first_entry().
This can be simplified a bit with a new list operation routine that is
the list_first_entry_or_null() which has been introduced by:
commit 6d7581e62f
list: introduce list_first_entry_or_null
v2: make xfs_ail_min() as a static inline function and move it to
xfs_trans_priv.h as per Dave Chinner's comments.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Fix the issue with improper counting number of flying bio requests for
BIO_EOPNOTSUPP error detection case.
The sb_nbio must be incremented exactly the same number of times as
complete() function was called (or will be called) because
nilfs_segbuf_wait() will call wail_for_completion() for the number of
times set to sb_nbio:
do {
wait_for_completion(&segbuf->sb_bio_event);
} while (--segbuf->sb_nbio > 0);
Two functions complete() and wait_for_completion() must be called the
same number of times for the same sb_bio_event. Otherwise,
wait_for_completion() will hang or leak.
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove double call of bio_put() in nilfs_end_bio_write() for the case of
BIO_EOPNOTSUPP error detection. The issue was found by Dan Carpenter
and he suggests first version of the fix too.
Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the code initializizes mp->m_icsb_mutex and other things
_after_ register_hotcpu_notifier().
As the notifier takes mp->m_icsb_mutex it can happen
that it takes the lock before it's initialization.
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
After reclaiming state that was lost, the NFS client tries to reclaim
any locks, and then checks that each one has NFS_LOCK_INITIALIZED set
(which means that the server has confirmed the lock).
However if the client holds a delegation, nfs_reclaim_locks() simply aborts
(or more accurately it called nfs_lock_reclaim() and that returns without
doing anything).
This is because when a delegation is held, the server doesn't need to
know about locks.
So if a delegation is held, NFS_LOCK_INITIALIZED is not expected, and
its absence is certainly not an error.
So don't print the warnings if NFS_DELGATED_STATE is set.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fix up the wording of sysfs_create/remove_groups() a bit.
Reported-by: Anthony Foiani <tkil@scrye.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Add XFS superblock v4 support for the file type field in the
directory entry feature.
This support adds a feature bit for version 4 superblocks and
leaves the original superblock 5 incompatibility bit.
Signed-off-by: Mark Tinguely <tinguely@sgi.com>
Reviewed-by: Geoffrey Wehrman <gwehrman@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add support to propagate and add filetype values into the on-disk
directs. This involves passing the filetype into the xfs_da_args
structure along with the name and namelength for direct operations,
and encoding it into the dirent at the same time we write the inode
number into the dirent.
With write support, add the feature flag to the
XFS_SB_FEAT_INCOMPAT_ALL mask so we can now mount filesystems with
this feature set.
Performance of directory recursion is now much improved. Parallel
walk of ~50 million directory entries across hundreds of directories
improves significantly. Unpatched, no CRCs:
Walking via ls -R
real 3m19.886s
user 6m36.960s
sys 28m19.087s
THis is doing roughly 500 getdents() calls per second, and 250,000
inode lookups per second to determine the inode type at roughly
17,000 read IOPS. CPU usage is 90% kernel space.
With dtype support patched in and the fileset recreated with CRCs
enabled:
Walking via ls -R
real 0m31.316s
user 6m32.975s
sys 0m21.111s
This is doing roughly 3500 getdents() calls per second at 16,000
IOPS. There are no inode lookups at all. CPU usages is almost 100%
userspace.
This is a big win for recursive directory walks that only need to
find file names and file types.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add support for the file type field in directory entries so that
readdir can return the type of the inode the dirent points to to
userspace without first having to read the inode off disk.
The encoding of the type field is a single byte that is added to the
end of the directory entry name length. For all intents and
purposes, it appends a "hidden" byte to the name field which
contains the type information. As the directory entry is already of
dynamic size, helpers are already required to access and decode the
direct entry structures.
Hence the relevent extraction and iteration helpers are updated to
understand the hidden byte. Helpers for reading and writing the
filetype field from the directory entries are also added. Only the
read helpers are used by this patch. It also adds all the code
necessary to read the type information out of the dirents on disk.
Further we add the superblock feature bit and helpers to indicate
that we understand the on-disk format change. This is not a
compatible change - existing kernels cannot read the new format
successfully - so an incompatible feature flag is added. We don't
yet allow filesystems to mount with this flag yet - that will be
added once write support is added.
Finally, the code to take the type from the VFS, convert it to an
XFS on-disk type and put it into the xfs_name structures passed
around is added, but the directory code does not use this field yet.
That will be in the next patch.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add tracepoints to nfs41_setup_sequence and nfs41_sequence_done
to track session and slot table state changes.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Set up tracepoints to track read, write and commit, as well as
pNFS reads and writes and commits to the data server.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Set up tracepoints to track when delegations are set, reclaimed,
returned by the client, or recalled by the server.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Set up basic tracepoints for debugging NFSv4 setattr, access,
readlink, readdir, get_acl set_acl get_security_label,
and set_security_label.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Set up basic tracepoints for debugging NFSv4 lookup, unlink/remove,
symlink, mkdir, mknod, fs_locations and secinfo.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Set up basic tracepoints for debugging client id creation/destruction
and session creation/destruction.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When doing an open of a directory, ensure that we do pass the lookup flags
from nfs_atomic_open into nfs_lookup.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Add tracepoints for inode attribute updates, attribute revalidation,
writeback start/end fsync start/end, attribute change start/end,
permission check start/end.
The intention is to enable performance tracing using 'perf'as well as
improving debugging.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Optimise for the case where we only do one lookup.
Clean up the code so it is obvious that silly[] is not a dynamic array.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
We always encode to __be32 format in XDR: silences a sparse warning.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Andy Adamson <andros@netapp.com>
Technically, we don't really need to convert these time stamps,
since they are actually cookies.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Chuck Lever <Chuck.Lever@oracle.com>
This fixes the coding style warnings in fs/sysfs/file.c for broken
strings across lines.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The export should happen after the function, not at the bottom of the
file, so fix that up.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
sysfs_remove_group() never had kerneldoc, so add it, and fix up the
kerneldoc for sysfs_remove_groups() which didn't specify the parameters
properly.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
checkpatch complains about the broken string in the file, and it's
correct, so fix it up.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This fixes up the coding style issue of incorrectly placing the
EXPORT_SYMBOL_GPL() macro, it should be right after the function itself,
not at the end of the file.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
These functions are being open-coded in 3 different places in the driver
core, and other driver subsystems will want to start doing this as well,
so move it to the sysfs core to keep it all in one place, where we know
it is written properly.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There is a nasty bug in the SCSI SG_IO ioctl that in some circumstances
leads to one process writing data into the address space of some other
random unrelated process if the ioctl is interrupted by a signal.
What happens is the following:
- A process issues an SG_IO ioctl with direction DXFER_FROM_DEV (ie the
underlying SCSI command will transfer data from the SCSI device to
the buffer provided in the ioctl)
- Before the command finishes, a signal is sent to the process waiting
in the ioctl. This will end up waking up the sg_ioctl() code:
result = wait_event_interruptible(sfp->read_wait,
(srp_done(sfp, srp) || sdp->detached));
but neither srp_done() nor sdp->detached is true, so we end up just
setting srp->orphan and returning to userspace:
srp->orphan = 1;
write_unlock_irq(&sfp->rq_list_lock);
return result; /* -ERESTARTSYS because signal hit process */
At this point the original process is done with the ioctl and
blithely goes ahead handling the signal, reissuing the ioctl, etc.
- Eventually, the SCSI command issued by the first ioctl finishes and
ends up in sg_rq_end_io(). At the end of that function, we run through:
write_lock_irqsave(&sfp->rq_list_lock, iflags);
if (unlikely(srp->orphan)) {
if (sfp->keep_orphan)
srp->sg_io_owned = 0;
else
done = 0;
}
srp->done = done;
write_unlock_irqrestore(&sfp->rq_list_lock, iflags);
if (likely(done)) {
/* Now wake up any sg_read() that is waiting for this
* packet.
*/
wake_up_interruptible(&sfp->read_wait);
kill_fasync(&sfp->async_qp, SIGPOLL, POLL_IN);
kref_put(&sfp->f_ref, sg_remove_sfp);
} else {
INIT_WORK(&srp->ew.work, sg_rq_end_io_usercontext);
schedule_work(&srp->ew.work);
}
Since srp->orphan *is* set, we set done to 0 (assuming the
userspace app has not set keep_orphan via an SG_SET_KEEP_ORPHAN
ioctl), and therefore we end up scheduling sg_rq_end_io_usercontext()
to run in a workqueue.
- In workqueue context we go through sg_rq_end_io_usercontext() ->
sg_finish_rem_req() -> blk_rq_unmap_user() -> ... ->
bio_uncopy_user() -> __bio_copy_iov() -> copy_to_user().
The key point here is that we are doing copy_to_user() on a
workqueue -- that is, we're on a kernel thread with current->mm
equal to whatever random previous user process was scheduled before
this kernel thread. So we end up copying whatever data the SCSI
command returned to the virtual address of the buffer passed into
the original ioctl, but it's quite likely we do this copying into a
different address space!
As suggested by James Bottomley <James.Bottomley@hansenpartnership.com>,
add a check for current->mm (which is NULL if we're on a kernel thread
without a real userspace address space) in bio_uncopy_user(), and skip
the copy if we're on a kernel thread.
There's no reason that I can think of for any caller of bio_uncopy_user()
to want to do copying on a kernel thread with a random active userspace
address space.
Huge thanks to Costa Sapuntzakis <costa@purestorage.com> for the
original pointer to this bug in the sg code.
Signed-off-by: Roland Dreier <roland@purestorage.com>
Tested-by: David Milburn <dmilburn@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
For XFS, add support for Q_XGETQSTATV quotactl command.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
XFS now supports three types of quotas (user, group and project).
Current version of Q_XGETSTAT has support for only two types of quotas.
In order to support three types of quotas, the interface, specifically
struct fs_quota_stat, need to be expanded. Current version of fs_quota_stat
does not allow expansion without breaking backward compatibility.
So, a quotactl command and new fs_quota_stat structure need to be added.
This patch adds a new command Q_XGETQSTATV to quotactl() which takes
a new data structure fs_quota_statv. This new data structure provides
support for future expansion and backward compatibility.
Callers of the new quotactl command have to set the version of the data
structure being passed, and kernel will fill as much data as requested.
If the kernel does not support the user-space provided version, EINVAL
will be returned. User-space can reduce the version number and call the same
quotactl again.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Rich Johnston <rjohnston@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
[v2: Applied rjohnston's suggestions as per Chandra's request. -bpm]
Follow up with xfs naming style.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Ever since commit 6168f62cb (Add ACCESS operation to OPEN compound)
the NFSv4 atomic open has primed the access cache, and so nfs_permission
will no longer do an RPC call on the wire.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Don't emit OOM warnings when k.alloc calls fail when
there there is a v.alloc immediately afterwards.
Converted a kmalloc/vmalloc with memset to kzalloc/vzalloc.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This patch removes a false-alaramed BUG_ON.
The previous BUG_ON condition didn't cover the following true scenario.
In f2fs_add_link, 1) get_new_data_page gives an uptodate page successfully,
and then, 2) init_inode_metadata returns -ENOSPC.
At this moment, a new clean data page is remained in the page cache, but its
block address still indicates NEW_ADDR.
After then, even if sync is called, this clean data page cannot be written to
the disk due to the clean state.
So this means that get_lock_data_page should make a new empty page when its
block address is NEW_ADDR and its page is not uptodated.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When any of the caches create fails in init_f2fs_fs(), the other caches which are
create successful should be free.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
We need to check the glock ref counter in a race free way
in order to ensure that the gfs2_glock_hold() call will
succeed. The easiest way to do that is to simply take the
reference count early in the common code of examine_bucket,
skipping any glocks with zero ref count.
That means that the examiner functions all need to put their
reference on the glock once they've performed their function.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: David Teigland <teigland@redhat.com>
Tested-by: David Teigland <teigland@redhat.com>
In the previous commit, Richard Genoud fixed proc_root_readdir(), which
had lost the check for whether all of the non-process /proc entries had
been returned or not.
But that in turn exposed _another_ bug, namely that the original readdir
conversion patch had yet another problem: it had lost the return value
of proc_readdir_de(), so now checking whether it had completed
successfully or not didn't actually work right anyway.
This reinstates the non-zero return for the "end of base entries" that
had also gotten lost in commit f0c3b5093a ("[readdir] convert
procfs"). So now you get all the base entries *and* you get all the
process entries, regardless of getdents buffer size.
(Side note: the Linux "getdents" manual page actually has a nice example
application for testing getdents, which can be easily modified to use
different buffers. Who knew? Man-pages can be useful)
Reported-by: Emmanuel Benisty <benisty.e@gmail.com>
Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Cc: Richard Genoud <richard.genoud@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In pstore write, add character 'C'(compressed) or 'D'(decompressed)
in the header while writing to Ram persistent buffer. In pstore read,
read the header and update the 'compressed' flag accordingly.
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
In case decompression fails, add a ".enc.z" to indicate the file has
compressed data. This will help user space utilities to figure
out the file contents.
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Based on the flag 'compressed' set or not, pstore will decompress the
data returning a plain text file. If decompression fails for a particular
record it will have the compressed data in the file which can be
decompressed with 'openssl' command line tool.
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Backends will set the flag 'compressed' after reading the log from
persistent store to indicate the data being returned to pstore is
compressed or not.
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Add compression support to pstore which will help in capturing more data.
Initially, pstore will make a call to kmsg_dump with a bigger buffer
and will pass the size of bigger buffer to kmsg_dump and then compress
the data to registered buffer of registered size.
In case compression fails, pstore will capture the uncompressed
data by making a call again to kmsg_dump with registered_buffer
of registered size.
Pstore will indicate the data is compressed or not with a flag
in the write callback.
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Pstore will make use of deflate and inflate algorithm to compress and decompress
the data. So when Pstore is enabled select zlib_deflate and zlib_inflate.
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Addition of new argument 'compressed' in the write call back will
help the backend to know if the data passed from pstore is compressed
or not (In case where compression fails.). If compressed, the backend
can add a tag indicating the data is compressed while writing to
persistent store.
Signed-off-by: Aruna Balakrishnaiah <aruna@linux.vnet.ibm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
d_alloc_name() returns NULL on error. Also I changed the error code
from -ENOSPC to -ENOMEM to reflect that we were short on RAM not disk
space.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Commit f0c3b5093a ("[readdir] convert procfs") introduced a bug on the
listing of the proc file-system. The return value of proc_readdir()
isn't tested anymore in the proc_root_readdir function.
This lead to an "interesting" behaviour when we are using the getdents()
system call with a buffer too small: instead of failing, it returns the
first entries of /proc (enough to fill the given buffer), plus the PID
directories.
This is not triggered on glibc (as getdents is called with a 32KB
buffer), but on uclibc, the buffer size is only 1KB, thus some proc
entries are missing.
See https://lkml.org/lkml/2013/8/12/288 for more background.
Signed-off-by: Richard Genoud <richard.genoud@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull gfs2 fixes from Steven Whitehouse:
"Out of these five patches, the one for ensuring that the number of
revokes is not exceeded, and the one for checking the glock is not
already held in gfs2_getxattr are the two most important. The latter
can be triggered by selinux.
The other three patches are very small and fix mostly fairly trivial
issues"
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-fixes:
GFS2: Check for glock already held in gfs2_getxattr
GFS2: alloc_workqueue() doesn't return an ERR_PTR
GFS2: don't overrun reserved revokes
GFS2: WQ_NON_REENTRANT is meaningless and going away
GFS2: Fix typo in gfs2_create_inode()
Since gfs2_sync_meta() is only called from a single file, lets move
it to lops.c where it is used, and mark it static. At the same
time, we can clean up the meta_io.h header too.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Since the introduction of atomic_open, gfs2_getxattr can be
called with the glock already held, so we need to allow for
this.
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-by: David Teigland <teigland@redhat.com>
Tested-by: David Teigland <teigland@redhat.com>
alloc_workqueue() returns a NULL on error, it doesn't return an ERR_PTR.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
When run during fsync, a gfs2_log_flush could happen between the
time when gfs2_ail_flush checked the number of blocks to revoke,
and when it actually started the transaction to do those revokes.
This occassionally caused it to need more revokes than it reserved,
causing gfs2 to crash.
Instead of just reserving enough revokes to handle the blocks that
currently need them, this patch makes gfs2_ail_flush reserve the
maximum number of revokes it can, without increasing the total number
of reserved log blocks. This patch also passes the number of reserved
revokes to __gfs2_ail_flush() so that it doesn't go over its limit
and cause a crash like we're seeing. Non-fsync calls to __gfs2_ail_flush
will still cause a BUG() necessary revokes are skipped.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages.
This patch doesn't introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Cc: cluster-devel@redhat.com
PTR_RET should be PTR_ERR
Reported-by: Sachin Kamat <sachin.kamat@linaro.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
An error "label at end of compound statement" will occur if CONFIG_F2FS_STAT_FS
disabled.
fs/f2fs/segment.c:556:1: error: label at end of compound statement
So clean up the 'out' label to fix it.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In f2fs_write_inode, updating inode after f2fs_balance_fs is not
a optimized way in the case that f2fs_gc is performed ahead. The
inode page will be unnecessarily written out twice, one of which
is in f2fs_gc->...->sync_node_pages and the other is in
update_inode_page.
Let's update the inode page in prior to f2fs_balance_fs to avoid
this.
To reproduce it,
$ touch file (before this step, should make the device need f2fs_gc)
$ sync (or wait the bdi to write dirty inode)
Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
alloc_page() returns a NULL on failure, it never returns an ERR_PTR.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJSD2uJAAoJENNvdpvBGATwPwEP/R1Yu0UReTDVSmRwn8MGbk5O
M7QwPgUVnEdqr86v4qx7kqOmKtwSQKfSr7+07UGwEBVasPPI1ezXT5c2VAwz146z
5EClKIJoZoKIdWwM0NvUYxb7p5HgTLubho5RTsHmjOGIy37ju+5rFKZ44aHt9siY
J3w5v1EruHdDXwjyiZcIeHFUBGV34x9zPtIGy3yY06gEPAUMeEXShiH/Uj3EroKH
50PkwECBvqFuO4HpunhSYfNc3csTir2ZAx4judaV0s5ebyxzyZ398ql581q73mvP
m5KREYYVLQnN9VzCnsjLsnyZboh6XoTxmlmblNVOr0LihyKLS/Nof6pvrWEqgnBf
6NpR6Y8cPVUq+rhhXwFx/RIkVh2EIME0T5dZplr7fVaPxzI1ZKud8p18WifxMAix
y3rbdZiEA//OuH5u5XqV0/zv3aJUP7XjdT8dia5iq3FazFOdyT1VfVXzAfaXyqZk
JlFguSrOJbQe+vgk2gFLWLH8L4wsnGlSUv1AeDEboriushmZqYlAn10y6ENnM5yW
IMFNA6Wyz35MghsKMKV57H7B05VS8em6NGtHZVhTBt7JoiFn7PBvic/KcK/czpDy
W2Y8K8azhuscmQfa/RjuVDtwakbNwTu341e7hIUOy41qL/x9nNr07ZRAeF68Nesd
p4uRmJxLSW9PReYp9n4V
=OroQ
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull jbd2 bug fixes from Ted Ts'o:
"Two jbd2 bug fixes, one of which is a regression fix"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd2: Fix oops in jbd2_journal_file_inode()
jbd2: Fix use after free after error in jbd2_journal_dirty_metadata()
The following race can lead to a loss of i_disksize update from truncate
thus resulting in a wrong inode size if the inode size isn't updated
again before inode is reclaimed:
ext4_setattr() mpage_map_and_submit_extent()
EXT4_I(inode)->i_disksize = attr->ia_size;
... ...
disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
/* False because i_size isn't
* updated yet */
if (disksize > i_size_read(inode))
/* True, because i_disksize is
* already truncated */
if (disksize > EXT4_I(inode)->i_disksize)
/* Overwrite i_disksize
* update from truncate */
ext4_update_i_disksize()
i_size_write(inode, attr->ia_size);
For other places updating i_disksize such race cannot happen because
i_mutex prevents these races. Writeback is the only place where we do
not hold i_mutex and we cannot grab it there because of lock ordering.
We fix the race by doing both i_disksize and i_size update in truncate
atomically under i_data_sem and in mpage_map_and_submit_extent() we move
the check against i_size under i_data_sem as well.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Merge conditions in ext4_setattr() handling inode size changes, also
move ext4_begin_ordered_truncate() call somewhat earlier because it
simplifies error recovery in case of failure. Also add error handling in
case i_disksize update fails.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Inode size can arbitrarily change while writeback is in progress. When
ext4_writepages() has prepared a long extent for mapping and truncate
then reduces i_size, mpage_map_and_submit_buffers() will always map just
one buffer in a page instead of all of them due to lblk < blocks check.
So we end up not using all blocks we've allocated (thus leaking them)
and also delalloc accounting goes wrong manifesting as a warning like:
ext4_da_release_space:1333: ext4_da_release_space: ino 12, to_free 1
with only 0 reserved data blocks
Note that the problem can happen only when blocksize < pagesize because
otherwise we have only a single buffer in the page.
Fix the problem by removing the size check from the mapping loop. We
have an extent allocated so we have to use it all before checking for
i_size. We also rename add_page_bufs_to_extent() to
mpage_process_page_bufs() and make that function submit the page for IO
if all buffers (upto EOF) in it are mapped.
Reported-by: Dave Jones <davej@redhat.com>
Reported-by: Zheng Liu <gnehzuil.liu@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Currently the logic whether the current buffer can be added to an extent
of buffers to map is split between mpage_add_bh_to_extent() and
add_page_bufs_to_extent(). Move the whole logic to
mpage_add_bh_to_extent() which makes things a bit more straightforward
and make following i_size fixes easier.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
reaim workfile.dbase test easily triggers warning in
ext4_da_update_reserve_space():
EXT4-fs warning (device ram0): ext4_da_update_reserve_space:365:
ino 12, allocated 1 with only 0 reserved metadata blocks (releasing 1
blocks with reserved 9 data blocks)
The problem is that (one of) tests creates file and then randomly writes
to it with O_SYNC. That results in writing back pages of the file in
random order so we create extents for written blocks say 0, 2, 4, 6, 8
- this last allocation also allocates new block for extents. Then we
writeout block 1 so we have extents 0-2, 4, 6, 8 and we release
indirect extent block because extents fit in the inode again. Then we
writeout block 10 and we need to allocate indirect extent block again
which triggers the warning because we don't have the reservation
anymore.
Fix the problem by giving back freed metadata blocks resulting from
extent merging into inode's reservation pool.
Signed-off-by: Jan Kara <jack@suse.cz>
ext4 needs to convert allocated (metadata) blocks back into blocks
reserved for delayed allocation. Add functions into quota code for
supporting such operation.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In no journal mode, if an inode has recently been deleted, we
shouldn't reuse it right away. Otherwise it's possible, after an
unclean shutdown, to hit a situation where a recently deleted inode
gets reused for some other purpose before the inode table block has
been written to disk. However, if the directory entry has been
updated, then the directory entry will be pointing at the old inode
contents.
E2fsck will make sure the file system is consistent after the
unclean shutdown. However, if the recently deleted inode is a
character mode device, or an inode with the immutable bit set, even
after the file system has been fixed up by e2fsck, it can be
possible for a *.pyc file to be pointing at a character mode
device, and when python tries to open the *.pyc file, Hilarity
Ensues. We could change all of userspace to be very suspicious
about stat'ing files before opening them, and clearing the
immutable flag if necessary --- or we can just avoid reusing an
inode number if it has been recently deleted.
Google-Bug-Id: 10017573
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When ext4_rename() overwrites an already existing file, call
ext4_alloc_da_blocks() before starting the journal handle which
actually does the rename, instead of doing this afterwards. This
improves the likelihood that the contents will survive a crash if an
application replaces a file using the sequence:
1) write replacement contents to foo.new
2) <omit fsync of foo.new>
3) rename foo.new to foo
It is still not a guarantee, since ext4_alloc_da_blocks() is *not*
doing a file integrity sync; this means if foo.new is a very large
file, it may not be completely flushed out to disk.
However, for files smaller than a megabyte or so, any dirty pages
should be flushed out before we do the rename operation, and so at the
next journal commit, the CACHE FLUSH command will make sure al of
these pages are safely on the disk platter.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
In ext4_rename(), don't start the journal handle until the the
directory entries have been successfully looked up.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Add a new fiemap flag which forces the all of the extents in an inode
to be cached in the extent_status tree. This is critically important
when using AIO to a preallocated file, since if we need to read in
blocks from the extent tree, the io_submit(2) system call becomes
synchronous, and the AIO is no longer "A", which is bad.
In addition, for most files which have an external leaf tree block,
the cost of caching the information in the extent status tree will be
less than caching the entire 4k block in the buffer cache. So it is
generally a win to keep the extent information cached.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
When we read in an extent tree leaf block from disk, arrange to have
all of its entries cached. In nearly all cases the in-memory
representation will be more compact than the on-disk representation in
the buffer cache, and it allows us to get the information without
having to traverse the extent tree for successive extents.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Don't use an unsigned long long for the es_status flags; this requires
that we pass 64-bit values around which is painful on 32-bit systems.
Instead pass the extent status flags around using the low 4 bits of an
unsigned int, and shift them into place when we are reading or writing
es_pblk.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
When we find an invalid extent tree block, report the block number of
the bad block for debugging purposes.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Refactor out the code needed to read the extent tree block into a
single read_extent_tree_block() function. In addition to simplifying
the code, it also makes sure that we call the ext4_ext_load_extent
tracepoint whenever we need to read an extent tree block from disk.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
Commit 0713ed0cde added
jbd2_journal_file_inode() call into ext4_block_zero_page_range().
However that function gets called from truncate path and thus inode
needn't have jinode attached - that happens in ext4_file_open() but
the file needn't be ever open since mount. Calling
jbd2_journal_file_inode() without jinode attached results in the oops.
We fix the problem by attaching jinode to inode also in ext4_truncate()
and ext4_punch_hole() when we are going to zero out partial blocks.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Ben Tebulin reported:
"Since v3.7.2 on two independent machines a very specific Git
repository fails in 9/10 cases on git-fsck due to an SHA1/memory
failures. This only occurs on a very specific repository and can be
reproduced stably on two independent laptops. Git mailing list ran
out of ideas and for me this looks like some very exotic kernel issue"
and bisected the failure to the backport of commit 53a59fc67f ("mm:
limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT").
That commit itself is not actually buggy, but what it does is to make it
much more likely to hit the partial TLB invalidation case, since it
introduces a new case in tlb_next_batch() that previously only ever
happened when running out of memory.
The real bug is that the TLB gather virtual memory range setup is subtly
buggered. It was introduced in commit 597e1c3580 ("mm/mmu_gather:
enable tlb flush range in generic mmu_gather"), and the range handling
was already fixed at least once in commit e6c495a96c ("mm: fix the TLB
range flushed when __tlb_remove_page() runs out of slots"), but that fix
was not complete.
The problem with the TLB gather virtual address range is that it isn't
set up by the initial tlb_gather_mmu() initialization (which didn't get
the TLB range information), but it is set up ad-hoc later by the
functions that actually flush the TLB. And so any such case that forgot
to update the TLB range entries would potentially miss TLB invalidates.
Rather than try to figure out exactly which particular ad-hoc range
setup was missing (I personally suspect it's the hugetlb case in
zap_huge_pmd(), which didn't have the same logic as zap_pte_range()
did), this patch just gets rid of the problem at the source: make the
TLB range information available to tlb_gather_mmu(), and initialize it
when initializing all the other tlb gather fields.
This makes the patch larger, but conceptually much simpler. And the end
result is much more understandable; even if you want to play games with
partial ranges when invalidating the TLB contents in chunks, now the
range information is always there, and anybody who doesn't want to
bother with it won't introduce subtle bugs.
Ben verified that this fixes his problem.
Reported-bisected-and-tested-by: Ben Tebulin <tebulin@googlemail.com>
Build-testing-by: Stephen Rothwell <sfr@canb.auug.org.au>
Build-testing-by: Richard Weinberger <richard.weinberger@gmail.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NFSv4 reserves readdir cookie values 0-2 for special entries (. and ..),
but jfs allows a value of 2 for a non-special entry. This incompatibility
can result in the nfs client reporting a readdir loop.
This patch doesn't change the value stored internally, but adds one to
the value exposed to the iterate method.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Tested-by: Christian Kujau <lists@nerdbynature.de>
When a transaction is cancelled and the buffer log item is clean in
the transaction, the buffer log item is unconditionally freed. If
the log item is in the AIL, however, this leads to a use after free
condition as the item still has other users.
In this case, xfs_buf_item_relse() should only be called on clean
buffer items if the reference count has dropped to zero. This
ensures only the last user frees the item.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Check for CAP_SYS_ADMIN since the caller can truncate preallocated
blocks from files they do not own nor have write access to. A more
fine grained access check was considered: require the caller to
specify their own uid/gid and to use inode_permission to check for
write, but this would not catch the case of an inode not reachable
via path traversal from the callers mount namespace.
Add check for read-only filesystem to free eofblocks ioctl.
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Have eofblocks ioctl convert uid_t to kuid_t into internal structure.
Update internal filter matching to compare ids with kuid_t types.
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Use uint32 from init_user_ns for xfs internal uid/gid
representation in xfs_icdinode, xfs_dqid_t.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Use inode_capable() to check if SUID|SGID bits should be cleared to match
similar check in inode_change_ok().
The check for CAP_LINUX_IMMUTABLE was not modified since all other file
systems also check against init_user_ns rather than current_user_ns.
Only allow changing of projid from init_user_ns.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Change permission check for setting ACL to use inode_owner_or_capable()
which will additionally allow a CAP_FOWNER user in a user namespace to
be able to set an ACL on an inode covered by the user namespace mapping.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Dwight Engen <dwight.engen@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
This patch implements fallocate and punch hole support for Ceph kernel client.
Signed-off-by: Li Wang <liwang@ubuntukylin.com>
Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
ceph_check_caps() requests new max size only when there is Fw cap.
If we call check_max_size() while there is no Fw cap. It updates
i_wanted_max_size and calls ceph_check_caps(), but ceph_check_caps()
does nothing. Later when Fw cap is issued, we call check_max_size()
again. But i_wanted_max_size is equal to 'endoff' at this time, so
check_max_size() doesn't call ceph_check_caps() and we end up with
waiting for the new max size forever.
The fix is duplicate ceph_check_caps()'s "request max size" code in
check_max_size(), and make try_get_cap_refs() wait for the Fw cap
before retry requesting new max size.
This patch also removes the "endoff > (inode->i_size << 1)" check
in check_max_size(). It's useless because there is no corresponding
logic in ceph_check_caps().
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
I encountered below deadlock when running fsstress
wmtruncate work truncate MDS
--------------- ------------------ --------------------------
lock i_mutex
<- truncate file
lock i_mutex (blocked)
<- revoking Fcb (filelock to MIX)
send request ->
handle request (xlock filelock)
At the initial time, there are some dirty pages in the page cache.
When the kclient receives the truncate message, it reduces inode size
and creates some 'out of i_size' dirty pages. wmtruncate work can't
truncate these dirty pages because it's blocked by the i_mutex. Later
when the kclient receives the cap message that revokes Fcb caps, It
can't flush all dirty pages because writepages() only flushes dirty
pages within the inode size.
When the MDS handles the 'truncate' request from kclient, it waits
for the filelock to become stable. But the filelock is stuck in
unstable state because it can't finish revoking kclient's Fcb caps.
The truncate pagecache locking has already caused lots of trouble
for use. I think it's time simplify it by introducing a new mutex.
We use the new mutex to prevent concurrent truncate_inode_pages().
There is no need to worry about race between buffered write and
truncate_inode_pages(), because our "get caps" mechanism prevents
them from concurrent execution.
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
The invalidatepage code bails if it encounters a non-zero page offset. The
current logic that does is non-obvious with multiple if statements.
This should be logically and functionally equivalent.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Recently we met quite a lot of random kernel panic issues after enabling
CONFIG_PROC_PAGE_MONITOR. After debuggind we found this has something
to do with following bug in pagemap:
In struct pagemapread:
struct pagemapread {
int pos, len;
pagemap_entry_t *buffer;
bool v2;
};
pos is number of PM_ENTRY_BYTES in buffer, but len is the size of
buffer, it is a mistake to compare pos and len in add_page_map() for
checking buffer is full or not, and this can lead to buffer overflow and
random kernel panic issue.
Correct len to be total number of PM_ENTRY_BYTES in buffer.
[akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition]
Signed-off-by: Yonghua Zheng <younghua.zheng@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since ocfs2_cow_file_pos will invoke ocfs2_refcount_icow with a NULL as
the struct file pointer, it finally result in a null pointer dereference
in ocfs2_duplicate_clusters_by_page.
This patch replace file pointer with inode pointer in
cow_duplicate_clusters to fix this issue.
[jeff.liu@oracle.com: rebased patch against linux-next tree]
Signed-off-by: Tiger Yang <tiger.yang@oracle.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Acked-by: Tao Ma <tm@tao.ma>
Tested-by: David Weber <wb@munzinger.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit 40bd62eb7f ("fs/ocfs2/journal.h: add bits_wanted while
calculating credits in ocfs2_calc_extend_credits").
Unfortunately this change broke fallocate even if there is insufficient
disk space for the preallocation, which is a serious problem.
# df -h
/dev/sda8 22G 1.2G 21G 6% /ocfs2
# fallocate -o 0 -l 200M /ocfs2/testfile
fallocate: /ocfs2/test: fallocate failed: No space left on device
and a kernel warning:
CPU: 3 PID: 3656 Comm: fallocate Tainted: G W O 3.11.0-rc3 #2
Call Trace:
dump_stack+0x77/0x9e
warn_slowpath_common+0xc4/0x110
warn_slowpath_null+0x2a/0x40
start_this_handle+0x6c/0x640 [jbd2]
jbd2__journal_start+0x138/0x300 [jbd2]
jbd2_journal_start+0x23/0x30 [jbd2]
ocfs2_start_trans+0x166/0x300 [ocfs2]
__ocfs2_extend_allocation+0x38f/0xdb0 [ocfs2]
ocfs2_allocate_unwritten_extents+0x3c9/0x520
__ocfs2_change_file_space+0x5e0/0xa60 [ocfs2]
ocfs2_fallocate+0xb1/0xe0 [ocfs2]
do_fallocate+0x1cb/0x220
SyS_fallocate+0x6f/0xb0
system_call_fastpath+0x16/0x1b
JBD2: fallocate wants too many credits (51216 > 4381)
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dave has reported the following lockdep splat:
=================================
[ INFO: inconsistent lock state ]
3.11.0-rc1+ #9 Not tainted
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
{RECLAIM_FS-ON-W} state was registered at:
mark_held_locks+0x81/0xe7
lockdep_trace_alloc+0x5e/0xbc
__alloc_pages_nodemask+0x8b/0x9b6
__get_free_pages+0x20/0x31
get_zeroed_page+0x12/0x14
__pmd_alloc+0x1c/0x6b
huge_pmd_share+0x265/0x283
huge_pte_alloc+0x5d/0x71
hugetlb_fault+0x7c/0x64a
handle_mm_fault+0x255/0x299
__do_page_fault+0x142/0x55c
do_page_fault+0xd/0x16
error_code+0x6c/0x74
irq event stamp: 3136917
hardirqs last enabled at (3136917): _raw_spin_unlock_irq+0x27/0x50
hardirqs last disabled at (3136916): _raw_spin_lock_irq+0x15/0x78
softirqs last enabled at (3136180): __do_softirq+0x137/0x30f
softirqs last disabled at (3136175): irq_exit+0xa8/0xaa
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&mapping->i_mmap_mutex);
<Interrupt>
lock(&mapping->i_mmap_mutex);
*** DEADLOCK ***
no locks held by kswapd0/49.
stack backtrace:
CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
Call Trace:
dump_stack+0x4b/0x79
print_usage_bug+0x1d9/0x1e3
mark_lock+0x1e0/0x261
__lock_acquire+0x623/0x17f2
lock_acquire+0x7d/0x195
mutex_lock_nested+0x6c/0x3a7
page_referenced+0x87/0x5e3
shrink_page_list+0x3d9/0x947
shrink_inactive_list+0x155/0x4cb
shrink_lruvec+0x300/0x5ce
shrink_zone+0x53/0x14e
kswapd+0x517/0xa75
kthread+0xa8/0xaa
ret_from_kernel_thread+0x1b/0x28
which is a false positive caused by hugetlb pmd sharing code which
allocates a new pmd from withing mapping->i_mmap_mutex. If this
allocation causes reclaim then the lockdep detector complains that we
might self-deadlock.
This is not correct though, because hugetlb pages are not reclaimable so
their mapping will be never touched from the reclaim path.
The patch tells lockup detector that hugetlb i_mmap_mutex is special by
assigning it a separate lockdep class so it won't report possible
deadlocks on unrelated mappings.
[peterz@infradead.org: comment for annotation]
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy reported that if file page get reclaimed we lose the soft-dirty bit
if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
encoded into pte entry. Thus when #pf happens on such non-present pte
we can restore it back.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set
get swapped out, the bit is getting lost and no longer available when
pte read back.
To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in
pte entry for the page being swapped out. When such page is to be read
back from a swap cache we check for bit presence and if it's there we
clear it and restore the former _PAGE_SOFT_DIRTY bit back.
One of the problem was to find a place in pte entry where we can save
the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was
chosen for that, it doesn't intersect with swap entry format stored in
pte.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The xc_cil_lock is used for two purposes - to protect the CIL
itself, and to protect the push/commit state and lists. These are
two logically separate structures and operations, so can have their
own locks. This means that pushing on the CIL and the commit wait
ordering won't contend for a lock with other transactions that are
completing concurrently. As the CIL insertion is the hottest path
throught eh CIL, this is a big win.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Now that all the log item preparation and formatting is done under
the CIL lock, we can get rid of the intermediate log vector chain
used to track items to be inserted into the CIL.
We can already find all the items to be committed from the
transaction handle, so as long as we attach the log vectors to the
item before we insert the items into the CIL, we don't need to
create a log vector chain to pass around.
This means we can move all the item insertion code into and optimise
it into a pair of simple passes across all the items in the
transaction. The first pass does the formatting and accounting, the
second inserts them all into the CIL.
We keep this two pass split so that we can separate the CIL
insertion - which must be done under the CIL spinlock - from the
formatting. We could insert each item into the CIL with a single
pass, but that massively increases the number of times we have to
grab the CIL spinlock. It is much more efficient (and hence
scalable) to do a batch operation and insert all objects in a single
lock grab.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Now that we have the size of the log vector that has been allocated,
we can determine if we need to allocate a new log vector for
formatting and insertion. We only need to allocate a new vector if
it won't fit into the existing buffer.
However, we need to hold the CIL context lock while we do this so
that we can't race with a push draining the currently queued log
vectors. It is safe to do this as long as we do GFP_NOFS allocation
to avoid avoid memory allocation recursing into the filesystem.
Hence we can safely overwrite the existing log vector on the CIL if
it is large enough to hold all the dirty regions of the current
item.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Now that we have the size of the object before the formatting pass
is called, we can allocation the log vector and it's buffer in a
single allocation rather than two separate allocations.
Store the size of the allocated buffer in the log vector so that
we potentially avoid allocation for future modifications of the
object.
While touching this code, remove the IOP_FORMAT definition.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
To begin optimising the CIL commit process, we need to have IOP_SIZE
return both the number of vectors and the size of the data pointed
to by the vectors. This enables us to calculate the size ofthe
memory allocation needed before the formatting step and reduces the
number of memory allocations per item by one.
While there, kill the IOP_SIZE macro.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
xlog_find_tail() currently leaks a bp on one error path.
There is no error target, so manually free the bp before
returning the error.
Found by Coverity.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
xlog_find_zeroed() currently leaks a bp on one error path.
Using the bp_err: target resolves this.
Found by Coverity.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
xfs_attr_node_addname()'s error handling tests whether it
should free "state" in the out: error handling label:
out:
if (state)
xfs_da_state_free(state);
but an earlier free doesn't set state to NULL afterwards; this
could lead to a double free. Fix it by setting state to NULL
after it's freed.
This was found by Coverity.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Replace roundup() with roundup_64() as we calculate min_logblks
with 64-bit divisions. Hence, call roundup() will cause the
following error while compiling a 32-bit kernel:
fs/built-in.o: In function `xfs_log_calc_minimum_size':
fs/xfs/xfs_log_rlimit.c:140: undefined reference to `__udivdi3'
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Validate log space during log mount stage, the underlying function
will drop a warning message via syslog in critical level if the log
space is too small or too large.
[ dchinner: For CRC enable filesystems, abort the mounting of the
filesystem as mkfs should never make a log too small for the given
filesystem configuration. ]
[ dchinner: make a note of the fact that the log size limits in
block counts are in units of filesystem blocks, not basic blocks. ]
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Add source files for xfs_log_rlimit.c The new file is used for log
size calculations and validation shared with userspace.
[dchinner: xfs_log_calc_max_attrsetm_res() does not modify the
tr_attrsetm reservation, just calculates the maximum. ]
[dchinner: rework loop in xfs_log_get_max_trans_res() ]
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Refactor xlog_ticket_alloc() to extract a new helper, i.e.
xfs_log_calc_unit_res().
This helper would be used to calculate the total log reservation
size by adding extra log operation/transation headers for a new
log ticket.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Get rid of all XFS_XXX_LOG_RES() macros since they are obsoleted now.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
With the new xfs_trans_res structure has been introduced, the log
reservation size, log count as well as log flags are pre-initialized
at mount time. So it's time to refine xfs_trans_reserve() interface
to be more neat.
Also, introduce a new helper M_RES() to return a pointer to the
mp->m_resv structure to simplify the input.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
tr_writeid is defined at mp->m_resv structure, however, it does not
really being used when it should be..
This patch changes it to tr_writeid to fetch the correct log
reservation size.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
A preparation step.
For now fsync_ts transaction use the pre-calculated log reservation
size of tr_swrite. This patch introduce a new item tr_fsyncts to
mp->m_reservations structure so that we can fetch the log
reservation value for it in a same manner to others.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Introduce a new structure xfs_trans_res to hold transaction
reservation item info per log ticket.
We also need to improve xfs_trans_resv_calc() by initializing the
log count as well as log flags for permanent log reservation.
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The struct xfs_perag has many kernel-only definitions in it,
requiring a __KERNEL__ guard so userspace can use it to. Move it to
xfs_mount.h so that it it kernel-only, and let userspace redefine
it's own version of the structure containing only what it needs.
This gets rid of another __KERNEL__ check in the XFS header files.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
xfs_types.h is shared with userspace, so having kernel specific
types defined in it is problematic. Move all the kernel specific
defines to xfs_linux.h so we can remove the __KERNEL__ guards from
xfs_types.h.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Pull CIFS fixes from Steve French:
"A set of small cifs fixes, including 3 relating to symlink handling"
* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
cifs: don't instantiate new dentries in readdir for inodes that need to be revalidated immediately
cifs: set sb->s_d_op before calling d_make_root()
cifs: fix bad error handling in crypto code
cifs: file: initialize oparms.reconnect before using it
Do not attempt to do cifs operations reading symlinks with SMB2
cifs: extend the buffer length enought for sprintf() using
so that if ext4 is built as a module, to allow it to be unloaded.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJSCOS0AAoJENNvdpvBGATwOYkQALgOTX2ma7hgCr91ldg3Q69e
bejRm2Prdjs3HXa3MwC5yZ0i2nDXxTjMXidnVcD2QgLwS6rPGQQfrGhQ1OtJqLxA
Gzf6mHIWIaRG9/JtTkzp3ucONbhc13LnaJd9B2bHASlq4+5aJhlziblwapMSvVbI
YixhejVREmXdZ/typvnQOsoPfXzH9NLOTY/asT9IcZK6flhczNIlMC0/zRZQw0+I
pDK/akmdOLpkd57UvJcwdcGIgcioSrg+JXk++EsDe82v2bpxleG2IdLbVZ76whIS
A3FHEJvUfiFPj+GhKMQJyI1sa1ZGmD7nH8n2Yf6TVqZ4jgFx4ox9d83c0PpSLn0+
VTZD601lfAoThUwKRTk45FUI4siZ1zBP13cbTH1DJXvGVllehhC9zBCezlNma4vr
cI+K3d8VBHV7bKSRSgRVcB2pY+bJE1qP5XJXGnfDhjEiecNgXFRm0i2UtYSEl6R7
aQGAdfjxp6tf+gme6c13zFK43FEJ9/DMaC8l0eamiQyH5U4OhOD8Q4ExnHjrfN69
V3yZc4AZq46J9+2H/Jc457NYiFH2YK74jpAcpsVV0fXKFDR8Ss+rKXmfB+TGS45f
Q1eHbKLuA3fHzHfwJCozqnUBxQO41M4vEd/ktqenwsKzeOw0qH3qynxEUd6xgka5
1pcnJZnSGbm1d8pwYLMc
=NuDG
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull more ext4 bugfixes from Ted Ts'o:
"A number of miscellaneous ext4 bugs fixes for v3.11, including a fix
so that if ext4 is built as a module, to allow it to be unloaded"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOT
ext4: fix mount/remount error messages for incompatible mount options
ext4: allow the mount options nodelalloc and data=journal
Because it is only used within the kernel.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
It's actually an ifndef section, which means it is only included in
userspace. however, it's deep within the libxfs code, so it's
unlikely that the condition checked in userspace can actually occur
(search an empty leaf) through the libxfs interfaces. i.e. if it can
happen in usrspace, it can happen in the kernel, so remove it from
userspace too....
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
There is no reason the remaining kernel-only debug code needs to
remain kernel-only. Kill the __KERNEL__ part of the defines, and let
userspace handle the debug code appropriately.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Userspace running debug builds is relatively rare, so there's need
to special case the allocation algorithm code coverage debug switch.
As it is, userspace defines random numbers to 0, so invert the
logic of the switch so it is effectively a no-op in userspace.
This kills another couple of __KERNEL__ users.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Neither kernel or userspace support shared read-only mounts, so
don't bother special casing the support check to be different
between kernel and userspace. The same check can be used as neither
like it...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
So we don't need xfs_dfrag.h in userspace anymore, move the extent
swap ioctl structure definition to xfs_fs.h where most of the other
ioctl structure definitions are.
Now that we don't need separate files for extent swapping, separate
the basic file descriptor checking code to xfs_ioctl.c, and the code
that does the extent swap operation to xfs_bmap_util.c. This
cleanly separates the user interface code from the physical
mechanism used to do the extent swap.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
There are a few small helper functions in xfs_util, all related to
xfs_inode modifications. Move them all to xfs_inode.c so all
xfs_inode operations are consiolidated in the one place.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Move the rename code to xfs_inode.c to continue consolidating
all the kernel xfs_inode operations in the one place.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Now we have xfs_inode.c for holding kernel-only XFS inode
operations, move all the inode operations from xfs_vnodeops.c to
this new file as it holds another set of kernel-only inode
operations. The name of this file traces back to the days of Irix
and it's vnodes which we don't have anymore.
Essentially this move consolidates the inode locking functions
and a bunch of XFS inode operations into the one file. Eventually
the high level functions will be merged into the VFS interface
functions in xfs_iops.c.
This leaves only internal preallocation, EOF block manipulation and
hole punching functions in vnodeops.c. Move these to xfs_bmap_util.c
where we are already consolidating various in-kernel physical extent
manipulation and querying functions.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Some of the code shared with userspace causes compilation warnings
from things turned off in the kernel code, such as differences in
variable signedness. Fix those issues.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
These come from syncing the shared userspace and kernel code. Small
whitespace and trivial cleanups.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
There is a bunch of code in xfs_bmap.c that is kernel specific and
not shared with userspace. To minimise the difference between the
kernel and userspace code, shift this unshared code to
xfs_bmap_util.c, and the declarations to xfs_bmap_util.h.
The biggest issue here is xfs_bmap_finish() - userspace has it's own
definition of this function, and so we need to move it out of
xfs_bmap.[ch]. This means several other files need to include
xfs_bmap_util.h as well.
It also introduces and interesting dance for the stack switching
code in xfs_bmapi_allocate(). The stack switching/workqueue code is
actually moved to xfs_bmap_util.c, so that userspace can simply use
a #define in a header file to connect the dots without needing to
know about the stack switch code at all.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
xfs_mount.c is shared with userspace, but the only functions that
are shared are to do with physical superblock manipulations. This
means that less than 25% of the xfs_mount.c code is actually shared
with userspace. Move all the superblock functions to xfs_sb.c and
share that instead with libxfs.
Note that this will leave all the in-core transaction related
superblock counter modifications in xfs_mount.c as none of that is
shared with userspace. With a few more small changes, xfs_mount.h
won't need to be shared with userspace anymore, either.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The remote symlink format definition and manipulation needs to be
shared with userspace, but the in-kernel interfaces do not. Split
the remote symlink format handling out into xfs_symlink_remote.[ch]
fo it can easily be shared with userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The attribute inactivation code is not used by userspace, so like
the attribute listing, split it out into a separate file to minimise
the differences between the filesystem shared with libxfs in
userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The attribute listing code is not used by userspace, so like the
directory readdir code, split it out into a separate file to
minimise the differences between the filesystem shared with libxfs
in userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Many of the definitions within xfs_dir2_priv.h are needed in
userspace outside libxfs. Definitions within xfs_dir2_priv.h are
wholly contained within libxfs, so we need to shuffle some of the
definitions around to keep consistency across files shared between
user and kernel space.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The directory readdir code is not used by userspace, but it is
intermingled with files that are shared with userspace. This makes
it difficult to compare the differences between the userspac eand
kernel files are the userspace files don't have the getdents code in
them. Move all the kernel getdents code to a separate file to bring
the shared content between userspace and kernel files closer
together.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The only thing remaining in xfs_inode.[ch] are the operations that
read, write or verify physical inodes in their underlying buffers.
Move all this code to xfs_inode_buf.[ch] and so we can stop sharing
xfs_inode.[ch] with userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The inode fork definitions are a combination of on-disk format
definition and in-memory tracking and manipulation. They are both
shared with userspace, so move them all into their own file so
sharing is easy to do and track. This removes all inode fork
related information from xfs_inode.h.
Do the same for the all the C code that currently resides in
xfs_inode.c for the same reason.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The transaction reservation size calculations is used by both kernel
and userspace, but most of the transaction code in xfs_trans.c is
kernel specific. Split all the transaction reservation code out into
it's own files to make sharing with userspace simpler. This just
leaves kernel-only definitions in xfs_trans.h, so it doesn't need to
be shared with userspace anymore, either.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Little things like exported functions, __KERNEL__ protections, and
so on that ensure user and kernel shared headers are identical.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
There are a lot of quota flag definitions that are shared by user
and kernel space. Move them all to xfs_quota_defs.h so we can
unshare xfs_quota.h and remove the __KERNEL__ regions from it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
There are quite a few realtime device definitions shared with
userspace. Move them from xfs_rtalloc.h to xfs_rt_alloc_defs.h
so we don't need to share xfs_rtalloc.h with userspace anymore.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
There's a bunch of definitions in xfs_trans.h that define on-disk
formats - transaction headers that get written into the log, log
item type definitions, etc. Split out everything into a separate
file so that all which remains in xfs_trans.h are kernel only
definitions.
Also, remove the duplicate magic number definitions for
XFS_TRANS_MAGIC...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The on disk log format definitions for the icreate log item are
intertwined with the kernel-only in-memory log item definitions.
Separate the log format definitions out into their own header file
so they can easily be shared with userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The on disk format definitions of the on-disk dquot, log formats and
quota off log formats are all intertwined with other definitions for
quotas. Separate them out into their own header file so they can
easily be shared with userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The EFI/EFD item format definitions are shared with userspace. Split
the out of header files that contain kernel only defintions to make
it simple to shared them.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The log item format definitions are shared with userspace. Split
them out of header files that contain kernel only defintions to make
it simple to shared them.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
The on-disk format definitions for the log are spread randoms
through a couple of header files. Consolidate it all in a single
file that can be shared easily with userspace. This means that
xfs_log.h and xfs_log_priv.h no longer need to be shared with
userspace.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When jbd2_journal_dirty_metadata() returns error,
__ext4_handle_dirty_metadata() stops the handle. However callers of this
function do not count with that fact and still happily used now freed
handle. This use after free can result in various issues but very likely
we oops soon.
The motivation of adding __ext4_journal_stop() into
__ext4_handle_dirty_metadata() in commit 9ea7a0df seems to be only to
improve error reporting. So replace __ext4_journal_stop() with
ext4_journal_abort_handle() which was there before that commit and add
WARN_ON_ONCE() to dump stack to provide useful information.
Reported-by: Sage Weil <sage@inktank.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org # 3.2+
Previously we weren't swapping only some of the extent_status LRU
fields during the processing of the EXT4_IOC_SWAP_BOOT ioctl. The
much safer thing to do is to just completely flush the extent status
tree when doing the swap.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
Cc: stable@vger.kernel.org
Previously, f2fs_setxattr assigns i_xattr_nid in the inode page inconsistently.
The scenario is:
= Thread 1 = = Thread 2 = = fi->i_xattr_nid = = on-disk nid =
f2fs_setxattr 0 0
new_node_page X 0
sync_inode_page X X
checkpoint X X -.
grab_cache_page X X |
--> allocate a new xattr node block or -ENOSPC <----------------'
At this moment, the checkpoint stores inconsistent data where the inode has
i_xattr_nid but actual xattr node block is not allocated yet.
So, we should assign the real i_xattr_nid only after its xattr node block is
allocated.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Pull btrfs fixes from Chris Mason:
"These are assorted fixes, mostly from Josef nailing down xfstests
runs. Zach also has a long standing fix for problems with readdir
wrapping f_pos (or ctx->pos)
These patches were spread out over different bases, so I rebased
things on top of rc4 and retested overnight"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: don't loop on large offsets in readdir
Btrfs: check to see if root_list is empty before adding it to dead roots
Btrfs: release both paths before logging dir/changed extents
Btrfs: allow splitting of hole em's when dropping extent cache
Btrfs: make sure the backref walker catches all refs to our extent
Btrfs: fix backref walking when we hit a compressed extent
Btrfs: do not offset physical if we're compressed
Btrfs: fix extent buffer leak after backref walking
Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents
btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified
- Stable patch for lockd to fix Oopses due to inappropriate calls to
utsname()->nodename
- Stable patches for sunrpc to fix Oopses on shutdown when using
AF_LOCAL sockets with rpcbind
- Fix memory leak and error checking issues in nfs4_proc_lookup_mountpoint
- Fix a regression with the sync mount option failing to work for nfs4 mounts
- Fix a writeback performance issue when doing cache invalidation
- Remove an incorrect call to nfs_setsecurity in nfs_fhget
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJSBRqqAAoJEGcL54qWCgDyejoP/RNfkebEex2ugAAfymMOUNjA
qc8X/YO8KDQD5T0X8P9MFXzgSSMPT5YfemxfKDpkjJlYwij8O9D8r1BAfgmhhKFI
bVp9J8pH5Y3jXgLFVsubEc9N9CXLI0yRzcOFfzapYmGJV/ZO4cBAi8HMowFtbREj
l2uk9H8XQ1p8KqVWXhFKoXVL2b9q0CGj5z3+RkE/rD5uuZyusbTB6OhNFArPnwXk
aCx373mlvpIAhU6DkueCiXaHH06Mff2Vlu6eBkGrdfBGC6l+x1nwevxYH750fqSU
8O94rQkw9++qnIIvBJF/g1NzqyDhychJcXtgGLdxYUdWH3c8tevJZxCEj7U/dIJQ
ndEaZGFxSFfdnxrwJBWtB+xsEfe9K4no9JwlkyVi8oZ5j2NUv7cJpA5cdQ4IUf/1
uqTlIxtPHQhHWCUUKpGLlhLiZyvwOPtJvuBl/Pc9UYQbyNtSqjYBk9mamrrIC6FK
mF6jXgWe9x+miBqWYrEdPNLGdx/hUhhqGweYPJa6jTcxif+2l2xGfscWYI89Io/e
qy8YNcHUrRci+o+YfY4lLhk88WBZogzFOYc4jDaLCEL1TE5B2k/jHr9v39V5S7Ks
63bhmfCTxB82uMzFeUWbiPzWQzt030pXPYy/PUPaucy/+QaZC/0lZ3HJKRKI37JJ
ygSD7ndCZJCG+PxLkk49
=W12x
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.11-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Stable patch for lockd to fix Oopses due to inappropriate calls to
utsname()->nodename
- Stable patches for sunrpc to fix Oopses on shutdown when using
AF_LOCAL sockets with rpcbind
- Fix memory leak and error checking issues in nfs4_proc_lookup_mountpoint
- Fix a regression with the sync mount option failing to work for nfs4
mounts
- Fix a writeback performance issue when doing cache invalidation
- Remove an incorrect call to nfs_setsecurity in nfs_fhget
* tag 'nfs-for-3.11-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Fix up nfs4_proc_lookup_mountpoint
NFS: Remove unnecessary call to nfs_setsecurity in nfs_fhget()
NFSv4: Fix the sync mount option for nfs4 mounts
NFS: Fix writeback performance issue on cache invalidation
SUNRPC: If the rpcbind channel is disconnected, fail the call to unregister
SUNRPC: Don't auto-disconnect from the local rpcbind socket
LOCKD: Don't call utsname()->nodename from nlmclnt_setlockargs
Pull nfsd fixes from Bruce Fields:
"Some fixes for a 4.1 feature that in retrospect probably should have
waited for 3.12.... But it appears to be working now"
* 'for-3.11' of git://linux-nfs.org/~bfields/linux:
nfsd: Fix SP4_MACH_CRED negotiation in EXCHANGE_ID
nfsd4: Fix MACH_CRED NULL dereference
The early bug checks are moot because the VMA layer ensures those things.
1. It will not call invalidatepage unless PagePrivate (or PagePrivate2) are set
2. It will not call invalidatepage without taking a PageLock first.
3. Guantrees that the inode page is mapped.
Signed-off-by: Milosz Tanski <milosz@adfin.com>
Reviewed-by: Sage Weil <sage@inktank.com>
All of the early exit paths need to drop the mutex; it is only the normal
path through the function that does not. Skip the unlock in that case
with a goto out_unlocked.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Jianpeng Ma <majianpeng@gmail.com>
Only for ceph_sync_write, the osd can return EOLDSNAPC.so move the
related codes after the call ceph_sync_write.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
remove_session_caps() uses iterate_session_caps() to remove caps,
but iterate_session_caps() skips inodes that are being deleted.
So session->s_nr_caps can be non-zero after iterate_session_caps()
return.
We can fix the issue by waiting until deletions are complete.
__wait_on_freeing_inode() is designed for the job, but it is not
exported, so we use lookup inode function to access it.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Func ceph_calc_ceph_pg maybe failed.So add check for returned value.
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Sending reads and writes through the sync read/write paths bypasses the
page cache, which is not expected or generally a good idea. Removing
the write check is safe as there is a conditional vfs_fsync_range() later
in ceph_aio_write that already checks for the same flag (via
IS_SYNC(inode)).
Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
We pass in a u64 value for "len" and then immediately truncate away the
upper 32 bits.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <alex.elder@linaro.org>
The MDS uses caps message to notify clients about deleted inode.
when receiving a such message, invalidate any alias of the inode.
This makes the kernel release the inode ASAP.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
To write data, the writer first acquires the i_mutex, then try getting
caps. The writer may sleep while holding the i_mutex. If the MDS revokes
Fb cap in this case, vmtruncate work can't do its job because i_mutex
is locked. We should wake up the writer and let it truncate the pages.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
To handle "link" request, the MDS need to xlock inode's linklock,
which requires revoking any CAP_LINK_SHARED.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
When register_session() is given an out-of-range argument for mds,
ceph_mdsmap_get_addr() will return a null pointer, which would be given to
ceph_con_open() & be dereferenced, causing a kernel oops. This fixes bug #4685
in the Ceph bug tracker <http://tracker.ceph.com/issues/4685>.
Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com>
Reviewed-by: Sage Weil <sage@inktank.com>
When btrfs readdir() hits the last entry it sets the readdir offset to a
huge value to stop buggy apps from breaking when the same name is
returned by readdir() with concurrent rename()s.
But unconditionally setting the offset to INT_MAX causes readdir() to
loop returning any entries with offsets past INT_MAX. It only takes a
few hours of constant file creation and removal to create entries past
INT_MAX.
So let's set the huge offset to LLONG_MAX if the last entry has already
overflowed 32bit loff_t. Without large offsets behaviour is identical.
With large offsets 64bit apps will work and 32bit apps will be no more
broken than they currently are if they see large offsets.
Signed-off-by: Zach Brown <zab@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
A user reported a panic when running with autodefrag and deleting snapshots.
This is because we could end up trying to add the root to the dead roots list
twice. To fix this check to see if we are empty before adding ourselves to the
dead roots list. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
The ceph guys tripped over this bug where we were still holding onto the
original path that we used to copy the inode with when logging. This is based
on Chris's fix which was reported to fix the problem. We need to drop the paths
in two cases anyway so just move the drop up so that we don't have duplicate
code. Thanks,
Cc: stable@vger.kernel.org
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
I noticed while running multi-threaded fsync tests that sometimes fsck would
complain about an improper gap. This happens because we fail to add a hole
extent to the file, which was happening when we'd split a hole EM because
btrfs_drop_extent_cache was just discarding the whole em instead of splitting
it. So this patch fixes this by allowing us to split a hole em properly, which
means that added holes actually get logged properly and we no longer see this
fsck error. Thankfully we're tolerant of these sort of problems so a user would
not see any adverse effects of this bug, other than fsck complaining. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Because we don't mess with the offset into the extent for compressed we will
properly find both extents for this case
[extent a][extent b][rest of extent a]
but because we already added a ref for the front half we won't add the inode
information for the second half. This causes us to leak that memory and not
print out the other offset when we do logical-resolve. So fix this by calling
ulist_add_merge and then add our eie to the existing entry if there is one.
With this patch we get both offsets out of logical-resolve. With this and the
other 2 patches I've sent we now pass btrfs/276 on my vm with compress-force=lzo
set. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
If you do btrfs inspect-internal logical-resolve on a compressed extent that has
been partly overwritten it won't find anything. This is because we try and
match the extent offset we've searched for based on the extent offset in the
data extent entry. However this doesn't work for compressed extents because the
offsets are for the uncompressed size, not the compressed size. So instead only
do this check if we are not compressed, that way we can get an actual entry for
the physical offset rather than nothing for compressed. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
offsetting the physical based on the extent offset. This is perfectly fine with
uncompressed extents, however the extent offset is into the uncompressed area,
not the compressed. So we can return a physical value that isn't at all within
the area we have allocated on disk. Fix this by returning the start of the
extent if it is compressed no matter what the offset. Thanks,
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
commit 47fb091fb787420cd195e66f162737401cce023f(Btrfs: fix unlock after free on rewinded tree blocks)
takes an extra increment on the reference of allocated dummy extent buffer, so now we
cannot free this dummy one, and end up with extent buffer leak.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
For partial extents, snapshot-aware defrag does not work as expected,
since
a) we use the wrong logical offset to search for parents, which should be
disk_bytenr + extent_offset, not just disk_bytenr,
b) 'offset' returned by the backref walking just refers to key.offset, not
the 'offset' stored in btrfs_extent_data_ref which is
(key.offset - extent_offset).
The reproducer:
$ mkfs.btrfs sda
$ mount sda /mnt
$ btrfs sub create /mnt/sub
$ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done
$ btrfs sub snap /mnt/sub /mnt/snap1
$ btrfs sub snap /mnt/sub /mnt/snap2
$ sync; btrfs filesystem defrag /mnt/sub/foo;
$ umount /mnt
$ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared.
This addresses the above two problems.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
Create a small file and fallocate it to a big size with
FALLOC_FL_KEEP_SIZE option, then truncate it back to the
small size again, the disk free space is not changed back
in this case. i.e,
total 4
-rw-r--r-- 1 root root 512 Jun 28 11:35 test
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 56K 7.2G 1% /mnt
-rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 5.1G 2.2G 70% /mnt
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 5.1G 2.2G 70% /mnt
With this fix, the truncated up space is back as:
Filesystem Size Used Avail Use% Mounted on
....
/dev/sdb1 8.0G 56K 7.2G 1% /mnt
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
device_close()->recalc_sigpending() is not needed, sigprocmask() takes
care of TIF_SIGPENDING correctly.
And without ->siglock it is racy and wrong, it can wrongly clear
TIF_SIGPENDING and miss a signal.
But even with this patch device_close() is still buggy:
1. sigprocmask() should not be used, we have set_task_blocked(),
but this is minor.
2. We should never block SIGKILL or SIGSTOP, and this is what
the code tries to do.
3. This can't protect against SIGKILL or SIGSTOP anyway. Another
thread can do signal_wake_up(), say, do_signal_stop() or
complete_signal() or debugger.
4. sigprocmask(SIG_BLOCK, allsigs) doesn't necessarily clears
TIF_SIGPENDING, say, freezing() or ->jobctl.
5. device_write() looks equally wrong by the same reason.
Looks like, this tries to protect some wait_event_interruptible() logic
from signals, it should be turned into uninterruptible wait. Or we need
to implement something like signals_stop/start for such a use-case.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Backport of jbd2 commit 169f1a2a87
("jbd2: use a single printk for jbd_debug()")
Since the jbd_debug() is implemented with two separate printk()
calls, it can lead to corrupted and misleading debug output like
the following (see lines marked with "*"):
[ 290.339362] (fs/jbd2/journal.c, 203): kjournald2: kjournald2 wakes
[ 290.339365] (fs/jbd2/journal.c, 155): kjournald2: commit_sequence=42103, commit_request=42104
[ 290.339369] (fs/jbd2/journal.c, 158): kjournald2: OK, requests differ
[* 290.339376] (fs/jbd2/journal.c, 648): jbd2_log_wait_commit:
[* 290.339379] (fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: want 42104, j_commit_sequence=42103
[* 290.339382] JBD2: starting commit of transaction 42104
[ 290.339410] (fs/jbd2/revoke.c, 566): jbd2_journal_write_revoke_records: Wrote 0 revoke records
[ 290.376555] (fs/jbd2/commit.c, 1088): jbd2_journal_commit_transaction: JBD2: commit 42104 complete, head 42079
i.e. the debug output from log_wait_commit and journal_commit_transaction
have become interleaved. The output should have been:
(fs/jbd2/journal.c, 648): jbd2_log_wait_commit: JBD2: want 42104, j_commit_sequence=42103
(fs/jbd2/commit.c, 370): jbd2_journal_commit_transaction: JBD2: starting commit of transaction 42104
It is expected that this is not easy to replicate -- I was only able
to cause it on preempt-rt kernels, and even then only under heavy
I/O load.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Previously xattr node blocks are stored to the COLD_NODE log, which means that
our roll-forward mechanism doesn't recover the xattr node blocks at all.
Only the direct node blocks in the WARM_NODE log can be recovered.
So, let's resolve the issue simply by conducting checkpoint during fsync when a
file has a modified xattr node block.
This approach is able to degrade the performance, but normally the checkpoint
overhead is shown at the initial fsync call after the xattr entry changes.
Once the checkpoint is done, no additional overhead would be occurred.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch fixes the use of XATTR_NODE_OFFSET.
o The offset should not use several MSB bits which are used by marking node
blocks.
o IS_DNODE should handle XATTR_NODE_OFFSET to avoid potential abnormality
during the fsync call.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Commit 5688978 ("ext4: improve handling of conflicting mount options")
introduced incorrect messages shown while choosing wrong mount options.
First of all, both cases of incorrect mount options,
"data=journal,delalloc" and "data=journal,dioread_nolock" result in
the same error message.
Secondly, the problem above isn't solved for remount option: the
mismatched parameter is simply ignored. Moreover, ext4_msg states
that remount with options "data=journal,delalloc" succeeded, which is
not true.
To fix it up, I added a simple check after parse_options() call to
ensure that data=journal and delalloc/dioread_nolock parameters are
not present at the same time.
Signed-off-by: Piotr Sarna <p.sarna@partner.samsung.com>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Commit 26092bf ("ext4: use a table-driven handler for mount options")
wrongly disallows the specifying the mount options nodelalloc and
data=journal simultaneously. This is incorrect; it should have only
disallowed the combination of delalloc and data=journal
simultaneously.
Reported-by: Piotr Sarna <p.sarna@partner.samsung.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
The names of the two struct cgroup_subsys_state accessors -
cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
The former clashes with the type name and the latter doesn't even
indicate it's somehow related to cgroup.
We're about to revamp large portion of cgroup API, so, let's rename
them so that they're less awkward. Most per-controller usages of the
accessors are localized in accessor wrappers and given the amount of
scheduled changes, this isn't gonna add any noticeable headache.
Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
to task_css(). This patch is pure rename.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Previous commits released the write lock across quota operations but
missed several places. In particular, the free operations can also
call into the file system code and take the write lock, causing
deadlocks.
This patch introduces some more helpers and uses them for quota call
sites. Without this patch applied, reiserfs + quotas runs into deadlocks
under anything more than trivial load.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
The reiserfs write lock replaced the BKL and uses similar semantics.
Frederic's locking code makes a distinction between when the lock is nested
and when it's being acquired/released, but I don't think that's the right
distinction to make.
The right distinction is between the lock being released at end-of-use and
the lock being released for a schedule. The unlock should return the depth
and the lock should restore it, rather than the other way around as it is now.
This patch implements that and adds a number of places where the lock
should be dropped.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
The reiserfs xattr code doesn't need the write lock and sleeps all over
the place. We can simplify the locking by releasing it and reacquiring
after the xattr call.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJR+YUoAAoJENNvdpvBGATwxsgQAJzxe8x1SWW8jrvasMELW3bS
I7gl3lu789GROmZo5MI6+XI1DSCjHcMTisoPnPGx8q9LNhiYQGjp9Loxgl+L+Cem
wbDfkSHrhxtDoEw2rdpPb/hFHOpsv8t6TT53GVbdc5l/lxwyvdrblXxhWbLuTRCQ
G0X2VoxvTph7SyVPpRqoU4U99vW3g6LRI0dsxyU8kLc/kp7/t89gwh6Gj0wlb6tb
4pswvEuxK4HauMPjajCIaEeGpQoHbvCVpB4P9HvEtL/wklc5hNIAWy5+KYjMFWil
8Vc6RsJv6DZc66z+X4PT1oYrx6bkhNKNRGBETV2/dzeLU/7o2ttUIQ0vlOlN9gKV
xSne19OKwqhhRMuWrc3LDzunlFiLex2zHbZJVDOeKNXmEJVpO9ZPVkvpoi/g0vOE
DPwcc89J+763CVmXu8uyPqgQRfJTp2dXNggP3lINCDi1mFayrBqR0S6rQ1B2m/6R
8HYDjJ6gBpXBcs0FTGh1cfbpI8GpPx9Xu3Q/FtjvczEXUO3831KCm+k7O3cUjB72
I7I3wmyYQvUanOc39fg+EZ0cXpiSIZG4Mjv+8vWgADFlCMUKMaC+vp2PuKmBxTuk
7dAkxKr2kXfMirGWF1R/vo/CaKnCWmJbAJDFJ8Mt/Nk3zn3NOl/kCRtxndlj7k+m
/s8DHlVLyD8dgvzRO9gp
=xqB3
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o.
Misc ext4 fixes, delayed by Ted moving mail servers and email getting
marked as spam due to bad spf records.
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: add WARN_ON to check the length of allocated blocks
ext4: fix retry handling in ext4_ext_truncate()
ext4: destroy ext4_es_cachep on module unload
ext4: make sure group number is bumped after a inode allocation race
As per RFC 5661 Security Considerations
Commit 4edaa308 "NFS: Use "krb5i" to establish NFSv4 state whenever possible"
uses the nfs_client cl_rpcclient for all clientid management operations.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
As per RFC 3530 and RFC 5661 Security Considerations
Commit 4edaa308 "NFS: Use "krb5i" to establish NFSv4 state whenever possible"
uses the nfs_client cl_rpcclient for all clientid management operations.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
This patch should resolve the following error reported by kbuild test robot.
All error/warnings:
In file included from fs/f2fs/dir.c:13:0:
>> fs/f2fs/f2fs.h:435:17: error: field 's_kobj' has incomplete type
struct kobject s_kobj;
The failure was caused by missing the kobject header file in dir.c.
So, this patch move the header file to the right location, f2fs.h.
CC: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Currently, we do not check the return value of client = rpc_clone_client(),
nor do we shut down the resulting cloned rpc_clnt in the case where a
NFS4ERR_WRONGSEC has caused nfs4_proc_lookup_common() to replace the
original value of 'client' (causing a memory leak).
Fix both issues and simplify the code by moving the call to
rpc_clone_client() until after nfs4_proc_lookup_common() has
done its business.
Reported-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Another shortcoming of the table lookup patch was revealed where the pointer
was not being tested before being dereferenced. Verify this to avoid the
NULL pointer dereference.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
We only need to call it on the creation of the inode.
Reported-by: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Steve Dickson <SteveD@redhat.com>
Cc: Dave Quigley <dpquigl@davequigley.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The sync mount option stopped working for NFSv4 mounts after commit
c02d7adf8c (NFSv4: Replace nfs4_path_walk() with
FS path lookup in a private namespace). If MS_SYNCHRONOUS is set in the
super_block that we're cloning from, then it should be set in the new
super_block as well.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
If a cache invalidation is triggered, and we happen to have a lot of
writebacks cached at the time, then the call to invalidate_inode_pages2()
will end up calling ->launder_page() on each and every dirty page in order
to sync its contents to disk, thus defeating write coalescing.
The following patch ensures that we try to sync the inode to disk before
calling invalidate_inode_pages2() so that we do the writeback as efficiently
as possible.
Reported-by: William Dauchy <william@gandi.net>
Reported-by: Pascal Bouchareine <pascal@gandi.net>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: William Dauchy <william@gandi.net>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
lead to race conditions between deleting an event and accessing an event
debugfs file. This included a fix to the debugfs system (acked by
Greg Kroah-Hartman). We think that all the holes have been patched and
hopefully we don't find more. I haven't marked all of them for stable
because I need to examine them more to figure out how far back some of
the changes need to go.
Along the way, some other fixes have been made. Alexander Z Lam fixed
some logic where the wrong buffer was being modifed.
Andrew Vagin found a possible corruption for machines that actually
allocate cpumask, as a reference to one was being zeroed out by mistake.
Dhaval Giani found a bad prototype when tracing is not configured.
And I not only had some changes to help Oleg, but also finally fixed
a long standing bug that Dave Jones and others have been hitting, where
a module unload and reload can cause the function tracing accounting
to get screwed up.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQEcBAABAgAGBQJR/7FvAAoJEOdOSU1xswtMCAgH/RpxS5mQcQnhrGfu1qnSk+3v
CyL1X9IkvgP5f+N59tbU3ZuE8Q3YQvTII35na38fgbHbwWrhYjLSvlf3Pwtre8zr
Rjm9b8zA6iSobHu3DyuuupzMsLE+SpBakVSmG6mi8izZyuCi0YHMZMnHTGNnW9Vv
YFqkfGgQQWMqiUHLcueetOqWAjR2JiJaLfr47rsRLNflF6dB2MbG/7YbYJ0TBk7E
hemVmR7JH7606eJZ6nR7bh/hYv0sL2SSqVVaOHO5nWFLFbI8CybpNVGcoD+nihZY
M7LrZT1C1mn3nKrfDhyXy4dwwx6wBON0fY23eJTHMVNnc0tvY1xqH52vTdPILWg=
=UfsE
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-3.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Oleg Nesterov has been working hard in closing all the holes that can
lead to race conditions between deleting an event and accessing an
event debugfs file. This included a fix to the debugfs system (acked
by Greg Kroah-Hartman). We think that all the holes have been patched
and hopefully we don't find more. I haven't marked all of them for
stable because I need to examine them more to figure out how far back
some of the changes need to go.
Along the way, some other fixes have been made. Alexander Z Lam fixed
some logic where the wrong buffer was being modifed.
Andrew Vagin found a possible corruption for machines that actually
allocate cpumask, as a reference to one was being zeroed out by
mistake.
Dhaval Giani found a bad prototype when tracing is not configured.
And I not only had some changes to help Oleg, but also finally fixed a
long standing bug that Dave Jones and others have been hitting, where
a module unload and reload can cause the function tracing accounting
to get screwed up"
* tag 'trace-fixes-3.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix reset of time stamps during trace_clock changes
tracing: Make TRACE_ITER_STOP_ON_FREE stop the correct buffer
tracing: Fix trace_dump_stack() proto when CONFIG_TRACING is not set
tracing: Fix fields of struct trace_iterator that are zeroed by mistake
tracing/uprobes: Fail to unregister if probe event files are in use
tracing/kprobes: Fail to unregister if probe event files are in use
tracing: Add comment to describe special break case in probe_remove_event_call()
tracing: trace_remove_event_call() should fail if call/file is in use
debugfs: debugfs_remove_recursive() must not rely on list_empty(d_subdirs)
ftrace: Check module functions being traced on reload
ftrace: Consolidate some duplicate code for updating ftrace ops
tracing: Change remove_event_file_dir() to clear "d_subdirs"->i_private
tracing: Introduce remove_event_file_dir()
tracing: Change f_start() to take event_mutex and verify i_private != NULL
tracing: Change event_filter_read/write to verify i_private != NULL
tracing: Change event_enable/disable_read() to verify i_private != NULL
tracing: Turn event/id->i_private into call->event.type
Increase NFS4_DEF_SLOT_TABLE_SIZE which is used as the client ca_maxreequests
value in CREATE_SESSION. Current non-dynamic session slot server
implementations use the client ca_maxrequests as a maximum slot number: 64
session slots can handle most workloads.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Never try to use a non-UID 0 user credential for lease management,
as that credential can change out from under us. The server will
block NFSv4 lease recovery with NFS4ERR_CLID_INUSE.
Since the mechanism to acquire a credential for lease management
is now the same for all minor versions, replace the minor version-
specific callout with a single function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Commit 05f4c350 "NFS: Discover NFSv4 server trunking when mounting"
Fri Sep 14 17:24:32 2012 introduced Uniform Client String support,
which forces our NFS client to establish a client ID immediately
during a mount operation rather than waiting until a user wants to
open a file.
Normally machine credentials (eg. from a keytab) are used to perform
a mount operation that is protected by Kerberos. Before 05fc350,
SETCLIENTID used a machine credential, or fell back to a regular
user's credential if no keytab is available.
On clients that don't have a keytab, performing SETCLIENTID early
means there's no user credential to fall back on, since no regular
user has kinit'd yet. 05f4c350 seems to have broken the ability
to mount with sec=krb5 on clients that don't have a keytab in
kernels 3.7 - 3.10.
To address this regression, commit 4edaa308 (NFS: Use "krb5i" to
establish NFSv4 state whenever possible), Sat Mar 16 15:56:20 2013,
was merged in 3.10. This commit forces the NFS client to fall back
to AUTH_SYS for lease management operations if no keytab is
available.
Neil Brown noticed that, since root is required to kinit to do a
sec=krb5 mount when a client doesn't have a keytab, we can try to
use root's Kerberos credential before AUTH_SYS.
Now, when determining a principal and flavor to use for lease
management, the NFS client tries in this order:
1. Flavor: AUTH_GSS, krb5i
Principal: service principal (via keytab)
2. Flavor: AUTH_GSS, krb5i
Principal: user principal established for UID 0 (via kinit)
3. Flavor: AUTH_SYS
Principal: UID 0 / GID 0
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Currently, you can open a NFSv4 file with O_APPEND|O_DIRECT, but cannot
fcntl(F_SETFL,...) with those flags. This flag combination is explicitly
forbidden on NFSv3 opens, and it seems like it should also be on NFSv4.
Reported-by: Chao Ye <cye@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
- don't BUG_ON() when not SP4_NONE
- calculate recv and send reserve sizes correctly
Signed-off-by: Weston Andros Adamson <dros@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
David reported that commit c2b93e06 (cifs: only set ops for inodes in
I_NEW state) caused a regression with mfsymlinks. Prior to that patch,
if a mfsymlink dentry was instantiated at readdir time, the inode would
get a new set of ops when it was revalidated. After that patch, this
did not occur.
This patch addresses this by simply skipping instantiating dentries in
the readdir codepath when we know that they will need to be immediately
revalidated. The next attempt to use that dentry will cause a new lookup
to occur (which is basically what we want to happen anyway).
Cc: <stable@vger.kernel.org>
Cc: "Stefan (metze) Metzmacher" <metze@samba.org>
Cc: Sachin Prabhu <sprabhu@redhat.com>
Reported-and-Tested-by: David McBride <dwm37@cam.ac.uk>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
This patch fixes a deadlock bug that occurs quite often when there are
concurrent write and fsync on a same file.
Following is the simplified call trace when tasks get hung.
fsync thread:
- f2fs_sync_file
...
- f2fs_write_data_pages
...
- update_extent_cache
...
- update_inode
- wait_on_page_writeback
bdi writeback thread
- __writeback_single_inode
- f2fs_write_data_pages
- mutex_lock(sbi->writepages)
The deadlock happens when the fsync thread waits on a inode page that has
been added to the f2fs' cached bio sbi->bio[NODE], and unfortunately,
no one else could be able to submit the cached bio to block layer for
writeback. This is because the fsync thread already hold a sbi->fs_lock and
the sbi->writepages lock, causing the bdi thread being blocked when attempt
to write data pages for the same inode. At the same time, f2fs_gc thread
does not notice the situation and could not help. Even the sync syscall
gets blocked.
To fix it, we could submit the cached bio first before waiting on a inode page
that is being written back.
Signed-off-by: Jin Xu <jinuxstyle@gmail.com>
[Jaegeuk Kim: add more cases to use f2fs_wait_on_page_writeback]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This code is being used for nobh_write_end() function.
But since now f2fs_write_end function is added so
there is no need for this code.
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add sysfs entry gc_idle to control the gc policy. Where
gc_idle = 1 corresponds to selecting a cost benefit approach,
while gc_idle = 2 corresponds to selecting a greedy approach
to garbage collection. The selection is mutually exclusive one
approach will work at any point. If gc_idle = 0, then this
option is disabled.
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: change the select_gc_type() flow slightly]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add sysfs entries to control the timing parameters for
f2fs gc thread.
Various Sysfs options introduced are:
gc_min_sleep_time: Min Sleep time for GC in ms
gc_max_sleep_time: Max Sleep time for GC in ms
gc_no_gc_sleep_time: Default Sleep time for GC in ms
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Namjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: Pankaj Kumar <pankaj.km@samsung.com>
Reviewed-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: fix an umount bug and some minor changes]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Firstly, nlmclnt_setlockargs can be called from a reclaimer thread, in
which case we're in entirely the wrong namespace.
Secondly, commit 8aac62706a (move
exit_task_namespaces() outside of exit_notify()) now means that
exit_task_work() is called after exit_task_namespaces(), which
triggers an Oops when we're freeing up the locks.
Fix this by ensuring that we initialise the nlm_host's rpc_client at mount
time, so that the cl_nodename field is initialised to the value of
utsname()->nodename that the net namespace uses. Then replace the
lockd callers of utsname()->nodename.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Toralf Förster <toralf.foerster@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Nix <nix@esperi.org.uk>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: stable@vger.kernel.org # 3.10.x
In the patch "aio: convert the ioctx list to table lookup v3", incorrect
handling in the ioctx_alloc() error path was introduced that lead to an
ioctx being added via ioctx_add_table() while freed when the ioctx_alloc()
call returned -EAGAIN due to hitting the aio_max_nr limit. Fix this by
only calling ioctx_add_table() as the last step in ioctx_alloc().
Also, several unnecessary rcu_dereference() calls were added that lead to
RCU warnings where the system was already protected by a spin lock for
accessing mm->ioctx_table.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
As comment in include/uapi/asm-generic/fcntl.h described, when
introducing new O_* bits, we need to check its uniqueness in
fcntl_init(). But __O_TMPFILE bit is missing. So fix it.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Every now and then someone proposes a new flink syscall, and this spawns
a long discussion of whether it would be a security problem. I think
that this is missing the point: flink is *already* allowed without
privilege as long as /proc is mounted -- it's called AT_SYMLINK_FOLLOW.
Now that O_TMPFILE is here, the ability to create a file with O_TMPFILE,
write it, and link it in is very convenient. The only problem is that
it requires that /proc be mounted so that you can do:
linkat(AT_FDCWD, "/proc/self/fd/<tmpfd>", dfd, path, AT_SYMLINK_NOFOLLOW)
This sucks -- it's much nicer to do:
linkat(tmpfd, "", dfd, path, AT_EMPTY_PATH)
Let's allow it.
If this turns out to be excessively scary, it we could instead require
that the inode in question be I_LINKABLE, but this seems pointless given
the /proc situation
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
O_TMPFILE, like O_CREAT, should respect the requested mode and should
create regular files.
This fixes two bugs: O_TMPFILE required privilege (because the mode
ended up as 000) and it produced bogus inodes with no type.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since remove_proc_entry() started to wait for IO in progress (i.e.
since 2007 or so), the locking in fs/reiserfs/proc.c became wrong;
if procfs read happens between the moment when umount() locks the
victim superblock and removal of /proc/fs/reiserfs/<device>/*,
we'll get a deadlock - read will wait for s_umount (in sget(),
called by r_start()), while umount will wait in remove_proc_entry()
for that read to finish, holding s_umount all along.
Fortunately, the same change allows a much simpler race avoidance -
all we need to do is remove the procfs entries in the very beginning
of reiserfs ->kill_sb(); that'll guarantee that pointer to superblock
will remain valid for the duration for procfs IO, so we don't need
sget() to keep the sucker alive. As the matter of fact, we can
get rid of the home-grown iterator completely, and use single_open()
instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The state lock is taken after we are doing an assert on the state
value, not before. So we might in fact be doing an assert on a
transient value. Ensure the state check is within the scope of
the state lock being taken.
Backport of jbd2 commit 3ca841c106
("jbd2: relocate assert after state lock in journal_commit_transaction()")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Add the missing NULL check of the return value of find_or_create_page() in
function ocfs2_duplicate_clusters_by_page().
[akpm@linux-foundation.org: fix layout, per Joel]
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Acked-by: Joel Becker <jlbec@evilplan.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Refuse RW mount of udf filesystem. So far we just silently changed it
to RO mount but when the media is writeable, block layer won't notice
this change and thus will think device is used RW and will block eject
button of the drive. That is unexpected by users because for
non-writeable media eject button works just fine.
Userspace mount(8) command handles this just fine and retries mounting
with MS_RDONLY set so userspace shouldn't see any regression. Plus any
tool mounting udf is likely confronted with the case of read-only
media where block layer already refuses to mount the filesystem without
MS_RDONLY set so our behavior shouldn't be anything new for it.
Reported-by: Hui Wang <hui.wang@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Change all function used in filesystem discovery during mount to user
standard kernel return values - -errno on error, 0 on success instead
of 1 on failure and 0 on success. This allows us to pass error number
(not just failure / success) so we can abort device scanning earlier
in case of errors like EIO or ENOMEM . Also we will be able to return
EROFS in case writeable mount is requested but writing isn't supported.
Signed-off-by: Jan Kara <jack@suse.cz>
Refuse RW mount of isofs filesystem. So far we just silently changed it
to RO mount but when the media is writeable, block layer won't notice
this change and thus will think device is used RW and will block eject
button of the drive. That is unexpected by users because for
non-writeable media eject button works just fine.
Userspace mount(8) command handles this just fine and retries mounting
with MS_RDONLY set so userspace shouldn't see any regression. Plus any
tool mounting isofs is likely confronted with the case of read-only
media where block layer already refuses to mount the filesystem without
MS_RDONLY set so our behavior shouldn't be anything new for it.
Reported-by: Hui Wang <hui.wang@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
It's always been a hassle that if an external journal's
device number changes, the filesystem won't mount.
And since boot-time enumeration can change, device number
changes aren't unusual.
The current mechanism to update the journal location is by
passing in a mount option w/ a new devnum, but that's a hassle;
it's a manual approach, fixing things after the fact.
Adding a mount option, "-o journal_path=/dev/$DEVICE" would
help, since then we can do i.e.
# mount -o journal_path=/dev/disk/by-label/$JOURNAL_LABEL ...
and it'll mount even if the devnum has changed, as shown here:
# losetup /dev/loop0 journalfile
# mke2fs -L mylabel-journal -O journal_dev /dev/loop0
# mkfs.ext3 -L mylabel -J device=/dev/loop0 /dev/sdb1
Change the journal device number:
# losetup -d /dev/loop0
# losetup /dev/loop1 journalfile
And today it will fail:
# mount /dev/sdb1 /mnt/test
mount: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
# dmesg | tail -n 1
[17343.240702] EXT3-fs (sdb1): error: couldn't read superblock of external journal
But with this new mount option, we can specify the new path:
# mount -o journal_path=/dev/loop1 /dev/sdb1 /mnt/test
#
(which does update the encoded device number, incidentally):
# umount /dev/sdb1
# dumpe2fs -h /dev/sdb1 | grep "Journal device"
dumpe2fs 1.41.12 (17-May-2010)
Journal device: 0x0701
But best of all we can just always mount by journal-path, and
it'll always work:
# mount -o journal_path=/dev/disk/by-label/mylabel-journal /dev/sdb1 /mnt/test
#
So the journal_path option can be specified in fstab, and as long as
the disk is available somewhere, and findable by label (or by UUID),
we can mount.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently, the s_root dentry doesn't get its d_op pointer set to
anything. This breaks lookups in the root of case-insensitive mounts
since that relies on having d_hash and d_compare routines that know to
treat the filename as case-insensitive.
cifs.ko has been broken this way for a long time, but commit 1c929cfe6
("switch cifs"), added a cryptic comment which is removed in the patch
below, which makes me wonder if this was done deliberately for some
reason. It's not clear to me why we'd want the s_root not to have d_op
set properly.
It may have something to do with d_automount or d_revalidate on the
root, but my suspicion in looking over the code is that Al was just
trying to preserve the existing behavior when changing this code over to
use s_d_op.
This patch changes it so that we set s_d_op before calling d_make_root
and removes the comment. I tested mounting, accessing and unmounting
several types of shares (including DFS referrals) and everything still
seemed to work OK afterward. I could be missing something however, so
please do let me know if I am.
Reported-by: Jan-Marek Glogowski <glogow@fbihome.de>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
debugfs_remove_recursive() is wrong,
1. it wrongly assumes that !list_empty(d_subdirs) means that this
dir should be removed.
This is not that bad by itself, but:
2. if d_subdirs does not becomes empty after __debugfs_remove()
it gives up and silently fails, it doesn't even try to remove
other entries.
However ->d_subdirs can be non-empty because it still has the
already deleted !debugfs_positive() entries.
3. simple_release_fs() is called even if __debugfs_remove() fails.
Suppose we have
dir1/
dir2/
file2
file1
and someone opens dir1/dir2/file2.
Now, debugfs_remove_recursive(dir1/dir2) succeeds, and dir1/dir2 goes
away.
But debugfs_remove_recursive(dir1) silently fails and doesn't remove
this directory. Because it tries to delete (the already deleted)
dir1/dir2/file2 again and then fails due to "Avoid infinite loop"
logic.
Test-case:
#!/bin/sh
cd /sys/kernel/debug/tracing
echo 'p:probe/sigprocmask sigprocmask' >> kprobe_events
sleep 1000 < events/probe/sigprocmask/id &
echo -n >| kprobe_events
[ -d events/probe ] && echo "ERR!! failed to rm probe"
And after that it is not possible to create another probe entry.
With this patch debugfs_remove_recursive() skips !debugfs_positive()
files although this is not strictly needed. The most important change
is that it does not try to make ->d_subdirs empty, it simply scans
the whole list(s) recursively and removes as much as possible.
Link: http://lkml.kernel.org/r/20130726151256.GC19472@redhat.com
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
In the event that an overflow/underflow occurs while calculating req_batch,
clamp the minimum at 1 request instead of doing a BUG_ON().
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
This kfree() is no longer needed after a79dc083d7 "f2fs: move
bio_private allocation out of f2fs_bio_alloc()". The "bio->bi_private"
is NULL here so it's a no-op.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In the cifs_reopen_file function, if the following statement is
asserted:
(tcon->unix_ext && cap_unix(tcon->ses) &&
(CIFS_UNIX_POSIX_PATH_OPS_CAP &
(tcon->fsUnixInfo.Capability)))
and we succeed to open with cifs_posix_open, the function jumps
to the label reopen_success and checks for oparms.reconnect
which is not initialized.
This issue has been reported by scan.coverity.com
Signed-off-by: Andi Shyti <andi@etezian.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
When use of symlinks is enabled (mounting with mfsymlinks option) to
non-Samba servers, we always tried to use cifs, even when we
were mounted with SMB2 or SMB3, which causes the server to drop the
network connection.
This patch separates out the protocol specific operations for cifs from
the code which recognizes symlinks, and fixes the problem where
with SMB2 mounts we attempt cifs operations to open and read
symlinks. The next patch will add support for SMB2 for opening
and reading symlinks. Additional followon patches will address
the similar problem creating symlinks.
Signed-off-by: Steve French <smfrench@gmail.com>
For cifs_set_cifscreds() in "fs/cifs/connect.c", 'desc' buffer length
is 'CIFSCREDS_DESC_SIZE' (56 is less than 256), and 'ses->domainName'
length may be "255 + '\0'".
The related sprintf() may cause memory overflow, so need extend related
buffer enough to hold all things.
It is also necessary to be sure of 'ses->domainName' must be less than
256, and define the related macro instead of hard code number '256'.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Reviewed-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Scott Lovenberg <scott.lovenberg@gmail.com>
CC: <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages.
This patch doesn't introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ben Myers <bpm@sgi.com>
Cc: Alex Elder <elder@kernel.org>
Cc: xfs@oss.sgi.com
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
On Wed, Jun 12, 2013 at 11:14:40AM -0700, Kent Overstreet wrote:
> On Mon, Apr 15, 2013 at 02:40:55PM +0300, Octavian Purdila wrote:
> > When using a large number of threads performing AIO operations the
> > IOCTX list may get a significant number of entries which will cause
> > significant overhead. For example, when running this fio script:
> >
> > rw=randrw; size=256k ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=512; thread; loops=100
> >
> > on an EXT2 filesystem mounted on top of a ramdisk we can observe up to
> > 30% CPU time spent by lookup_ioctx:
> >
> > 32.51% [guest.kernel] [g] lookup_ioctx
> > 9.19% [guest.kernel] [g] __lock_acquire.isra.28
> > 4.40% [guest.kernel] [g] lock_release
> > 4.19% [guest.kernel] [g] sched_clock_local
> > 3.86% [guest.kernel] [g] local_clock
> > 3.68% [guest.kernel] [g] native_sched_clock
> > 3.08% [guest.kernel] [g] sched_clock_cpu
> > 2.64% [guest.kernel] [g] lock_release_holdtime.part.11
> > 2.60% [guest.kernel] [g] memcpy
> > 2.33% [guest.kernel] [g] lock_acquired
> > 2.25% [guest.kernel] [g] lock_acquire
> > 1.84% [guest.kernel] [g] do_io_submit
> >
> > This patchs converts the ioctx list to a radix tree. For a performance
> > comparison the above FIO script was run on a 2 sockets 8 core
> > machine. This are the results (average and %rsd of 10 runs) for the
> > original list based implementation and for the radix tree based
> > implementation:
> >
> > cores 1 2 4 8 16 32
> > list 109376 ms 69119 ms 35682 ms 22671 ms 19724 ms 16408 ms
> > %rsd 0.69% 1.15% 1.17% 1.21% 1.71% 1.43%
> > radix 73651 ms 41748 ms 23028 ms 16766 ms 15232 ms 13787 ms
> > %rsd 1.19% 0.98% 0.69% 1.13% 0.72% 0.75%
> > % of radix
> > relative 66.12% 65.59% 66.63% 72.31% 77.26% 83.66%
> > to list
> >
> > To consider the impact of the patch on the typical case of having
> > only one ctx per process the following FIO script was run:
> >
> > rw=randrw; size=100m ;directory=/mnt/fio; ioengine=libaio; iodepth=1
> > blocksize=1024; numjobs=1; thread; loops=100
> >
> > on the same system and the results are the following:
> >
> > list 58892 ms
> > %rsd 0.91%
> > radix 59404 ms
> > %rsd 0.81%
> > % of radix
> > relative 100.87%
> > to list
>
> So, I was just doing some benchmarking/profiling to get ready to send
> out the aio patches I've got for 3.11 - and it looks like your patch is
> causing a ~1.5% throughput regression in my testing :/
... <snip>
I've got an alternate approach for fixing this wart in lookup_ioctx()...
Instead of using an rbtree, just use the reserved id in the ring buffer
header to index an array pointing the ioctx. It's not finished yet, and
it needs to be tidied up, but is most of the way there.
-ben
--
"Thought is the essence of where you are now."
--
kmo> And, a rework of Ben's code, but this was entirely his idea
kmo> -Kent
bcrl> And fix the code to use the right mm_struct in kill_ioctx(), actually
free memory.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
With the changes to use percpu counters for aio event ring size calculation,
existing increases to aio_max_nr are now insufficient to allow for the
allocation of enough events. Double the value used for aio_max_nr to account
for the doubling introduced by the percpu slack.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
sock_aio_dtor() is dead code - and stuff that does need to do cleanup
can simply do it before calling aio_complete().
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
The kiocb refcount is only needed for cancellation - to ensure a kiocb
isn't freed while a ki_cancel callback is running. But if we restrict
ki_cancel callbacks to not block (which they currently don't), we can
simply drop the refcount.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
The old aio retry infrastucture needed to save the various arguments to
to aio operations. But with the retry infrastructure gone, we can trim
struct kiocb quite a bit.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
This code doesn't serve any purpose anymore, since the aio retry
infrastructure has been removed.
This change should be safe because aio_read/write are also used for
synchronous IO, and called from do_sync_read()/do_sync_write() - and
there's no looping done in the sync case (the read and write syscalls).
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
aio_complete() (arguably) needs to keep its own trusted copy of the tail
pointer, but io_getevents() doesn't have to use it - it's already using
the head pointer from the ring buffer.
So convert it to use the tail from the ring buffer so it touches fewer
cachelines and doesn't contend with the cacheline aio_complete() needs.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Originally, io_event() was documented to return the io_event if
cancellation succeeded - the io_event wouldn't be delivered via the ring
buffer like it normally would.
But this isn't what the implementation was actually doing; the only
driver implementing cancellation, the usb gadget code, never returned an
io_event in its cancel function. And aio_complete() was recently changed
to no longer suppress event delivery if the kiocb had been cancelled.
This gets rid of the unused io_event argument to kiocb_cancel() and
kiocb->ki_cancel(), and changes io_cancel() to return -EINPROGRESS if
kiocb->ki_cancel() returned success.
Also tweak the refcounting in kiocb_cancel() to make more sense.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
This just converts the ioctx refcount to the new generic dynamic percpu
refcount code.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
See the previous patch ("aio: reqs_active -> reqs_available") for why we
want to do this - this basically implements a per cpu allocator for
reqs_available that doesn't actually allocate anything.
Note that we need to increase the size of the ringbuffer we allocate,
since a single thread won't necessarily be able to use all the
reqs_available slots - some (up to about half) might be on other per cpu
lists, unavailable for the current thread.
We size the ringbuffer based on the nr_events userspace passed to
io_setup(), so this is a slight behaviour change - but nr_events wasn't
being used as a hard limit before, it was being rounded up to the next
page before so this doesn't change the actual semantics.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
The number of outstanding kiocbs is one of the few shared things left that
has to be touched for every kiocb - it'd be nice to make it percpu.
We can make it per cpu by treating it like an allocation problem: we have
a maximum number of kiocbs that can be outstanding (i.e. slots) - then we
just allocate and free slots, and we know how to write per cpu allocators.
So as prep work for that, we convert reqs_active to reqs_available.
Signed-off-by: Kent Overstreet <koverstreet@google.com>
Cc: Zach Brown <zab@redhat.com>
Cc: Felipe Balbi <balbi@ti.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Asai Thambi S P <asamymuthupa@micron.com>
Cc: Selvan Mani <smani@micron.com>
Cc: Sam Bradshaw <sbradshaw@micron.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Reviewed-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
For 'NULL' terminated string, recommend always to be ended by zero.
Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op and the flag is going away. Remove its usages.
This patch doesn't introduce any behavior changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David Teigland <teigland@redhat.com>
This patch fixes mishandling of the sbi->n_orphans variable.
If users request lots of f2fs_unlink(), check_orphan_space() could be contended.
In such the case, sbi->n_orphans can be read incorrectly so that f2fs_unlink()
would fall into the wrong state which results in the failure of
add_orphan_inode().
So, let's increment sbi->n_orphans virtually prior to the actual orphan inode
stuffs. After that, let's release sbi->n_orphans by calling release_orphan_inode
or remove_orphan_inode.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
bio->bi_private is not always needed. As in the reading data path,
end_read_io does not need bio_private for further using, so moving
bio_private allocation out of f2fs_bio_alloc(). Alloc it in the
submit_write_page(), and ignore it in the f2fs_readpage().
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
As we remove the target single node, so list_for_each is enought, in order to
clean up, we use list_for_each_entry instead.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
For string without format specifiers, using seq_puts()/seq_putc()
instead of seq_printf().
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
As similar as the i_pino fix, i_name also should be fixed when i_nlink is 1.
The errorneous scenario is like this.
1. touch test1
2. link test1 test2
3. unlink test2
4. fsync test1
After this, i_name should be test1.
CC: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The error is reproducible by:
0. mkfs.f2fs /dev/sdb1 & mount
1. touch test1
2. touch test2
3. mv test1 test2
4. umount
5. dumpt.f2fs -i 4 /dev/sdb1
After this, when we retrieve the inode->i_name of test2 by dump.f2fs, we get
test1 instead of test2.
This is because f2fs didn't update the file name during the f2fs_rename.
So, this patch fixes that.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce help function F2FS_NODE() to simplify the conversion of node_page to
f2fs_node.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Add a help func F2FS_STAT() to get the f2fs_stat_info.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In order to support SQLite that uses fdatasync instead of fsync, we should
guarantee the data requested by fdatasync can be recovered after sudden-power-
off.
So, let's remove the fdatasync condition in f2fs_sync_file.
Otherwise, we can restore the data after sudden-power-off due to nonexistence
of any fsync mark'ed node blocks.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In commit 921f266b: ext4: add self-testing infrastructure to do a
sanity check, some sanity checks were added in map_blocks to make sure
'retval == map->m_len'.
Enable these checks by default and report any assertion failures using
ext4_warning() and WARN_ON() since they can help us to figure out some
bugs that are otherwise hard to hit.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The dev_attrs field of struct class is going away soon, dev_groups
should be used instead. This converts the cuse class code to use the
correct field.
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This actually makes a difference in the 4.1 case, since we use the
status to decide what reason to give the client for the delegation
refusal (see nfsd4_open_deleg_none_ext), and in theory a client might
choose suboptimal behavior if we give the wrong answer.
Reported-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
When we try to allocate an inode, and there is a race between two
CPU's trying to grab the same inode, _and_ this inode is the last free
inode in the block group, make sure the group number is bumped before
we continue searching the rest of the block groups. Otherwise, we end
up searching the current block group twice, and we end up skipping
searching the last block group. So in the unlikely situation where
almost all of the inodes are allocated, it's possible that we will
return ENOSPC even though there might be free inodes in that last
block group.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
- fix for regression in commit cca9f93a52, recovery causing filesystem
corruption after a crash
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJR8ZWnAAoJENaLyazVq6ZOb7cP/iLwa59qX5sHAoYRGterE4Li
34xPkihJEjcbvNCtK+rznXT9ohSvwTahnrdlLy/bQ6d0K1gBX3j1DD4cpTvGRWJR
hEBbQU0PXhXjRL6ixgxfeNPfbEfNMYhiTFfjPhBjKVgzYN3NnJZ1lv8zTHaeQ8JP
m7dEKrrg/J8LsW18fq2E0p/SjKi7cT1mEf8jkcYu0UGYd7yDtQSukMbEjfsIJq9L
DpB3QXHQkUf1UlVdUvLncGmcDUPAEt+8/ae9uUpY2nxHv+7jmzAoCyRUCTDYsIh2
gznQjsns56B2FWfnkyzXC3nMaoyIZpT8Fy3FRBsQRKGboOeOPS+/Yyzf/FcLQ8Jl
yMXA0oR+3Ft7wJ62+aSuP3/dug8TbBk09bI+RqV4D+GwM7n7kLE/Fo3kQLva5Aqf
rZIhwzfBDl51vxRzm4I29wOkfvQRXndy4c0hYtfeVy0lBA2yCFLSlzGha5EX+CxM
s1kbpOkuOOE5k5Mgjve/iIKbwG3OKEuPCrESJPG+sTREAkkXkycnVQft2ihJYgg8
yIgPG4fxpIIpwTdC016YAa/raOm/unIG6ko+ec3m2rB2lmo8j3vQOjoIFuV0KYV1
enzhK5F+sQJl9evQOgfJc+uOMgjs1DrE38hnlQ8rc3LXa5Dtb7ReMRAT7z2FxicF
keAPwJNrMlwgIyYi+3B+
=hrYY
-----END PGP SIGNATURE-----
Merge tag 'for-linus-v3.11-rc3' of git://oss.sgi.com/xfs/xfs
Pull xfs fix from Ben Myers:
"Fix for regression in commit cca9f93a52 ("xfs: don't do IO when
creating an new inode"), recovery causing filesystem corruption after
a crash"
* tag 'for-linus-v3.11-rc3' of git://oss.sgi.com/xfs/xfs:
xfs: di_flushiter considered harmful
Pull nfsd fix from Bruce Fields:
"One more nfsd bugfix for 3.11"
* 'for-3.11' of git://linux-nfs.org/~bfields/linux:
nfsd: nfsd_open: when dentry_open returns an error do not propagate as struct file
When we made all inode updates transactional, we no longer needed
the log recovery detection for inodes being newer on disk than the
transaction being replayed - it was redundant as replay of the log
would always result in the latest version of the inode would be on
disk. It was redundant, but left in place because it wasn't
considered to be a problem.
However, with the new "don't read inodes on create" optimisation,
flushiter has come back to bite us. Essentially, the optimisation
made always initialises flushiter to zero in the create transaction,
and so if we then crash and run recovery and the inode already on
disk has a non-zero flushiter it will skip recovery of that inode.
As a result, log recovery does the wrong thing and we end up with a
corrupt filesystem.
Because we have to support old kernel to new kernel upgrades, we
can't just get rid of the flushiter support in log recovery as we
might be upgrading from a kernel that doesn't have fully transactional
inode updates. Unfortunately, for v4 superblocks there is no way to
guarantee that log recovery knows about this fact.
We cannot add a new inode format flag to say it's a "special inode
create" because it won't be understood by older kernels and so
recovery could do the wrong thing on downgrade. We cannot specially
detect the combination of zero mode/non-zero flushiter on disk to
non-zero mode, zero flushiter in the log item during recovery
because wrapping of the flushiter can result in false detection.
Hence that makes this "don't use flushiter" optimisation limited to
a disk format that guarantees that we don't need it. And that means
the only fix here is to limit the "no read IO on create"
optimisation to version 5 superblocks....
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
(cherry picked from commit e60896d8f2)
Since everybody sets kstrdup()ed constant string to "struct xattr"->name but
nobody modifies "struct xattr"->name , we can omit kstrdup() and its failure
checking by constifying ->name member of "struct xattr".
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Joel Becker <jlbec@evilplan.org> [ocfs2]
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Acked-by: Casey Schaufler <casey@schaufler-ca.com>
Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Reviewed-by: Paul Moore <paul@paul-moore.com>
Tested-by: Paul Moore <paul@paul-moore.com>
Acked-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <james.l.morris@oracle.com>
Commit 6f2ea7f2a (NFS: Add nfs4_unique_id boot parameter) introduces a
boot parameter that allows client administrators to set a string
identifier for use by the EXCHANGE_ID and SETCLIENTID arguments in order
to make them more globally unique.
Unfortunately, that uniquifier is no longer globally unique in the presence
of net namespaces, since each container expects to be able to set up their
own lease when mounting a new NFSv4/4.1 partition.
The fix is to add back in the container-specific hostname in addition to
the unique id.
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When we made all inode updates transactional, we no longer needed
the log recovery detection for inodes being newer on disk than the
transaction being replayed - it was redundant as replay of the log
would always result in the latest version of the inode would be on
disk. It was redundant, but left in place because it wasn't
considered to be a problem.
However, with the new "don't read inodes on create" optimisation,
flushiter has come back to bite us. Essentially, the optimisation
made always initialises flushiter to zero in the create transaction,
and so if we then crash and run recovery and the inode already on
disk has a non-zero flushiter it will skip recovery of that inode.
As a result, log recovery does the wrong thing and we end up with a
corrupt filesystem.
Because we have to support old kernel to new kernel upgrades, we
can't just get rid of the flushiter support in log recovery as we
might be upgrading from a kernel that doesn't have fully transactional
inode updates. Unfortunately, for v4 superblocks there is no way to
guarantee that log recovery knows about this fact.
We cannot add a new inode format flag to say it's a "special inode
create" because it won't be understood by older kernels and so
recovery could do the wrong thing on downgrade. We cannot specially
detect the combination of zero mode/non-zero flushiter on disk to
non-zero mode, zero flushiter in the log item during recovery
because wrapping of the flushiter can result in false detection.
Hence that makes this "don't use flushiter" optimisation limited to
a disk format that guarantees that we don't need it. And that means
the only fix here is to limit the "no read IO on create"
optimisation to version 5 superblocks....
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
When creating a less privileged mount namespace or propogating mounts
from a more privileged to a less privileged mount namespace lock the
submounts so they may not be unmounted individually in the child mount
namespace revealing what is under them.
This enforces the reasonable expectation that it is not possible to
see under a mount point. Most of the time mounts are on empty
directories and revealing that does not matter, however I have seen an
occassionaly sloppy configuration where there were interesting things
concealed under a mount point that probably should not be revealed.
Expirable submounts are not locked because they will eventually
unmount automatically so whatever is under them already needs
to be safe for unprivileged users to access.
From a practical standpoint these restrictions do not appear to be
significant for unprivileged users of the mount namespace. Recursive
bind mounts and pivot_root continues to work, and mounts that are
created in a mount namespace may be unmounted there. All of which
means that the common idiom of keeping a directory of interesting
files and using pivot_root to throw everything else away continues to
work just fine.
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Should not use the clientid maintenance rpc_clnt.
Signed-off-by: Andy Adamson <andros@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Clean up: when NFSv4.1 support is compiled out,
nfs4_end_drain_session() becomes a stub. Make the synopsis of the
stub match the synopsis of the real version of the function.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
nfs4_proc_setattr removes ATTR_OPEN from sattr->ia_valid, but later
nfs4_do_setattr checks for it
Signed-off-by: Nadav Shemer <nadav@tonian.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The attribute length is already calculated in advance. There is no
reason why we cannot calculate the bitmap in advance too so that
we don't have to play pointer games.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
The calculation of the attribute length was 4 bytes off.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Tested-by: Andre Heider <a.heider@gmail.com>
Reported-and-tested-by: Henrik Rydberg <rydberg@euromail.se>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If fi_fds = {non-NULL, NULL, non-NULL} and oflag = O_WRONLY
the WARN_ON_ONCE(!(fp->fi_fds[oflag] || fp->fi_fds[O_RDWR]))
doesn't trigger when it should.
Signed-off-by: Harshula Jayasuriya <harshula@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
The following call chain:
------------------------------------------------------------
nfs4_get_vfs_file
- nfsd_open
- dentry_open
- do_dentry_open
- __get_file_write_access
- get_write_access
- return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
------------------------------------------------------------
can result in the following state:
------------------------------------------------------------
struct nfs4_file {
...
fi_fds = {0xffff880c1fa65c80, 0xffffffffffffffe6, 0x0},
fi_access = {{
counter = 0x1
}, {
counter = 0x0
}},
...
------------------------------------------------------------
1) First time around, in nfs4_get_vfs_file() fp->fi_fds[O_WRONLY] is
NULL, hence nfsd_open() is called where we get status set to an error
and fp->fi_fds[O_WRONLY] to -ETXTBSY. Thus we do not reach
nfs4_file_get_access() and fi_access[O_WRONLY] is not incremented.
2) Second time around, in nfs4_get_vfs_file() fp->fi_fds[O_WRONLY] is
NOT NULL (-ETXTBSY), so nfsd_open() is NOT called, but
nfs4_file_get_access() IS called and fi_access[O_WRONLY] is incremented.
Thus we leave a landmine in the form of the nfs4_file data structure in
an incorrect state.
3) Eventually, when __nfs4_file_put_access() is called it finds
fi_access[O_WRONLY] being non-zero, it decrements it and calls
nfs4_file_put_fd() which tries to fput -ETXTBSY.
------------------------------------------------------------
...
[exception RIP: fput+0x9]
RIP: ffffffff81177fa9 RSP: ffff88062e365c90 RFLAGS: 00010282
RAX: ffff880c2b3d99cc RBX: ffff880c2b3d9978 RCX: 0000000000000002
RDX: dead000000100101 RSI: 0000000000000001 RDI: ffffffffffffffe6
RBP: ffff88062e365c90 R8: ffff88041fe797d8 R9: ffff88062e365d58
R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000007 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff88062e365c98] __nfs4_file_put_access at ffffffffa0562334 [nfsd]
#10 [ffff88062e365cc8] nfs4_file_put_access at ffffffffa05623ab [nfsd]
#11 [ffff88062e365ce8] free_generic_stateid at ffffffffa056634d [nfsd]
#12 [ffff88062e365d18] release_open_stateid at ffffffffa0566e4b [nfsd]
#13 [ffff88062e365d38] nfsd4_close at ffffffffa0567401 [nfsd]
#14 [ffff88062e365d88] nfsd4_proc_compound at ffffffffa0557f28 [nfsd]
#15 [ffff88062e365dd8] nfsd_dispatch at ffffffffa054543e [nfsd]
#16 [ffff88062e365e18] svc_process_common at ffffffffa04ba5a4 [sunrpc]
#17 [ffff88062e365e98] svc_process at ffffffffa04babe0 [sunrpc]
#18 [ffff88062e365eb8] nfsd at ffffffffa0545b62 [nfsd]
#19 [ffff88062e365ee8] kthread at ffffffff81090886
#20 [ffff88062e365f48] kernel_thread at ffffffff8100c14a
------------------------------------------------------------
Cc: stable@vger.kernel.org
Signed-off-by: Harshula Jayasuriya <harshula@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Start using pquotino and define a macro to check if the
superblock has pquotino.
Keep backward compatibilty by alowing mount of older superblock
with no separate pquota inode.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
mkfs doesn't initialize the quota inodes to NULLFSINO as it does for the
other internal inodes. This leads to two in-core values (0 and NULLFSINO)
to be checked against, to make sure if a quota inode is valid.
Solve that problem by initializing the in-core values of all quotaino
values to NULLFSINO if they are 0 in the disk.
Note that these values are not written back to on-disk superblock unless
some quota is enabled on the filesystem. Even in that case sb_pquotino is
written to disk only if the on-disk superblock supports pquotino
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
While testing and rearranging pquota/gquota code, I stumbled
on a xfs_shutdown() during a mount. But the mount just hung.
Debugged and found that there is a deadlock involving
&log->l_cilp->xc_ctx_lock.
It is in a code path where &log->l_cilp->xc_ctx_lock is first
acquired in read mode and some levels down the same semaphore
is being acquired in write mode causing a deadlock.
This is the stack:
xfs_log_commit_cil -> acquires &log->l_cilp->xc_ctx_lock in read mode
xlog_print_tic_res
xfs_force_shutdown
xfs_log_force_umount
xlog_cil_force
xlog_cil_force_lsn
xlog_cil_push_foreground
xlog_cil_push - tries to acquire same semaphore in write mode
This patch fixes the deadlock by changing the reason code for
xfs_force_shutdown in xlog_print_tic_res() to SHUTDOWN_LOG_IO_ERROR.
SHUTDOWN_LOG_IO_ERROR is the right reason code to be set since
we are in the log path.
Thanks to Dave for suggesting this solution.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
In xfs_vm_write_failed(), we evaluate the block_offset of pos with
PAGE_MASK which is an unsigned long. That is fine on 64-bit platforms
regardless of whether the request pos is 32-bit or 64-bit. However, on
32-bit platforms the value is 0xfffff000 and so the high 32 bits in it
will be masked off with (pos & PAGE_MASK) for a 64-bit pos.
As a result, the evaluated block_offset is incorrect which will cause
this failure ASSERT(block_offset + from == pos); and potentially pass
the wrong block to xfs_vm_kill_delalloc_range().
In this case, we can get a kernel panic if CONFIG_XFS_DEBUG is enabled:
XFS: Assertion failed: block_offset + from == pos, file: fs/xfs/xfs_aops.c, line: 1504
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:100!
invalid opcode: 0000 [#1] SMP
........
Pid: 4057, comm: mkfs.xfs Tainted: G O 3.9.0-rc2 #1
EIP: 0060:[<f94a7e8b>] EFLAGS: 00010282 CPU: 0
EIP is at assfail+0x2b/0x30 [xfs]
EAX: 00000056 EBX: f6ef28a0 ECX: 00000007 EDX: f57d22a4
ESI: 1c2fb000 EDI: 00000000 EBP: ea6b5d30 ESP: ea6b5d1c
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: 094f3ff4 CR3: 2bcb4000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
Process mkfs.xfs (pid: 4057, ti=ea6b4000 task=ea5799e0 task.ti=ea6b4000)
Stack:
00000000 f9525c48 f951fa80 f951f96b 000005e4 ea6b5d7c f9494b34 c19b0ea2
00000066 f3d6c620 c19b0ea2 00000000 e9a91458 00001000 00000000 00000000
00000000 c15c7e89 00000000 1c2fb000 00000000 00000000 1c2fb000 00000080
Call Trace:
[<f9494b34>] xfs_vm_write_failed+0x74/0x1b0 [xfs]
[<c15c7e89>] ? printk+0x4d/0x4f
[<f9494d7d>] xfs_vm_write_begin+0x10d/0x170 [xfs]
[<c110a34c>] generic_file_buffered_write+0xdc/0x210
[<f949b669>] xfs_file_buffered_aio_write+0xf9/0x190 [xfs]
[<f949b7f3>] xfs_file_aio_write+0xf3/0x160 [xfs]
[<c115e504>] do_sync_write+0x94/0xd0
[<c115ed1f>] vfs_write+0x8f/0x160
[<c115e470>] ? wait_on_retry_sync_kiocb+0x50/0x50
[<c115f017>] sys_write+0x47/0x80
[<c15d860d>] sysenter_do_call+0x12/0x28
.............
EIP: [<f94a7e8b>] assfail+0x2b/0x30 [xfs] SS:ESP 0068:ea6b5d1c
---[ end trace cdd9af4f4ecab42f ]---
Kernel panic - not syncing: Fatal exception
In order to avoid this, we can evaluate the block_offset of the start
of the page by using shifts rather than masks the mismatch problem.
Thanks Dave Chinner for help finding and fixing this bug.
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Mark Tinguely <tinguely@sgi.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
PTR_RET should be PTR_ERR
Reported-by: Sachin Kamat <sachin.kamat@linaro.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Pull vfs fixes from Al Viro:
"The sget() one is a long-standing bug and will need to go into -stable
(in fact, it had been originally caught in RHEL6), the other two are
3.11-only"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: constify dentry parameter in d_count()
livelock avoidance in sget()
allow O_TMPFILE to work with O_WRONLY
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJR6cmlAAoJENNvdpvBGATwZF0P/0a7ET511UJwQbgAIq5ftFlj
86Bzvy28xo2T85t64L+Ib2XDehWHk0sZlQpB/gK8MLYn4rCRWCxkQAshKwoequsC
AhuvQ7NtX9vJNCSR30+RrLhkvj6UKsMuM724adARLBUgMBoScABzZImR1e14ELah
bN27a4Bk2aNUpNX68QYdQX3TGiHGZy//lNmh81JTxFS3Moqm6bIZAJbYpOslATsI
Q5nti/TjQJKso2gF7Jx7NffXv0g5rGxaVQEZJPpfIv1Vs0b6vabK/sYp608ayM0K
qKyjJABaHR1Pzb16V82ZqvSlsHm/ARhCF1nMM6gQ8nwl/plxcQ6Jvd/qJsNej3b/
7Jfm86xLe+G0G5oeNEJXsoEFAsvxug6ZRMfyoRHaPlGIksmz+Jc9kzTtM3qzdzOB
5OPJwlONlM4dRVA6rgb7KiuE3h/sRt4CctFejD0f6mUqKa+B+zyHq/a/8a+60IqQ
/sDiTQrqrI6LWxECFasDNoGxtnvVtKC21jbg+MTzumZDvjgnJIFFe5NrinI6SB9x
VQYVq/vVkE576VTwGAttTg3s4sRwQKd/iuQjuoP76iFFHvq/sNX6fBq0NW5gpsj2
WAfH+fLQsMcVJ2MAcc3DwdBT1wQbLu+Y19hv4TDOZRmnKGhq9K08hzWR4tIUKdFJ
UcjWk35Wuoz1IGpVlHJ5
=ngfz
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"Fixes for 3.11-rc2, sent at 5pm, in the professoinal style. :-)"
I'm not sure I like this new level of "professionalism".
9-5, people, 9-5.
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: call ext4_es_lru_add() after handling cache miss
ext4: yield during large unlinks
ext4: make the extent_status code more robust against ENOMEM failures
ext4: simplify calculation of blocks to free on error
ext4: fix error handling in ext4_ext_truncate()
- Fix a regression against NFSv4 FreeBSD servers when creating a new file
- Fix another regression in rpc_client_register()
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJR6ZKMAAoJEGcL54qWCgDyQX8P/19LKLNKcL+y2zVGjLbXMTq0
TpyWdBO0ux7QcqnPEDg+Jpvu62IowYiKTtaSOXtHb5BNjQMBo2RKw3B0eMBoCp/z
6gHmQRD2hMgqwBxBwHceV+dNwueCUiZW7GqaaNh6/3bpGQefegdONnLEifuPogEu
oZmEuiVrGDfITEF7D4k5+shXCQN4eNH0LFuIQo4XXdCqmK6PwvOsidZ7YwHVC3Mg
/Jzda2YsCxHj8kPi1xb9skPPAn6g4kdfYfyr/xSY7IviPixrkg/nEEK1b8xHU81e
a0dd0Yx5kq6fR8LsBvQCHdj2m7doHM15jf5Np5G7VnnaWEjB2y+QftkxWc9lCNU3
t2fr9YVD7ZG/GGNSFePHAHmBY0OqDB1Htp4vcwEQfzX6CAR3Hel82WVvut62Z6m4
G5qHjwdqUFhmRN//SWlDpEqSn+pbeCvPhQS60ayN0TLivRsscm/I4yA75odAnn9b
4su1IcUpqeJGeV6yDyMUqbx4kYZFyCZg/DNkThXiTKOs47A7ogSS9ev2fTB/V+jd
rroNHNd/U508ze9D6D4ai9vR78uUp4wKNSSBZMCkBtNh0uSApOTgyGVhertB1EKS
vgAr4T1tc+9t+0qg1Sb+hbKyBM/KaS5zUrPn+APHPoBXPh5PSVBzeNJkpxHRw/V0
ZxkEgSQKLZSXYb5ab770
=XE+7
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
- Fix a regression against NFSv4 FreeBSD servers when creating a new
file
- Fix another regression in rpc_client_register()
* tag 'nfs-for-3.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Fix a regression against the FreeBSD server
SUNRPC: Fix another issue with rpc_client_register()
Pull btrfs fixes from Josef Bacik:
"I'm playing the role of Chris Mason this week while he's on vacation.
There are a few critical fixes for btrfs here, all regressions and
have been tested well"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next:
Btrfs: fix wrong write offset when replacing a device
Btrfs: re-add root to dead root list if we stop dropping it
Btrfs: fix lock leak when resuming snapshot deletion
Btrfs: update drop progress before stopping snapshot dropping
Eric Sandeen has found a nasty livelock in sget() - take a mount(2) about
to fail. The superblock is on ->fs_supers, ->s_umount is held exclusive,
->s_active is 1. Along comes two more processes, trying to mount the same
thing; sget() in each is picking that superblock, bumping ->s_count and
trying to grab ->s_umount. ->s_active is 3 now. Original mount(2)
finally gets to deactivate_locked_super() on failure; ->s_active is 2,
superblock is still ->fs_supers because shutdown will *not* happen until
->s_active hits 0. ->s_umount is dropped and now we have two processes
chasing each other:
s_active = 2, A acquired ->s_umount, B blocked
A sees that the damn thing is stillborn, does deactivate_locked_super()
s_active = 1, A drops ->s_umount, B gets it
A restarts the search and finds the same superblock. And bumps it ->s_active.
s_active = 2, B holds ->s_umount, A blocked on trying to get it
... and we are in the earlier situation with A and B switched places.
The root cause, of course, is that ->s_active should not grow until we'd
got MS_BORN. Then failing ->mount() will have deactivate_locked_super()
shut the damn thing down. Fortunately, it's easy to do - the key point
is that grab_super() is called only for superblocks currently on ->fs_supers,
so it can bump ->s_count and grab ->s_umount first, then check MS_BORN and
bump ->s_active; we must never increment ->s_count for superblocks past
->kill_sb(), but grab_super() is never called for those.
The bug is pretty old; we would've caught it by now, if not for accidental
exclusion between sget() for block filesystems; the things like cgroup or
e.g. mtd-based filesystems don't have anything of that sort, so they get
bitten. The right way to deal with that is obviously to fix sget()...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull s390 fixes from Martin Schwidefsky:
"An update for the BFP jit to the latest and greatest, two patches to
get kdump working again, the random-abort ptrace extention for
transactional execution, the z90crypt module alias for ap and a tiny
cleanup"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/zcrypt: Alias for new zcrypt device driver base module
s390/kdump: Allow copy_oldmem_page() copy to virtual memory
s390/kdump: Disable mmap for s390
s390/bpf,jit: add pkt_type support
s390/bpf,jit: address randomize and write protect jit code
s390/bpf,jit: use generic jit dumper
s390/bpf,jit: call module_free() from any context
s390/qdio: remove unused variable
s390/ptrace: PTRACE_TE_ABORT_RAND
Miao Xie reported the following issue:
The filesystem was corrupted after we did a device replace.
Steps to reproduce:
# mkfs.btrfs -f -m single -d raid10 <device0>..<device3>
# mount <device0> <mnt>
# btrfs replace start -rfB 1 <device4> <mnt>
# umount <mnt>
# btrfsck <device4>
The reason for the issue is that we changed the write offset by mistake,
introduced by commit 625f1c8dc.
We read the data from the source device at first, and then write the
data into the corresponding place of the new device. In order to
implement the "-r" option, the source location is remapped using
btrfs_map_block(). The read takes place on the mapped location, and
the write needs to take place on the unmapped location. Currently
the write is using the mapped location, and this commit changes it
back by undoing the change to the write address that the aforementioned
commit added by mistake.
Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
If we stop dropping a root for whatever reason we need to add it back to the
dead root list so that we will re-start the dropping next transaction commit.
The other case this happens is if we recover a drop because we will add a root
without adding it to the fs radix tree, so we can leak it's root and commit root
extent buffer, adding this to the dead root list makes this cleanup happen.
Thanks,
Cc: stable@vger.kernel.org
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
We aren't setting path->locks[level] when we resume a snapshot deletion which
means we won't unlock the buffer when we free the path. This causes deadlocks
if we happen to re-allocate the block before we've evicted the extent buffer
from cache. Thanks,
Cc: stable@vger.kernel.org
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Alex pointed out a problem and fix that exists in the drop one snapshot at a
time patch. If we decide we need to exit for whatever reason (umount for
example) we will just exit the snapshot dropping without updating the drop
progress. So the next time we go to resume we will BUG_ON() because we can't
find the extent we left off at because we never updated it. This patch fixes
the problem.
Cc: stable@vger.kernel.org
Reported-by: Alex Lyakas <alex.btrfs@zadarastorage.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
Here are some driver core patches for 3.11-rc2. They aren't really
bugfixes, but a bunch of new helper macros for drivers to properly
create attribute groups, which drivers and subsystems need to fix up a
ton of race issues with incorrectly creating sysfs files (binary and
normal) after userspace has been told that the device is present.
Also here is the ability to create binary files as attribute groups, to
solve that race condition, which was impossible to do before this, so
that's my fault the drivers were broken.
The majority of the .c changes is indenting and moving code around a
bit. It affects no existing code, but allows the large backlog of 70+
patches that I already have created to start flowing into the different
subtrees, instead of having to live in my driver-core tree, causing
merge nightmares in linux-next for the next few months.
These were finalized too late for the -rc1 merge window, which is why
they were didn't make that pull request, testing and review from others
didn't happen until a few weeks ago, and then there's the whole
distraction of the past few days, which prevented these from getting to
you sooner, sorry about that.
Oh, and there's a bugfix for the documentation build warning in here as
well. All of these have been in linux-next this week, with no reported
problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.20 (GNU/Linux)
iEYEABECAAYFAlHoRUUACgkQMUfUDdst+ymkNACdHAjEXZZmXohDuCb2SqyMeQsz
AZcAn3qqJa/NoPEgTCgOkDlAQZM6BnC5
=+Gqk
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches from Greg KH:
"Here are some driver core patches for 3.11-rc2. They aren't really
bugfixes, but a bunch of new helper macros for drivers to properly
create attribute groups, which drivers and subsystems need to fix up a
ton of race issues with incorrectly creating sysfs files (binary and
normal) after userspace has been told that the device is present.
Also here is the ability to create binary files as attribute groups,
to solve that race condition, which was impossible to do before this,
so that's my fault the drivers were broken.
The majority of the .c changes is indenting and moving code around a
bit. It affects no existing code, but allows the large backlog of 70+
patches that I already have created to start flowing into the
different subtrees, instead of having to live in my driver-core tree,
causing merge nightmares in linux-next for the next few months.
These were finalized too late for the -rc1 merge window, which is why
they were didn't make that pull request, testing and review from
others didn't happen until a few weeks ago, and then there's the whole
distraction of the past few days, which prevented these from getting
to you sooner, sorry about that.
Oh, and there's a bugfix for the documentation build warning in here
as well. All of these have been in linux-next this week, with no
reported problems"
* tag 'driver-core-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
driver-core: fix new kernel-doc warning in base/platform.c
sysfs: use file mode defines from stat.h
sysfs: add more helper macro's for (bin_)attribute(_groups)
driver core: add default groups to struct class
driver core: Introduce device_create_groups
sysfs: prevent warning when only using binary attributes
sysfs: add support for binary attributes in groups
driver core: device.h: add RW and RO attribute macros
sysfs.h: add BIN_ATTR macro
sysfs.h: add ATTRIBUTE_GROUPS() macro
sysfs.h: add __ATTR_RW() macro
The kdump mmap patch series (git commit 83086978c6) directly
map the PT_LOADs to memory. On s390 this does not work because the
copy_from_oldmem() function swaps [0,crashkernel size] with
[crashkernel base, crashkernel base+crashkernel size]. The swap
int copy_from_oldmem() was done in order correctly implement /dev/oldmem.
See: http://marc.info/?l=kexec&m=136940802511603&w=2
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Technically, the Linux client is allowed by the NFSv4 spec to send
3 word bitmaps as part of an OPEN request. However, this causes the
current FreeBSD server to return NFS4ERR_ATTRNOTSUPP errors.
Fix the regression by making the Linux client use a 2 word bitmap unless
doing NFSv4.2 with labeled NFS.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Pull nfsd bugfixes from Bruce Fields:
"Just three minor bugfixes"
* 'for-3.11' of git://linux-nfs.org/~bfields/linux:
svcrdma: underflow issue in decode_write_list()
nfsd4: fix minorversion support interface
lockd: protect nlm_blocked access in nlmsvc_retry_blocked
When "fs/aio: Add support to aio ring pages migration" was applied, it
broke the build when CONFIG_MIGRATION was disabled. Wrap the migration
code with a test for CONFIG_MIGRATION to fix this and save a few bytes
when migration is disabled.
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Fuse does instantiation slightly differently from NFS/CIFS which use
d_materialise_unique().
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
Add sanity checks before adding or updating an entry with data received
from readdirplus.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
In case d_lookup() returns a dentry with d_inode == NULL, the dentry is not
returned with dput(). This results in triggering a BUG() in
shrink_dcache_for_umount_subtree():
BUG: Dentry ...{i=0,n=...} still in use (1) [unmount of fuse fuse]
[SzM: need to d_drop() as well]
Reported-by: Justin Clift <jclift@redhat.com>
Signed-off-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Tested-by: Brian Foster <bfoster@redhat.com>
Tested-by: Niels de Vos <ndevos@redhat.com>
CC: stable@vger.kernel.org
When only using bin_attrs instead of attrs the kernel prints a warning
and refuses to create the sysfs entry. This fixes that.
Signed-off-by: Oliver Schinagl <oliver@schinagl.nl>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
groups should be able to support binary attributes, just like it
supports "normal" attributes. This lets us only handle one type of
structure, groups, throughout the driver core and subsystems, making
binary attributes a "full fledged" part of the driver model, and not
something just "tacked on".
Reported-by: Oliver Schinagl <oliver@schinagl.nl>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
If there are no items in the extent status tree, ext4_es_lru_add() is
a no-op. So it is not sufficient to call ext4_es_lru_add() before we
try to lookup an entry in the extent status tree. We also need to
call it at the end of ext4_ext_map_blocks(), after items have been
added to the extent status tree.
This could lead to inodes with that have extent status trees but which
are not in the LRU list, which means they won't get considered for
eviction by the es_shrinker.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <wenqing.lz@taobao.com>
Cc: stable@vger.kernel.org
As the aio job will pin the ring pages, that will lead to mem migrated
failed. In order to fix this problem we use an anon inode to manage the aio ring
pages, and setup the migratepage callback in the anon inode's address space, so
that when mem migrating the aio ring pages will be moved to other mem node safely.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Introduce a new lib function anon_inode_getfile_private(), it creates a new file
instance by hooking it up to an anonymous inode, and a dentry that describe the
"class" of the file, similar to anon_inode_getfile(), but each file holds a
single inode. Furthermore, anyone who wants to create a private anon file will
benefit from this change.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
PTR_RET is now deprecated. Use PTR_ERR_OR_ZERO instead.
Signed-off-by: Sachin Kamat <sachin.kamat@linaro.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
During large unlink operations on files with extents, we can use a lot
of CPU time. This adds a cond_resched() call when starting to examine
the next level of a multi-level extent tree. Multi-level extent trees
are rare in the first place, and this should rarely be executed.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJR43U/AAoJENNvdpvBGATwdl4P+gI23RkFXTHKvd3XtmXLQojT
ncRXVOAARuRZiMbiAOzXv/BDSkLHnOHw6fVLK5buFTLlpQ00tdlrd6ngui4NTe+v
Qo0GUqL09iSMLEgZV0OwxV5EULPpYb/xQwfQNAqG3pQbUFq/JdxptBT7r/go/YnX
bzWSDiMKeFQoIgH1/xDGXRrfcSdEbjewMfT7lXq+XWRlPyyJPjLnxzDGfJDaOLSR
rCZJOsbCfxzwhBd2HFzH55CGGU4yoZ6O7qpsMoF1gjqUSJ2DmVhMV/NSspmTnKRd
EZKDT7LK8c02UNdYzLPzPpRjAQfUWBgnh9R84Ake8Py2UHGommTyz6TqMmNTbW5Q
EMRd461v+8bvIYnbe/tkT+CTTkC7lRapX6AYaq8k+MpLIWE1bmvX+bMRYOejTE4r
jTgYUktzaVzx/4XdgT837vCbsFttixL3x62XelrkZoANw/m0+jgOn9mY5pjDFp8j
Eq5wWJ8IsuxCofk/qQj5rOK7/3tFcdJULCoX8f3AB0vooAUKTXBYxYflfIeSgqeZ
vlp0ymj588pimH3LM0Vs1BT/aGh0JninLIBk+hcb2YxC2NzvLO2pjSV8i+olBU+C
Yq7MoakdT/FDTWp8WbbZm21C95Tj/zCfMCBSgC0k7LpQVM00ts87UdUgfAZPzI1w
ZISZFy6O/zhPMFAZCxfV
=qf2h
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"Various regression and bug fixes for ext4"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: don't allow ext4_free_blocks() to fail due to ENOMEM
ext4: fix spelling errors and a comment in extent_status tree
ext4: rate limit printk in buffer_io_error()
ext4: don't show usrquota/grpquota twice in /proc/mounts
ext4: fix warning in ext4_evict_inode()
ext4: fix ext4_get_group_number()
ext4: silence warning in ext4_writepages()
Some callers of ext4_es_remove_extent() and ext4_es_insert_extent()
may not be completely robust against ENOMEM failures (or the
consequences of reflecting ENOMEM back up to userspace may lead to
xfstest or user application failure).
To mitigate against this, when trying to insert an entry in the extent
status tree, try to shrink the inode's extent status tree before
returning ENOMEM. If there are entries which don't record information
about extents under delayed allocations, freeing one of them is
preferable to returning ENOMEM.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
In ext4_ext_map_blocks(), if we have successfully allocated the data
blocks, but then run into trouble inserting the extent into the extent
tree, most likely due to an ENOSPC condition, determine the arguments
to ext4_free_blocks() in a simpler way which is easier to prove to be
correct.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Previously ext4_ext_truncate() was ignoring potential error returns
from ext4_es_remove_extent() and ext4_ext_remove_space(). This can
lead to the on-diks extent tree and the extent status tree cache
getting out of sync, which is particuarlly bad, and can lead to file
system corruption and potential data loss.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Pull more vfs stuff from Al Viro:
"O_TMPFILE ABI changes, Oleg's fput() series, misc cleanups, including
making simple_lookup() usable for filesystems with non-NULL s_d_op,
which allows us to get rid of quite a bit of ugliness"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
sunrpc: now we can just set ->s_d_op
cgroup: we can use simple_lookup() now
efivarfs: we can use simple_lookup() now
make simple_lookup() usable for filesystems that set ->s_d_op
configfs: don't open-code d_alloc_name()
__rpc_lookup_create_exclusive: pass string instead of qstr
rpc_create_*_dir: don't bother with qstr
llist: llist_add() can use llist_add_batch()
llist: fix/simplify llist_add() and llist_add_batch()
fput: turn "list_head delayed_fput_list" into llist_head
fs/file_table.c:fput(): add comment
Safer ABI for O_TMPFILE
Pull networking fixes from David Miller:
"Just a bunch of small fixes and tidy ups:
1) Finish the "busy_poll" renames, from Eliezer Tamir.
2) Fix RCU stalls in IFB driver, from Ding Tianhong.
3) Linearize buffers properly in tun/macvtap zerocopy code.
4) Don't crash on rmmod in vxlan, from Pravin B Shelar.
5) Spinlock used before init in alx driver, from Maarten Lankhorst.
6) A sparse warning fix in bnx2x broke TSO checksums, fix from Dmitry
Kravkov.
7) Dummy and ifb driver load failure paths can oops, fixes from Tan
Xiaojun and Ding Tianhong.
8) Correct MTU calculations in IP tunnels, from Alexander Duyck.
9) Account all TCP retransmits in SNMP stats properly, from Yuchung
Cheng.
10) atl1e and via-rhine do not handle DMA mapping failures properly,
from Neil Horman.
11) Various equal-cost multipath route fixes in ipv6 from Hannes
Frederic Sowa"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (36 commits)
ipv6: only static routes qualify for equal cost multipathing
via-rhine: fix dma mapping errors
atl1e: fix dma mapping warnings
tcp: account all retransmit failures
usb/net/r815x: fix cast to restricted __le32
usb/net/r8152: fix integer overflow in expression
net: access page->private by using page_private
net: strict_strtoul is obsolete, use kstrtoul instead
drivers/net/ieee802154: don't use devm_pinctrl_get_select_default() in probe
drivers/net/ethernet/cadence: don't use devm_pinctrl_get_select_default() in probe
drivers/net/can/c_can: don't use devm_pinctrl_get_select_default() in probe
net/usb: add relative mii functions for r815x
net/tipc: use %*phC to dump small buffers in hex form
qlcnic: Adding Maintainers.
gre: Fix MTU sizing check for gretap tunnels
pkt_sched: sch_qfq: remove forward declaration of qfq_update_agg_ts
pkt_sched: sch_qfq: improve efficiency of make_eligible
gso: Update tunnel segmentation to support Tx checksum offload
inet: fix spacing in assignment
ifb: fix oops when loading the ifb failed
...
- fix for xfs_fsr returning -EINVAL
- cleanup in xfs_bulkstat
- cleanup in xfs_open_by_handle
- update mount options documentation
- clean up local format handling in xfs_bmapi_write
- fix dquot log reservations which were too small
- fix sgid inheritance for subdirectories when default acls are in use
- add project quota fields to various structures
- fix teardown of quotainfo structures when quotas are turned off
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJR4FDjAAoJENaLyazVq6ZOhZIQAMLWC4Wz+3PpRLkIlXZ83wdG
LYDNxdYntSVDXNLCrfIhuavgW1yhseLcQZD34g0hgOLQRsmjvUw2ikO6u7qlyoV1
GZZVwdVhsZNicycvuEE0Sva9Jmjgxe0XUOCxJBNAWN0fm5Jzg7w0OGWyzxn2obRX
L4LNPqnQS/phCrtNYfUnXpCuUcy0KbpX4GYrdt5tThIWHm7AcyRnOArEkFvVuwdf
N3OSN/jSMaEF/GsAqmYFYSXhuL9P1vyBSlyFW82YbyFFd4FKZJbRiFbOsrgbFSkA
Ssum7N+rfMc9DCbJsrztgxFaYpj42JR5eCm+jvTejx8nJWKiGjMVtzjq4QtwQQ6e
vby7MzdjZ+l2oJclA0y8hOjeg0R7sPVP7xZziZRuK4PHsjtBH3N2FtCOeBtlGhyW
14LK+z+5YXU/gEwmxV5LaknODb2mxvWycf70jaQ6bvrQRUiFnPNIxYKvgAx8YJxl
jgYSassHHKtLg0S54P/T91tRsyDVOhy5lqgeogzK5uYa+v+xlMloG+fLzb9GmIgS
hXgUIAo+lNlHZkw1FdD4aRgh3OMiUvLQN6woBMbbXfS5XaNpF1UG30YAXeeIqV5e
cLChzY+jQiCsmcktb3YQs9C5yfxEciFFSYmZOaKCTgQRWWnyI4/lAu1gd2KtTYUt
ZfV0niME4wp0kBWZHOEH
=QiZy
-----END PGP SIGNATURE-----
Merge tag 'for-linus-v3.11-rc1-2' of git://oss.sgi.com/xfs/xfs
Pull more xfs updates from Ben Myers:
"Here are a fix for xfs_fsr, a cleanup in bulkstat, a cleanup in
xfs_open_by_handle, updated mount options documentation, a cleanup in
xfs_bmapi_write, a fix for the size of dquot log reservations, a fix
for sgid inheritance when acls are in use, a fix for cleaning up
quotainfo structures, and some more of the work which allows group and
project quotas to be used together.
We had a few more in this last quota category that we might have liked
to get in, but it looks there are still a few items that need to be
addressed.
- fix for xfs_fsr returning -EINVAL
- cleanup in xfs_bulkstat
- cleanup in xfs_open_by_handle
- update mount options documentation
- clean up local format handling in xfs_bmapi_write
- fix dquot log reservations which were too small
- fix sgid inheritance for subdirectories when default acls are in use
- add project quota fields to various structures
- fix teardown of quotainfo structures when quotas are turned off"
* tag 'for-linus-v3.11-rc1-2' of git://oss.sgi.com/xfs/xfs:
xfs: Fix the logic check for all quotas being turned off
xfs: Add pquota fields where gquota is used.
xfs: fix sgid inheritance for subdirectories inheriting default acls [V3]
xfs: dquot log reservations are too small
xfs: remove local fork format handling from xfs_bmapi_write()
xfs: update mount options documentation
xfs: use get_unused_fd_flags(0) instead of get_unused_fd()
xfs: clean up unused codes at xfs_bulkstat()
xfs: use XFS_BMAP_BMDR_SPACE vs. XFS_BROOT_SIZE_ADJ
Pull cifs fixes from Steve French:
"Fixes for 4 cifs bugs, including a reconnect problem, a problem
parsing responses to SMB2 open request, and setting nlink incorrectly
to some servers which don't report it properly on the wire. Also
improves data integrity on reconnect with series from Pavel which adds
durable handle support for SMB2."
* 'for-linus' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: Fix a deadlock when a file is reopened
CIFS: Reopen the file if reconnect durable handle failed
[CIFS] Fix minor endian error in durable handle patch series
CIFS: Reconnect durable handles for SMB2
CIFS: Make SMB2_open use cifs_open_parms struct
CIFS: Introduce cifs_open_parms struct
CIFS: Request durable open for SMB2 opens
CIFS: Simplify SMB2 create context handling
CIFS: Simplify SMB2_open code path
CIFS: Respect create_options in smb2_open_file
CIFS: Fix lease context buffer parsing
[CIFS] use sensible file nlink values if unprovided
Limit allocation of crypto mechanisms to dialect which requires
fput() and delayed_fput() can use llist and avoid the locking.
This is unlikely path, it is not that this change can improve
the performance, but this way the code looks simpler.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A missed update to "fput: task_work_add() can fail if the caller has
passed exit_task_work()".
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Vagin <avagin@openvz.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[suggested by Rasmus Villemoes] make O_DIRECTORY | O_RDWR part of O_TMPFILE;
that will fail on old kernels in a lot more cases than what I came up with.
And make sure O_CREAT doesn't get there...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The filesystem should not be marked inconsistent if ext4_free_blocks()
is not able to allocate memory. Unfortunately some callers (most
notably ext4_truncate) don't have a way to reflect an error back up to
the VFS. And even if we did, most userspace applications won't deal
with most system calls returning ENOMEM anyway.
Reported-by: Nagachandra P <nagachandra@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
Replace "assertation" with "assertion" in lots and lots of debugging
messages.
Correct the comment stating when ext4_es_insert_extent() is used. It
was no doubt tree at one point, but it is no longer true...
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Zheng Liu <gnehzuil.liu@gmail.com>
You can turn on or off support for minorversions using e.g.
echo "-4.2" >/proc/fs/nfsd/versions
However, the current implementation is a little wonky. For example, the
above will turn off 4.2 support, but it will also turn *on* 4.1 support.
This didn't matter as long as we only had 2 minorversions, which was
true till very recently.
And do a little cleanup here.
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If there are a lot of outstanding buffered IOs when a device is
taken offline (due to hardware errors etc), ext4_end_bio prints
out a message for each failed logical block. While this is desirable,
we see thousands of such lines being printed out before the
serial console gets overwhelmed, causing ext4_end_bio() wait for
the printk to complete.
This in itself isn't a disaster, except for the detail that this
function is being called with the queue lock held.
This causes any other function in the block layer
to spin on its spin_lock_irqsave while the serial console is
draining. If NMI watchdog is enabled on this machine then it
eventually comes along and shoots the machine in the head.
The end result is that losing any one disk causes the machine to
go down. This patch rate limits the printk to bandaid around the
problem.
Tested: xfstests
Change-Id: I8ab5690dcf4f3a67e78be147d45e489fdf4a88d8
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If we request reading or writing on a file that needs to be
reopened, it causes the deadlock: we are already holding rw
semaphore for reading and then we try to acquire it for writing
in cifs_relock_file. Fix this by acquiring the semaphore for
reading in cifs_relock_file due to we don't make any changes in
locks and don't need a write access.
CC: <stable@vger.kernel.org>
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
This is a follow-on patch for 8/8 patch from the durable handles
series. It fixes the problem when durable file handle timeout
expired on the server and reopen returns -ENOENT for such files.
Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org>
Signed-off-by: Steve French <smfrench@gmail.com>
We now print mount options in a generic fashion in
ext4_show_options(), so we shouldn't be explicitly printing the
{usr,grp}quota options in ext4_show_quota_options().
Without this patch, /proc/mounts can look like this:
/dev/vdb /vdb ext4 rw,relatime,quota,usrquota,data=ordered,usrquota 0 0
^^^^^^^^ ^^^^^^^^
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
During the review of seperate pquota inode patches, David noticed
that the test to detect all quotas being turned off was
incorrect, and hence the block was not freeing all the quota
information.
The check made sense in Irix, but in Linux, quota is turned off
one at a time, which makes the test invalid for Linux.
This problem existed since XFS was ported to Linux.
David suggested to fix the problem by detecting when all quotas are
turned off by checking m_qflags.
Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com>
Reviewed-by: Ben Myers <bpm@sgi.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
In nlmsvc_retry_blocked, the check that the list is non-empty and acquiring
the pointer of the first entry is unprotected by any lock. This allows a rare
race condition when there is only one entry on the list. A function such as
nlmsvc_grant_callback() can be called, which will temporarily remove the entry
from the list. Between the list_empty() and list_entry(),the list may become
empty, causing an invalid pointer to be used as an nlm_block, leading to a
possible crash.
This patch adds the nlm_block_lock around these calls to prevent concurrent
use of the nlm_blocked list.
This was a regression introduced by
f904be9cc7 "lockd: Mostly remove BKL from
the server".
Cc: Bryan Schumaker <bjschuma@netapp.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Pull core block IO updates from Jens Axboe:
"Here are the core IO block bits for 3.11. It contains:
- A tweak to the reserved tag logic from Jan, for weirdo devices with
just 3 free tags. But for those it improves things substantially
for random writes.
- Periodic writeback fix from Jan. Marked for stable as well.
- Fix for a race condition in IO scheduler switching from Jianpeng.
- The hierarchical blk-cgroup support from Tejun. This is the grunt
of the series.
- blk-throttle fix from Vivek.
Just a note that I'm in the middle of a relocation, whole family is
flying out tomorrow. Hence I will be awal the remainder of this week,
but back at work again on Monday the 15th. CC'ing Tejun, since any
potential "surprises" will most likely be from the blk-cgroup work.
But it's been brewing for a while and sitting in my tree and
linux-next for a long time, so should be solid."
* 'for-3.11/core' of git://git.kernel.dk/linux-block: (36 commits)
elevator: Fix a race in elevator switching
block: Reserve only one queue tag for sync IO if only 3 tags are available
writeback: Fix periodic writeback after fs mount
blk-throttle: implement proper hierarchy support
blk-throttle: implement throtl_grp->has_rules[]
blk-throttle: Account for child group's start time in parent while bio climbs up
blk-throttle: add throtl_qnode for dispatch fairness
blk-throttle: make throtl_pending_timer_fn() ready for hierarchy
blk-throttle: make tg_dispatch_one_bio() ready for hierarchy
blk-throttle: make blk_throtl_bio() ready for hierarchy
blk-throttle: make blk_throtl_drain() ready for hierarchy
blk-throttle: dispatch from throtl_pending_timer_fn()
blk-throttle: implement dispatch looping
blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work
blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it
blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
blk-throttle: add throtl_service_queue->parent_sq
blk-throttle: generalize update_disptime optimization in blk_throtl_bio()
blk-throttle: dispatch to throtl_data->service_queue.bio_lists[]
blk-throttle: move bio_lists[] and friends to throtl_service_queue
...
Highlights include:
- Fix an_rpc pipefs regression that causes a deadlock on mount
- Readdir optimisations by Scott Mayhew and Jeff Layton
- clean up the rpc_pipefs dentry operation setup
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iQIcBAABAgAGBQJR3vVIAAoJEGcL54qWCgDyBWEP/0blqSlJId4zZj4xDviRFqJ4
93C7b/Vn7LrAcNCgDQsPkkzTwAX5yTB1H5eNtMuyggAdGj89d4n0jXgBniIMHmqI
Pjrr/XMQ65NddehrO491N01iJSfP9wE3CizJodnAv4VxMRO3xqiJG85lcnoLOFea
V1FnEFUu9oi8e93cQt2fe6KdmTu/SuRqlqR7WPGyTFgS26x1l8nkp2OQgulit5Up
lWuaxg4xbKOdj1jfUDXZhWUnDtkFjxyGxnKR63aA2X1DEGCUTJ6gB3tAl9pvnUb2
RTQF3GVj+Bm/E3gE6ULJvqOjhsgWYjLAZn6hDA3yNAIiFyV7aA6gwK4oKy/B47a6
tFEN2O1EupWzCqGyHhTArk+oEBLfUv/EgFyo7+Y0YIFV4sQTu5RbaZ0nQ2geY6LA
50q2GH57tkXTs859gtBPQgKzgRF1ulkF1FDY9EYQHyGiUbNxBfx+6/2OI04ubQt3
1gKUmm9w1WVzYGmHcHbxsXPT53NtAnHXW4ExcMgpaZ1YOPuIILm78ZuAw78XB/dd
mvXRtbhVt/gs7qZAQQPp1iHIv+vnJ0KgjO62gbuTIRftw5jwWrpWcfYMUUZrMnyM
kn326z3f4gn/vSDZI7J4tOfG1Uc7eNy+cJxStjtiNWTs3UzuWJKzJH0rZnoNZdei
xAkLhjIUEybAqIpXJuGH
=NqQf
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-3.11-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull second set of NFS client updates from Trond Myklebust:
"This mainly contains some small readdir optimisations that had
dependencies on Al Viro's readdir rewrite. There is also a fix for a
nasty deadlock which surfaced earlier in this merge window.
Highlights include:
- Fix an_rpc pipefs regression that causes a deadlock on mount
- Readdir optimisations by Scott Mayhew and Jeff Layton
- clean up the rpc_pipefs dentry operation setup"
* tag 'nfs-for-3.11-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Fix a deadlock in rpc_client_register()
rpc_pipe: rpc_dir_inode_operations can be static
NFS: Allow nfs_updatepage to extend a write under additional circumstances
NFS: Make nfs_readdir revalidate less often
NFS: Make nfs_attribute_cache_expired() non-static
rpc_pipe: set dentry operations at d_alloc time
nfs: set verifier on existing dentries in nfs_prime_dcache
Several of these patches were rebased in order to correct style issues.
Only stylistic changes were made versus the patches which were in linux-next
for two weeks. The rebases have been in linux-next for 3 days and have
passed my regressions.
The bulk of these are RDMA fixes and improvements. There's also some
additions on the extended attributes front to support some additional
namespaces and a new option for TCP to force allocation of mount requests
from a priviledged port.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: GPGTools - http://gpgtools.org
iQIcBAABAgAGBQJR3rWXAAoJEDZk62b0Tg6xabIP/12I+SkQ57wRN03EQy5fqUdX
gK/YMHKQ9QuDnZPBvrZ2lypesQNqVU0KINay6VEA86JG1gwzPyUd2MnpQ7F0vV3N
XwVD54IoflV/M74xUnrgGWB8YxaPcdacQQ8yazX+mEgOgYGdWmDAl7FHmAkdKAFB
gSl25f3PNJX1Rjay0dssNVXrVPXuJY/fZXKnNQZKtRwXffRWKsWHd8FU0Eq7F30A
kNQB8tmMSfHBBjP+tzR0My6/kQ09jzHdtZOkH9IgVpNzqrd8tfy0l6tEvFypxqGT
5oQFoxHHL/tUW05V0P3gYany2A7lEhSUifPKS6omqHO+vPlw+pDJw+xWlNq9fnDt
8S8znqVuEHhvqRQW7zFdb9ac2MZi8CHHhC2wGIZ7GYjNG2q5XwE8b/QhdXQeFin7
ibugvoW7+ZdcDewpQW27oO0g7B/8hRt8KC+1lc/8rITKIfGxbNJkGzTDl0F4Co7v
IH7Ew5PHPe6ZiuU0QSdU+NBuvk8g8sWGxx04Xvzl3WicwOg7XvN3ivrKB9oN2U1x
50KZRnYpwQQv/9AxyhroYU+Ufje8SF4v++zsq1eMzUcHsC/C73eatw2m764t+X4S
8yMLrgqY1Nzif4nAMi/SDMnB/R1bXeuc8kXD9xT6XD9d2tf6e+zCHhQklVeC0tuK
RiVRJqGrfanbKMnWIG0Y
=n9rI
-----END PGP SIGNATURE-----
Merge tag 'for-linus-3.11-merge-window-part-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull second round of 9p patches from Eric Van Hensbergen:
"Several of these patches were rebased in order to correct style
issues. Only stylistic changes were made versus the patches which
were in linux-next for two weeks. The rebases have been in linux-next
for 3 days and have passed my regressions.
The bulk of these are RDMA fixes and improvements. There's also some
additions on the extended attributes front to support some additional
namespaces and a new option for TCP to force allocation of mount
requests from a priviledged port"
* tag 'for-linus-3.11-merge-window-part-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: Remove the unused variable "err" in v9fs_vfs_getattr()
9P: Add cancelled() to the transport functions.
9P/RDMA: count posted buffers without a pending request
9P/RDMA: Improve error handling in rdma_request
9P/RDMA: Do not free req->rc in error handling in rdma_request()
9P/RDMA: Use a semaphore to protect the RQ
9P/RDMA: Protect against duplicate replies
9P/RDMA: increase P9_RDMA_MAXSIZE to 1MB
9pnet: refactor struct p9_fcall alloc code
9P/RDMA: rdma_request() needs not allocate req->rc
9P: Fix fcall allocation for rdma
fs/9p: xattr: add trusted and security namespaces
net/9p: add privport option to 9p tcp transport
- Remove redundant code by merging some encrypt and decrypt functions
- Get rid of a helper page allocation during page decryption by using in-place
decryption
- Better use of entire pages during page crypto operations
- Several code cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIbBAABCgAGBQJR3Z1JAAoJENaSAD2qAscK+jAP92NR3W17njuPFJBHCROsS481
tyhzy2W/LlNK1njnS6SRP3O3Icv3adiJRtMZePV7bpCH3yD/JoYZ6VHQBNV7msHW
VuEwE6eqCkKl3NLunhwd4+m5R9qijJYGfYEzTve/RNASDU7/LPcTUF8OQ5dIB0wA
J6IMGGZSpsFa8ymN01YzmUEmOUx1IBR2aYBT8Og4Ke117ywDYqxS0ghd1rb953sS
7H8wnNcijs1DLGe71SZnKMVCYwO32GWarxDBAa0KsabcyY4Yr43O3ov/CsIAAA7B
Q1Dn7KNNSOCu9G6fHrnuMOTncGnNPLhIe6Yc0PCZ7ykVstpzlNkKJ628IEonsJaJ
4bYc3bqq4KH7rqMxjA+1GoLehpJWJzqwfiFI1fWLlYMmO2ky126rJUgSNBHQe9+M
iWl+ZrYokSsNWBcUsIq7SJFaLIhWDNcb+Wl7RiTNBBwoBaZclrNuWKIyeWPhH+9/
+/K3LBaggujzVpE743wgJhY60sfdHZmaRAD9agEbcG773JePXBg9OkiUp/hKSe8s
UaGkfmwAlz8u6mR1eJCuFDCqwJKByyT4vObuOFroh7NgOHaQZghlnO4HwuOjzp6U
wTUiMVslFY9WAsEWxdDhaCXxB8IrjHz3YZGIt8PU2eT6ucQU+HLvkkxqtzSYm9/7
BBfryWZKwR7T50JdehI=
=mRpN
-----END PGP SIGNATURE-----
Merge tag 'ecryptfs-3.11-rc1-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull eCryptfs updates from Tyler Hicks:
"Code cleanups and improved buffer handling during page crypto
operations:
- Remove redundant code by merging some encrypt and decrypt functions
- Get rid of a helper page allocation during page decryption by using
in-place decryption
- Better use of entire pages during page crypto operations
- Several code cleanups"
* tag 'ecryptfs-3.11-rc1-cleanup' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
Use ecryptfs_dentry_to_lower_path in a couple of places
eCryptfs: Make extent and scatterlist crypt function parameters similar
eCryptfs: Collapse crypt_page_offset() into crypt_extent()
eCryptfs: Merge ecryptfs_encrypt_extent() and ecryptfs_decrypt_extent()
eCryptfs: Combine page_offset crypto functions
eCryptfs: Combine encrypt_scatterlist() and decrypt_scatterlist()
eCryptfs: Decrypt pages in-place
eCryptfs: Accept one offset parameter in page offset crypto functions
eCryptfs: Simplify lower file offset calculation
eCryptfs: Read/write entire page during page IO
eCryptfs: Use entire helper page during page crypto operations
eCryptfs: Cocci spatch "memdup.spatch"