The estimate of necessary transaction credits in ext4_flex_group_add()
is too pessimistic. It reserves credit for sb, resize inode, and resize
inode dindirect block for each group added in a flex group although they
are always the same block and thus it is enough to account them only
once. Also the number of modified GDT block is overestimated since we
fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block.
Make the estimation more precise. That reduces number of requested
credits enough that we can grow 20 MB filesystem (which has 1 MB
journal, 79 reserved GDT blocks, and flex group size 16 by default).
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
fallocate() checks that the file is extent-based and returns
EOPNOTSUPP in case is not. Other tasks can convert from and to
indirect and extent so it's safe to check only after grabbing
the inode mutex.
Signed-off-by: Davide Italiano <dccitaliano@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Currently it is possible to lose whole file system block worth of data
when we hit the specific interaction with unwritten and delayed extents
in status extent tree.
The problem is that when we insert delayed extent into extent status
tree the only way to get rid of it is when we write out delayed buffer.
However there is a limitation in the extent status tree implementation
so that when inserting unwritten extent should there be even a single
delayed block the whole unwritten extent would be marked as delayed.
At this point, there is no way to get rid of the delayed extents,
because there are no delayed buffers to write out. So when a we write
into said unwritten extent we will convert it to written, but it still
remains delayed.
When we try to write into that block later ext4_da_map_blocks() will set
the buffer new and delayed and map it to invalid block which causes
the rest of the block to be zeroed loosing already written data.
For now we can fix this by simply not allowing to set delayed status on
written extent in the extent status tree. Also add WARN_ON() to make
sure that we notice if this happens in the future.
This problem can be easily reproduced by running the following xfs_io.
xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
-c "falloc 0 131072" \
-c "pwrite -S 0xbb 65536 2048" \
-c "fsync" /mnt/test/fff
echo 3 > /proc/sys/vm/drop_caches
xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff
This can be theoretically also reproduced by at random by running fsx,
but it's not very reliable, though on machines with bigger page size
(like ppc) this can be seen more often (especially xfstest generic/127)
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
This patch removes duplicated encryption modes which were already in
ext4.h. They were duplicated from commit 3edc18d and commit f542fb.
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Michael Halcrow <mhalcrow@google.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Signed-off-by: Chanho Park <chanho61.park@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This patch adds a tristate EXT4_ENCRYPTION to do the selections
for EXT4_FS_ENCRYPTION because selecting from a bool causes all
the selected options to be built-in, even if EXT4 itself is a
module.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
This obscures the length of the filenames, to decrease the amount of
information leakage. By default, we pad the filenames to the next 4
byte boundaries. This costs nothing, since the directory entries are
aligned to 4 byte boundaries anyway. Filenames can also be padded to
8, 16, or 32 bytes, which will consume more directory space.
Change-Id: Ibb7a0fb76d2c48e2061240a709358ff40b14f322
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Avoid using SHA-1 when calculating the user-visible filename when the
encryption key is available, and avoid decrypting lots of filenames
when searching for a directory entry in a directory block.
Change-Id: If4655f144784978ba0305b597bfa1c8d7bb69e63
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull btrfs fixes from Chris Mason:
"A few more btrfs fixes.
These range from corners Filipe found in the new free space cache
writeback to a grab bag of fixes from the list"
* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent
Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.
btrfs: unlock i_mutex after attempting to delete subvolume during send
btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache
btrfs: fix race on ENOMEM in alloc_extent_buffer
btrfs: handle ENOMEM in btrfs_alloc_tree_block
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
Btrfs: don't check for delalloc_bytes in cache_save_setup
Btrfs: fix deadlock when starting writeback of bg caches
Btrfs: fix race between start dirty bg cache writeout and bg deletion
btrfs_release_extent_buffer_page() can't handle dummy extent that
allocated by btrfs_clone_extent_buffer() properly. That is because
reference count of pages that allocated by btrfs_clone_extent_buffer()
was 2, 1 by alloc_page(), and another by attach_extent_buffer_page().
Running following command repeatly can check this memory leak problem
btrfs inspect-internal inode-resolve 256 /mnt/btrfs
Signed-off-by: Chien-Kuan Yeh <ckya@synology.com>
Signed-off-by: Forrest Liu <forrestl@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Tested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull btrfs fixes from Chris Mason:
"Filipe hit two problems in my block group cache patches. We finalized
the fixes last week and ran through more tests"
* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
Btrfs: prevent list corruption during free space cache processing
Btrfs: fix inode cache writeout
Highlights include:
Stable patches:
- Fix a regression in /proc/self/mountstats
- Fix the pNFS flexfiles O_DIRECT support
- Fix high load average due to callback thread sleeping
Bugfixes:
- Various patches to fix the pNFS layoutcommit support
- Do not cache pNFS deviceids unless server notifications are enabled
- Fix a SUNRPC transport reconnection regression
- make debugfs file creation failure non-fatal in SUNRPC
- Another fix for circular directory warnings on NFSv4 "junctioned" mountpoints
- Fix locking around NFSv4.2 fallocate() support
- Truncating NFSv4 file opens should also sync O_DIRECT writes
- Prevent infinite loop in rpcrdma_ep_create()
Features:
- Various improvements to the RDMA transport code's handling of memory
registration
- Various code cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVOmT6AAoJEGcL54qWCgDyrhYQAMPKXB55jrdOR/7UVSF/xPML
7OjMGHvBnTn/y0pNIyLyS1PjTZZsD/WZjoW9EFGpTv727qQNVoFxFRLNUcgi3NoL
1YledCkLf7Q32aqod93SRRFPc9hzBoKhOZpOzBuWaAviyAB3KLi70DWAq9qRReYM
prXUQQjpW5FLU+B2ifaVc2RCnu/rZ2c02YdR2XdtkBaAJxuhB2vR8IY1evwjCv3R
5zyLDd9zSDDoArdpUzM97cxZPcYRSqbOwgTKvaaRnDDq/mKbKMZaqmEfjblwzNFt
b43FbveJzZ3hlPADIpmaiMHjRTbxWjIKc9K1sOF2FPfcuPe2yM3DMAxDegUkEveS
7fkbv/qRZ30NqfchGanX/pmBlLOcdI76qe/bwhN19wCnw48O1eeHi1HK8rWGhU+E
qcrRZ3ZS2ufP/MVBuhauy0qU9Q4wcEtm7NGGP1231ZtmfjHKyBa4pLirNfG1AlJt
dK7tBrknVx+WVm/UddJp/fEsxbP0+fki6TwzioHUSWcz8rDVYF6PFT/QPM54SX2h
0oqwvu6d/uShpkVRm+fbje8FHmUxKdgqDsCYX2fNjWskh1oXSPsItvjqmTmTlE0i
EBmBwVwI0uB1ZQ3PrJLadhRcO3ZJmLQ5gNj456dstvWy6UQds1xyIQ/DgvmlzxWO
E9t0l18xHGRwbndsDa8f
=j5dP
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client updates from Trond Myklebust:
"Another set of mainly bugfixes and a couple of cleanups. No new
functionality in this round.
Highlights include:
Stable patches:
- Fix a regression in /proc/self/mountstats
- Fix the pNFS flexfiles O_DIRECT support
- Fix high load average due to callback thread sleeping
Bugfixes:
- Various patches to fix the pNFS layoutcommit support
- Do not cache pNFS deviceids unless server notifications are enabled
- Fix a SUNRPC transport reconnection regression
- make debugfs file creation failure non-fatal in SUNRPC
- Another fix for circular directory warnings on NFSv4 "junctioned"
mountpoints
- Fix locking around NFSv4.2 fallocate() support
- Truncating NFSv4 file opens should also sync O_DIRECT writes
- Prevent infinite loop in rpcrdma_ep_create()
Features:
- Various improvements to the RDMA transport code's handling of
memory registration
- Various code cleanups"
* tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits)
fs/nfs: fix new compiler warning about boolean in switch
nfs: Remove unneeded casts in nfs
NFS: Don't attempt to decode missing directory entries
Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one"
NFS: Rename idmap.c to nfs4idmap.c
NFS: Move nfs_idmap.h into fs/nfs/
NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h
NFS: Add a stub for GETDEVICELIST
nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes
nfs: fix DIO good bytes calculation
nfs: Fetch MOUNTED_ON_FILEID when updating an inode
sunrpc: make debugfs file creation failure non-fatal
nfs: fix high load average due to callback thread sleeping
NFS: Reduce time spent holding the i_mutex during fallocate()
NFS: Don't zap caches on fallocate()
xprtrdma: Make rpcrdma_{un}map_one() into inline functions
xprtrdma: Handle non-SEND completions via a callout
xprtrdma: Add "open" memreg op
xprtrdma: Add "destroy MRs" memreg op
xprtrdma: Add "reset MRs" memreg op
...
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
We need to fill inode when we found a node for it in delayed_nodes_tree.
But we did not fill the ->last_trans currently, it will cause the test
of xfstest/generic/311 fail. Scenario of the 311 is shown as below:
Problem:
(1). test_fd = open(fname, O_RDWR|O_DIRECT)
(2). pwrite(test_fd, buf, 4096, 0)
(3). close(test_fd)
(4). drop_all_caches() <-------- "echo 3 > /proc/sys/vm/drop_caches"
(5). test_fd = open(fname, O_RDWR|O_DIRECT)
(6). fsync(test_fd);
<-------- we did not get the correct log entry for the file
Reason:
When we re-open this file in (5), we would find a node
in delayed_nodes_tree and fill the inode we are lookup with the
information. But the ->last_trans is not filled, then the fsync()
will check the ->last_trans and found it's 0 then say this inode
is already in our tree which is commited, not recording the extents
for it.
Fix:
This patch fill the ->last_trans properly and set the
runtime_flags if needed in this situation. Then we can get the
log entries we expected after (6) and generic/311 passed.
Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com>
Reviewed-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Chris Mason <clm@fb.com>
Whenever the check for a send in progress introduced in commit
521e0546c9 (btrfs: protect snapshots from deleting during send) is
hit, we return without unlocking inode->i_mutex. This is easy to see
with lockdep enabled:
[ +0.000059] ================================================
[ +0.000028] [ BUG: lock held when returning to user space! ]
[ +0.000029] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted
[ +0.000026] ------------------------------------------------
[ +0.000029] btrfs/211 is leaving the kernel with locks still held!
[ +0.000029] 1 lock held by btrfs/211:
[ +0.000023] #0: (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0
Make sure we unlock it in the error path.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
If io_ctl_prepare_pages fails, the pages in io_ctl.pages are not valid.
When we try to access them later, things will blow up in various ways.
Also fix the comment about the return value, which is an errno on error,
not -1, and update the cases where it was not.
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Chris Mason <clm@fb.com>
Consider the following interleaving of overlapping calls to
alloc_extent_buffer:
Call 1:
- Successfully allocates a few pages with find_or_create_page
- find_or_create_page fails, goto free_eb
- Unlocks the allocated pages
Call 2:
- Calls find_or_create_page and gets a page in call 1's extent_buffer
- Finds that the page is already associated with an extent_buffer
- Grabs a reference to the half-written extent_buffer and calls
mark_extent_buffer_accessed on it
mark_extent_buffer_accessed will then try to call mark_page_accessed on
a null page and panic.
The fix is to decrement the reference count on the half-written
extent_buffer before unlocking the pages so call 2 won't use it. We
should also set exists = NULL in the case that we don't use exists to
avoid accidentally returning a freed extent_buffer in an error case.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
This is one of the first places to give out when memory is tight. Handle
it properly rather than with a BUG_ON.
Also fix the comment about the return value, which is an ERR_PTR, not
NULL, on error.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Reviewed-by: David Sterba <dsterba@suse.cz>
Signed-off-by: Chris Mason <clm@fb.com>
If device tree has hole, find_free_dev_extent() cannot find available
address properly.
The problem can be reproduce by following script.
mntpath=/btrfs
loopdev=/dev/loop0
filepath=/home/forrest/image
umount $mntpath
losetup -d $loopdev
truncate --size 100g $filepath
losetup $loopdev $filepath
mkfs.btrfs -f $loopdev
mount $loopdev $mntpath
# make device tree with one big hole
for i in `seq 1 1 100`; do
fallocate -l 1g $mntpath/$i
done
sync
for i in `seq 1 1 95`; do
rm $mntpath/$i
done
sync
# wait cleaner thread remove unused block group
sleep 300
fallocate -l 1g $mntpath/aaa
# failed to allocate new chunk
fallocate -l 1g $mntpath/bbb
Above script will make device tree with one big hole, and can only allocate
just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb
item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 106292051968 length 1073741824
item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48
dev extent chunk_tree 3
chunk objectid 256 chunk offset 103108575232 length 1073741824
Signed-off-by: Forrest Liu <forrestl@synology.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Chris Mason <clm@fb.com>
Now that we're doing free space cache writeback outside the critical
section in the commit, there is a bigger window for delalloc_bytes to
be added after a cache has been written. find_free_extent may do this
without putting the block group back into the dirty list, and also
without a transaction running.
Checking for delalloc_bytes in cache_save_setup means we might leave the
cache marked as written without invalidating it. Consistency checks
during mount will toss the cache, but it's better to get rid of the
check in cache_save_setup and let it get invalidated by the checks
already done during cache write out.
Signed-off-by: Chris Mason <clm@fb.com>
While running xfstests I ran into the following:
[20892.242791] ------------[ cut here ]------------
[20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]()
[20892.245874] BTRFS: Transaction aborted (error -2)
[20892.247329] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse$
[20892.258488] CPU: 0 PID: 13299 Comm: fsstress Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2
[20892.262011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20892.264738] 0000000000000009 ffff880427f8bc18 ffffffff8142fa46 ffffffff8108b6a2
[20892.266244] ffff880427f8bc68 ffff880427f8bc58 ffffffff81045ea5 ffff880427f8bc48
[20892.267761] ffffffffa0509a6d 00000000fffffffe ffff8803545d6f40 ffffffffa05a15a0
[20892.269378] Call Trace:
[20892.269915] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[20892.271097] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[20892.272173] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[20892.273386] [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.274857] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[20892.275851] [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs]
[20892.277341] [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs]
[20892.278628] [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs]
[20892.280191] [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs]
[20892.281781] [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf
[20892.282873] [<ffffffffa054163b>] btrfs_sync_file+0x313/0x387 [btrfs]
[20892.284111] [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4
[20892.285203] [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28
[20892.286290] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[20892.287469] [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e
[20892.288412] [<ffffffff8117ae54>] do_fsync+0x34/0x4e
[20892.289348] [<ffffffff8117b07c>] SyS_fsync+0x10/0x14
[20892.290255] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[20892.291316] ---[ end trace 597f77e664245373 ]---
[20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry
[20892.297390] BTRFS info (device sdg): forced readonly
This happens because in btrfs_start_dirty_block_groups() we splice the
transaction's list of dirty block groups into a local list and then we
keep extracting the first element of the list without holding the
cache_write_mutex mutex. This means that before we acquire that mutex
the first block group on the list might be removed by a conurrent task
running btrfs_remove_block_group(). So make sure we extract the first
element (and test the list emptyness) while holding that mutex.
Fixes: 1bbc621ef2 ("Btrfs: allow block group cache writeout
outside critical section in commit")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Calling unlazy_walk() in walk_component() and do_last() when we find
a symlink that needs to be followed doesn't acquire a reference to vfsmount.
That's fine when the symlink is on the same vfsmount as the parent directory
(which is almost always the case), but it's not always true - one _can_
manage to bind a symlink on top of something. And in such cases we end up
with excessive mntput().
Cc: stable@vger.kernel.org # since 2.6.39
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.
For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:
clat percentiles (usec):
| 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
| 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
| 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
| 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
| 99.99th=[ 165]
After:
clat percentiles (usec):
| 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
| 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
| 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
| 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
| 99.99th=[ 438]
In other setups, Robert Elliott reported seeing good performance
improvements:
https://lkml.org/lkml/2015/4/3/557
The more applications accessing the device, the worse it gets.
Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro's IOV changes broke 9p readdir() because the new code
didn't abort the read when it returned nothing. The original
code checked if the combined error/length was <= 0 but in the
new code that accidentally got changed to just an error check.
Add back the return from the function when nothing is read.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Fixes: e1200fe68f ("9p: switch p9_client_read() to passing struct iov_iter *")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
__btrfs_write_out_cache is holding the ctl->tree_lock while it prepares
a list of bitmaps to record in the free space cache. It was dropping
the lock while it worked on other components, which made a window for
free_bitmap() to free the bitmap struct without removing it from the
list.
This changes things to hold the lock the whole time, and also makes sure
we hold the lock during enospc cleanup.
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
Pull nfsd updates from Bruce Fields:
"A quiet cycle this time; this is basically entirely bugfixes.
The few that aren't cc'd to stable are cleanup or seemed unlikely to
affect anyone much"
* 'for-4.1' of git://linux-nfs.org/~bfields/linux:
uapi: Remove kernel internal declaration
nfsd: fix nsfd startup race triggering BUG_ON
nfsd: eliminate NFSD_DEBUG
nfsd4: fix READ permission checking
nfsd4: disallow SEEK with special stateids
nfsd4: disallow ALLOCATE with special stateids
nfsd: add NFSEXP_PNFS to the exflags array
nfsd: Remove duplicate macro define for max sec label length
nfsd: allow setting acls with unenforceable DENYs
nfsd: NFSD_FAULT_INJECTION depends on DEBUG_FS
nfsd: remove unused status arg to nfsd4_cleanup_open_state
nfsd: remove bogus setting of status in nfsd4_process_open2
NFSD: Use correct reply size calculating function
NFSD: Using path_equal() for checking two paths
Pull btrfs updates from Chris Mason:
"I've been running these through a longer set of load tests because my
commits change the free space cache writeout. It fixes commit stalls
on large filesystems (~20T space used and up) that we have been
triggering here. We were seeing new writers blocked for 10 seconds or
more during commits, which is far from good.
Josef and I fixed up ENOSPC aborts when deleting huge files (3T or
more), that are triggered because our metadata reservations were not
properly accounting for crcs and were not replenishing during the
truncate.
Also in this series, a number of qgroup fixes from Fujitsu and Dave
Sterba collected most of the pending cleanups from the list"
* 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (93 commits)
btrfs: quota: Update quota tree after qgroup relationship change.
btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations.
btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota.
btrfs: Update btrfs qgroup status item when rescan is done.
btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value.
btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created
btrfs: Check qgroup level in kernel qgroup assign.
btrfs: qgroup: allow to remove qgroup which has parent but no child.
btrfs: qgroup: return EINVAL if level of parent is not higher than child's.
btrfs: qgroup: do a reservation in a higher level.
Btrfs: qgroup, Account data space in more proper timings.
Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.
Btrfs: qgroup: free reserved in exceeding quota.
Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup().
btrfs: qgroup: fix limit args override whole limit struct
btrfs: qgroup: update limit info in function btrfs_run_qgroups().
btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item().
btrfs: qgroup: update qgroup in memory at the same time when we update it in btree.
btrfs: qgroup: inherit limit info from srcgroup in creating snapshot.
btrfs: Support busy loop of write and delete
...
This update contains:
o RENAME_WHITEOUT support
o conversion of per-cpu superblock accounting to use generic counters
o new inode mmap lock so that we can lock page faults out of truncate, hole
punch and other direct extent manipulation functions to avoid racing mmap
writes from causing data corruption
o rework of direct IO submission and completion to solve data corruption issue
when running concurrent extending DIO writes. Also solves problem of running
IO completion transactions in interrupt context during size extending AIO
writes.
o FALLOC_FL_INSERT_RANGE support for inserting holes into a file via direct
extent manipulation to avoid needing to copy data within the file
o attribute block header field overflow fix for 64k block size filesystems
o Lots of changes to log messaging to be more informative and concise when
errors occur. Also prevent a lot of unnecessary log spamming due to cascading
failures in error conditions.
o lots of cleanups and bug fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJVOE8oAAoJEK3oKUf0dfodx1kQAIIH8CwqcBrIslOntfHlFPHz
P9aQl5uiI6JcnFqMiHG6mfnjWGpn+Z6XMDGIBwrSTzHj8IEnHTeXqYiS6SDPAnrH
+VmlJEvW01ucAv7vcXKPrfutcc8dxLpy4fs63HOWmXh4rmrTcpel5S+0JSQxyGd6
OriLg1nfD4Sid7R9CFEXAKLghJFK+gbao2CmT0wo6ZrTwiZl2p62Y187ou+d+u3k
BRol99pI/Sp9bKpWZpUv3q2RnfD1v/k4oDP/JG4Ohdt2dx+nDqCjLvL8B5hJu74B
ZI+R+N28sAkMmbtR61kk06F7MS9RZqzBNIZalugaSuspKoenDZzmURZX+i77ogPQ
Ii3XLUMUzdwmi55/tBhpI7VkpFxahaEbWzTT1sMBh/Ka3GXO56BMIYTPvntjoN4w
ElcbFAMAZl8O56ruGBnc/k72CfFbq8qp93KkOfBGIKwwiPN+eCK8bQYL4G3sIZzx
f6k/WLbbShyViX9qoWLiX7qUfvh0NU/EcmGcJBsTmn0NFNOP4WmuojAq6SrvTgEz
No6zYJtnJvEPDa/v5A0dZyYfLqz2cTkEyTM9uwSixcCa1qAS+8IBcCGgTKfQOYkV
hCUWugiHwj4OQ/6WgP6oYLtIYdw6gqXgUKZy1Iy+ThDRwLbg9emYWixQTi4GAuRO
2SEBbFGSk7KIpoPENDUC
=WE6f
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pull xfs update from Dave Chinner:
"This update contains:
- RENAME_WHITEOUT support
- conversion of per-cpu superblock accounting to use generic counters
- new inode mmap lock so that we can lock page faults out of
truncate, hole punch and other direct extent manipulation functions
to avoid racing mmap writes from causing data corruption
- rework of direct IO submission and completion to solve data
corruption issue when running concurrent extending DIO writes.
Also solves problem of running IO completion transactions in
interrupt context during size extending AIO writes.
- FALLOC_FL_INSERT_RANGE support for inserting holes into a file via
direct extent manipulation to avoid needing to copy data within the
file
- attribute block header field overflow fix for 64k block size
filesystems
- Lots of changes to log messaging to be more informative and concise
when errors occur. Also prevent a lot of unnecessary log spamming
due to cascading failures in error conditions.
- lots of cleanups and bug fixes
One thing of note is the direct IO fixes that we merged last week
after the window opened. Even though a little late, they fix a user
reported data corruption and have been pretty well tested. I figured
there was not much point waiting another 2 weeks for -rc1 to be
released just so I could send them to you..."
* tag 'xfs-for-linus-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits)
xfs: using generic_file_direct_write() is unnecessary
xfs: direct IO EOF zeroing needs to drain AIO
xfs: DIO write completion size updates race
xfs: DIO writes within EOF don't need an ioend
xfs: handle DIO overwrite EOF update completion correctly
xfs: DIO needs an ioend for writes
xfs: move DIO mapping size calculation
xfs: factor DIO write mapping from get_blocks
xfs: unlock i_mutex in xfs_break_layouts
xfs: kill unnecessary firstused overflow check on attr3 leaf removal
xfs: use larger in-core attr firstused field and detect overflow
xfs: pass attr geometry to attr leaf header conversion functions
xfs: disallow ro->rw remount on norecovery mount
xfs: xfs_shift_file_space can be static
xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate
fs: Add support FALLOC_FL_INSERT_RANGE for fallocate
xfs: Fix incorrect positive ENOMEM return
xfs: xfs_mru_cache_insert() should use GFP_NOFS
xfs: %pF is only for function pointers
xfs: fix shadow warning in xfs_da3_root_split()
...
The code to fix stalls during free spache cache IO wasn't using
the correct root when waiting on the IO for inode caches. This
is only a problem when the inode cache is enabled with
mount -o inode_cache
This fixes the inode cache writeout to preserve any error values and
makes sure not to override the root when inode cache writeout is done.
Reported-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
This patch series creates an operation vector for each of the different
memory registration modes. This should make it easier to one day increase
credit limit, rsize, and wsize.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJVIoD8AAoJENfLVL+wpUDrGzMP/3976ixlHREOUxITQaWLUCpE
g55hhfkv1ebu3CiaRaV1Zz9lfZMREyzsFPcM6ZmzJ6M4s1zJLfmA6QtbnHOkKcrP
pfYRVaJtJ4Qw0CXPKHCmJXYKqoPlxIfeNxGP71MNKo3bx18g4kh0eoGvp1kjC7Fd
p0t6/n/GPG4Wx0ll93cH/0/MXvCmngRnpJFM8+j8o1NpBwcgJu1MKtj+UQHWzpII
QATefZzqz2xW4Er9dKt0qoKm1R22sz5GE7AHMdDtBvtnKaliE/pm3W1RMtgO3Fbc
fa+ISfHyeQnlAmye4iEpZc6MAugQm6/av39Qn8OOUDOYd8cpM3FTHVYPcgOoM1cL
SNOPp/YfiZDAgRzV3KmfVeXT0EqJoKmZsmQwaRNO+N0KXuVzIcaURWXoih8ANQ/m
9Rh97lRo1xFgLbDFIJCp/kCT43hN8UDQdDEVLvAXMfEff7rnDAQg8Gw3PeNJpvHJ
VazGJKeTzHYw1qut7TzQXYQicYyuIW9QyEpbsO1rx6iFdK6wdZxG6ingLCit9oIc
zh3/L+/plJBYTWEZtzffskfjFIXDvbgTzMIt8gB9W7Bdpbd28S8tmu/TAJfsuUId
Z4mojHpaksD1dO9dnF3viibYgEAs5E+dFrDoMhceBTbnu7zt5BgU7zbOZPTR6qvt
yC4FYxzfhgYpmEooYFoP
=Fkzr
-----END PGP SIGNATURE-----
Merge tag 'nfs-rdma-for-4.1-1' of git://git.linux-nfs.org/projects/anna/nfs-rdma
NFS: NFSoRDMA Client Changes
This patch series creates an operation vector for each of the different
memory registration modes. This should make it easier to one day increase
credit limit, rsize, and wsize.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
* bugfixes:
NFSv4: Return delegations synchronously in evict_inode
SUNRPC: Fix a regression when reconnecting
NFS: remount with security change should return EINVAL
nfs: do not export discarded symbols
NFSv4.1: don't export static symbol
The brand new GCC 5.1.0 warns by default on using a boolean in the
switch condition. This results in the following warning:
fs/nfs/nfs4proc.c: In function 'nfs4_proc_get_rootfh':
fs/nfs/nfs4proc.c:3100:10: warning: switch condition has boolean value [-Wswitch-bool]
switch (auth_probe) {
^
This code was obviously using switch to make use of the fall-through
semantics (without the usual comment, though).
Rewrite that code using if statements to avoid the warning and make
the code a bit more readable on the way.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This reverts commit 5a254d08b0.
Since commit 5a254d08b0 ("nfs: replace nfs_add_stats with
nfs_inc_stats when add one"), nfs_readpage and nfs_do_writepage use
nfs_inc_stats to increment NFSIOS_READPAGES and NFSIOS_WRITEPAGES
instead of nfs_add_stats.
However nfs_inc_stats does not do the same thing as nfs_add_stats with
value 1 because these functions work on distinct stats:
nfs_inc_stats increments stats from "enum nfs_stat_eventcounters" (in
server->io_stats->events) and nfs_add_stats those from "enum
nfs_stat_bytecounters" (in server->io_stats->bytes).
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Fixes: 5a254d08b0 ("nfs: replace nfs_add_stats with nfs_inc_stats...")
Cc: stable@vger.kernel.org # 3.19+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
I added the nfs4 prefix to make it obvious that this file is built into
the NFS v4 module, and not the generic client.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This file is only used internally to the NFS v4 module, so it doesn't
need to be in the global include path. I also renamed it from
nfs_idmap.h to nfs4idmap.h to emphasize that it's an NFSv4-only include
file.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The idmapper is completely internal to the NFS v4 module, so this macro
will always evaluate to true. This patch also removes unnecessary
includes of this file from the generic NFS client.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
d4b18c3e (pnfs: remove GETDEVICELIST implementation) removed the
GETDEVICELIST operation from the NFS client, but left a "hole" in the
nfs4_procedures array. This caused /proc/self/mountstats to report an
operation named "51" where GETDEVICELIST used to be. This patch adds a
stub to fix mountstats.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Fixes: d4b18c3e (pnfs: remove GETDEVICELIST implementation)
Cc: stable@vger.kernel.org # 3.17+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
For flexfiles driver, we might choose to read from mirror index other
than 0 while mirror_count is always 1 for read.
Reported-by: Jean Spector <jean@primarydata.com>
Cc: <stable@vger.kernel.org> # v3.19+
Cc: Weston Andros Adamson <dros@primarydata.com>
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
For direct read that has IO size larger than rsize, we'll split
it into several READ requests and nfs_direct_good_bytes() would
count completed bytes incorrectly by eating last zero count reply.
Fix it by handling mirror and non-mirror cases differently such that
we only count mirrored writes differently.
This fixes 5fadeb47("nfs: count DIO good bytes correctly with mirroring").
Reported-by: Jean Spector <jean@primarydata.com>
Cc: <stable@vger.kernel.org> # v3.19+
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2ef47eb1 (NFS: Fix use of nfs_attr_use_mounted_on_fileid()) was a good
start to fixing a circular directory structure warning for NFS v4
"junctioned" mountpoints. Unfortunately, further testing continued to
generate this error.
My server is configured like this:
anna@nfsd ~ % df
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 9.1G 2.0G 6.5G 24% /
/dev/vdc1 1014M 33M 982M 4% /exports
/dev/vdc2 1014M 33M 982M 4% /exports/vol1
/dev/vdc3 1014M 33M 982M 4% /exports/vol1/vol2
anna@nfsd ~ % cat /etc/exports
/exports/ *(rw,async,no_subtree_check,no_root_squash)
/exports/vol1/ *(rw,async,no_subtree_check,no_root_squash)
/exports/vol1/vol2 *(rw,async,no_subtree_check,no_root_squash)
I've been running chown across the entire mountpoint twice in a row to
hit this problem. The first run succeeds, but the second one fails with
the circular directory warning along with:
anna@client ~ % dmesg
[Apr 3 14:28] NFS: server 192.168.100.204 error: fileid changed
fsid 0:39: expected fileid 0x100080, got 0x80
WHere 0x80 is the mountpoint's fileid and 0x100080 is the mounted-on
fileid.
This patch fixes the issue by requesting an updated mounted-on fileid
from the server during nfs_update_inode(), and then checking that the
fileid stored in the nfs_inode matches either the fileid or mounted-on
fileid returned by the server.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Chuck pointed out a problem that crept in with commit 6ffa30d3f7 (nfs:
don't call blocking operations while !TASK_RUNNING). Linux counts tasks
in uninterruptible sleep against the load average, so this caused the
system's load average to be pinned at at least 1 when there was a
NFSv4.1+ mount active.
Not a huge problem, but it's probably worth fixing before we get too
many complaints about it. This patch converts the code back to use
TASK_INTERRUPTIBLE sleep, simply has it flush any signals on each loop
iteration. In practice no one should really be signalling this thread at
all, so I think this is reasonably safe.
With this change, there's also no need to game the hung task watchdog so
we can also convert the schedule_timeout call back to a normal schedule.
Cc: <stable@vger.kernel.org>
Reported-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Tested-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: commit 6ffa30d3f7 (“nfs: don't call blocking . . .”)
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
At the very least, we should not be taking the i_mutex until after
checking if the server even supports ALLOCATE or DEALLOCATE, allowing
v4.0 or v4.1 to exit without potentially waiting on a lock.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patch adds a GETATTR to the end of ALLOCATE and DEALLOCATE
operations so we can set the updated inode size and change attribute
directly. DEALLOCATE will still need to release pagecache pages, so
nfs42_proc_deallocate() now calls truncate_pagecache_range() before
contacting the server.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* Add Kconfig option for keeping both the 'master' and 'partition' MTDs
registered as devices. This would really make a better default if we could
do it over, as it allows a lot more flexibility in (1) determining the flash
topology of the system from user-space and (2) adding temporary partitions
at runtime (ioctl(BLKPG)). Unfortunately, this would possibly cause
user-space breakage, as it will cause renumbering of the /dev/mtdX devices.
We'll see if we can change this in the future, as there have already been a
few people looking for this feature, and I know others have just been
working around our current limitations instead of fixing them this way.
* Along with the previous change, add some additional information to sysfs, so
user-space can read the offset of each partition within its master device
SPI NOR:
* add new device tree compatible binding to represent the mostly-compatible
class of SPI NOR flash which can be detected by their extended JEDEC ID
bytes, cutting down the duplication of our ID tables
* misc. new IDs
Various other miscellaneous fixes and changes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVN9ypAAoJEFySrpd9RFgtaUQQAKmlCVMrxAKtF6U5jpzf07hA
7ZrcMdUTSwS++dBIAgDl6JSuSGT5KRLrS1FOp60p+VAjbD9VFcRLUUQxahXW1tAh
Dr8a3Akwd+lgIp77bZhWBY35dXmjIJ1GSzo7jdbJMDwAeDd3gBeSFTDoePsrCt6K
0/NPOsQzCFDDr1lwuQh1LzkLLQfVAC3ImNCBm5smvyEfhxXqzC02HOLf8Z9VMGnY
OxM9i0T6Ik3xeaaP/vH91sApmdn598gP5DB5cNr61YrZeVZmEoI4EWlOmagcYVC2
Tef9Ng4YmHGXo65k7XcKRykAVWECYAGr4HKCDZ8tsbvpfdbQMS5wHEgxMsAdvb01
aChcBNxf4w/Mh49fzjZppTlPN25FERRMnXt7CkUqQkqet9uDkD/5RNPl65ermeC7
EKx2MoxnpXrfZ0EkSxqrfdzP0oQx0AqAkbCyLIN42Vbxl7ckFMN3WAPQ2NR2Aaoh
SUiKwwaFFiK+C9qEytj0s+cmKPzsTzeQVYgp9NX64EfVQumqpsfbu6XIPV+FGy2i
DvHvmTEvm4SpqMPSnhkmZ6DFSjuzvQdqzKtDyZmRppxHKgWUsXYdftGPMG0+ZbaG
t4zysWfJG897TMVYLKY9pGqvouMuAVJ4kX1+iZbJc8dr4bwIzXIYuEGPLVv58gUO
KjjlYk91/jFNmBW5anxC
=aIsV
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20150422' of git://git.infradead.org/linux-mtd
Pull MTD updates from Brian Norris:
"Common MTD:
- Add Kconfig option for keeping both the 'master' and 'partition'
MTDs registered as devices. This would really make a better
default if we could do it over, as it allows a lot more flexibility
in (1) determining the flash topology of the system from user-space
and (2) adding temporary partitions at runtime (ioctl(BLKPG)).
Unfortunately, this would possibly cause user-space breakage, as it
will cause renumbering of the /dev/mtdX devices. We'll see if we
can change this in the future, as there have already been a few
people looking for this feature, and I know others have just been
working around our current limitations instead of fixing them this
way.
- Along with the previous change, add some additional information to
sysfs, so user-space can read the offset of each partition within
its master device
SPI NOR:
- add new device tree compatible binding to represent the
mostly-compatible class of SPI NOR flash which can be detected by
their extended JEDEC ID bytes, cutting down the duplication of our
ID tables
- misc. new IDs
Various other miscellaneous fixes and changes"
* tag 'for-linus-20150422' of git://git.infradead.org/linux-mtd: (53 commits)
mtd: spi-nor: Add support for Macronix mx25u6435f serial flash
mtd: spi-nor: Add support for Winbond w25q64dw serial flash
mtd: spi-nor: add support for the Winbond W25X05 flash
mtd: spi-nor: support en25s64 device
mtd: m25p80: bind to "nor-jedec" ID, for auto-detection
Documentation: devicetree: m25p80: add "nor-jedec" binding
mtd: Make MTD tests cancelable
mtd: mtd_oobtest: Fix bitflip_limit usage in test case 3
mtd: docg3: remove invalid __exit annotations
mtd: fsl_ifc_nand: use msecs_to_jiffies for time conversion
mtd: atmel_nand: don't map the ROM table if no pmecc table offset in DT
mtd: atmel_nand: add a definition for the oob reserved bytes
mtd: part: Remove partition overlap checks
mtd: part: Add sysfs variable for offset of partition
mtd: part: Create the master device node when partitioned
mtd: ts5500_flash: Fix typo in MODULE_DESCRIPTION in ts5500_flash.c
mtd: denali: Disable sub-page writes in Denali NAND driver
mtd: pxa3xx_nand: cleanup wait_for_completion handling
mtd: nand: gpmi: Check for scan_bbt() error
mtd: nand: gpmi: fixup return type of wait_for_completion_timeout
...
Pull Ceph updates from Sage Weil:
"This time around we have a collection of CephFS fixes from Zheng
around MDS failure handling and snapshots, support for a new CRUSH
straw2 algorithm (to sync up with userspace) and several RBD cleanups
and fixes from Ilya, an error path leak fix from Taesoo, and then an
assorted collection of cleanups from others"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (28 commits)
rbd: rbd_wq comment is obsolete
libceph: announce support for straw2 buckets
crush: straw2 bucket type with an efficient 64-bit crush_ln()
crush: ensuring at most num-rep osds are selected
crush: drop unnecessary include from mapper.c
ceph: fix uninline data function
ceph: rename snapshot support
ceph: fix null pointer dereference in send_mds_reconnect()
ceph: hold on to exclusive caps on complete directories
libceph: simplify our debugfs attr macro
ceph: show non-default options only
libceph: expose client options through debugfs
libceph, ceph: split ceph_show_options()
rbd: mark block queue as non-rotational
libceph: don't overwrite specific con error msgs
ceph: cleanup unsafe requests when reconnecting is denied
ceph: don't zero i_wrbuffer_ref when reconnecting is denied
ceph: don't mark dirty caps when there is no auth cap
ceph: keep i_snap_realm while there are writers
libceph: osdmap.h: Add missing format newlines
...
nfsd triggered a BUG_ON in net_generic(...) when rpc_pipefs_event(...)
in fs/nfsd/nfs4recover.c was called before assigning ntfsd_net_id.
The following was observed on a MIPS 32-core processor:
kernel: Call Trace:
kernel: [<ffffffffc00bc5e4>] rpc_pipefs_event+0x7c/0x158 [nfsd]
kernel: [<ffffffff8017a2a0>] notifier_call_chain+0x70/0xb8
kernel: [<ffffffff8017a4e4>] __blocking_notifier_call_chain+0x4c/0x70
kernel: [<ffffffff8053aff8>] rpc_fill_super+0xf8/0x1a0
kernel: [<ffffffff8022204c>] mount_ns+0xb4/0xf0
kernel: [<ffffffff80222b48>] mount_fs+0x50/0x1f8
kernel: [<ffffffff8023dc00>] vfs_kern_mount+0x58/0xf0
kernel: [<ffffffff802404ac>] do_mount+0x27c/0xa28
kernel: [<ffffffff80240cf0>] SyS_mount+0x98/0xe8
kernel: [<ffffffff80135d24>] handle_sys64+0x44/0x68
kernel:
kernel:
Code: 0040f809 00000000 2e020001 <00020336> 3c12c00d
3c02801a de100000 6442eb98 0040f809
kernel: ---[ end trace 7471374335809536 ]---
Fixed this behaviour by calling register_pernet_subsys(&nfsd_net_ops) before
registering rpc_pipefs_event(...) with the notifier chain.
Signed-off-by: Giuseppe Cantavenera <giuseppe.cantavenera.ext@nokia.com>
Signed-off-by: Lorenzo Restelli <lorenzo.restelli.ext@nokia.com>
Reviewed-by: Kinlong Mee <kinglongmee@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Commit f895b252d4 ("sunrpc: eliminate RPC_DEBUG") introduced
use of IS_ENABLED() in a uapi header which leads to a build
failure for userspace apps trying to use <linux/nfsd/debug.h>:
linux/nfsd/debug.h:18:15: error: missing binary operator before token "("
#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
^
Since this was only used to define NFSD_DEBUG if CONFIG_SUNRPC_DEBUG
is enabled, replace instances of NFSD_DEBUG with CONFIG_SUNRPC_DEBUG.
Cc: stable@vger.kernel.org
Fixes: f895b252d4 "sunrpc: eliminate RPC_DEBUG"
Signed-off-by: Mark Salter <msalter@redhat.com>
Reviewed-by: Jeff Layton <jlayton@primarydata.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
In the case we already have a struct file (derived from a stateid), we
still need to do permission-checking; otherwise an unauthorized user
could gain access to a file by sniffing or guessing somebody else's
stateid.
Cc: stable@vger.kernel.org
Fixes: dc97618ddd "nfsd4: separate splice and readv cases"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
If the client uses a special stateid then we'll pass a NULL file to
vfs_llseek.
Fixes: 24bab49122 " NFSD: Implement SEEK"
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: stable@vger.kernel.org
Reported-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
vfs_fallocate will hit a NULL dereference if the client tries an
ALLOCATE or DEALLOCATE with a special stateid. Fix that. (We also
depend on the open to have broken any conflicting leases or delegations
for us.)
(If it turns out we need to allow special stateid's then we could do a
temporary open here in the special-stateid case, as we do for read and
write. For now I'm assuming it's not necessary.)
Fixes: 95d871f03c "nfsd: Add ALLOCATE support"
Cc: stable@vger.kernel.org
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This reverts commit e2ac55b6a8.
Huang Ying reports that this causes a hang at boot with debugfs disabled.
It is true that the debugfs error checks are kind of confusing, and this
code certainly merits more cleanup and thinking about it, but there's
something wrong with the trivial "check not just for NULL, but for error
pointers too" patch.
Yes, with debugfs disabled, we will end up setting the o2hb_debug_dir
pointer variable to an error pointer (-ENODEV), and then continue as if
everything was fine. But since debugfs is disabled, all the _users_ of
that pointer end up being compiled away, so even though the pointer can
not be dereferenced, that's still fine.
So it's confusing and somewhat questionable, but the "more correct"
error checks end up causing more trouble than they fix.
Reported-by: Huang Ying <ying.huang@intel.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Chengyu Song <csong84@gatech.edu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a directory is complete, we want to keep the exclusive
cap. So that MDS does not end up revoking the shared cap
on every create/unlink operation.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Don't pollute /proc/mounts with default options (presently these are
dcache, nofsc and acl). Leave the acl/noacl however - it's a bit of
a special case due to CONFIG_CEPH_FS_POSIX_ACL.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Split ceph_show_options() into two pieces and move the piece
responsible for printing client (libceph) options into net/ceph. This
way people adding a libceph option wouldn't have to remember to update
code in fs/ceph.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
remove_session_caps_cb() does not truncate dirty data in page
cache, but zeros i_wrbuffer_ref/i_wrbuffer_ref_head. This will
result negtive i_wrbuffer_ref/i_wrbuffer_ref_head
Signed-off-by: Yan, Zheng <zyan@redhat.com>
when reconnecting to MDS is denied, we remove session caps
forcibly. But it's possible there are ongoing write, the
write code needs to reference i_snap_realm. So if there are
ongoing write, we keep i_snap_realm.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
Currently, there is no check for the kstrdup() for r_path2,
r_path1 and snapdir_name as various locations as there is a
possibility of failure during memory pressure. Therefore,
returning ENOMEM where the checks have been missed.
Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
When ceph_update_writeable_page fails (including -EAGAIN), it
unlocks (w/ unlock_page) the page but does not 'release'
(w/ page_cache_release) properly.
Upon error, properly set *pagep to NULL, indicating an error.
Signed-off-by: Taesoo Kim <tsgatesv@gmail.com>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
return type of wait_for_completion_timeout is unsigned long not int. An
appropriately named unsigned long is added and the assignment fixed up.
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
This is only an API consolidation and should make things more readable
it replaces var * HZ / 1000 by msecs_to_jiffies(var).
Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Signed-off-by: Yan, Zheng <zyan@redhat.com>
These cap releases are stale because MDS will re-establish client
caps according to the cap reconnect messages.
Note: MDS can detect stale cap messages, so these stale cap
releases are harmless even we don't drop them.
Signed-off-by: Yan, Zheng <zyan@redhat.com>
This prevents a race between chown() and execve(), where chowning a
setuid-user binary to root would momentarily make the binary setuid
root.
This patch was mostly written by Linus Torvalds.
Signed-off-by: Jann Horn <jann@thejh.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some accumulated cleanup patches for kerneldoc and unused variables
as well as some lock bug fixes and adding privateport option for RDMA.
A quick check shows some merge-conflicts versus current-tip on
9p: use unsigned integers for nwqid/count
If you would prefer I can rebase, remerge and fix the patch but didn't
want to do that and look the for-next references.
Signed-off-by: Eric Van Hensbergen <ericvh@gmail.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: GPGTools - http://gpgtools.org
iQIcBAABAgAGBQJVMqXZAAoJEDZk62b0Tg6xBD4P/03nCkTxE5qDN9TVUSNdwHQD
Oyq3JvvmfOORDHy7pZMp7wTdU4OLz+78RHYprpgJCk4Vs8Gcnl3hloeZ3L9l/W7J
tz2Ek1noEE9uZLmeH6WPzSaba0sFOlnjbWPsLE8O84/zHOI/qj75s0UDPdrFRt1x
LvMNQlTZqgUx0hogq1yLFKjp49bUzph78gMaJkoKK+30q9B4skPRRV93HLLzlo9j
0dAGd0yhO8xUjtlm/ZkXIKiyeGeQ2XXj6UTnH6/4nwL29yVosWkGNjqIXkgz+ROu
eyPvJqrjaBVtj8ZJkwfyZqM6xPrnsEbuSYUKLT2GcId87Ycebd7Wq1w+vhAO7l0H
N1ZnzMGlQXHTszEhDGVCICCv1QU8b3ifvtA+nQYUly9JnDeIBcZGQ16g0oYQNoes
1L6XKsrX4wdxROHYLqRJoNQ120KcaXAnRE3AmT8emiU8gl0KWW0TJ7WpLs9ICKRg
cwgz1UzeGb/GGRtCv0gTlAE07fe/OjQVrSM3Q+ivTA+juRE2MWvluYh/WAMQHdFV
FnJ5/sPKbcGK+IrHNWktkTLm2ZbbdcDnWHLmtk3egT3IubY5iLVpa5ADV47WsLAa
viDp7N3mK0kZL8BJHgPs+aspRwMAHavme/EWzkuRTL048ABo8uTrM/BXiYsAaBBI
GGh4+vEwcFDQdg2gMbF9
=2sr2
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.1-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs
Pull 9pfs updates from Eric Van Hensbergen:
"Some accumulated cleanup patches for kerneldoc and unused variables as
well as some lock bug fixes and adding privateport option for RDMA"
* tag 'for-linus-4.1-merge-window' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
net/9p: add a privport option for RDMA transport.
fs/9p: Initialize status in v9fs_file_do_lock.
net/9p: Initialize opts->privport as it should be.
net/9p: use memcpy() instead of snprintf() in p9_mount_tag_show()
9p: use unsigned integers for nwqid/count
9p: do not crash on unknown lock status code
9p: fix error handling in v9fs_file_do_lock
9p: remove unused variable in p9_fd_create()
9p: kerneldoc warning fixes
Pull usernamespace mount fixes from Eric Biederman:
"Way back in October Andrey Vagin reported that umount(MNT_DETACH)
could be used to defeat MNT_LOCKED. As I worked to fix this I
discovered that combined with mount propagation and an appropriate
selection of shared subtrees a reference to a directory on an
unmounted filesystem is not necessary.
That MNT_DETACH is allowed in user namespace in a form that can break
MNT_LOCKED comes from my early misunderstanding what MNT_DETACH does.
To avoid breaking existing userspace the conflict between MNT_DETACH
and MNT_LOCKED is fixed by leaving mounts that are locked to their
parents in the mount hash table until the last reference goes away.
While investigating this issue I also found an issue with
__detach_mounts. The code was unnecessarily and incorrectly
triggering mount propagation. Resulting in too many mounts going away
when a directory is deleted, and too many cpu cycles are burned while
doing that.
Looking some more I realized that __detach_mounts by only keeping
mounts connected that were MNT_LOCKED it had the potential to still
leak information so I tweaked the code to keep everything locked
together that possibly could be.
This code was almost ready last cycle but Al invented fs_pin which
slightly simplifies this code but required rewrites and retesting, and
I have not been in top form for a while so it took me a while to get
all of that done. Similiarly this pull request is late because I have
been feeling absolutely miserable all week.
The issue of being able to escape a bind mount has not yet been
addressed, as the fixes are not yet mature"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
mnt: Update detach_mounts to leave mounts connected
mnt: Fix the error check in __detach_mounts
mnt: Honor MNT_LOCKED when detaching mounts
fs_pin: Allow for the possibility that m_list or s_list go unused.
mnt: Factor umount_mnt from umount_tree
mnt: Factor out unhash_mnt from detach_mnt and umount_tree
mnt: Fail collect_mounts when applied to unmounted mounts
mnt: Don't propagate unmounts to locked mounts
mnt: On an unmount propagate clearing of MNT_LOCKED
mnt: Delay removal from the mount hash.
mnt: Add MNT_UMOUNT flag
mnt: In umount_tree reuse mnt_list instead of mnt_hash
mnt: Don't propagate umounts in __detach_mounts
mnt: Improve the umount_tree flags
mnt: Use hlist_move_list in namespace_unlock
Pull f2fs updates from Jaegeuk Kim:
"New features:
- in-memory extent_cache
- fs_shutdown to test power-off-recovery
- use inline_data to store symlink path
- show f2fs as a non-misc filesystem
Major fixes:
- avoid CPU stalls on sync_dirty_dir_inodes
- fix some power-off-recovery procedure
- fix handling of broken symlink correctly
- fix missing dot and dotdot made by sudden power cuts
- handle wrong data index during roll-forward recovery
- preallocate data blocks for direct_io
... and a bunch of minor bug fixes and cleanups"
* tag 'for-f2fs-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (71 commits)
f2fs: pass checkpoint reason on roll-forward recovery
f2fs: avoid abnormal behavior on broken symlink
f2fs: flush symlink path to avoid broken symlink after POR
f2fs: change 0 to false for bool type
f2fs: do not recover wrong data index
f2fs: do not increase link count during recovery
f2fs: assign parent's i_mode for empty dir
f2fs: add F2FS_INLINE_DOTS to recover missing dot dentries
f2fs: fix mismatching lock and unlock pages for roll-forward recovery
f2fs: fix sparse warnings
f2fs: limit b_size of mapped bh in f2fs_map_bh
f2fs: persist system.advise into on-disk inode
f2fs: avoid NULL pointer dereference in f2fs_xattr_advise_get
f2fs: preallocate fallocated blocks for direct IO
f2fs: enable inline data by default
f2fs: preserve extent info for extent cache
f2fs: initialize extent tree with on-disk extent info of inode
f2fs: introduce __{find,grab}_extent_tree
f2fs: split set_data_blkaddr from f2fs_update_extent_cache
f2fs: enable fast symlink by utilizing inline data
...
Some buggy firmware implementations update VariableNameSize on success
such that it does not include the final NUL character which results in
garbage in the efivarfs name entries. Use kzalloc on the efivar_entry
(as is done in efivars.c) to ensure that the name is always
NUL-terminated.
The buggy firmware is:
BIOS Information
Vendor: Intel Corp.
Version: S1200RP.86B.02.02.0005.102320140911
Release Date: 10/23/2014
BIOS Revision: 4.6
System Information
Manufacturer: Intel Corporation
Product Name: S1200RP_SE
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Matthew Garrett <mjg59@coreos.com>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Merge third patchbomb from Andrew Morton:
- various misc things
- a couple of lib/ optimisations
- provide DIV_ROUND_CLOSEST_ULL()
- checkpatch updates
- rtc tree
- befs, nilfs2, hfs, hfsplus, fatfs, adfs, affs, bfs
- ptrace fixes
- fork() fixes
- seccomp cleanups
- more mmap_sem hold time reductions from Davidlohr
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (138 commits)
proc: show locks in /proc/pid/fdinfo/X
docs: add missing and new /proc/PID/status file entries, fix typos
drivers/rtc/rtc-at91rm9200.c: make IO endian agnostic
Documentation/spi/spidev_test.c: fix warning
drivers/rtc/rtc-s5m.c: allow usage on device type different than main MFD type
.gitignore: ignore *.tar
MAINTAINERS: add Mediatek SoC mailing list
tomoyo: reduce mmap_sem hold for mm->exe_file
powerpc/oprofile: reduce mmap_sem hold for exe_file
oprofile: reduce mmap_sem hold for mm->exe_file
mips: ip32: add platform data hooks to use DS1685 driver
lib/Kconfig: fix up HAVE_ARCH_BITREVERSE help text
x86: switch to using asm-generic for seccomp.h
sparc: switch to using asm-generic for seccomp.h
powerpc: switch to using asm-generic for seccomp.h
parisc: switch to using asm-generic for seccomp.h
mips: switch to using asm-generic for seccomp.h
microblaze: use asm-generic for seccomp.h
arm: use asm-generic for seccomp.h
seccomp: allow COMPAT sigreturn overrides
...
Let's show locks which are associated with a file descriptor in
its fdinfo file.
Currently we don't have a reliable way to determine who holds a lock. We
can find some information in /proc/locks, but PID which is reported there
can be wrong. For example, a process takes a lock, then forks a child and
dies. In this case /proc/locks contains the parent pid, which can be
reused by another process.
$ cat /proc/locks
...
6: FLOCK ADVISORY WRITE 324 00:13:13431 0 EOF
...
$ ps -C rpcbind
PID TTY TIME CMD
332 ? 00:00:00 rpcbind
$ cat /proc/332/fdinfo/4
pos: 0
flags: 0100000
mnt_id: 22
lock: 1: FLOCK ADVISORY WRITE 324 00:13:13431 0 EOF
$ ls -l /proc/332/fd/4
lr-x------ 1 root root 64 Mar 5 14:43 /proc/332/fd/4 -> /run/rpcbind.lock
$ ls -l /proc/324/fd/
total 0
lrwx------ 1 root root 64 Feb 27 14:50 0 -> /dev/pts/0
lrwx------ 1 root root 64 Feb 27 14:50 1 -> /dev/pts/0
lrwx------ 1 root root 64 Feb 27 14:49 2 -> /dev/pts/0
You can see that the process with the 324 pid doesn't hold the lock.
This information is required for proper dumping and restoring file
locks.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case of failed memory allocation, the return should be ENOMEM instead
of ENOSPC.
Return -EIO when sb_bread() fails.
Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a possibility of kstrdup() failure upon memory pressure.
Therefore, returning ENOMEM even for new_opts.
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: Taesoo kim <taesoo@gatech.edu>
Cc: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace mount option test by affs_test_opt().
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace direct mount option assignation by affs_set_opt() macro.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, affs still uses direct access on mount_options. This patch
prepares to use affs_clear/set/test_opt() like other filesystems.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix the wrong values returned by various functions such as EIO and ENOMEM.
Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Joe Perches <joe@perches.com>
Cc: Taesoo kim <taesoo@gatech.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We set sig->notify_count = -1 between RELEASE and ACQUIRE operations:
spin_unlock_irq(lock);
...
if (!thread_group_leader(tsk)) {
...
for (;;) {
sig->notify_count = -1;
write_lock_irq(&tasklist_lock);
There are no restriction on it so other processors may see this STORE
mixed with other STOREs in both areas limited by the spinlocks.
Probably, it may be reordered with the above
sig->group_exit_task = tsk;
sig->notify_count = zap_other_threads(tsk);
in some way.
Set it under tasklist_lock locked to be sure nothing will be reordered.
Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Oleg cleverly suggested using xchg() to set the new mm->exe_file instead
of calling set_mm_exe_file() which requires some form of serialization --
mmap_sem in this case. For archs that do not have atomic rmw instructions
we still fallback to a spinlock alternative, so this should always be
safe. As such, we only need the mmap_sem for looking up the backing
vm_file, which can be done sharing the lock. Naturally, this means we
need to manually deal with both the new and old file reference counting,
and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can
probably be deleted in the future anyway.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes mm->mmap_sem from mm->exe_file read side.
Also it kills dup_mm_exe_file() and moves exe_file duplication into
dup_mmap() where both mmap_sems are locked.
[akpm@linux-foundation.org: fix comment typo]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'fat.h' includes <linux/buffer_head.h> which includes <linux/fs.h> which
includes all the header files required for all *.c files fat filesystem.
[akpm@linux-foundation.org: fs/fat/iode.c needs seq_file.h]
[sfr@canb.auug.org.au: put one actually necessary include file back]
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'*sb' never used, so let's remote it and pass inode->i_sb directly to the
MSDOS_SB.
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On Mac OS X, HFS+ extended attributes are not namespaced. Since we want
to be compatible with OS X filesystems and yet still support the Linux
namespacing system, the hfsplus driver implements a special "osx"
namespace that is reported for any attribute that is not namespaced
on-disk. However, the current code for getting and setting these
unprefixed attributes is broken.
hfsplus_osx_setattr() and hfsplus_osx_getattr() are passed names that have
already had their "osx." prefixes stripped by the generic functions. The
functions first, quite correctly, check those names to make sure that they
aren't prefixed with a known namespace, which would allow namespace access
restrictions to be bypassed. However, the functions then prepend "osx."
to the name they're given before passing it on to hfsplus_getattr() and
hfsplus_setattr(). Not only does this cause the "osx." prefix to be
stored on-disk, defeating its purpose, it also breaks the check for the
special "com.apple.FinderInfo" attribute, which is reported for all files,
and as a consequence makes some userspace applications (e.g. GNU patch)
fail even when extended attributes are not otherwise in use.
There are five commits which have touched this particular code:
127e5f5ae5 ("hfsplus: rework functionality of getting, setting and deleting of extended attributes")
b168fff721 ("hfsplus: use xattr handlers for removexattr")
bf29e886b2 ("hfsplus: correct usage of HFSPLUS_ATTR_MAX_STRLEN for non-English attributes")
fcacbd95e121 ("fs/hfsplus: move xattr_name allocation in hfsplus_getxattr()")
ec1bbd346f18 ("fs/hfsplus: move xattr_name allocation in hfsplus_setxattr()")
The first commit creates the functions to begin with. The namespace is
prepended by the original code, which I believe was correct at the time,
since hfsplus_?etattr() stripped the prefix if found. The second commit
removes this behavior from hfsplus_?etattr() and appears to have been
intended to also remove the prefixing from hfsplus_osx_?etattr().
However, what it actually does is remove a necessary strncpy() call
completely, breaking the osx namespace entirely. The third commit re-adds
the strncpy() call as it was originally, but doesn't mention it in its
commit message. The final two commits refactor the code and don't affect
its functionality.
This commit does what b168fff attempted to do (prevent the prefix from
being added), but does it properly, instead of passing in an empty buffer
(which is what b168fff actually did).
Fixes: b168fff721 ("hfsplus: use xattr handlers for removexattr")
Signed-off-by: Thomas Hebb <tommyhebb@gmail.com>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Cc: Sergei Antonov <saproj@gmail.com>
Cc: Anton Altaparmakov <anton@tuxera.com>
Cc: Fabian Frederick <fabf@skynet.be>
Cc: Christian Kujau <lists@nerdbynature.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a bug which is reproduced as follows. Create a file:
echo abc > test_file
Try to expand the file beyond available space:
truncate --size=<size exceeding available space> test_file
Since HFS+ does not support file size > allocated size, truncate should
fail. However, it ends successfully. The driver returns success despite
having been unable to allocate the requested space for the file. Also
filesystem check finds an error:
Checking catalog file.
Incorrect size for file test_file
(It should be 469094400 instead of 1000000000)
Add a piece of code analogous to code in the fat driver. Now a proper
error is returned and filesystem remains consistent.
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Hin-Tak Leung <htl10@users.sourceforge.net>
Reviewed-by: Anton Altaparmakov <anton@tuxera.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Sougata Santra <sougata@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In case of memory allocation error, the return should be -ENOMEM, instead
of -ENOSPC.
Signed-off-by: Chengyu Song <csong84@gatech.edu>
Reviewed-by: Sergei Antonov <saproj@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
is_known_namespace() only returns true/false. Also remove inline and let
compiler decide what to do with static functions.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
According to commit 5f16f3225b ("ext4: atomically set inode->i_flags in
ext4_set_inode_flags()").
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
security/trusted/user/osx setxattr did the same
xattr_name initialization. Move that operation in hfsplus_setxattr().
Tested with security/trusted/user getfattr/setfattr
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
security/trusted/user/osx getxattr did the same
xattr_name initialization. Move that operation in hfsplus_getxattr().
Tested with security/trusted/user getfattr/setfattr
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This doesn't change how the code works, but clearly the curly braces were
intended.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Vyacheslav Dubeyko <slava@dubeyko.com>
Cc: Sougata Santra <sougata@tuxera.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>