Firstly by applying the following with coccinelle's spatch:
@@ expression SB; @@
-SB->s_flags & MS_RDONLY
+sb_rdonly(SB)
to effect the conversion to sb_rdonly(sb), then by applying:
@@ expression A, SB; @@
(
-(!sb_rdonly(SB)) && A
+!sb_rdonly(SB) && A
|
-A != (sb_rdonly(SB))
+A != sb_rdonly(SB)
|
-A == (sb_rdonly(SB))
+A == sb_rdonly(SB)
|
-!(sb_rdonly(SB))
+!sb_rdonly(SB)
|
-A && (sb_rdonly(SB))
+A && sb_rdonly(SB)
|
-A || (sb_rdonly(SB))
+A || sb_rdonly(SB)
|
-(sb_rdonly(SB)) != A
+sb_rdonly(SB) != A
|
-(sb_rdonly(SB)) == A
+sb_rdonly(SB) == A
|
-(sb_rdonly(SB)) && A
+sb_rdonly(SB) && A
|
-(sb_rdonly(SB)) || A
+sb_rdonly(SB) || A
)
@@ expression A, B, SB; @@
(
-(sb_rdonly(SB)) ? 1 : 0
+sb_rdonly(SB)
|
-(sb_rdonly(SB)) ? A : B
+sb_rdonly(SB) ? A : B
)
to remove left over excess bracketage and finally by applying:
@@ expression A, SB; @@
(
-(A & MS_RDONLY) != sb_rdonly(SB)
+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
|
-(A & MS_RDONLY) == sb_rdonly(SB)
+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
)
to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.
Signed-off-by: David Howells <dhowells@redhat.com>
feature, which allows ext4 directories to support over 2 billion
directory entries (assuming ~64 byte file names; in practice, users
will run into practical performance limits first.) This feature was
originally written by the Lustre team, and credit goes to Artem
Blagodarenko from Seagate for getting this feature upstream.
The second major major feature allows ext4 to support extended
attribute values up to 64k. This feature was also originally from
Lustre, and has been enhanced by Tahsin Erdogan from Google with a
deduplication feature so that if multiple files have the same xattr
value (for example, Windows ACL's stored by Samba), only one copy will
be stored on disk for encoding and caching efficiency.
We also have the usual set of bug fixes, cleanups, and optimizations.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAllhl5AACgkQ8vlZVpUN
gaOiNQf+L23sT9KIQmFwQP38vkBVw67Eo7gBfevmk7oqQLiRppT5mmLzW8EWEDxR
PVaDQXvSZi18wSCAAcCd1ZqeIZk0P6tst0ufnIT60tGlZdUlwSLyrqvV/30axR2g
6kcnv90ZszrQNx5U8q8bMzNrs1KtyPHFCRzavFsBX11WezNSpWnH2in/uxO+t9Jy
F2zlrLUrE2m9AVMH48Dh6LbeaB6pqgr4k3jq1jG4Iqb2h9xgU8OKhs8gL07YS+Qi
5A7s8GIvYQSoZUO9DOOie2f1zhpO0KrhXchyZTJukVQH7TsmFxoSh0vhXnP1Bohu
CNLV6dzetDT0VfmPr1WhVe7lhZeeVw==
=FFkF
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"The first major feature for ext4 this merge window is the largedir
feature, which allows ext4 directories to support over 2 billion
directory entries (assuming ~64 byte file names; in practice, users
will run into practical performance limits first.) This feature was
originally written by the Lustre team, and credit goes to Artem
Blagodarenko from Seagate for getting this feature upstream.
The second major major feature allows ext4 to support extended
attribute values up to 64k. This feature was also originally from
Lustre, and has been enhanced by Tahsin Erdogan from Google with a
deduplication feature so that if multiple files have the same xattr
value (for example, Windows ACL's stored by Samba), only one copy will
be stored on disk for encoding and caching efficiency.
We also have the usual set of bug fixes, cleanups, and optimizations"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (47 commits)
ext4: fix spelling mistake: "prellocated" -> "preallocated"
ext4: fix __ext4_new_inode() journal credits calculation
ext4: skip ext4_init_security() and encryption on ea_inodes
fs: generic_block_bmap(): initialize all of the fields in the temp bh
ext4: change fast symlink test to not rely on i_blocks
ext4: require key for truncate(2) of encrypted file
ext4: don't bother checking for encryption key in ->mmap()
ext4: check return value of kstrtoull correctly in reserved_clusters_store
ext4: fix off-by-one fsmap error on 1k block filesystems
ext4: return EFSBADCRC if a bad checksum error is found in ext4_find_entry()
ext4: return EIO on read error in ext4_find_entry
ext4: forbid encrypting root directory
ext4: send parallel discards on commit completions
ext4: avoid unnecessary stalls in ext4_evict_inode()
ext4: add nombcache mount option
ext4: strong binding of xattr inode references
ext4: eliminate xattr entry e_hash recalculation for removes
ext4: reserve space for xattr entries/names
quota: add get_inode_usage callback to transfer multi-inode charges
ext4: xattr inode deduplication
...
Since only an open file can be mmap'ed, and we only allow open()ing an
encrypted file when its key is available, there is no need to check for
the key again before permitting each mmap().
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Return EAGAIN if any of the following checks fail for direct I/O:
+ i_rwsem is lockable
+ Writing beyond end of file (will trigger allocation)
+ Blocks are not allocated at the write location
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ext4_find_unwritten_pgoff() is used to search for offset of hole or
data in page range [index, end] (both inclusive), and the max number
of pages to search should be at least one, if end == index.
Otherwise the only page is missed and no hole or data is found,
which is not correct.
When block size is smaller than page size, this can be demonstrated
by preallocating a file with size smaller than page size and writing
data to the last block. E.g. run this xfs_io command on a 1k block
size ext4 on x86_64 host.
# xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \
-c "seek -d 0" /mnt/ext4/testfile
wrote 1024/1024 bytes at offset 2048
1 KiB, 1 ops; 0.0000 sec (42.459 MiB/sec and 43478.2609 ops/sec)
Whence Result
DATA EOF
Data at offset 2k was missed, and lseek(2) returned ENXIO.
This is unconvered by generic/285 subtest 07 and 08 on ppc64 host,
where pagesize is 64k. Because a recent change to generic/285
reduced the preallocated file size to smaller than 64k.
Signed-off-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
There is an off-by-one error in loop termination conditions in
ext4_find_unwritten_pgoff() since 'end' may index a page beyond end of
desired range if 'endoff' is page aligned. It doesn't have any visible
effects but still it is good to fix it.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, SEEK_HOLE implementation in ext4 may both return that there's
a hole at some offset although that offset already has data and skip
some holes during a search for the next hole. The first problem is
demostrated by:
xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (2.054 GiB/sec and 538461.5385 ops/sec)
Whence Result
HOLE 0
Where we can see that SEEK_HOLE wrongly returned offset 0 as containing
a hole although we have written data there. The second problem can be
demonstrated by:
xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k"
-c "seek -h 0" file
wrote 57344/57344 bytes at offset 0
56 KiB, 14 ops; 0.0000 sec (1.978 GiB/sec and 518518.5185 ops/sec)
wrote 8192/8192 bytes at offset 131072
8 KiB, 2 ops; 0.0000 sec (2 GiB/sec and 500000.0000 ops/sec)
Whence Result
HOLE 139264
Where we can see that hole at offsets 56k..128k has been ignored by the
SEEK_HOLE call.
The underlying problem is in the ext4_find_unwritten_pgoff() which is
just buggy. In some cases it fails to update returned offset when it
finds a hole (when no pages are found or when the first found page has
higher index than expected), in some cases conditions for detecting hole
are just missing (we fail to detect a situation where indices of
returned pages are not contiguous).
Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page
indices and also handle all cases where we got less pages then expected
in one place and handle it properly there.
CC: stable@vger.kernel.org
Fixes: c8c0df241c
CC: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
DAX will return to locking exceptional entry before mapping blocks for a
page fault to fix possible races with concurrent writes. To avoid lock
inversion between exceptional entry lock and transaction start, start
the transaction already in ext4_dax_huge_fault().
Fixes: 9f141d6ef6
Link: http://lkml.kernel.org/r/20170510085419.27601-4-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Return enhanced file attributes from the Ext4 filesystem. This includes
the following:
(1) The inode creation time (i_crtime) as stx_btime, setting STATX_BTIME.
(2) Certain FS_xxx_FL flags are mapped to stx_attribute flags.
This requires that all ext4 inodes have a getattr call, not just some of
them, so to this end, split the ext4_getattr() function and only call part
of it where appropriate.
Example output:
[root@andromeda ~]# touch foo
[root@andromeda ~]# chattr +ai foo
[root@andromeda ~]# /tmp/test-statx foo
statx(foo) = 0
results=fff
Size: 0 Blocks: 0 IO Block: 4096 regular file
Device: 08:12 Inode: 2101950 Links: 1
Access: (0644/-rw-r--r--) Uid: 0 Gid: 0
Access: 2016-02-11 17:08:29.031795451+0000
Modify: 2016-02-11 17:08:29.031795451+0000
Change: 2016-02-11 17:11:11.987790114+0000
Birth: 2016-02-11 17:08:29.031795451+0000
Attributes: 0000000000000030 (-------- -------- -------- -------- -------- -------- -------- --ai----)
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since the introduction of FAULT_FLAG_SIZE to the vm_fault flag, it has
been somewhat painful with getting the flags set and removed at the
correct locations. More than one kernel oops was introduced due to
difficulties of getting the placement correctly.
Remove the flag values and introduce an input parameter to huge_fault
that indicates the size of the page entry. This makes the code easier
to trace and should avoid the issues we see with the fault flags where
removal of the flag was necessary in the fallback paths.
Link: http://lkml.kernel.org/r/148615748258.43180.1690152053774975329.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "1G transparent hugepage support for device dax", v2.
The following series implements support for 1G trasparent hugepage on
x86 for device dax. The bulk of the code was written by Mathew Wilcox a
while back supporting transparent 1G hugepage for fs DAX. I have
forward ported the relevant bits to 4.10-rc. The current submission has
only the necessary code to support device DAX.
Comments from Dan Williams: So the motivation and intended user of this
functionality mirrors the motivation and users of 1GB page support in
hugetlbfs. Given expected capacities of persistent memory devices an
in-memory database may want to reduce tlb pressure beyond what they can
already achieve with 2MB mappings of a device-dax file. We have
customer feedback to that effect as Willy mentioned in his previous
version of these patches [1].
[1]: https://lkml.org/lkml/2016/1/31/52
Comments from Nilesh @ Oracle:
There are applications which have a process model; and if you assume
10,000 processes attempting to mmap all the 6TB memory available on a
server; we are looking at the following:
processes : 10,000
memory : 6TB
pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB
pmd @ 2M page size: 120,000 / 512 = ~240GB
pud @ 1G page size: 240GB / 512 = ~480MB
As you can see with 2M pages, this system will use up an exorbitant
amount of DRAM to hold the page tables; but the 1G pages finally brings
it down to a reasonable level. Memory sizes will keep increasing; so
this number will keep increasing.
An argument can be made to convert the applications from process model
to thread model, but in the real world that may not be always practical.
Hopefully this helps explain the use case where this is valuable.
This patch (of 3):
In preparation for adding the ability to handle PUD pages, convert
vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The
vm_fault structure is extended to include a union of the different page
table pointers that may be needed, and three flag bits are reserved to
indicate which type of pointer is in the union.
[ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()]
Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com
[dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path]
Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to
take a vma and vmf parameter when the vma already resides in vmf.
Remove the vma parameter to simplify things.
[arnd@arndb.de: fix ARM build]
Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pmd_fault() and related functions really only need the vmf parameter since
the additional parameters are all included in the vmf struct. Remove the
additional parameter and simplify pmd_fault() and friends.
Link: http://lkml.kernel.org/r/1484085142-2297-8-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of passing in multiple parameters in the pmd_fault() handler,
a vmf can be passed in just like a fault() handler. This will simplify
code and remove the need for the actual pmd fault handlers to allocate a
vmf. Related functions are also modified to do the same.
[dave.jiang@intel.com: fix issue with xfs_tests stall when DAX option is off]
Link: http://lkml.kernel.org/r/148469861071.195597.3619476895250028518.stgit@djiang5-desk3.ch.intel.com
Link: http://lkml.kernel.org/r/1484085142-2297-7-git-send-email-ross.zwisler@linux.intel.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Unlike O_DIRECT DAX is not an optional opt-in feature selected by the
application, so we'll have to provide the traditional synchronіzation
of overlapping writes as we do for buffered writes.
This was broken historically for DAX, but got fixed for ext2 and XFS
as part of the iomap conversion. Fix up ext4 as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we
can use transaction starting in ext4_iomap_begin() and thus simplify
ext4_dax_fault(). It also provides us proper retries in case of ENOSPC.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Convert DAX faults to use iomap infrastructure. We would not have to start
transaction in ext4_dax_fault() anymore since ext4_iomap_begin takes
care of that but so far we do that to avoid lock inversion of
transaction start with DAX entry lock which gets acquired in
dax_iomap_fault() before calling ->iomap_begin handler.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Implement DAX writes using the new iomap infrastructure instead of
overloading the direct IO path.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Implement basic iomap_begin function that handles reading and use it for
DAX reads.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Factor out checks of 'from' and whether we are overwriting out of
ext4_file_write_iter() so that the function is easier to follow.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull vfs xattr updates from Al Viro:
"xattr stuff from Andreas
This completes the switch to xattr_handler ->get()/->set() from
->getxattr/->setxattr/->removexattr"
* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: Remove {get,set,remove}xattr inode operations
xattr: Stop calling {get,set,remove}xattr inode operations
vfs: Check for the IOP_XATTR flag in listxattr
xattr: Add __vfs_{get,set,remove}xattr helpers
libfs: Use IOP_XATTR flag for empty directory handling
vfs: Use IOP_XATTR flag for bad-inode handling
vfs: Add IOP_XATTR inode operations flag
vfs: Move xattr_resolve_name to the front of fs/xattr.c
ecryptfs: Switch to generic xattr handlers
sockfs: Get rid of getxattr iop
sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
kernfs: Switch to generic xattr handlers
hfs: Switch to generic xattr handlers
jffs2: Remove jffs2_{get,set,remove}xattr macros
xattr: Remove unnecessary NULL attribute name check
Merge updates from Andrew Morton:
- fsnotify updates
- ocfs2 updates
- all of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (127 commits)
console: don't prefer first registered if DT specifies stdout-path
cred: simpler, 1D supplementary groups
CREDITS: update Pavel's information, add GPG key, remove snail mail address
mailmap: add Johan Hovold
.gitattributes: set git diff driver for C source code files
uprobes: remove function declarations from arch/{mips,s390}
spelling.txt: "modeled" is spelt correctly
nmi_backtrace: generate one-line reports for idle cpus
arch/tile: adopt the new nmi_backtrace framework
nmi_backtrace: do a local dump_stack() instead of a self-NMI
nmi_backtrace: add more trigger_*_cpu_backtrace() methods
min/max: remove sparse warnings when they're nested
Documentation/filesystems/proc.txt: add more description for maps/smaps
mm, proc: fix region lost in /proc/self/smaps
proc: fix timerslack_ns CAP_SYS_NICE check when adjusting self
proc: add LSM hook checks to /proc/<tid>/timerslack_ns
proc: relax /proc/<tid>/timerslack_ns capability requirements
meminfo: break apart a very long seq_printf with #ifdefs
seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char
proc: faster /proc/*/status
...
These inode operations are no longer used; remove them.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
To support DAX pmd mappings with unmodified applications, filesystems
need to align an mmap address by the pmd size.
Call thp_get_unmapped_area() from f_op->get_unmapped_area.
Note, there is no change in behavior for a non-DAX file.
Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
do_blockdev_direct_IO() takes care of properly plugging direct IO so
there's no need to plug again inside ext4_file_write_iter().
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently we do not allow unlocked (meaning without inode_lock) direct
IO when the file has any pages cached. This check is not needed anymore
as we keep inode lock until ext4_direct_IO_write() and thus can happily
writeback and evict any pages conflicting with current direct IO write.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Create a macro to calculate length + offset -> maximum blocks
This adds more readability.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Merge updates from Andrew Morton:
- a few misc bits
- ocfs2
- most(?) of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
thp: fix comments of __pmd_trans_huge_lock()
cgroup: remove unnecessary 0 check from css_from_id()
cgroup: fix idr leak for the first cgroup root
mm: memcontrol: fix documentation for compound parameter
mm: memcontrol: remove BUG_ON in uncharge_list
mm: fix build warnings in <linux/compaction.h>
mm, thp: convert from optimistic swapin collapsing to conservative
mm, thp: fix comment inconsistency for swapin readahead functions
thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
shmem: split huge pages beyond i_size under memory pressure
thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
khugepaged: add support of collapse for tmpfs/shmem pages
shmem: make shmem_inode_info::lock irq-safe
khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
thp: extract khugepaged from mm/huge_memory.c
shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
shmem: add huge pages support
shmem: get_unmapped_area align huge page
shmem: prepare huge= mount option and sysfs knob
mm, rmap: account shmem thp pages
...
Remove the unused wrappers dax_fault() and dax_pmd_fault(). After this
removal, rename __dax_fault() and __dax_pmd_fault() to dax_fault() and
dax_pmd_fault() respectively, and update all callers.
The dax_fault() and dax_pmd_fault() wrappers were initially intended to
capture some filesystem independent functionality around page faults
(calling sb_start_pagefault() & sb_end_pagefault(), updating file mtime
and ctime).
However, the following commits:
5726b27b09 ("ext2: Add locking for DAX faults")
ea3d7209ca ("ext4: fix races between page faults and hole punching")
added locking to the ext2 and ext4 filesystems after these common
operations but before __dax_fault() and __dax_pmd_fault() were called.
This means that these wrappers are no longer used, and are unlikely to
be used in the future.
XFS has had locking analogous to what was recently added to ext2 and
ext4 since DAX support was initially introduced by:
6b698edeee ("xfs: add DAX file operations support")
Link: http://lkml.kernel.org/r/20160714214049.20075-2-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes the most parts of internal crypto codes.
And then, it modifies and adds some ext4-specific crypt codes to use the generic
facility.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
- Until now, dax has been disabled if media errors were found on
any device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.
- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.
Other misc changes:
- When mounting DAX filesystems, check to make sure the partition
is page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent reads/writes
would fail.
- Misc/cleanup fixes from Jan that remove unused code from DAX related to
zeroing, writeback, and some size checks.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJXQ4GKAAoJEHr6Yb6juE3/zowP/iclIhgXXXMQJRUHJlePMXC8
15sGZ32JS1ak9g7vrsmNVEDNynfNtiMYdBxtUyRuj6xqgwdZvFk3F55KOCPtaeA1
+yADkgeRkTAcwzmHw9WQVEzBCqyzSisdrwtEfH817qdq9FJdH66x2Kos6i+HeAVr
5Q/e4gs7lKrjf384/QBl+wxNZOndJaQAPd2VRHQqx2A9F33v0ljdwRaUG1r4fjK2
dtmhcZCqdQyuAGXW3piTnZc5ZFc3DPqO4FkEfqkEK3lFOflK0fd8wMsAZRp/Jd0j
GJsgnVSWSqG0Dz476djlG0w8t2p5Jv1g9cKChV+ZZEdFLKWHCOUFqXNj8uI8I4k5
cOEKCHyJ3IwfSHhNQqktEWrQN4T8ZXhWtuc9GuV4UZYuqJqHci6EdR/YsWsJjV+L
lm/qvK4ipDS1pivxOy8KX/iN0z7Io8J9GXpStDx3g8iWjLlh4YYlbJLWeeRepo/z
aPlV/QAKcHiGY6jzLExrZIyCWkzwo6O+0p1Kxerv9/7K/32HWbOodZ+tC8eD+N25
pV69nCGf+u50T2TtIx1+iann4NC1r7zg5yqnT9AgpyZpiwR5joCDzI5sXW+D0rcS
vPtfM84Ccdeq/e6mvfIpZgR0/npQapKnrmUest0J7P2BFPHiFPji1KzZ7M+1aFOo
9R6JdrAj0Sc+FBa+cGzH
=v6Of
-----END PGP SIGNATURE-----
Merge tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull misc DAX updates from Vishal Verma:
"DAX error handling for 4.7
- Until now, dax has been disabled if media errors were found on any
device. This enables the use of DAX in the presence of these
errors by making all sector-aligned zeroing go through the driver.
- The driver (already) has the ability to clear errors on writes that
are sent through the block layer using 'DSMs' defined in ACPI 6.1.
Other misc changes:
- When mounting DAX filesystems, check to make sure the partition is
page aligned. This is a requirement for DAX, and previously, we
allowed such unaligned mounts to succeed, but subsequent
reads/writes would fail.
- Misc/cleanup fixes from Jan that remove unused code from DAX
related to zeroing, writeback, and some size checks"
* tag 'dax-misc-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: fix a comment in dax_zero_page_range and dax_truncate_page
dax: for truncate/hole-punch, do zeroing through the driver if possible
dax: export a low-level __dax_zero_page_range helper
dax: use sb_issue_zerout instead of calling dax_clear_sectors
dax: enable dax in the presence of known media errors (badblocks)
dax: fallback from pmd to pte on error
block: Update blkdev_dax_capable() for consistency
xfs: Add alignment check for DAX mount
ext2: Add alignment check for DAX mount
ext4: Add alignment check for DAX mount
block: Add bdev_dax_supported() for dax mount checks
block: Add vfs_msg() interface
dax: Remove redundant inode size checks
dax: Remove pointless writeback from dax_do_io()
dax: Remove zeroing from dax_io()
dax: Remove dead zeroing code from fault handlers
ext2: Avoid DAX zeroing to corrupt data
ext2: Fix block zeroing in ext2_get_blocks() for DAX
dax: Remove complete_unwritten argument
DAX: move RADIX_DAX_ definitions to dax.c
after a crash and a potential BUG_ON crash if a file has the data
journalling flag enabled while it has dirty delayed allocation blocks
that haven't been written yet. Also fix a potential crash in the new
project quota code and a maliciously corrupted file system.
In addition, fix some DAX-specific bugs, including when there is a
transient ENOSPC situation and races between writes via direct I/O and
an mmap'ed segment that could lead to lost I/O.
Finally the usual set of miscellaneous cleanups.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXQ40fAAoJEPL5WVaVDYGjnwMH+wXHASgPfzZgtRInsTG8W/2L
jsmAcMlyMAYIATWMppNtPIq0td49z1dYO0YkKhtPVMwfzu230IFWhGWp93WqP9ve
XYHMmaBorFlMAzWgMKn1K0ExWZlV+ammmcTKgU0kU4qyZp0G/NnMtlXIkSNv2amI
9Mn6R+v97c20gn8e9HWP/IVWkgPr+WBtEXaSGjC7dL6yI8hL+rJMqN82D76oU5ea
vtwzrna/ISijy+etYmQzqHNYNaBKf40+B5HxQZw/Ta3FSHofBwXAyLaeEAr260Mf
V3Eg2NDcKQxiZ3adBzIUvrRnrJV381OmHoguo8Frs8YHTTRiZ0T/s7FGr2Q0NYE=
=7yIM
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Fix a number of bugs, most notably a potential stale data exposure
after a crash and a potential BUG_ON crash if a file has the data
journalling flag enabled while it has dirty delayed allocation blocks
that haven't been written yet. Also fix a potential crash in the new
project quota code and a maliciously corrupted file system.
In addition, fix some DAX-specific bugs, including when there is a
transient ENOSPC situation and races between writes via direct I/O and
an mmap'ed segment that could lead to lost I/O.
Finally the usual set of miscellaneous cleanups"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: pre-zero allocated blocks for DAX IO
ext4: refactor direct IO code
ext4: fix race in transient ENOSPC detection
ext4: handle transient ENOSPC properly for DAX
dax: call get_blocks() with create == 1 for write faults to unwritten extents
ext4: remove unmeetable inconsisteny check from ext4_find_extent()
jbd2: remove excess descriptions for handle_s
ext4: remove unnecessary bio get/put
ext4: silence UBSAN in ext4_mb_init()
ext4: address UBSAN warning in mb_find_order_for_block()
ext4: fix oops on corrupted filesystem
ext4: fix check of dqget() return value in ext4_ioctl_setproject()
ext4: clean up error handling when orphan list is corrupted
ext4: fix hang when processing corrupted orphaned inode list
ext4: remove trailing \n from ext4_warning/ext4_error calls
ext4: fix races between changing inode journal mode and ext4_writepages
ext4: handle unwritten or delalloc buffers before enabling data journaling
ext4: fix jbd2 handle extension in ext4_ext_truncate_extend_restart()
ext4: do not ask jbd2 to write data for delalloc buffers
jbd2: add support for avoiding data writes during transaction commits
...
Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Currently ext4 treats DAX IO the same way as direct IO. I.e., it
allocates unwritten extents before IO is done and converts unwritten
extents afterwards. However this way DAX IO can race with page fault to
the same area:
ext4_ext_direct_IO() dax_fault()
dax_io()
get_block() - allocates unwritten extent
copy_from_iter_pmem()
get_block() - converts
unwritten block to
written and zeroes it
out
ext4_convert_unwritten_extents()
So data written with DAX IO gets lost. Similarly dax_new_buf() called
from dax_io() can overwrite data that has been already written to the
block via mmap.
Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
use them for DAX mmap. The downside of this solution is that every
allocating write writes each block twice (once zeros, once data). Fixing
the race with locking is possible as well however we would need to
lock-out faults for the whole range written to by DAX IO. And that is
not easy to do without locking-out faults for the whole file which seems
too aggressive.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The kiocb already has the new position, so use that. The only interesting
case is AIO, where we currently don't bother updating ki_pos. We're about
to free the kiocb after we're done, so we might as well update it to make
everyone's life simpler.
While we're at it also return the bytes written argument passed in if
we were successful so that the boilerplate error switch code in the
callers can go away.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This will allow us to do per-I/O sync file writes, as required by a lot
of fileservers or storage targets.
XXX: Will need a few additional audits for O_DSYNC
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Messages passed to ext4_warning() or ext4_error() don't need trailing
newlines, because these function add the newlines themselves.
Signed-off-by: Jakub Wilk <jwilk@jwilk.net>
(badly behaved) dentry code in various file systems. These have been
reviewed by Al and the respective file system mtinainers and are going
through the ext4 tree for convenience.
This also has a few ext4 encryption bug fixes that were discovered in
Android testing (yes, we will need to get these sync'ed up with the
fs/crypto code; I'll take care of that). It also has some bug fixes
and a change to ignore the legacy quota options to allow for xfstests
regression testing of ext4's internal quota feature and to be more
consistent with how xfs handles this case.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJXBn4aAAoJEPL5WVaVDYGjHWgH/2wXnlQnC2ndJhblBWtPzprz
OQW4dawdnhxqbTEGUqWe942tZivSb/liu/lF+urCGbWsbgz9jNOCmEAg7JPwlccY
mjzwDvtVq5U4d2rP+JDWXLy/Gi8XgUclhbQDWFVIIIea6fS7IuFWqoVBR+HPMhra
9tEygpiy5lNtJA/hqq3/z9x0AywAjwrYR491CuWreo2Uu1aeKg0YZsiDsuAcGioN
Waa2TgbC/ZZyJuJcPBP8If+VOFAa0ea3F+C/o7Tb9bOqwuz0qSTcaMRgt6eQ2KUt
P4b9Ecp1XLjJTC7IYOknUOScY3lCyREx/Xya9oGZfFNTSHzbOlLBoplCr3aUpYQ=
=/HHR
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bugfixes from Ted Ts'o:
"These changes contains a fix for overlayfs interacting with some
(badly behaved) dentry code in various file systems. These have been
reviewed by Al and the respective file system mtinainers and are going
through the ext4 tree for convenience.
This also has a few ext4 encryption bug fixes that were discovered in
Android testing (yes, we will need to get these sync'ed up with the
fs/crypto code; I'll take care of that). It also has some bug fixes
and a change to ignore the legacy quota options to allow for xfstests
regression testing of ext4's internal quota feature and to be more
consistent with how xfs handles this case"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: ignore quota mount options if the quota feature is enabled
ext4 crypto: fix some error handling
ext4: avoid calling dquot_get_next_id() if quota is not enabled
ext4: retry block allocation for failed DIO and DAX writes
ext4: add lockdep annotations for i_data_sem
ext4: allow readdir()'s of large empty directories to be interrupted
btrfs: fix crash/invalid memory access on fsync when using overlayfs
ext4 crypto: use dget_parent() in ext4_d_revalidate()
ext4: use file_dentry()
ext4: use dget_parent() in ext4_file_open()
nfs: use file_dentry()
fs: add file_dentry()
ext4 crypto: don't let data integrity writebacks fail with ENOMEM
ext4: check if in-inode xattr is corrupted in ext4_expand_extra_isize_ea()
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.
This promise never materialized. And unlikely will.
We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE. And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.
Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.
Let's stop pretending that pages in page cache are special. They are
not.
The changes are pretty straight-forward:
- <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
- PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
- page_cache_get() -> get_page();
- page_cache_release() -> put_page();
This patch contains automated changes generated with coccinelle using
script below. For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.
The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.
There are few places in the code where coccinelle didn't reach. I'll
fix them manually in a separate patch. Comments and documentation also
will be addressed with the separate patch.
virtual patch
@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E
@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT
@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE
@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK
@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)
@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)
@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
EXT4 may be used as lower layer of overlayfs and accessing f_path.dentry
can lead to a crash.
Fix by replacing direct access of file->f_path.dentry with the
file_dentry() accessor, which will always return a native object.
Reported-by: Daniel Axtens <dja@axtens.net>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay")
Fixes: ff978b09f9 ("ext4 crypto: move context consistency check to ext4_file_open()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org> # v4.5
In f_op->open() lock on parent is not held, so there's no guarantee that
parent dentry won't go away at any time.
Even after this patch there's no guarantee that 'dir' will stay the parent
of 'inode', but at least it won't be freed while being used.
Fixes: ff978b09f9 ("ext4 crypto: move context consistency check to ext4_file_open()")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: <stable@vger.kernel.org> # v4.5
improvements, plus a lot of clean ups and bug fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABCAAGBQJW6c9mAAoJEPL5WVaVDYGjWsEIAJkWUvKB3GgGgP82sKDBP2P8
IbWegO1ICMrSY78BqLI7mLCqggH5JClBgYU3O4VFv8Brj1L9mS5X+vflaDE1j9jj
Ik1KZKtZl1opOwO1L3D4l/ipZAiENUp7NehTtpsFousmz6nMZ5vo6x4t3QSwbUIm
YXpxUIxHEhBcW5i3EDkfYG8305V5oj8HsVf6T98OlWGpBO5VGNMAHvA7CQdQe7Rd
chv70rij5V684bJAEoosEFXVAuOUrxcBqbFA3Nlb432YOPj0ISLx76kw0GIjUYtf
yjoSClbRgwxGzh0jm+yaoYjjm83xbsYbHSsBmh3+/QLMbKTLXeCqR/BiqJavmcM=
=bWpz
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Performance improvements in SEEK_DATA and xattr scalability
improvements, plus a lot of clean ups and bug fixes"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (38 commits)
ext4: clean up error handling in the MMP support
jbd2: do not fail journal because of frozen_buffer allocation failure
ext4: use __GFP_NOFAIL in ext4_free_blocks()
ext4: fix compile error while opening the macro DOUBLE_CHECK
ext4: print ext4 mount option data_err=abort correctly
ext4: fix NULL pointer dereference in ext4_mark_inode_dirty()
ext4: drop unneeded BUFFER_TRACE in ext4_delete_inline_entry()
ext4: fix misspellings in comments.
jbd2: fix FS corruption possibility in jbd2_journal_destroy() on umount path
ext4: more efficient SEEK_DATA implementation
ext4: cleanup handling of bh->b_state in DAX mmap
ext4: return hole from ext4_map_blocks()
ext4: factor out determining of hole size
ext4: fix setting of referenced bit in ext4_es_lookup_extent()
ext4: remove i_ioend_count
ext4: simplify io_end handling for AIO DIO
ext4: move trans handling and completion deferal out of _ext4_get_block
ext4: rename and split get blocks functions
ext4: use i_mutex to serialize unaligned AIO DIO
ext4: pack ioend structure better
...
Using SEEK_DATA in a huge sparse file can easily lead to sotflockups as
ext4_seek_data() iterates hole block-by-block. Fix the problem by using
returned hole size from ext4_map_blocks() and thus skip the hole in one
go.
Update also SEEK_HOLE implementation to follow the same pattern as
SEEK_DATA to make future maintenance easier.
Furthermore we add cond_resched() to both ext4_seek_data() and
ext4_seek_hole() to avoid softlockups in case evil user creates huge
fragmented file and we have to go through lots of extents.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently we've used hashed aio_mutex to serialize unaligned AIO DIO.
However the code cleanups that happened after 2011 when the lock was
introduced made aio_mutex acquired at almost the same places where we
already have exclusion using i_mutex. So just use i_mutex for the
exclusion of unaligned AIO DIO.
The change moves waiting for pending unwritten extent conversion under
i_mutex. That makes special handling of O_APPEND writes unnecessary and
also avoids possible livelocking of unaligned AIO DIO with aligned one
(nothing was preventing contiguous stream of aligned AIO DIOs to let
unaligned AIO DIO wait forever).
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
As it is currently written ext4_dax_mkwrite() assumes that the call into
__dax_mkwrite() will not have to do a block allocation so it doesn't create
a journal entry. For a read that creates a zero page to cover a hole
followed by a write that actually allocates storage this is incorrect. The
ext4_dax_mkwrite() -> __dax_mkwrite() -> __dax_fault() path calls
get_blocks() to allocate storage.
Fix this by having the ->page_mkwrite fault handler call ext4_dax_fault()
as this function already has all the logic needed to allocate a journal
entry and call __dax_fault().
Also update the ext2 fault handlers in this same way to remove duplicate
code and keep the logic between ext2 and ext4 the same.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
In the case where the per-file key for the directory is cached, but
root does not have access to the key needed to derive the per-file key
for the files in the directory, we allow the lookup to succeed, so
that lstat(2) and unlink(2) can suceed. However, if a program tries
to open the file, it will get an ENOKEY error.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull final vfs updates from Al Viro:
- The ->i_mutex wrappers (with small prereq in lustre)
- a fix for too early freeing of symlink bodies on shmem (they need to
be RCU-delayed) (-stable fodder)
- followup to dedupe stuff merged this cycle
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: abort dedupe loop if fatal signals are pending
make sure that freeing shmem fast symlinks is RCU-delayed
wrappers for ->i_mutex access
lustre: remove unused declaration
To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Layton <jlayton@poochiereds.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).
Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make DAX fault path use pre-zeroed blocks to avoid races with extent
conversion and zeroing when two page faults to the same block happen.
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently, page faults and hole punching are completely unsynchronized.
This can result in page fault faulting in a page into a range that we
are punching after truncate_pagecache_range() has been called and thus
we can end up with a page mapped to disk blocks that will be shortly
freed. Filesystem corruption will shortly follow. Note that the same
race is avoided for truncate by checking page fault offset against
i_size but there isn't similar mechanism available for punching holes.
Fix the problem by creating new rw semaphore i_mmap_sem in inode and
grab it for writing over truncate, hole punching, and other functions
removing blocks from extent tree and for read over page faults. We
cannot easily use i_data_sem for this since that ranks below transaction
start and we need something ranking above it so that it can be held over
the whole truncate / hole punching operation. Also remove various
workarounds we had in the code to reduce race window when page fault
could have created pages with stale mapping information.
Signed-off-by: Jan Kara <jack@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Jan Kara pointed out that in the case where we are writing to a hole, we
can end up with a lock inversion between the page lock and the journal
lock. We can avoid this by starting the transaction in ext4 before
calling into DAX. The journal lock nests inside the superblock
pagefault lock, so we have to duplicate that code from dax_fault, like
XFS does.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DAX wants different semantics from any currently-existing ext4 get_block
callback. Unlike ext4_get_block_write(), it needs to honour the
'create' flag, and unlike ext4_get_block(), it needs to be able to
return unwritten extents. So introduce a new ext4_get_block_dax() which
has those semantics.
We could also change ext4_get_block_write() to honour the 'create' flag,
but that might have consequences on other users that I do not currently
understand.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
DAX relies on the get_block function either zeroing newly allocated
blocks before they're findable by subsequent calls to get_block, or
marking newly allocated blocks as unwritten. ext4_get_block() cannot
create unwritten extents, but ext4_get_block_write() can.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reported-by: Andy Rudoff <andy.rudoff@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use DAX to provide support for huge pages.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h. Given that we don't want to include <linux/mm.h>
in <linux/fs.h>, the easiest solution is to move the DAX-related
functions to a new header, <linux/dax.h>. We could also have moved
VM_FAULT_* definitions to a new header, or a different header that isn't
quite such a boil-the-ocean header as <linux/mm.h>, but this felt like
the best option.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This update contains:
o A new sparse on-disk inode record format to allow small extents to
be used for inode allocation when free space is fragmented.
o DAX support. This includes minor changes to the DAX core code to
fix problems with lock ordering and bufferhead mapping abuse.
o transaction commit interface cleanup
o removal of various unnecessary XFS specific type definitions
o cleanup and optimisation of freelist preparation before allocation
o various minor cleanups
o bug fixes for
- transaction reservation leaks
- incorrect inode logging in unwritten extent conversion
- mmap lock vs freeze ordering
- remote symlink mishandling
- attribute fork removal issues.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJVkhI0AAoJEK3oKUf0dfod45MQAJCOEkNduBdlfPvTCMPjj/7z
vzcfDdzgKwhpPTMXSDRvw4zDPt3C2FLMBJqxtPpC4sKGKG/8G0kFvw8bDtBag1m9
ru5nI5LaQ6LC5RcU40zxBx1s/L8qYvyfUlxeoOT5lSwN9c6ENGOCQ3bUk4pSKaee
pWDplag9LbfQomW2GHtxd8agMUZEYx0R1vgfv88V8xgPka8CvQo81XUgkb4PcDZV
ugR+wDUsvwMS01aLYBmRFkMXuExNuCJVwtvdTJS+ZWGHzyTpulFoANUW6QT24gAM
eP4yRXN4bv9vXrXpg8JkF25DHsfw4HBwNEL17ZvoB8t3oJp1/NYaH8ce1jS0+I8i
NCtaO+qUqDSTGQZKgmeDPwCciQp54ra9LEdmIJFxpZxiBof9g/tIYEFgRklyFLwR
GZU6Io6VpBa1oTGlC4D1cmG6bdcnhMB9MGVVCbqnB5mRRDKCmVgCyJwusd1pi7Re
G4O6KkFt21O7+fP13VsjP57KoaJzsIgZ/+H3Ff/fJOJ33AKYTRCmwi8+IMi2n5JI
zz+V0AIBQZAx9dlVyENnxufh9eJYcnwta0lUSLCCo91fZKxbo3ktK1kVHNZP5EGs
IMFM1Ka6hibY20rWlR3GH0dfyP5/yNcvNgTMYPKjj9SVjTar1aSfF2rGpkqYXYyH
D4FICbtDgtOc2ClfpI2k
=3x+W
-----END PGP SIGNATURE-----
Merge tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs
Pul xfs updates from Dave Chinner:
"There's a couple of small API changes to the core DAX code which
required small changes to the ext2 and ext4 code bases, but otherwise
everything is within the XFS codebase.
This update contains:
- A new sparse on-disk inode record format to allow small extents to
be used for inode allocation when free space is fragmented.
- DAX support. This includes minor changes to the DAX core code to
fix problems with lock ordering and bufferhead mapping abuse.
- transaction commit interface cleanup
- removal of various unnecessary XFS specific type definitions
- cleanup and optimisation of freelist preparation before allocation
- various minor cleanups
- bug fixes for
- transaction reservation leaks
- incorrect inode logging in unwritten extent conversion
- mmap lock vs freeze ordering
- remote symlink mishandling
- attribute fork removal issues"
* tag 'xfs-for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits)
xfs: don't truncate attribute extents if no extents exist
xfs: clean up XFS_MIN_FREELIST macros
xfs: sanitise error handling in xfs_alloc_fix_freelist
xfs: factor out free space extent length check
xfs: xfs_alloc_fix_freelist() can use incore perag structures
xfs: remove xfs_caddr_t
xfs: use void pointers in log validation helpers
xfs: return a void pointer from xfs_buf_offset
xfs: remove inst_t
xfs: remove __psint_t and __psunsigned_t
xfs: fix remote symlinks on V5/CRC filesystems
xfs: fix xfs_log_done interface
xfs: saner xfs_trans_commit interface
xfs: remove the flags argument to xfs_trans_cancel
xfs: pass a boolean flag to xfs_trans_free_items
xfs: switch remaining xfs_trans_dup users to xfs_trans_roll
xfs: check min blks for random debug mode sparse allocations
xfs: fix sparse inodes 32-bit compile failure
xfs: add initial DAX support
xfs: add DAX IO path support
...
dax_fault() currently relies on the get_block callback to attach an
io completion callback to the mapping buffer head so that it can
run unwritten extent conversion after zeroing allocated blocks.
Instead of this hack, pass the conversion callback directly into
dax_fault() similar to the get_block callback. When the filesystem
allocates unwritten extents, it will set the buffer_unwritten()
flag, and hence the dax_fault code can call the completion function
in the contexts where it is necessary without overloading the
mapping buffer head.
Note: The changes to ext4 to use this interface are suspect at best.
In fact, the way ext4 did this end_io assignment in the first place
looks suspect because it only set a completion callback when there
wasn't already some other write() call taking place on the same
inode. The ext4 end_io code looks rather intricate and fragile with
all it's reference counting and passing to different contexts for
modification via inode private pointers that aren't protected by
locks...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This is a pretty massive patch which does a number of different things:
1) The per-inode encryption information is now stored in an allocated
data structure, ext4_crypt_info, instead of directly in the node.
This reduces the size usage of an in-memory inode when it is not
using encryption.
2) We drop the ext4_fname_crypto_ctx entirely, and use the per-inode
encryption structure instead. This remove an unnecessary memory
allocation and free for the fname_crypto_ctx as well as allowing us
to reuse the ctfm in a directory for multiple lookups and file
creations.
3) We also cache the inode's policy information in the ext4_crypt_info
structure so we don't have to continually read it out of the
extended attributes.
4) We now keep the keyring key in the inode's encryption structure
instead of releasing it after we are done using it to derive the
per-inode key. This allows us to test to see if the key has been
revoked; if it has, we prevent the use of the derived key and free
it.
5) When an inode is released (or when the derived key is freed), we
will use memset_explicit() to zero out the derived key, so it's not
left hanging around in memory. This implies that when a user logs
out, it is important to first revoke the key, and then unlink it,
and then finally, to use "echo 3 > /proc/sys/vm/drop_caches" to
release any decrypted pages and dcache entries from the system
caches.
6) All this, and we also shrink the number of lines of code by around
100. :-)
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull third hunk of vfs changes from Al Viro:
"This contains the ->direct_IO() changes from Omar + saner
generic_write_checks() + dealing with fcntl()/{read,write}() races
(mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
repeatedly looking at ->f_flags, which can be changed by fcntl(2),
check ->ki_flags - which cannot) + infrastructure bits for dhowells'
d_inode annotations + Christophs switch of /dev/loop to
vfs_iter_write()"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
block: loop: switch to VFS ITER_BVEC
configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
NFS: Don't use d_inode as a variable name
VFS: Impose ordering on accesses of d_inode and d_flags
VFS: Add owner-filesystem positive/negative dentry checks
nfs: generic_write_checks() shouldn't be done on swapout...
ocfs2: use __generic_file_write_iter()
mirror O_APPEND and O_DIRECT into iocb->ki_flags
switch generic_write_checks() to iocb and iter
ocfs2: move generic_write_checks() before the alignment checks
ocfs2_file_write_iter: stop messing with ppos
udf_file_write_iter: reorder and simplify
fuse: ->direct_IO() doesn't need generic_write_checks()
ext4_file_write_iter: move generic_write_checks() up
xfs_file_aio_write_checks: switch to iocb/iov_iter
generic_write_checks(): drop isblk argument
blkdev_write_iter: expand generic_file_checks() call in there
...
Merge second patchbomb from Andrew Morton:
- the rest of MM
- various misc bits
- add ability to run /sbin/reboot at reboot time
- printk/vsprintf changes
- fiddle with seq_printf() return value
* akpm: (114 commits)
parisc: remove use of seq_printf return value
lru_cache: remove use of seq_printf return value
tracing: remove use of seq_printf return value
cgroup: remove use of seq_printf return value
proc: remove use of seq_printf return value
s390: remove use of seq_printf return value
cris fasttimer: remove use of seq_printf return value
cris: remove use of seq_printf return value
openrisc: remove use of seq_printf return value
ARM: plat-pxa: remove use of seq_printf return value
nios2: cpuinfo: remove use of seq_printf return value
microblaze: mb: remove use of seq_printf return value
ipc: remove use of seq_printf return value
rtc: remove use of seq_printf return value
power: wakeup: remove use of seq_printf return value
x86: mtrr: if: remove use of seq_printf return value
linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
MAINTAINERS: CREDITS: remove Stefano Brivio from B43
.mailmap: add Ricardo Ribalda
CREDITS: add Ricardo Ribalda Delgado
...
The original dax patchset split the ext2/4_file_operations because of the
two NULL splice_read/splice_write in the dax case.
In the vfs if splice_read/splice_write are NULL we then call
default_splice_read/write.
What we do here is make generic_file_splice_read aware of IS_DAX() so the
original ext2/4_file_operations can be used as is.
For write it appears that iter_file_splice_write is just fine. It uses
the regular f_op->write(file,..) or new_sync_write(file, ...).
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
From: Yigal Korman <yigal@plexistor.com>
[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.
A new xfstest is submitted for testing this (generic/080)
[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... returning -E... upon error and amount of data left in iter after
(possible) truncation upon success. Note, that normal case gives
a non-zero (positive) return value, so any tests for != 0 _must_ be
updated.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts:
fs/ext4/file.c
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove unused header files and header files which are included in
ext4.h.
Signed-off-by: Sheng Yong <shengyong1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This is a port of the DAX functionality found in the current version of
ext2.
[matthew.r.wilcox@intel.com: heavily tweaked]
[akpm@linux-foundation.org: remap_pages went away]
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 14516bb7bb.
This was causing regression test failures with generic/285 with an ext3
filesystem using CONFIG_EXT4_USE_FOR_EXT23.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
It is ridiculous practice to scan inode block by block, this technique
applicable only for old indirect files. This takes significant amount
of time for really large files. Let's reuse ext4_fiemap which already
traverse inode-tree in most optimal meaner.
TESTCASE:
ftruncate64(fd, 0);
ftruncate64(fd, 1ULL << 40);
/* lseek will spin very long time */
lseek64(fd, 0, SEEK_DATA);
lseek64(fd, 0, SEEK_HOLE);
Original report: https://lkml.org/lkml/2014/10/16/620
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
There is no kind of file which does not supply a page reading function.
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Pull vfs updates from Al Viro:
"This the bunch that sat in -next + lock_parent() fix. This is the
minimal set; there's more pending stuff.
In particular, I really hope to get acct.c fixes merged this cycle -
we need that to deal sanely with delayed-mntput stuff. In the next
pile, hopefully - that series is fairly short and localized
(kernel/acct.c, fs/super.c and fs/namespace.c). In this pile: more
iov_iter work. Most of prereqs for ->splice_write with sane locking
order are there and Kent's dio rewrite would also fit nicely on top of
this pile"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (70 commits)
lock_parent: don't step on stale ->d_parent of all-but-freed one
kill generic_file_splice_write()
ceph: switch to iter_file_splice_write()
shmem: switch to iter_file_splice_write()
nfs: switch to iter_splice_write_file()
fs/splice.c: remove unneeded exports
ocfs2: switch to iter_file_splice_write()
->splice_write() via ->write_iter()
bio_vec-backed iov_iter
optimize copy_page_{to,from}_iter()
bury generic_file_aio_{read,write}
lustre: get rid of messing with iovecs
ceph: switch to ->write_iter()
ceph_sync_direct_write: stop poking into iov_iter guts
ceph_sync_read: stop poking into iov_iter guts
new helper: copy_page_from_iter()
fuse: switch to ->write_iter()
btrfs: switch to ->write_iter()
ocfs2: switch to ->write_iter()
xfs: switch to ->write_iter()
...
iter_file_splice_write() - a ->splice_write() instance that gathers the
pipe buffers, builds a bio_vec-based iov_iter covering those and feeds
it to ->write_iter(). A bunch of simple cases coverted to that...
[AV: fixed the braino spotted by Cyrill]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I have been running make namespacecheck to look for unneeded globals, and
found these in ext4.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
unfortunately, Ted's changes to ext4_file_write() are *still* an
incomplete fix - playing with rlimits can let you smuggle an
unaligned request past the checks. So there almost certainly
will be more merge PITA around that place...
[fix from Peter Ujfalusi <peter.ujfalusi@ti.com> folded]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Al Viro pointed out that locking for O_APPEND writes was problematic,
since the location of the write isn't known until after we take the
i_mutex, which impacts the ext4_unaligned_aio() and s_bitmap_maxbytes
check.
For O_APPEND always assume that the write is unaligned so call
ext4_unwritten_wait(). And to solve the second problem, take the
i_mutex earlier before we start the s_bitmap_maxbytes check.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This commit doesn't actually change anything; it just moves code
around in preparation for some code simplification work.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Copy generic_file_aio_write() into ext4_file_write(). This is part of
a patch series which allows us to simplify ext4_file_write() and
ext4_file_dio_write(), by calling __generic_file_aio_write() directly.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Currently in ext4 there is quite a mess when it comes to naming
unwritten extents. Sometimes we call it uninitialized and sometimes we
refer to it as unwritten.
The right name for the extent which has been allocated but does not
contain any written data is _unwritten_. Other file systems are
using this name consistently, even the buffer head state refers to it as
unwritten. We need to fix this confusion in ext4.
This commit changes every reference to an uninitialized extent (meaning
allocated but unwritten) to unwritten extent. This includes comments,
function names and variable names. It even covers abbreviation of the
word uninitialized (such as uninit) and some misspellings.
This commit does not change any of the code paths at all. This has been
confirmed by comparing md5sums of the assembly code of each object file
after all the function names were stripped from it.
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
We had a number of new features in ext4 during this merge window
(ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so
there were many more regression and bug fixes this time around. It
didn't help that xfstests hadn't been fully updated to fully stress
test COLLAPSE_RANGE until after -rc1.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJTVIEUAAoJENNvdpvBGATwnKkQANlzQv6BhgzCa0b5Iu0SkHeD
OuLAtPFYE5OVEK22oWT0H76gBi71RHLboHwThd+ZfEeEPvyfs42wY0J/PV/R9dHx
kwhU+MaDDzugfVj3gg29DpYNLQkL/evq0vlNbrRk5je877c2I8JbXV/aAoTVFZoH
NGOsagwBqWCsgL5nSOk/nEZSRX2AzSCkgmOVxylLzFoyTUkX3vZx8G8XtS1zRgbH
b1yOWIK1Ifj7tmBZ4HLpNiK6/NpHAHeHRFiaCQxY0hkLjUeMyVNJfZzXS/Fzp8DP
p1/nm5z9PaFj4nyBC1Wvh9Z6Lj0zQ0ap73LV+w4fHM1SZub3XY+hvyXj/8qMNaSc
lLIGwa2AZFpurbKKn6MZTi5CubVLZs6PZKzDgYURnEcJCgeMujMOxbKekcL5sP9E
Gb6Hh9I/f08HagCRox5O0W7f0/TBY5bFryx5kQQZUtpcRmnY3m7cohSkn6WriwTZ
zYApOZMZkFX5spSeYsfyi8K8wHij/5mXvm7qeqQ0Rj4Ehycd+7jwltOCVXAYN29+
zSKaBaxH2+V7zuGHSxjDFbOOlPotTFNzGmFh08DPTF4Vgnc9uMlLo0Oz8ADFDcT2
JZ4pAFTEREnHOATNl5bAEi8wNrU/Ln9IGhlYCYI9X5BQXjf9oPXcYwQT/lKCb07s
ks8ujfry1R/gjQGuv+LH
=gi42
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"These are regression and bug fixes for ext4.
We had a number of new features in ext4 during this merge window
(ZERO_RANGE and COLLAPSE_RANGE fallocate modes, renameat, etc.) so
there were many more regression and bug fixes this time around. It
didn't help that xfstests hadn't been fully updated to fully stress
test COLLAPSE_RANGE until after -rc1"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (31 commits)
ext4: disable COLLAPSE_RANGE for bigalloc
ext4: fix COLLAPSE_RANGE failure with 1KB block size
ext4: use EINVAL if not a regular file in ext4_collapse_range()
ext4: enforce we are operating on a regular file in ext4_zero_range()
ext4: fix extent merging in ext4_ext_shift_path_extents()
ext4: discard preallocations after removing space
ext4: no need to truncate pagecache twice in collapse range
ext4: fix removing status extents in ext4_collapse_range()
ext4: use filemap_write_and_wait_range() correctly in collapse range
ext4: use truncate_pagecache() in collapse range
ext4: remove temporary shim used to merge COLLAPSE_RANGE and ZERO_RANGE
ext4: fix ext4_count_free_clusters() with EXT4FS_DEBUG and bigalloc enabled
ext4: always check ext4_ext_find_extent result
ext4: fix error handling in ext4_ext_shift_extents
ext4: silence sparse check warning for function ext4_trim_extent
ext4: COLLAPSE_RANGE only works on extent-based files
ext4: fix byte order problems introduced by the COLLAPSE_RANGE patches
ext4: use i_size_read in ext4_unaligned_aio()
fs: disallow all fallocate operation on active swapfile
fs: move falloc collapse range check into the filesystem methods
...
Pull vfs updates from Al Viro:
"The first vfs pile, with deep apologies for being very late in this
window.
Assorted cleanups and fixes, plus a large preparatory part of iov_iter
work. There's a lot more of that, but it'll probably go into the next
merge window - it *does* shape up nicely, removes a lot of
boilerplate, gets rid of locking inconsistencie between aio_write and
splice_write and I hope to get Kent's direct-io rewrite merged into
the same queue, but some of the stuff after this point is having
(mostly trivial) conflicts with the things already merged into
mainline and with some I want more testing.
This one passes LTP and xfstests without regressions, in addition to
usual beating. BTW, readahead02 in ltp syscalls testsuite has started
giving failures since "mm/readahead.c: fix readahead failure for
memoryless NUMA nodes and limit readahead pages" - might be a false
positive, might be a real regression..."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
missing bits of "splice: fix racy pipe->buffers uses"
cifs: fix the race in cifs_writev()
ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure
kill generic_file_buffered_write()
ocfs2_file_aio_write(): switch to generic_perform_write()
ceph_aio_write(): switch to generic_perform_write()
xfs_file_buffered_aio_write(): switch to generic_perform_write()
export generic_perform_write(), start getting rid of generic_file_buffer_write()
generic_file_direct_write(): get rid of ppos argument
btrfs_file_aio_write(): get rid of ppos
kill the 5th argument of generic_file_buffered_write()
kill the 4th argument of __generic_file_aio_write()
lustre: don't open-code kernel_recvmsg()
ocfs2: don't open-code kernel_recvmsg()
drbd: don't open-code kernel_recvmsg()
constify blk_rq_map_user_iov() and friends
lustre: switch to kernel_sendmsg()
ocfs2: don't open-code kernel_sendmsg()
take iov_iter stuff to mm/iov_iter.c
process_vm_access: tidy up a bit
...
filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.
It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ning Qu <quning@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We know that "ret > 0" is true here. These tests were left over from
commit 02afc27fae ('direct-io: Handle O_(D)SYNC AIO') and aren't
needed any more.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It actually goes back to 2004 ([PATCH] Concurrent O_SYNC write support)
when sync_page_range() had been introduced; generic_file_write{,v}() correctly
synced
pos_after_write - written .. pos_after_write - 1
but generic_file_aio_write() synced
pos_before_write .. pos_before_write + written - 1
instead. Which is not the same thing with O_APPEND, obviously.
A couple of years later correct variant had been killed off when
everything switched to use of generic_file_aio_write().
All users of generic_file_aio_write() are affected, and the same bug
has been copied into other instances of ->aio_write().
The fix is trivial; the only subtle point is that generic_write_sync()
ought to be inlined to avoid calculations useless for the majority of
calls.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Call generic_write_sync() from the deferred I/O completion handler if
O_DSYNC is set for a write request. Also make sure various callers
don't call generic_write_sync if the direct I/O code returns
-EIOCBQUEUED.
Based on an earlier patch from Jan Kara <jack@suse.cz> with updates from
Jeff Moyer <jmoyer@redhat.com> and Darrick J. Wong <darrick.wong@oracle.com>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commit 0713ed0cde added
jbd2_journal_file_inode() call into ext4_block_zero_page_range().
However that function gets called from truncate path and thus inode
needn't have jinode attached - that happens in ext4_file_open() but
the file needn't be ever open since mount. Calling
jbd2_journal_file_inode() without jinode attached results in the oops.
We fix the problem by attaching jinode to inode also in ext4_truncate()
and ext4_punch_hole() when we are going to zero out partial blocks.
Reported-by: majianpeng <majianpeng@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Pull second set of VFS changes from Al Viro:
"Assorted f_pos race fixes, making do_splice_direct() safe to call with
i_mutex on parent, O_TMPFILE support, Jeff's locks.c series,
->d_hash/->d_compare calling conventions changes from Linus, misc
stuff all over the place."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
Document ->tmpfile()
ext4: ->tmpfile() support
vfs: export lseek_execute() to modules
lseek_execute() doesn't need an inode passed to it
block_dev: switch to fixed_size_llseek()
cpqphp_sysfs: switch to fixed_size_llseek()
tile-srom: switch to fixed_size_llseek()
proc_powerpc: switch to fixed_size_llseek()
ubi/cdev: switch to fixed_size_llseek()
pci/proc: switch to fixed_size_llseek()
isapnp: switch to fixed_size_llseek()
lpfc: switch to fixed_size_llseek()
locks: give the blocked_hash its own spinlock
locks: add a new "lm_owner_key" lock operation
locks: turn the blocked_list into a hashtable
locks: convert fl_link to a hlist_node
locks: avoid taking global lock if possible when waking up blocked waiters
locks: protect most of the file_lock handling with i_lock
locks: encapsulate the fl_link list handling
locks: make "added" in __posix_lock_file a bool
...
For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
matter in lseek_execute() to update the current file offset
to the desired offset if it is valid, ceph also does the
simliar things at ceph_llseek().
To reduce the duplications, this patch make lseek_execute()
public accessible so that we can call it directly from the
underlying file systems.
Thanks Dave Chinner for this suggestion.
[AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
v2->v1:
- Add kernel-doc comments for lseek_execute()
- Call lseek_execute() in ceph->llseek()
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <chris.mason@fusionio.com>
Cc: Josef Bacik <jbacik@fusionio.com>
Cc: Ben Myers <bpm@sgi.com>
Cc: Ted Tso <tytso@mit.edu>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Sage Weil <sage@inktank.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ext4_lblk_t is just u32 so multiplying it by blocksize can easily
overflow for files larger than 4 GB. Fix that by properly typing the
block offsets before shifting.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Zheng Liu <wenqing.lz@taobao.com>
regression) introduced during the 3.10-rc1 merge window. Also
included is a bug fix relating to allocating blocks after resizing an
ext3 file system when using the ext4 file system driver.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJRkZBlAAoJENNvdpvBGATwLYQP/iWOBs2z93WG23cqkgqvL8o6
ZyeJdgy9dkFCArVDX5SSnGkJXZ3iqIKi5HoTKTJKfytgMzgiDAZcLsIHVv6NczwR
UGhjgS3HEdV5tJ46E6JnpB3NLSb+rAdc5kCdlsbzU46CP+JjFiYEhxVpK7ELuM/G
yctChbIH9FY+1OwxHccacBOaJU2ELhnH6B/8Ry/6gM2H0vfKeTNOdocOHdxvbNqg
ooGjytMfVopMQEfVG8aXtTfy341NFJH5fAYEahCcXxeO9ta6Unj9yOu5JV2wVrTt
39+DBsquGX6AVQsc9IxJ6YAN6ldwWN7l3huE9/AI0o/alwGsfVi5M+M/d1MMjDqf
Fgl2EzzBpZQeKKY9UXNi4LLgYdBiILMgKDOGoRKhRb8ynSSf/JX43+24FvidEi3o
o//J4aR+oSZfaovGAeikqyF1cumayhoNN8MINRN8igIinBiC4GjBFEl/Kl/1eAY/
lREGcsmYPXOkVPpM72waRYlP4GwNdOg4QSEY0SGljpwluO+dYtKQjHXcv/s/xL5v
j3GemzYVyjx4zaq1g3PxGfuD6VKFHr0T6jvzd6cHu17lnPlw9fwznHbEm9BEcXDY
gbGx9u+a2ZTqDwYVALbeoRpf9Zz6DUCse3ts4N3rbkXUQQiBYo7tybfVopIMAukb
CexvidDE/ryJrJJFBwoK
=6cRD
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 update from Ted Ts'o:
"Fixed regressions (two stability regressions and a performance
regression) introduced during the 3.10-rc1 merge window.
Also included is a bug fix relating to allocating blocks after
resizing an ext3 file system when using the ext4 file system driver"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
jbd,jbd2: fix oops in jbd2_journal_put_journal_head()
ext4: revert "ext4: use io_end for multiple bios"
ext4: limit group search loop for non-extent files
ext4: fix fio regression
We (Linux Kernel Performance project) found a regression introduced
by commit:
f7fec032aa ext4: track all extent status in extent status tree
The commit causes about 20% performance decrease in fio random write
test. Profiler shows that rb_next() uses a lot of CPU time. The call
stack is:
rb_next
ext4_es_find_delayed_extent
ext4_map_blocks
_ext4_get_block
ext4_get_block_write
__blockdev_direct_IO
ext4_direct_IO
generic_file_direct_write
__generic_file_aio_write
ext4_file_write
aio_rw_vect_retry
aio_run_iocb
do_io_submit
sys_io_submit
system_call_fastpath
io_submit
td_io_getevents
io_u_queued_complete
thread_main
main
__libc_start_main
The cause is that ext4_es_find_delayed_extent() doesn't have an
upper bound, it keeps searching until a delayed extent is found.
When there are a lots of non-delayed entries in the extent state
tree, ext4_es_find_delayed_extent() may uses a lot of CPU time.
Reported-by: LKP project <lkp@linux.intel.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Pull vfs pile (part one) from Al Viro:
"Assorted stuff - cleaning namei.c up a bit, fixing ->d_name/->d_parent
locking violations, etc.
The most visible changes here are death of FS_REVAL_DOT (replaced with
"has ->d_weak_revalidate()") and a new helper getting from struct file
to inode. Some bits of preparation to xattr method interface changes.
Misc patches by various people sent this cycle *and* ocfs2 fixes from
several cycles ago that should've been upstream right then.
PS: the next vfs pile will be xattr stuff."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (46 commits)
saner proc_get_inode() calling conventions
proc: avoid extra pde_put() in proc_fill_super()
fs: change return values from -EACCES to -EPERM
fs/exec.c: make bprm_mm_init() static
ocfs2/dlm: use GFP_ATOMIC inside a spin_lock
ocfs2: fix possible use-after-free with AIO
ocfs2: Fix oops in ocfs2_fast_symlink_readpage() code path
get_empty_filp()/alloc_file() leave both ->f_pos and ->f_version zero
target: writev() on single-element vector is pointless
export kernel_write(), convert open-coded instances
fs: encode_fh: return FILEID_INVALID if invalid fid_type
kill f_vfsmnt
vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op
nfsd: handle vfs_getattr errors in acl protocol
switch vfs_getattr() to struct path
default SET_PERSONALITY() in linux/elf.h
ceph: prepopulate inodes only when request is aborted
d_hash_and_lookup(): export, switch open-coded instances
9p: switch v9fs_set_create_acl() to inode+fid, do it before d_instantiate()
9p: split dropping the acls from v9fs_set_create_acl()
...
This commit renames ext4_es_find_extent with ext4_es_find_delayed_extent
and improve this function. First, we split input and output parameter.
Second, this function never return the first block of the next delayed
extent after 'es'.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan kara <jack@suse.cz>
This commit refines the extent status tree code.
1) A prefix 'es_' is added to to the extent status tree structure
members.
2) Refactored es_remove_extent() so that __es_remove_extent() can be
used by es_insert_extent() to remove the old extent entry(-ies) before
inserting a new one.
3) Rename extent_status_end() to ext4_es_end()
4) ext4_es_can_be_merged() is define to check whether two extents can
be merged or not.
5) Update and clarified comments.
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
So we can better understand what bits of ext4 are responsible for
long-running jbd2 handles, use jbd2__journal_start() so we can pass
context information for logging purposes.
The recommended way for finding the longer-running handles is:
T=/sys/kernel/debug/tracing
EVENT=$T/events/jbd2/jbd2_handle_stats
echo "interval > 5" > $EVENT/filter
echo 1 > $EVENT/enable
./run-my-fs-benchmark
cat $T/trace > /tmp/problem-handles
This will list handles that were active for longer than 20ms. Having
longer-running handles is bad, because a commit started at the wrong
time could stall for those 20+ milliseconds, which could delay an
fsync() or an O_SYNC operation. Here is an example line from the
trace file describing a handle which lived on for 311 jiffies, or over
1.2 seconds:
postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32
tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
dirtied_blocks 0
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
which could cause file system corruptions when performing file punch
operations.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJQ374OAAoJENNvdpvBGATwEGAP/jKUwjQhBZiF0k9dg1kQ5eTz
bdli4fy1vxrEMIOym8IZa4nBQJVCkArwRgjc28gCBD6k9u6X3GPa26vUydsoPfP6
odPdc9c9HtsbYQGuaq1SohID5HfjxHewTcUmCs4X4SpGcSurUcT7eQYWqSuIxFHR
0nKk8NO4EcWh2uqIoGPrc8QpSdor0DXXYYjZmHCeVLH1n6PyoMsnrFMfO9KqMLUL
vNR54CX9n1GRTfAfJNkNzcwfs8IfNkDUyv5hFpDh15tLltogU0TqnlAl3vSeZGSx
vVfhwHmQTK/bJyC3YaoRZqq9CQJVk2f/OTBpJDFY/USaapuitJd6vqbmh7NiRNAN
LaKmFt99MPfwyjEhIA7+J0LCTraAxc536q43oWWK5dAJhWI7DW0lbHARVeQTixNy
KJ1Lp0pmmz1mX8/lugOnK1SPBF525kTaoiz2bWqg4oQgn7mBzUlgj+EV22/6Rq83
TpKOKstl4BiZi8t5AhmFiwqtknCDiT5vUKQNy2kuM/oXtPJID/lM/TJbR5viYD3l
AH3Ef7xj61CynFZ0oBeraGwtXc2BHJpJdWz+8uj0/VhFfC+uNUYapSLFwyiAVZKO
xxaItT3ylfKpa0AWK6HBc2SLuL72SCHAPks06YKFtSyHtr5C8SCcafxU2DSOSi7K
VrhkcH6STa77Br7a1ORt
=9R/D
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 bug fixes from Ted Ts'o:
"Various bug fixes for ext4. Perhaps the most serious bug fixed is one
which could cause file system corruptions when performing file punch
operations."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: avoid hang when mounting non-journal filesystems with orphan list
ext4: lock i_mutex when truncating orphan inodes
ext4: do not try to write superblock on ro remount w/o journal
ext4: include journal blocks in df overhead calcs
ext4: remove unaligned AIO warning printk
ext4: fix an incorrect comment about i_mutex
ext4: fix deadlock in journal_unmap_buffer()
ext4: split off ext4_journalled_invalidatepage()
jbd2: fix assertion failure in jbd2_journal_flush()
ext4: check dioread_nolock on remount
ext4: fix extent tree corruption caused by hole punch
Although I put this in, I now think it was a bad decision. For most
users, there is very little to be done in this case. They get the
message, once per day, with no real context or proposed action. TBH,
it generates support calls when it probably does not need to; the
message sounds more dire than the situation really is.
Just nuke it. Normal investigation via blktrace or whatnot can
reveal poor IO patterns if bad performance is encountered.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
But the kernel decided to call it "origin" instead. Fix most of the
sites.
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ted has sent out a RFC about removing this feature. Eric and Jan
confirmed that both RedHat and SUSE enable this feature in all their
product. David also said that "As far as I know, it's enabled in all
Android kernels that use ext4." So it seems OK for us.
And what's more, as inline data depends its implementation on xattr,
and to be frank, I don't run any test again inline data enabled while
xattr disabled. So I think we should add inline data and remove this
config option in the same release.
[ The savings if you disable CONFIG_EXT4_FS_XATTR is only 27k, which
isn't much in the grand scheme of things. Since no one seems to be
testing this configuration except for some automated compile farms, on
balance we are better removing this config option, and so that it is
effectively always enabled. -- tytso ]
Cc: David Brown <davidb@codeaurora.org>
Cc: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
This patch makes ext4 really support SEEK_DATA/SEEK_HOLE flags. Block-mapped
and extent-mapped files are fully implemented together because ext4_map_blocks
hides this differences.
After applying this patch, it will cause a failure in xfstest #285 when the file
is block-mapped due to block-mapped file isn't support fallocate(2).
I had tried to use ext4_ext_walk_space() to retrieve the offset for a
extent-mapped file. But finally I decide to keep using ext4_map_blocks() to
support SEEK_DATA/SEEK_HOLE because ext4_map_blocks() can hide the difference
between block-mapped file and extent-mapped file. Moreover, in next step,
extent status tree will track all extent status, and we can get all mappings
from this tree. So I think that using ext4_map_blocks() is a better choice.
CC: Hugh Dickins <hughd@google.com>
Signed-off-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().
Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.
Now device drivers can implement this method and obtain nonlinear vma support.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com> #arch/tile
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
BUG #1) All places where we call ext4_flush_completed_IO are broken
because buffered io and DIO/AIO goes through three stages
1) submitted io,
2) completed io (in i_completed_io_list) conversion pended
3) finished io (conversion done)
And by calling ext4_flush_completed_IO we will flush only
requests which were in (2) stage, which is wrong because:
1) punch_hole and truncate _must_ wait for all outstanding unwritten io
regardless to it's state.
2) fsync and nolock_dio_read should also wait because there is
a time window between end_page_writeback() and ext4_add_complete_io()
As result integrity fsync is broken in case of buffered write
to fallocated region:
fsync blkdev_completion
->filemap_write_and_wait_range
->ext4_end_bio
->end_page_writeback
<-- filemap_write_and_wait_range return
->ext4_flush_completed_IO
sees empty i_completed_io_list but pended
conversion still exist
->ext4_add_complete_io
BUG #2) Race window becomes wider due to the 'ext4: completed_io
locking cleanup V4' patch series
This patch make following changes:
1) ext4_flush_completed_io() now first try to flush completed io and when
wait for any outstanding unwritten io via ext4_unwritten_wait()
2) Rename function to more appropriate name.
3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
prevent endless wait
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
AIO/DIO prefix is wrong because it account unwritten extents which
also may be scheduled from buffered write endio
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
greatest note is a speed up for parallel, non-allocating DIO writes,
since we no longer take the i_mutex lock in that case. For bug fixes,
we fix an incorrect overhead calculation which caused slightly
incorrect results for df(1) and statfs(2). We also fixed bugs in the
metadata checksum feature.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABCAAGBQJQE1SzAAoJENNvdpvBGATwzOMQAKxVkaTqkMc+cUahLLUkFdGQ
xnmigHufVuOvv+B8l1p6vbYx+qOftztqXF1WC41j8mTfEDs19jKXv3LI57TJ+bIW
a/Kp1CjMicRs/HGhFPNWp1D+N8J95MTFP6Y9biXSmBBvefg2NSwxpg48yZtjUy1/
Zl0414AqMvTJyqKKOUa++oyl3XlawnzDZ+6a7QPKsrAaDOU5977W4y2tZkNFk84d
qRjTfaiX13aVe7XupgQHe/jtk40Sj38pyGVGiAGlHOZhZtYKE6MzB8OreGiMTy8z
rmJz/CQUsQ8+sbKYhAcDru+bMrKghbO0u78XRz9CY+YpVFF/Xs2QiXoV0ZOGkIm6
eokq7jEV0K+LtDEmM3PkmUPgXfYw5damTv8qWuBUFd4wtVE5x/DmK8AJVMidCAUN
GkVR+rEbbEi7RCwsuac/aKB8baVQCTiJ5tfNTgWh9zll+9GZSk+U71Pp0KdcJGiS
nxitAZ+20hZN2CQctlmaGbCPTPYCWU4hQ3IuMdTlQTQAs8S0y1FtylTRsXcC1eVR
i1hBS/dVw5PVCaqoX79zYrByUymgX0ZaYY6seRT6+U9xPGDCSNJ9mjXZL3fttnOC
d5gsx/pbMIAv52G5Hj6DfsXR2JFmmxsaIzsLtRvKi9q89d84XaZfbUsHYjn4Neym
5lTKaSQHU71cKCxrStHC
=VAVB
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"The usual collection of bug fixes and optimizations. Perhaps of
greatest note is a speed up for parallel, non-allocating DIO writes,
since we no longer take the i_mutex lock in that case.
For bug fixes, we fix an incorrect overhead calculation which caused
slightly incorrect results for df(1) and statfs(2). We also fixed
bugs in the metadata checksum feature."
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
ext4: undo ext4_calc_metadata_amount if we fail to claim space
ext4: don't let i_reserved_meta_blocks go negative
ext4: fix hole punch failure when depth is greater than 0
ext4: remove unnecessary argument from __ext4_handle_dirty_metadata()
ext4: weed out ext4_write_super
ext4: remove unnecessary superblock dirtying
ext4: convert last user of ext4_mark_super_dirty() to ext4_handle_dirty_super()
ext4: remove useless marking of superblock dirty
ext4: fix ext4 mismerge back in January
ext4: remove dynamic array size in ext4_chksum()
ext4: remove unused variable in ext4_update_super()
ext4: make quota as first class supported feature
ext4: don't take the i_mutex lock when doing DIO overwrites
ext4: add a new nolock flag in ext4_map_blocks
ext4: split ext4_file_write into buffered IO and direct IO
ext4: remove an unused statement in ext4_mb_get_buddy_page_lock()
ext4: fix out-of-date comments in extents.c
ext4: use s_csum_seed instead of i_csum_seed for xattr block
ext4: use proper csum calculation in ext4_rename
ext4: fix overhead calculation used by ext4_statfs()
...
The last user of ext4_mark_super_dirty() in ext4_file_open() is so
rare it can well be modifying the superblock properly by journalling
the change. Change it and get rid of ext4_mark_super_dirty() as it's
not needed anymore.
Artem: small amendments.
Artem: tested using xfstests for both journalled and non-journalled ext4.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Tested-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Aligned and overwrite direct I/O can be parallelized. In
ext4_file_dio_write, we first check whether these conditions are
satisfied or not. If so, we take i_data_sem and release i_mutex lock
directly. Meanwhile iocb->private is set to indicate that this is a
dio overwrite, and it will be handled in ext4_ext_direct_IO.
[ Added fix from Dan Carpenter to fix locking bug on the error path. ]
CC: Tao Ma <tm@tao.ma>
CC: Eric Sandeen <sandeen@redhat.com>
CC: Robin Dong <hao.bigrat@gmail.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Use the new functionality in generic_file_llseek_size() to
accept a custom EOF position, and un-cut-and-paste all the
vfs llseek code from ext4.
Also fix up comments on ext4_llseek() to reflect reality.
Signed-off-by: Eric Sandeen <sandeen@redaht.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For ext3/4 htree directories, using the vfs llseek function with
SEEK_END goes to i_size like for any other file, but in reality
we want the maximum possible hash value. Recent changes
in ext4 have cut & pasted generic_file_llseek() back into fs/ext4/dir.c,
but replicating this core code seems like a bad idea, especially
since the copy has already diverged from the vfs.
This patch updates generic_file_llseek_size to accept
both a custom maximum offset, and a custom EOF position. With this
in place, ext4_dir_llseek can pass in the appropriate maximum hash
position for both maxsize and eof, and get what it wants.
As far as I know, this does not fix any bugs - nfs in the kernel
doesn't use SEEK_END, and I don't know of any user who does. But
some ext4 folks seem keen on doing the right thing here, and I can't
really argue.
(Patch also fixes up some comments slightly)
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ext4_file_dio_write is defined in order to split buffered IO and
direct IO in ext4. This patch just refactor some stuff in write path.
CC: Tao Ma <tm@tao.ma>
CC: Eric Sandeen <sandeen@redhat.com>
CC: Robin Dong <hao.bigrat@gmail.com>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The generic_file_aio_write() function returns ssize_t, and
ext4_file_write() returns a ssize_t, so use a ssize_t to collect the
return value from generic_file_aio_write(). It shouldn't matter since
the VFS read/write paths shouldn't allow a read greater than MAX_INT,
but there was previously a bug in the AIO code paths, and it's best if
we use a consistent type so that the return value from
generic_file_aio_write() can't get truncated.
Reported-by: Jouni Siren <jouni.siren@iki.fi>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (97 commits)
jbd2: Unify log messages in jbd2 code
jbd/jbd2: validate sb->s_first in journal_get_superblock()
ext4: let ext4_ext_rm_leaf work with EXT_DEBUG defined
ext4: fix a syntax error in ext4_ext_insert_extent when debugging enabled
ext4: fix a typo in struct ext4_allocation_context
ext4: Don't normalize an falloc request if it can fit in 1 extent.
ext4: remove comments about extent mount option in ext4_new_inode()
ext4: let ext4_discard_partial_buffers handle unaligned range correctly
ext4: return ENOMEM if find_or_create_pages fails
ext4: move vars to local scope in ext4_discard_partial_page_buffers_no_lock()
ext4: Create helper function for EXT4_IO_END_UNWRITTEN and i_aiodio_unwritten
ext4: optimize locking for end_io extent conversion
ext4: remove unnecessary call to waitqueue_active()
ext4: Use correct locking for ext4_end_io_nolock()
ext4: fix race in xattr block allocation path
ext4: trace punch_hole correctly in ext4_ext_map_blocks
ext4: clean up AGGRESSIVE_TEST code
ext4: move variables to their scope
ext4: fix quota accounting during migration
ext4: migrate cleanup
...
This gives ext4 the benefits of unlocked llseek.
Cc: tytso@mit.edu
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
In ext4_file_open, the filesystem records the mountpoint of the first
file that is opened after mounting the filesystem. It does this by
allocating a 64-byte stack buffer, calling d_path() to grab the mount
point through which this file was accessed, and then memcpy()ing 64
bytes into the superblock's s_last_mounted field, starting from the
return value of d_path(), which is stored as "cp". However, if cp >
buf (which it frequently is since path components are prepended
starting at the end of buf) then we can end up copying stack data into
the superblock.
Writing stack variables into the superblock doesn't sound like a great
idea, so use strlcpy instead. Andi Kleen suggested using strlcpy
instead of strncpy.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss. This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Since Ext4 has its own lseek we need to make sure it handles
SEEK_HOLE/SEEK_DATA. For now just do the same thing that is done in the generic
case, somebody else can come along and make it do fancy things later. Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Trivial conversion. Fixup one error handling case calling vmtruncate()
and remove ->truncate callback. We also fix a bug that IS_IMMUTABLE and
IS_APPEND files could not be truncated during failed writes. In fact, the
test can be completely removed as upper layers do necessary permission
checks for truncate in do_sys_[f]truncate() and may_open() anyway.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
ext4 has a data corruption case when doing non-block-aligned
asynchronous direct IO into a sparse file, as demonstrated
by xfstest 240.
The root cause is that while ext4 preallocates space in the
hole, mappings of that space still look "new" and
dio_zero_block() will zero out the unwritten portions. When
more than one AIO thread is going, they both find this "new"
block and race to zero out their portion; this is uncoordinated
and causes data corruption.
Dave Chinner fixed this for xfs by simply serializing all
unaligned asynchronous direct IO. I've done the same here.
The difference is that we only wait on conversions, not all IO.
This is a very big hammer, and I'm not very pleased with
stuffing this into ext4_file_write(). But since ext4 is
DIO_LOCKING, we need to serialize it at this high level.
I tried to move this into ext4_ext_direct_IO, but by then
we have the i_mutex already, and we will wait on the
work queue to do conversions - which must also take the
i_mutex. So that won't work.
This was originally exposed by qemu-kvm installing to
a raw disk image with a normal sector-63 alignment. I've
tested a backport of this patch with qemu, and it does
avoid the corruption. It is also quite a lot slower
(14 min for package installs, vs. 8 min for well-aligned)
but I'll take slow correctness over fast corruption any day.
Mingming suggested that we can track outstanding
conversions, and wait on those so that non-sparse
files won't be affected, and I've implemented that here;
unaligned AIO to nonsparse files won't take a perf hit.
[tytso@mit.edu: Keep the mutex as a hashed array instead
of bloating the ext4 inode]
[tytso@mit.edu: Fix up namespace issues so that global
variables are protected with an "ext4_" prefix.]
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Currently all filesystems except XFS implement fallocate asynchronously,
while XFS forced a commit. Both of these are suboptimal - in case of O_SYNC
I/O we really want our allocation on disk, especially for the !KEEP_SIZE
case where we actually grow the file with user-visible zeroes. On the
other hand always commiting the transaction is a bad idea for fast-path
uses of fallocate like for example in recent Samba versions. Given
that block allocation is a data plane operation anyway change it from
an inode operation to a file operation so that we have the file structure
available that lets us check for O_SYNC.
This also includes moving the code around for a few of the filesystems,
and remove the already unnedded S_ISDIR checks given that we only wire
up fallocate for regular files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Replace the jbd2_inode structure (which is 48 bytes) with a pointer
and only allocate the jbd2_inode when it is needed --- that is, when
the file system has a journal present and the inode has been opened
for writing. This allows us to further slim down the ext4_inode_info
structure.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
The llseek system call should return EINVAL if passed a seek offset
which results in a write error. What this maximum offset should be
depends on whether or not the huge_file file system feature is set,
and whether or not the file is extent based or not.
If the file has no "EXT4_EXTENTS_FL" flag, the maximum size which can be
written (write systemcall) is different from the maximum size which can be
sought (lseek systemcall).
For example, the following 2 cases demonstrates the differences
between the maximum size which can be written, versus the seek offset
allowed by the llseek system call:
#1: mkfs.ext3 <dev>; mount -t ext4 <dev>
#2: mkfs.ext3 <dev>; tune2fs -Oextent,huge_file <dev>; mount -t ext4 <dev>
Table. the max file size which we can write or seek
at each filesystem feature tuning and file flag setting
+============+===============================+===============================+
| \ File flag| | |
| \ | !EXT4_EXTENTS_FL | EXT4_EXTETNS_FL |
|case \| | |
+------------+-------------------------------+-------------------------------+
| #1 | write: 2194719883264 | write: -------------- |
| | seek: 2199023251456 | seek: -------------- |
+------------+-------------------------------+-------------------------------+
| #2 | write: 4402345721856 | write: 17592186044415 |
| | seek: 17592186044415 | seek: 17592186044415 |
+------------+-------------------------------+-------------------------------+
The differences exist because ext4 has 2 maxbytes which are sb->s_maxbytes
(= extent-mapped maxbytes) and EXT4_SB(sb)->s_bitmap_maxbytes (= block-mapped
maxbytes). Although generic_file_llseek uses only extent-mapped maxbytes.
(llseek of ext4_file_operations is generic_file_llseek which uses
sb->s_maxbytes.)
Therefore we create ext4 llseek function which uses 2 maxbytes.
The new own function originates from generic_file_llseek().
If the file flag, "EXT4_EXTENTS_FL" is not set, the function alters
inode->i_sb->s_maxbytes into EXT4_SB(inode->i_sb)->s_bitmap_maxbytes.
Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
We don't need to set s_dirt in most of the ext4 code when journaling
is enabled. In ext3/4 some of the summary statistics for # of free
inodes, blocks, and directories are calculated from the per-block
group statistics when the file system is mounted or unmounted. As a
result the superblock doesn't have to be updated, either via the
journal or by setting s_dirt. There are a few exceptions, most
notably when resizing the file system, where the superblock needs to
be modified --- and in that case it should be done as a journalled
operation if possible, and s_dirt set only in no-journal mode.
This patch will optimize out some unneeded disk writes when using ext4
with a journal.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
At several places we modify EXT4_I(inode)->i_flags without holding
i_mutex (ext4_do_update_inode, ...). These modifications are racy and
we can lose updates to i_flags. So convert handling of i_flags to use
bitops which are atomic.
https://bugzilla.kernel.org/show_bug.cgi?id=15792
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
quota: stop using QUOTA_OK / NO_QUOTA
dquot: cleanup dquot initialize routine
dquot: move dquot initialization responsibility into the filesystem
dquot: cleanup dquot drop routine
dquot: move dquot drop responsibility into the filesystem
dquot: cleanup dquot transfer routine
dquot: move dquot transfer responsibility into the filesystem
dquot: cleanup inode allocation / freeing routines
dquot: cleanup space allocation / freeing routines
ext3: add writepage sanity checks
ext3: Truncate allocated blocks if direct IO write fails to update i_size
quota: Properly invalidate caches even for filesystems with blocksize < pagesize
quota: generalize quota transfer interface
quota: sb_quota state flags cleanup
jbd: Delay discarding buffers in journal_unmap_buffer
ext3: quota_write cross block boundary behaviour
quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
quota: split out compat_sys_quotactl support from quota.c
quota: split out netlink notification support from quota.c
quota: remove invalid optimization from quota_sync_all
...
Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (36 commits)
ext4: fix up rb_root initializations to use RB_ROOT
ext4: Code cleanup for EXT4_IOC_MOVE_EXT ioctl
ext4: Fix the NULL reference in double_down_write_data_sem()
ext4: Fix insertion point of extent in mext_insert_across_blocks()
ext4: consolidate in_range() definitions
ext4: cleanup to use ext4_grp_offs_to_block()
ext4: cleanup to use ext4_group_first_block_no()
ext4: Release page references acquired in ext4_da_block_invalidatepages
ext4: Fix ext4_quota_write cross block boundary behaviour
ext4: Convert BUG_ON checks to use ext4_error() instead
ext4: Use direct_IO_no_locking in ext4 dio read
ext4: use ext4_get_block_write in buffer write
ext4: mechanical rename some of the direct I/O get_block's identifiers
ext4: make "offset" consistent in ext4_check_dir_entry()
ext4: Handle non empty on-disk orphan link
ext4: explicitly remove inode from orphan list after failed direct io
ext4: fix error handling in migrate
ext4: deprecate obsoleted mount options
ext4: Fix fencepost error in chosing choosing group vs file preallocation.
jbd2: clean up an assertion in jbd2_journal_commit_transaction()
...
Get rid of the initialize dquot operation - it is now always called from
the filesystem and if a filesystem really needs it's own (which none
currently does) it can just call into it's own routine directly.
Rename the now static low-level dquot_initialize helper to __dquot_initialize
and vfs_dq_init to dquot_initialize to have a consistent namespace.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Currently various places in the VFS call vfs_dq_init directly. This means
we tie the quota code into the VFS. Get rid of that and make the
filesystem responsible for the initialization. For most metadata operations
this is a straight forward move into the methods, but for truncate and
open it's a bit more complicated.
For truncate we currently only call vfs_dq_init for the sys_truncate case
because open already takes care of it for ftruncate and open(O_TRUNC) - the
new code causes an additional vfs_dq_init for those which is harmless.
For open the initialization is moved from do_filp_open into the open method,
which means it happens slightly earlier now, and only for regular files.
The latter is fine because we don't need to initialize it for operations
on special files, and we already do it as part of the namespace operations
for directories.
Add a dquot_file_open helper that filesystems that support generic quotas
can use to fill in ->open.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
path to mnt/mnt->mnt_root is no worse than that to
mnt->mnt_parent/mnt->mnt_mountpoint *and* needs no
pinning the sucker down (mnt is not going away and
mnt->mnt_root won't change)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
At several places we modify EXT4_I(inode)->i_state without holding
i_mutex (ext4_release_file, ext4_bmap, ext4_journalled_writepage,
ext4_do_update_inode, ...). These modifications are racy and we can
lose updates to i_state. So convert handling of i_state to use bitops
which are atomic.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* mark struct vm_area_struct::vm_ops as const
* mark vm_ops in AGP code
But leave TTM code alone, something is fishy there with global vm_ops
being used.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The syncing is now properly handled by generic_file_aio_write() so
no special ext4 code is needed.
CC: linux-ext4@vger.kernel.org
CC: tytso@mit.edu
Signed-off-by: Jan Kara <jack@suse.cz>