Commit Graph

297 Commits

Author SHA1 Message Date
Darrick J. Wong
29ac8e856c ocfs2: implement the VFS clone_range, copy_range, and dedupe_range features
Connect the new VFS clone_range, copy_range, and dedupe_range features
to the existing reflink capability of ocfs2.  Compared to the existing
ocfs2 reflink ioctl We have to do things a little differently to support
the VFS semantics (we can clone subranges of a file but we don't clone
xattrs), but the VFS ioctls are more broadly supported.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
v2: Convert inline data files to extents files before reflinking,
and fix i_blocks so that stat(2) output is correct.
v3: Make zero-length dedupe consistent with btrfs behavior.
v4: Use VFS double-inode lock routines and remove MAX_DEDUPE_LEN.
2016-12-10 12:39:45 -08:00
Darrick J. Wong
84e40080bd ocfs2: convert inode refcount test to a helper
Replace the open-coded inode refcount flag test with a helper function
to reduce the potential for bugs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-10 12:39:45 -08:00
Linus Torvalds
101105b171 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
 ">rename2() work from Miklos + current_time() from Deepa"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Replace current_fs_time() with current_time()
  fs: Replace CURRENT_TIME_SEC with current_time() for inode timestamps
  fs: Replace CURRENT_TIME with current_time() for inode timestamps
  fs: proc: Delete inode time initializations in proc_alloc_inode()
  vfs: Add current_time() api
  vfs: add note about i_op->rename changes to porting
  fs: rename "rename2" i_op to "rename"
  vfs: remove unused i_op->rename
  fs: make remaining filesystems use .rename2
  libfs: support RENAME_NOREPLACE in simple_rename()
  fs: support RENAME_NOREPLACE for local filesystems
  ncpfs: fix unused variable warning
2016-10-10 20:16:43 -07:00
Al Viro
3873691e5a Merge remote-tracking branch 'ovl/rename2' into for-linus 2016-10-10 23:02:51 -04:00
Linus Torvalds
97d2116708 Merge branch 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs xattr updates from Al Viro:
 "xattr stuff from Andreas

  This completes the switch to xattr_handler ->get()/->set() from
  ->getxattr/->setxattr/->removexattr"

* 'work.xattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  vfs: Remove {get,set,remove}xattr inode operations
  xattr: Stop calling {get,set,remove}xattr inode operations
  vfs: Check for the IOP_XATTR flag in listxattr
  xattr: Add __vfs_{get,set,remove}xattr helpers
  libfs: Use IOP_XATTR flag for empty directory handling
  vfs: Use IOP_XATTR flag for bad-inode handling
  vfs: Add IOP_XATTR inode operations flag
  vfs: Move xattr_resolve_name to the front of fs/xattr.c
  ecryptfs: Switch to generic xattr handlers
  sockfs: Get rid of getxattr iop
  sockfs: getxattr: Fail with -EOPNOTSUPP for invalid attribute names
  kernfs: Switch to generic xattr handlers
  hfs: Switch to generic xattr handlers
  jffs2: Remove jffs2_{get,set,remove}xattr macros
  xattr: Remove unnecessary NULL attribute name check
2016-10-10 17:11:50 -07:00
Linus Torvalds
abb5a14fa2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted misc bits and pieces.

  There are several single-topic branches left after this (rename2
  series from Miklos, current_time series from Deepa Dinamani, xattr
  series from Andreas, uaccess stuff from from me) and I'd prefer to
  send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
  proc: switch auxv to use of __mem_open()
  hpfs: support FIEMAP
  cifs: get rid of unused arguments of CIFSSMBWrite()
  posix_acl: uapi header split
  posix_acl: xattr representation cleanups
  fs/aio.c: eliminate redundant loads in put_aio_ring_file
  fs/internal.h: add const to ns_dentry_operations declaration
  compat: remove compat_printk()
  fs/buffer.c: make __getblk_slow() static
  proc: unsigned file descriptors
  fs/file: more unsigned file descriptors
  fs: compat: remove redundant check of nr_segs
  cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
  cifs: don't use memcpy() to copy struct iov_iter
  get rid of separate multipage fault-in primitives
  fs: Avoid premature clearing of capabilities
  fs: Give dentry to inode_change_ok() instead of inode
  fuse: Propagate dentry down to inode_change_ok()
  ceph: Propagate dentry down to inode_change_ok()
  xfs: Propagate dentry down to inode_change_ok()
  ...
2016-10-10 13:04:49 -07:00
Al Viro
e55f1d1d13 Merge remote-tracking branch 'jk/vfs' into work.misc 2016-10-08 11:06:08 -04:00
Andreas Gruenbacher
fd50ecaddf vfs: Remove {get,set,remove}xattr inode operations
These inode operations are no longer used; remove them.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-07 21:48:36 -04:00
Al Viro
82c156f853 switch generic_file_splice_read() to use of ->read_iter()
... and kill the ->splice_read() instances that can be switched to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:23:56 -04:00
Deepa Dinamani
078cd8279e fs: Replace CURRENT_TIME with current_time() for inode timestamps
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:21 -04:00
Jan Kara
31051c85b5 fs: Give dentry to inode_change_ok() instead of inode
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Ashish Samant
d21c353d5e ocfs2: fix start offset to ocfs2_zero_range_for_truncate()
If we punch a hole on a reflink such that following conditions are met:

1. start offset is on a cluster boundary
2. end offset is not on a cluster boundary
3. (end offset is somewhere in another extent) or
   (hole range > MAX_CONTIG_BYTES(1MB)),

we dont COW the first cluster starting at the start offset.  But in this
case, we were wrongly passing this cluster to
ocfs2_zero_range_for_truncate() to zero out.  This will modify the
cluster in place and zero it in the source too.

Fix this by skipping this cluster in such a scenario.

To reproduce:

1. Create a random file of say 10 MB
     xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile
2. Reflink  it
     reflink -f 10MBfile reflnktest
3. Punch a hole at starting at cluster boundary  with range greater that
1MB. You can also use a range that will put the end offset in another
extent.
     fallocate -p -o 0 -l 1048615 reflnktest
4. sync
5. Check the  first cluster in the source file. (It will be zeroed out).
    dd if=10MBfile iflag=direct bs=<cluster size> count=1 | hexdump -C

Link: http://lkml.kernel.org/r/1470957147-14185-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reported-by: Saar Maoz <saar.maoz@oracle.com>
Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Eric Ren <zren@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Linus Torvalds
7f427d3a60 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull parallel filesystem directory handling update from Al Viro.

This is the main parallel directory work by Al that makes the vfs layer
able to do lookup and readdir in parallel within a single directory.
That's a big change, since this used to be all protected by the
directory inode mutex.

The inode mutex is replaced by an rwsem, and serialization of lookups of
a single name is done by a "in-progress" dentry marker.

The series begins with xattr cleanups, and then ends with switching
filesystems over to actually doing the readdir in parallel (switching to
the "iterate_shared()" that only takes the read lock).

A more detailed explanation of the process from Al Viro:
 "The xattr work starts with some acl fixes, then switches ->getxattr to
  passing inode and dentry separately.  This is the point where the
  things start to get tricky - that got merged into the very beginning
  of the -rc3-based #work.lookups, to allow untangling the
  security_d_instantiate() mess.  The xattr work itself proceeds to
  switch a lot of filesystems to generic_...xattr(); no complications
  there.

  After that initial xattr work, the series then does the following:

   - untangle security_d_instantiate()

   - convert a bunch of open-coded lookup_one_len_unlocked() to calls of
     that thing; one such place (in overlayfs) actually yields a trivial
     conflict with overlayfs fixes later in the cycle - overlayfs ended
     up switching to a variant of lookup_one_len_unlocked() sans the
     permission checks.  I would've dropped that commit (it gets
     overridden on merge from #ovl-fixes in #for-next; proper resolution
     is to use the variant in mainline fs/overlayfs/super.c), but I
     didn't want to rebase the damn thing - it was fairly late in the
     cycle...

   - some filesystems had managed to depend on lookup/lookup exclusion
     for *fs-internal* data structures in a way that would break if we
     relaxed the VFS exclusion.  Fixing hadn't been hard, fortunately.

   - core of that series - parallel lookup machinery, replacing
     ->i_mutex with rwsem, making lookup_slow() take it only shared.  At
     that point lookups happen in parallel; lookups on the same name
     wait for the in-progress one to be done with that dentry.

     Surprisingly little code, at that - almost all of it is in
     fs/dcache.c, with fs/namei.c changes limited to lookup_slow() -
     making it use the new primitive and actually switching to locking
     shared.

   - parallel readdir stuff - first of all, we provide the exclusion on
     per-struct file basis, same as we do for read() vs lseek() for
     regular files.  That takes care of most of the needed exclusion in
     readdir/readdir; however, these guys are trickier than lookups, so
     I went for switching them one-by-one.  To do that, a new method
     '->iterate_shared()' is added and filesystems are switched to it
     as they are either confirmed to be OK with shared lock on directory
     or fixed to be OK with that.  I hope to kill the original method
     come next cycle (almost all in-tree filesystems are switched
     already), but it's still not quite finished.

   - several filesystems get switched to parallel readdir.  The
     interesting part here is dealing with dcache preseeding by readdir;
     that needs minor adjustment to be safe with directory locked only
     shared.

     Most of the filesystems doing that got switched to in those
     commits.  Important exception: NFS.  Turns out that NFS folks, with
     their, er, insistence on VFS getting the fuck out of the way of the
     Smart Filesystem Code That Knows How And What To Lock(tm) have
     grown the locking of their own.  They had their own homegrown
     rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink
     is the reader there).  Of course, with VFS getting the fuck out of
     the way, as requested, the actual smarts of the smart filesystem
     code etc. had become exposed...

   - do_last/lookup_open/atomic_open cleanups.  As the result, open()
     without O_CREAT locks the directory only shared.  Including the
     ->atomic_open() case.  Backmerge from #for-linus in the middle of
     that - atomic_open() fix got brought in.

   - then comes NFS switch to saner (VFS-based ;-) locking, killing the
     homegrown "lookup and readdir are writers" kinda-sorta rwsem.  All
     exclusion for sillyunlink/lookup is done by the parallel lookups
     mechanism.  Exclusion between sillyunlink and rmdir is a real rwsem
     now - rmdir being the writer.

     Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel
     now.

   - the rest of the series consists of switching a lot of filesystems
     to parallel readdir; in a lot of cases ->llseek() gets simplified
     as well.  One backmerge in there (again, #for-linus - rockridge
     fix)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits)
  ext4: switch to ->iterate_shared()
  hfs: switch to ->iterate_shared()
  hfsplus: switch to ->iterate_shared()
  hostfs: switch to ->iterate_shared()
  hpfs: switch to ->iterate_shared()
  hpfs: handle allocation failures in hpfs_add_pos()
  gfs2: switch to ->iterate_shared()
  f2fs: switch to ->iterate_shared()
  afs: switch to ->iterate_shared()
  befs: switch to ->iterate_shared()
  befs: constify stuff a bit
  isofs: switch to ->iterate_shared()
  get_acorn_filename(): deobfuscate a bit
  btrfs: switch to ->iterate_shared()
  logfs: no need to lock directory in lseek
  switch ecryptfs to ->iterate_shared
  9p: switch to ->iterate_shared()
  fat: switch to ->iterate_shared()
  romfs, squashfs: switch to ->iterate_shared()
  more trivial ->iterate_shared conversions
  ...
2016-05-17 11:01:31 -07:00
Junxiao Bi
5ee0fbd50f ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang
Commit 743b5f1434 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
introduced this issue.  ocfs2_setattr called by chmod command holds
cluster wide inode lock when calling posix_acl_chmod.  This latter
function in turn calls ocfs2_iop_get_acl and ocfs2_iop_set_acl.  These
two are also called directly from vfs layer for getfacl/setfacl commands
and therefore acquire the cluster wide inode lock.  If a remote
conversion request comes after the first inode lock in ocfs2_setattr,
OCFS2_LOCK_BLOCKED will be set.  And this will cause the second call to
inode lock from the ocfs2_iop_get_acl() to block indefinetly.

The deleted version of ocfs2_acl_chmod() calls __posix_acl_chmod() which
does not call back into the filesystem.  Therefore, we restore
ocfs2_acl_chmod(), modify it slightly for locking as needed, and use that
instead.

Fixes: 743b5f1434 ("ocfs2: take inode lock in ocfs2_iop_set/get_acl()")
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-12 15:52:50 -07:00
Al Viro
84695ffee7 Merge getxattr prototype change into work.lookups
The rest of work.xattr stuff isn't needed for this branch
2016-05-02 19:45:47 -04:00
Al Viro
fc64005c93 don't bother with ->d_inode->i_sb - it's always equal to ->d_sb
... and neither can ever be NULL

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-04-10 17:11:51 -04:00
Kirill A. Shutemov
09cbfeaf1a mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
ago with promise that one day it will be possible to implement page
cache with bigger chunks than PAGE_SIZE.

This promise never materialized.  And unlikely will.

We have many places where PAGE_CACHE_SIZE assumed to be equal to
PAGE_SIZE.  And it's constant source of confusion on whether
PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
especially on the border between fs and mm.

Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
breakage to be doable.

Let's stop pretending that pages in page cache are special.  They are
not.

The changes are pretty straight-forward:

 - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;

 - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};

 - page_cache_get() -> get_page();

 - page_cache_release() -> put_page();

This patch contains automated changes generated with coccinelle using
script below.  For some reason, coccinelle doesn't patch header files.
I've called spatch for them manually.

The only adjustment after coccinelle is revert of changes to
PAGE_CAHCE_ALIGN definition: we are going to drop it later.

There are few places in the code where coccinelle didn't reach.  I'll
fix them manually in a separate patch.  Comments and documentation also
will be addressed with the separate patch.

virtual patch

@@
expression E;
@@
- E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
expression E;
@@
- E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
+ E

@@
@@
- PAGE_CACHE_SHIFT
+ PAGE_SHIFT

@@
@@
- PAGE_CACHE_SIZE
+ PAGE_SIZE

@@
@@
- PAGE_CACHE_MASK
+ PAGE_MASK

@@
expression E;
@@
- PAGE_CACHE_ALIGN(E)
+ PAGE_ALIGN(E)

@@
expression E;
@@
- page_cache_get(E)
+ get_page(E)

@@
expression E;
@@
- page_cache_release(E)
+ put_page(E)

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-04-04 10:41:08 -07:00
Ryan Ding
e63890f38a ocfs2: fix ip_unaligned_aio deadlock with dio work queue
In the current implementation of unaligned aio+dio, lock order behave as
follow:

in user process context:
  -> call io_submit()
    -> get i_mutex
		<== window1
      -> get ip_unaligned_aio
        -> submit direct io to block device
    -> release i_mutex
  -> io_submit() return

in dio work queue context(the work queue is created in __blockdev_direct_IO):
  -> release ip_unaligned_aio
		<== window2
    -> get i_mutex
      -> clear unwritten flag & change i_size
    -> release i_mutex

There is a limitation to the thread number of dio work queue.  256 at
default.  If all 256 thread are in the above 'window2' stage, and there
is a user process in the 'window1' stage, the system will became
deadlock.  Since the user process hold i_mutex to wait ip_unaligned_aio
lock, while there is a direct bio hold ip_unaligned_aio mutex who is
waiting for a dio work queue thread to be schedule.  But all the dio
work queue thread is waiting for i_mutex lock in 'window2'.

This case only happened in a test which send a large number(more than
256) of aio at one io_submit() call.

My design is to remove ip_unaligned_aio lock.  Change it to a sync io
instead.  Just like ip_unaligned_aio lock, serialize the unaligned aio
dio.

[akpm@linux-foundation.org: remove OCFS2_IOCB_UNALIGNED_IO, per Junxiao Bi]
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Ryan Ding
f1f973ffce ocfs2: code clean up for direct io
Clean up ocfs2_file_write_iter & ocfs2_prepare_inode_for_write:
 * remove append dio check: it will be checked in ocfs2_direct_IO()
 * remove file hole check: file hole is supported for now
 * remove inline data check: it will be checked in ocfs2_direct_IO()
 * remove the full_coherence check when append dio: we will get the
   inode_lock in ocfs2_dio_get_block, there is no need to fall back to
   buffer io to ensure the coherence semantics.

Now the drop dio procedure is gone.  :)

[akpm@linux-foundation.org: remove unused label]
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-25 16:37:42 -07:00
Al Viro
5955102c99 wrappers for ->i_mutex access
parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-01-22 18:04:28 -05:00
John Haxby
d6364627ef ocfs2: return non-zero st_blocks for inline data
Some versions of tar assume that files with st_blocks == 0 do not
contain any data and will skip reading them entirely.  See also commit
9206c56155 ("ext4: return non-zero st_blocks for inline data").

Signed-off-by: John Haxby <john.haxby@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Acked-by: Gang He <ghe@suse.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-14 16:00:49 -08:00
Tariq Saeed
3d46a44a0c ocfs2: fix BUG_ON() in ocfs2_ci_checkpointed()
PID: 614    TASK: ffff882a739da580  CPU: 3   COMMAND: "ocfs2dc"
  #0 [ffff882ecc3759b0] machine_kexec at ffffffff8103b35d
  #1 [ffff882ecc375a20] crash_kexec at ffffffff810b95b5
  #2 [ffff882ecc375af0] oops_end at ffffffff815091d8
  #3 [ffff882ecc375b20] die at ffffffff8101868b
  #4 [ffff882ecc375b50] do_trap at ffffffff81508bb0
  #5 [ffff882ecc375ba0] do_invalid_op at ffffffff810165e5
  #6 [ffff882ecc375c40] invalid_op at ffffffff815116fb
     [exception RIP: ocfs2_ci_checkpointed+208]
     RIP: ffffffffa0a7e940  RSP: ffff882ecc375cf0  RFLAGS: 00010002
     RAX: 0000000000000001  RBX: 000000000000654b  RCX: ffff8812dc83f1f8
     RDX: 00000000000017d9  RSI: ffff8812dc83f1f8  RDI: ffffffffa0b2c318
     RBP: ffff882ecc375d20   R8: ffff882ef6ecfa60   R9: ffff88301f272200
     R10: 0000000000000000  R11: 0000000000000000  R12: ffffffffffffffff
     R13: ffff8812dc83f4f0  R14: 0000000000000000  R15: ffff8812dc83f1f8
     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
  #7 [ffff882ecc375d28] ocfs2_check_meta_downconvert at ffffffffa0a7edbd [ocfs2]
  #8 [ffff882ecc375d38] ocfs2_unblock_lock at ffffffffa0a84af8 [ocfs2]
  #9 [ffff882ecc375dc8] ocfs2_process_blocked_lock at ffffffffa0a85285 [ocfs2]
#10 [ffff882ecc375e18] ocfs2_downconvert_thread_do_work at ffffffffa0a85445 [ocfs2]
#11 [ffff882ecc375e68] ocfs2_downconvert_thread at ffffffffa0a854de [ocfs2]
#12 [ffff882ecc375ee8] kthread at ffffffff81090da7
#13 [ffff882ecc375f48] kernel_thread_helper at ffffffff81511884
assert is tripped because the tran is not checkpointed and the lock level is PR.

Some time ago, chmod command had been executed. As result, the following call
chain left the inode cluster lock in PR state, latter on causing the assert.
system_call_fastpath
  -> my_chmod
   -> sys_chmod
    -> sys_fchmodat
     -> notify_change
      -> ocfs2_setattr
       -> posix_acl_chmod
        -> ocfs2_iop_set_acl
         -> ocfs2_set_acl
          -> ocfs2_acl_set_mode
Here is how.
1119 int ocfs2_setattr(struct dentry *dentry, struct iattr *attr)
1120 {
1247         ocfs2_inode_unlock(inode, 1); <<< WRONG thing to do.
..
1258         if (!status && attr->ia_valid & ATTR_MODE) {
1259                 status =  posix_acl_chmod(inode, inode->i_mode);

519 posix_acl_chmod(struct inode *inode, umode_t mode)
520 {
..
539         ret = inode->i_op->set_acl(inode, acl, ACL_TYPE_ACCESS);

287 int ocfs2_iop_set_acl(struct inode *inode, struct posix_acl *acl, ...
288 {
289         return ocfs2_set_acl(NULL, inode, NULL, type, acl, NULL, NULL);

224 int ocfs2_set_acl(handle_t *handle,
225                          struct inode *inode, ...
231 {
..
252                                 ret = ocfs2_acl_set_mode(inode, di_bh,
253                                                          handle, mode);

168 static int ocfs2_acl_set_mode(struct inode *inode, struct buffer_head ...
170 {
183         if (handle == NULL) {
                    >>> BUG: inode lock not held in ex at this point <<<
184                 handle = ocfs2_start_trans(OCFS2_SB(inode->i_sb),
185                                            OCFS2_INODE_UPDATE_CREDITS);

ocfs2_setattr.#1247 we unlock and at #1259 call posix_acl_chmod. When we reach
ocfs2_acl_set_mode.#181 and do trans, the inode cluster lock is not held in EX
mode (it should be). How this could have happended?

We are the lock master, were holding lock EX and have released it in
ocfs2_setattr.#1247.  Note that there are no holders of this lock at
this point.  Another node needs the lock in PR, and we downconvert from
EX to PR.  So the inode lock is PR when do the trans in
ocfs2_acl_set_mode.#184.  The trans stays in core (not flushed to disc).
Now another node want the lock in EX, downconvert thread gets kicked
(the one that tripped assert abovt), finds an unflushed trans but the
lock is not EX (it is PR).  If the lock was at EX, it would have flushed
the trans ocfs2_ci_checkpointed -> ocfs2_start_checkpoint before
downconverting (to NULL) for the request.

ocfs2_setattr must not drop inode lock ex in this code path.  If it
does, takes it again before the trans, say in ocfs2_set_acl, another
cluster node can get in between, execute another setattr, overwriting
the one in progress on this node, resulting in a mode acl size combo
that is a mix of the two.

Orabug: 20189959
Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
bf59e6623a ocfs2: clean up unused local variables in ocfs2_file_write_iter
Since commit 86b9c6f3f8 ("ocfs2: remove filesize checks for sync I/O
journal commit") removes filesize checks for sync I/O journal commit,
variables old_size and old_clusters are not actually used any more.  So
clean them up.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
faaebf18f8 ocfs2: fix several issues of append dio
1) Take rw EX lock in case of append dio.
2) Explicitly treat the error code -EIOCBQUEUED as normal.
3) Set di_bh to NULL after brelse if it may be used again later.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Joseph Qi
512f62acbd ocfs2: fix race between dio and recover orphan
During direct io the inode will be added to orphan first and then
deleted from orphan.  There is a race window that the orphan entry will
be deleted twice and thus trigger the BUG when validating
OCFS2_DIO_ORPHANED_FL in ocfs2_del_inode_from_orphan.

ocfs2_direct_IO_write
    ...
    ocfs2_add_inode_to_orphan
    >>>>>>>> race window.
             1) another node may rm the file and then down, this node
             take care of orphan recovery and clear flag
             OCFS2_DIO_ORPHANED_FL.
             2) since rw lock is unlocked, it may race with another
             orphan recovery and append dio.
    ocfs2_del_inode_from_orphan

So take inode mutex lock when recovering orphans and make rw unlock at the
end of aio write in case of append dio.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Reported-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Ryan Ding
aa1057b3de ocfs2: direct write will call ocfs2_rw_unlock() twice when doing aio+dio
ocfs2_file_write_iter() is usng the wrong return value ('written').  This
will cause ocfs2_rw_unlock() be called both in write_iter & end_io,
triggering a BUG_ON.

This issue was introduced by commit 7da839c475 ("ocfs2: use
__generic_file_write_iter()").

Orabug: 21612107
Fixes: 7da839c475 ("ocfs2: use __generic_file_write_iter()")
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-04 16:54:41 -07:00
Jan Kara
9c89fe0af8 ocfs2: Handle error from dquot_initialize()
dquot_initialize() can now return error. Handle it where possible.

Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Jan Kara <jack@suse.com>
2015-07-23 20:59:38 +02:00
Jan Kara
6184fc0b8d quota: Propagate error from ->acquire_dquot()
Currently when some error happened in ->acquire_dquot(), dqget() just
returned NULL. That was indistinguishable from a case when e.g. someone
run quotaoff and so was generally silently ignored. However
->acquire_dquot() can fail because of ENOSPC or EIO in which case user
should better know. So propagate error up from ->acquire_dquot properly.

Signed-off-by: Jan Kara <jack@suse.cz>
2015-07-23 20:59:10 +02:00
Linus Torvalds
e4bc13adfd Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block
Pull cgroup writeback support from Jens Axboe:
 "This is the big pull request for adding cgroup writeback support.

  This code has been in development for a long time, and it has been
  simmering in for-next for a good chunk of this cycle too.  This is one
  of those problems that has been talked about for at least half a
  decade, finally there's a solution and code to go with it.

  Also see last weeks writeup on LWN:

        http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
  writeback, blkio: add documentation for cgroup writeback support
  vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
  writeback: do foreign inode detection iff cgroup writeback is enabled
  v9fs: fix error handling in v9fs_session_init()
  bdi: fix wrong error return value in cgwb_create()
  buffer: remove unusued 'ret' variable
  writeback: disassociate inodes from dying bdi_writebacks
  writeback: implement foreign cgroup inode bdi_writeback switching
  writeback: add lockdep annotation to inode_to_wb()
  writeback: use unlocked_inode_to_wb transaction in inode_congested()
  writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
  writeback: implement [locked_]inode_to_wb_and_lock_list()
  writeback: implement foreign cgroup inode detection
  writeback: make writeback_control track the inode being written back
  writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
  mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
  writeback: implement memcg writeback domain based throttling
  writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
  writeback: implement memcg wb_domain
  writeback: update wb_over_bg_thresh() to use wb_domain aware operations
  ...
2015-06-25 16:00:17 -07:00
WeiWei Wang
fa5a0eb3b0 ocfs2: remove OCFS2_IOCB_SEM lock type in direct io
In ocfs2 direct read/write, OCFS2_IOCB_SEM lock type is used to protect
inode->i_alloc_sem rw semaphore lock in the earlier kernel version.
However, in the latest kernel, inode->i_alloc_sem rw semaphore lock is not
used at all, so OCFS2_IOCB_SEM lock type needs to be removed.

Signed-off-by: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-24 17:49:39 -07:00
Tejun Heo
66114cad64 writeback: separate out include/linux/backing-dev-defs.h
With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it.  c files
which need access to more backing-dev details now include
backing-dev.h directly.  This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
David Howells
2b0143b5c9 VFS: normal filesystems (and lustre): d_inode() annotations
that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:57 -04:00
Al Viro
7da839c475 ocfs2: use __generic_file_write_iter()
we can do that now - all we need is to clear IOCB_DIRECT from ->ki_flags in
"can't do dio" case.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:30:22 -04:00
Al Viro
2ba48ce513 mirror O_APPEND and O_DIRECT into iocb->ki_flags
... avoiding write_iter/fcntl races.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:30:22 -04:00
Al Viro
3309dd04cb switch generic_write_checks() to iocb and iter
... returning -E... upon error and amount of data left in iter after
(possible) truncation upon success.  Note, that normal case gives
a non-zero (positive) return value, so any tests for != 0 _must_ be
updated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Conflicts:
	fs/ext4/file.c
2015-04-11 22:30:21 -04:00
Al Viro
90320251db ocfs2: move generic_write_checks() before the alignment checks
Alignment checks for dio depend upon the range truncation done by
generic_write_checks().  They can be done as soon as we got ocfs2_rw_lock()
and that actually makes ocfs2_prepare_inode_for_write() simpler.

	The only thing to watch out for is restoring the original count
in "unlock and redo without dio" case.  Position doesn't need to be
restored, since we change it only in O_APPEND case and in that case it
will be reassigned anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:30:21 -04:00
Al Viro
5dc3161cb6 ocfs2_file_write_iter: stop messing with ppos
it's &iocb->ki_pos; no need to obfuscate.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:30:21 -04:00
Al Viro
dfea934575 Merge branch 'for-linus' into for-next 2015-04-11 22:29:51 -04:00
Al Viro
0fa6b005af generic_write_checks(): drop isblk argument
all remaining callers are passing 0; some just obscure that fact.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:48 -04:00
Al Viro
5d5d568975 make new_sync_{read,write}() static
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:40 -04:00
Al Viro
64b4e2526d ocfs2: _really_ sync the right range
"ocfs2 syncs the wrong range" had been broken; prior to it the
code was doing the wrong thing in case of O_APPEND, all right,
but _after_ it we were syncing the wrong range in 100% cases.
*ppos, aka iocb->ki_pos is incremented prior to that point,
so we are always doing sync on the area _after_ the one we'd
written to.

Spotted by Joseph Qi <joseph.qi@huawei.com> back in January;
unfortunately, I'd missed his mail back then ;-/

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09 07:18:48 -04:00
Al Viro
9ce5a232b8 ocfs2_file_write_iter: keep return value and current position update in sync
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-08 16:59:12 -04:00
Al Viro
cf1b5ea1c5 [regression] ocfs2: do *not* increment ->ki_pos twice
generic_file_direct_write() already does that.  Broken by
"ocfs2: do not fallback to buffer I/O write if appending"

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-08 16:58:59 -04:00
Christoph Hellwig
66ee59af63 fs: remove ki_nbytes
There is no need to pass the total request length in the kiocb, as
we already get passed in through the iov_iter argument.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-12 23:50:23 -04:00
Joseph Qi
160cc26663 ocfs2: set append dio as a ro compat feature
Intruduce a bit OCFS2_FEATURE_RO_COMPAT_APPEND_DIO and check it in
write flow. If the bit is not set, fall back to the old way.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:05 -08:00
Joseph Qi
3a83b342c8 ocfs2: complete the rest request through buffer io
Complte the rest request thourgh buffer io after direct write performed.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:05 -08:00
Joseph Qi
d943d59dd3 ocfs2: do not fallback to buffer I/O write if appending
Now we can do direct io and do not fallback to buffered IO any more in
case of append O_DIRECT write.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:05 -08:00
Joseph Qi
026749a86e ocfs2: prepare some interfaces used in append direct io
Currently in case of append O_DIRECT write (block not allocated yet),
ocfs2 will fall back to buffered I/O.  This has some disadvantages.
Firstly, it is not the behavior as expected.  Secondly, it will consume
huge page cache, e.g.  in mass backup scenario.  Thirdly, modern
filesystems such as ext4 support this feature.

In this patch set, the direct I/O write doesn't fallback to buffer I/O
write any more because the allocate blocks are enabled in direct I/O now.

This patch (of 9):

Prepare some interfaces which will be used in append O_DIRECT write.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Weiwei Wang <wangww631@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Xuejiufei <xuejiufei@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: alex chen <alex.chen@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-16 17:56:04 -08:00
Linus Torvalds
6bec003528 Merge branch 'for-3.20/bdi' of git://git.kernel.dk/linux-block
Pull backing device changes from Jens Axboe:
 "This contains a cleanup of how the backing device is handled, in
  preparation for a rework of the life time rules.  In this part, the
  most important change is to split the unrelated nommu mmap flags from
  it, but also removing a backing_dev_info pointer from the
  address_space (and inode), and a cleanup of other various minor bits.

  Christoph did all the work here, I just fixed an oops with pages that
  have a swap backing.  Arnd fixed a missing export, and Oleg killed the
  lustre backing_dev_info from staging.  Last patch was from Al,
  unexporting parts that are now no longer needed outside"

* 'for-3.20/bdi' of git://git.kernel.dk/linux-block:
  Make super_blocks and sb_lock static
  mtd: export new mtd_mmap_capabilities
  fs: make inode_to_bdi() handle NULL inode
  staging/lustre/llite: get rid of backing_dev_info
  fs: remove default_backing_dev_info
  fs: don't reassign dirty inodes to default_backing_dev_info
  nfs: don't call bdi_unregister
  ceph: remove call to bdi_unregister
  fs: remove mapping->backing_dev_info
  fs: export inode_to_bdi and use it in favor of mapping->backing_dev_info
  nilfs2: set up s_bdi like the generic mount_bdev code
  block_dev: get bdev inode bdi directly from the block device
  block_dev: only write bdev inode on close
  fs: introduce f_op->mmap_capabilities for nommu mmap support
  fs: kill BDI_CAP_SWAP_BACKED
  fs: deduplicate noop_backing_dev_info
2015-02-12 13:50:21 -08:00
Junxiao Bi
696cdf730f ocfs2: fix uninitialized variable access
Variable "why" is not yet initialized at line 615, fix it.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-10 14:30:28 -08:00