Commit Graph

42157 Commits

Author SHA1 Message Date
Linus Torvalds
4bb0fb57f3 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs bug fixes from Miklos Szeredi:
 "This contains fixes for bugs that appeared in earlier kernels (all are
  marked for -stable)"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: free lower_mnt array in ovl_put_super
  ovl: free stack of paths in ovl_fill_super
  ovl: fix open in stacked overlay
  ovl: fix dentry reference leak
  ovl: use O_LARGEFILE in ovl_copy_up()
2015-10-31 14:49:19 -07:00
Tejun Heo
b33e18f61b fs/writeback, rcu: Don't use list_entry_rcu() for pointer offsetting in bdi_split_work_to_wbs()
bdi_split_work_to_wbs() uses list_for_each_entry_rcu_continue()
to walk @bdi->wb_list.  To set up the initial iteration
condition, it uses list_entry_rcu() to calculate the entry
pointer corresponding to the list head; however, this isn't an
actual RCU dereference and using list_entry_rcu() for it ended
up breaking a proposed list_entry_rcu() change because it was
feeding an non-lvalue pointer into the macro.

Don't use the RCU variant for simple pointer offsetting.  Use
list_entry() instead.

Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Patrick Marlier <patrick.marlier@gmail.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pranith kumar <bobby.prani@gmail.com>
Link: http://lkml.kernel.org/r/20151027051939.GA19355@mtj.duckdns.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-28 13:17:30 +01:00
Dirk Steinmetz
f2ca379642 namei: permit linking with CAP_FOWNER in userns
Attempting to hardlink to an unsafe file (e.g. a setuid binary) from
within an unprivileged user namespace fails, even if CAP_FOWNER is held
within the namespace. This may cause various failures, such as a gentoo
installation within a lxc container failing to build and install specific
packages.

This change permits hardlinking of files owned by mapped uids, if
CAP_FOWNER is held for that namespace. Furthermore, it improves consistency
by using the existing inode_owner_or_capable(), which is aware of
namespaced capabilities as of 23adbe12ef ("fs,userns: Change
inode_capable to capable_wrt_inode_uidgid").

Signed-off-by: Dirk Steinmetz <public@rsjtdrjgfuzkfg.com>

This is hitting us in Ubuntu during some dpkg upgrades in containers.
When upgrading a file dpkg creates a hard link to the old file to back
it up before overwriting it. When packages upgrade suid files owned by a
non-root user the link isn't permitted, and the package upgrade fails.
This patch fixes our problem.

Tested-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2015-10-27 16:12:35 -05:00
Linus Torvalds
ea1ee5ff1b Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A final set of fixes for 4.3.

  It is (again) bigger than I would have liked, but it's all been
  through the testing mill and has been carefully reviewed by multiple
  parties.  Each fix is either a regression fix for this cycle, or is
  marked stable.  You can scold me at KS.  The pull request contains:

   - Three simple fixes for NVMe, fixing regressions since 4.3.  From
     Arnd, Christoph, and Keith.

   - A single xen-blkfront fix from Cathy, fixing a NULL dereference if
     an error is returned through the staste change callback.

   - Fixup for some bad/sloppy code in nbd that got introduced earlier
     in this cycle.  From Markus Pargmann.

   - A blk-mq tagset use-after-free fix from Junichi.

   - A backing device lifetime fix from Tejun, fixing a crash.

   - And finally, a set of regression/stable fixes for cgroup writeback
     from Tejun"

* 'for-linus' of git://git.kernel.dk/linux-block:
  writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy()
  NVMe: Fix memory leak on retried commands
  block: don't release bdi while request_queue has live references
  nvme: use an integer value to Linux errno values
  blk-mq: fix use-after-free in blk_mq_free_tag_set()
  nvme: fix 32-bit build warning
  writeback: fix incorrect calculation of available memory for memcg domains
  writeback: memcg dirty_throttle_control should be initialized with wb->memcg_completions
  writeback: bdi_writeback iteration must not skip dying ones
  writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback()
  writeback: laptop_mode_timer_fn() needs rcu_read_lock() around bdi_writeback iteration
  nbd: Add locking for tasks
  xen-blkfront: check for null drvdata in blkback_changed (XenbusStateClosing)
2015-10-24 07:20:57 +09:00
Linus Torvalds
37902bc190 Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I have two more small fixes this week:

  Qu's fix avoids unneeded COW during fallocate, and Christian found a
  memory leak in the error handling of an earlier fix"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix possible leak in btrfs_ioctl_balance()
  btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size
2015-10-24 07:17:58 +09:00
Joseph Qi
b67de018b3 ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put
dlm_lockres_put will call dlm_lockres_release if it is the last
reference, and then it may call dlm_print_one_lock_resource and
take lockres spinlock.

So unlock lockres spinlock before dlm_lockres_put to avoid deadlock.

Signed-off-by: Joseph Qi <joseph.qi@huawei.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-23 17:55:10 +09:00
Benjamin Coddington
616fb38fa7 locks: cleanup posix_lock_inode_wait and flock_lock_inode_wait
All callers use locks_lock_inode_wait() instead.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-22 14:57:42 -04:00
Benjamin Coddington
4f6563677a Move locks API users to locks_lock_inode_wait()
Instead of having users check for FL_POSIX or FL_FLOCK to call the correct
locks API function, use the check within locks_lock_inode_wait().  This
allows for some later cleanup.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-22 14:57:36 -04:00
Benjamin Coddington
e55c34a66f locks: introduce locks_lock_inode_wait()
Users of the locks API commonly call either posix_lock_file_wait() or
flock_lock_file_wait() depending upon the lock type.  Add a new function
locks_lock_inode_wait() which will check and call the correct function for
the type of lock passed in.

Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-22 14:57:20 -04:00
Geliang Tang
7e26e9ff0a pstore: Fix return type of pstore_is_mounted()
This patch changes return type of pstore_is_mounted from int to bool.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-10-22 10:57:33 -07:00
Chao Yu
beaa57dd98 f2fs: fix to skip shrinking extent nodes
In f2fs_shrink_extent_tree we should stop shrink flow if we have already
shrunk enough nodes in extent cache.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-22 09:39:35 -07:00
Chao Yu
a6be014e1d f2fs: fix error path of ->symlink
Now, in ->symlink of f2fs, we kept the fixed invoking order between
f2fs_add_link and page_symlink since we should init node info firstly
in f2fs_add_link, then such node info can be used in page_symlink.

But we didn't fix to release meta info which was done before page_symlink
in our error path, so this will leave us corrupt symlink entry in its
parent's dentry page. Fix this issue by adding f2fs_unlink in the error
path for removing such linking.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-22 09:39:24 -07:00
Chao Yu
7fee740697 f2fs: fix to clear GCed flag for atomic written page
Atomic write page can be GCed, after committing this kind of page, we should
clear the GCed flag for it.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-22 09:37:13 -07:00
Geliang Tang
ee1d267423 pstore: add pstore unregister
pstore doesn't support unregistering yet. It was marked as TODO.
This patch adds some code to fix it:
 1) Add functions to unregister kmsg/console/ftrace/pmsg.
 2) Add a function to free compression buffer.
 3) Unmap the memory and free it.
 4) Add a function to unregister pstore filesystem.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Kees Cook <keescook@chromium.org>
[Removed __exit annotation from ramoops_remove(). Reported by Arnd Bergmann]
Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-10-22 08:59:18 -07:00
Jaegeuk Kim
2b246fb0f6 f2fs: don't need to submit bio on error case
If commit_atomic_write is failed, we don't need to submit any bio.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-21 19:05:53 -07:00
Jaegeuk Kim
d7b8b384b0 f2fs: fix leakage of inmemory atomic pages
If we got failure during commit_atomic_write, abort_volatile_write will be
called, but will not drop the inmemory pages due to no FI_ATOMIC_FILE.
Actually, there is no reason to check the flag in abort_volatile_write.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-21 19:04:17 -07:00
Christian Engelmayer
0f89abf56a btrfs: fix possible leak in btrfs_ioctl_balance()
Commit 8eb934591f ("btrfs: check unsupported filters in balance
arguments") adds a jump to exit label out_bargs in case the argument
check fails. At this point in addition to the bargs memory, the
memory for struct btrfs_balance_control has already been allocated.
Ownership of bctl is passed to btrfs_balance() in the good case,
thus the memory is not freed due to the introduced jump. Make sure
that the memory gets freed in any case as necessary. Detected by
Coverity CID 1328378.

Signed-off-by: Christian Engelmayer <cengelma@gmx.at>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21 18:10:02 -07:00
Jaegeuk Kim
f96999c35f f2fs: refactor __find_rev_next_{zero}_bit
This patch refactors __find_rev_next_{zero}_bit which was disabled previously
due to bugs.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-21 15:26:00 -07:00
Martin K. Petersen
25520d55cd block: Inline blk_integrity in struct gendisk
Up until now the_integrity profile has been dynamically allocated and
attached to struct gendisk after the disk has been made active.

This causes problems because NVMe devices need to register the profile
prior to the partition table being read due to a mandatory metadata
buffer requirement. In addition, DM goes through hoops to deal with
preallocating, but not initializing integrity profiles.

Since the integrity profile is small (4 bytes + a pointer), Christoph
suggested moving it to struct gendisk proper. This requires several
changes:

 - Moving the blk_integrity definition to genhd.h.

 - Inlining blk_integrity in struct gendisk.

 - Removing the dynamic allocation code.

 - Adding helper functions which allow gendisk to set up and tear down
   the integrity sysfs dir when a disk is added/deleted.

 - Adding a blk_integrity_revalidate() callback for updating the stable
   pages bdi setting.

 - The calls that depend on whether a device has an integrity profile or
   not now key off of the bi->profile pointer.

 - Simplifying the integrity support routines in DM (Mike Snitzer).

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reported-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-21 14:42:42 -06:00
Geliang Tang
1873041152 pstore: add a helper function pstore_register_kmsg
Add a new wrapper function pstore_register_kmsg to keep the
consistency with other similar pstore_register_* functions.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-10-21 09:27:10 -07:00
Geliang Tang
549b39a9e7 pstore: add vmalloc error check
If vmalloc fails, make write_pmsg return -ENOMEM.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2015-10-21 09:27:10 -07:00
David Howells
146aa8b145 KEYS: Merge the type-specific data with the payload data
Merge the type-specific data with the payload data into one four-word chunk
as it seems pointless to keep them separate.

Use user_key_payload() for accessing the payloads of overloaded
user-defined keys.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: linux-cifs@vger.kernel.org
cc: ecryptfs@vger.kernel.org
cc: linux-ext4@vger.kernel.org
cc: linux-f2fs-devel@lists.sourceforge.net
cc: linux-nfs@vger.kernel.org
cc: ceph-devel@vger.kernel.org
cc: linux-ima-devel@lists.sourceforge.net
2015-10-21 15:18:36 +01:00
Qu Wenruo
0f6925fa29 btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size
Current code will always truncate tailing page if its alloc_start is
smaller than inode size.

For example, the file extent layout is like:
0	4K	8K	16K	32K
|<-----Extent A---------------->|
|<--Inode size: 18K---------->|

But if calling fallocate even for range [0,4K), it will cause btrfs to
re-truncate the range [16,32K), causing COW and a new extent.

0	4K	8K	16K	32K
|///////|	<- Fallocate call range
|<-----Extent A-------->|<--B-->|

The cause is quite easy, just a careless btrfs_truncate_inode() in a
else branch without extra judgment.
Fix it by add judgment on whether the fallocate range is beyond isize.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-20 19:07:29 -07:00
Jaegeuk Kim
67f8cf3cee f2fs: support fiemap for inline_data
There is a FIEMAP_EXTENT_INLINE_DATA, pointed out by Marc.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-20 11:33:21 -07:00
Jaegeuk Kim
1d373a0ef7 f2fs: flush dirty data for bmap
Users expect bmap will give allocated block addresses.
Let's play likewise ext4.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-20 11:33:11 -07:00
Jarkko Sakkinen
37c1c04cca sysfs: added __compat_only_sysfs_link_entry_to_kobj()
Added a new function __compat_only_sysfs_link_group_to_kobj() that adds
a symlink from attribute or group to a kobject. This needed for
maintaining backwards compatibility with PPI attributes in the TPM
driver.

Signed-off-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Peter Huewe <peterhuewe@gmx.de>
2015-10-19 01:01:19 +02:00
Viresh Kumar
c23fe83138 debugfs: Add debugfs_create_ulong()
Add debugfs_create_ulong() for the users of type 'unsigned long'. These
will be 32 bits long on a 32 bit machine and 64 bits long on a 64 bit
machine.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-18 10:14:39 -07:00
Stephen Boyd
6713e8fb54 debugfs: Add read-only/write-only bool file ops
There aren't any read-only or write-only bool file ops, but there
is a caller of debugfs_create_bool() that calls it with mode
equal to 0400. This leads to the possibility of userspace
modifying the file, so let's use the newly created
debugfs_create_mode() helper here to fix this.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17 22:09:03 -07:00
Stephen Boyd
6db6652abc debugfs: Add read-only/write-only size_t file ops
There aren't any read-only or write-only size_t file ops, but there
is a caller of debugfs_create_size_t() that calls it with mode
equal to 0400. This leads to the possibility of userspace
modifying the file, so let's use the newly created
debugfs_create_mode() helper here to fix this.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17 22:09:03 -07:00
Stephen Boyd
82b7d4fb4e debugfs: Add read-only/write-only x64 file ops
There aren't any read-only or write-only x64 file ops, but there
is a caller of debugfs_create_x64() that calls it with mode equal
to S_IRUGO. This leads to the possibility of userspace modifying
the file, so let's use the newly created debugfs_create_mode()
helper here to fix this.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17 22:09:03 -07:00
Stephen Boyd
b97f679954 debugfs: Consolidate file mode checks in debugfs_create_*()
The code that creates debugfs file with different file ops based
on the file mode is duplicated in each debugfs_create_*() API.
Consolidate that code into debugfs_create_mode(), that takes
three file ops structures so that we don't have to keep
copy/pasting that logic.

Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-17 22:09:03 -07:00
Linus Torvalds
6aa8ca4df0 Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "I have two more bug fixes for btrfs.

  My commit fixes a bug we hit last week at FB, a combination of lots of
  hard links and an admin command to resolve inode numbers.

  Dave is adding checks to make sure balance on current kernels ignores
  filters it doesn't understand.  The penalty for being wrong is just
  doing more work (not crashing etc), but it's a good fix"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  btrfs: fix use after free iterating extrefs
  btrfs: check unsupported filters in balance arguments
2015-10-16 12:55:34 -07:00
Linus Torvalds
3d875182d7 Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
 "6 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  sh: add copy_user_page() alias for __copy_user()
  lib/Kconfig: ZLIB_DEFLATE must select BITREVERSE
  mm, dax: fix DAX deadlocks
  memcg: convert threshold to bytes
  builddeb: remove debian/files before build
  mm, fs: obey gfp_mapping for add_to_page_cache()
2015-10-16 11:42:37 -07:00
Ross Zwisler
0f90cc6609 mm, dax: fix DAX deadlocks
The following two locking commits in the DAX code:

commit 843172978b ("dax: fix race between simultaneous faults")
commit 46c043ede4 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")

introduced a number of deadlocks and other issues which need to be fixed
for the v4.3 kernel.  The list of issues in DAX after these commits
(some newly introduced by the commits, some preexisting) can be found
here:

  https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault").

This undoes most of the changes introduced by those two commits,
essentially returning us to the DAX locking scheme that was used in
v4.2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-16 11:42:28 -07:00
Michal Hocko
063d99b4fa mm, fs: obey gfp_mapping for add_to_page_cache()
Commit 6afdb859b7 ("mm: do not ignore mapping_gfp_mask in page cache
allocation paths") has caught some users of hardcoded GFP_KERNEL used in
the page cache allocation paths.  This, however, wasn't complete and
there were others which went unnoticed.

Dave Chinner has reported the following deadlock for xfs on loop device:
: With the recent merge of the loop device changes, I'm now seeing
: XFS deadlock on my single CPU, 1GB RAM VM running xfs/073.
:
: The deadlocked is as follows:
:
: kloopd1: loop_queue_read_work
:       xfs_file_iter_read
:       lock XFS inode XFS_IOLOCK_SHARED (on image file)
:       page cache read (GFP_KERNEL)
:       radix tree alloc
:       memory reclaim
:       reclaim XFS inodes
:       log force to unpin inodes
:       <wait for log IO completion>
:
: xfs-cil/loop1: <does log force IO work>
:       xlog_cil_push
:       xlog_write
:       <loop issuing log writes>
:               xlog_state_get_iclog_space()
:               <blocks due to all log buffers under write io>
:               <waits for IO completion>
:
: kloopd1: loop_queue_write_work
:       xfs_file_write_iter
:       lock XFS inode XFS_IOLOCK_EXCL (on image file)
:       <wait for inode to be unlocked>
:
: i.e. the kloopd, with it's split read and write work queues, has
: introduced a dependency through memory reclaim. i.e. that writes
: need to be able to progress for reads make progress.
:
: The problem, fundamentally, is that mpage_readpages() does a
: GFP_KERNEL allocation, rather than paying attention to the inode's
: mapping gfp mask, which is set to GFP_NOFS.
:
: The didn't used to happen, because the loop device used to issue
: reads through the splice path and that does:
:
:       error = add_to_page_cache_lru(page, mapping, index,
:                       GFP_KERNEL & mapping_gfp_mask(mapping));

This has changed by commit aa4d86163e ("block: loop: switch to VFS
ITER_BVEC").

This patch changes mpage_readpage{s} to follow gfp mask set for the
mapping.  There are, however, other places which are doing basically the
same.

lustre:ll_dir_filler is doing GFP_KERNEL from the function which
apparently uses GFP_NOFS for other allocations so let's make this
consistent.

cifs:readpages_get_pages is called from cifs_readpages and
__cifs_readpages_from_fscache called from the same path obeys mapping
gfp.

ramfs_nommu_expand_for_mapping is hardcoding GFP_KERNEL as well
regardless it uses mapping_gfp_mask for the page allocation.

ext4_mpage_readpages is the called from the page cache allocation path
same as read_pages and read_cache_pages

As I've noticed in my previous post I cannot say I would be happy about
sprinkling mapping_gfp_mask all over the place and it sounds like we
should drop gfp_mask argument altogether and use it internally in
__add_to_page_cache_locked that would require all the filesystems to use
mapping gfp consistently which I am not sure is the case here.  From a
quick glance it seems that some file system use it all the time while
others are selective.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-16 11:42:28 -07:00
Linus Torvalds
c7823b6b97 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull ext4 Kconfig description fixup from Jan Kara:
 "A small fixup in description of EXT4_USE_FOR_EXT2 config option"

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext4: Update EXT4_USE_FOR_EXT2 description
2015-10-15 13:31:00 -07:00
Benjamin Coddington
6ca7d91012 locks: Use more file_inode and fix a comment
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
2015-10-15 09:07:07 -04:00
Martin Schwidefsky
a7b7617493 mm: add architecture primitives for software dirty bit clearing
There are primitives to create and query the software dirty bits
in a pte or pmd. But the clearing of the software dirty bits is done
in common code with x86 specific page table functions.

Add the missing architecture primitives to clear the software dirty
bits to allow the feature to be used on non-x86 systems, e.g. the
s390 architecture.

Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2015-10-14 14:32:05 +02:00
Chris Mason
dc6c5fb3b5 btrfs: fix use after free iterating extrefs
The code for btrfs inode-resolve has never worked properly for
files with enough hard links to trigger extrefs.  It was trying to
get the leaf out of a path after freeing the path:

	btrfs_release_path(path);
	leaf = path->nodes[0];
	item_size = btrfs_item_size_nr(leaf, slot);

The fix here is to use the extent buffer we cloned just a little higher
up to avoid deadlocks caused by using the leaf in the path.

Signed-off-by: Chris Mason <clm@fb.com>
cc: stable@vger.kernel.org # v3.7+
cc: Mark Fasheh <mfasheh@suse.de>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Mark Fasheh <mfasheh@suse.de>
2015-10-13 18:54:44 -07:00
David Sterba
8eb934591f btrfs: check unsupported filters in balance arguments
We don't verify that all the balance filter arguments supplemented by
the flags are actually known to the kernel. Thus we let it silently pass
and do nothing.

At the moment this means only the 'limit' filter, but we're going to add
a few more soon so it's better to have that fixed. Also in older stable
kernels so that it works with newer userspace tools.

Cc: stable@vger.kernel.org # 3.16+
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-10-13 18:53:03 -07:00
Linus Torvalds
5b5f145527 Two nfsd fixes, one for an RDMA crash, one for a pnfs/block protocol
bug.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWHUj5AAoJECebzXlCjuG+KIoP/RW5zigAEKqUiD7ycKR91BxD
 9Nt0fqTTrbkGJhKM1/DN4YEjogAHeFW5OnGiLQRUNI/qdy+I1Gyr1kgwGmCCVDt9
 d8AhnxcnXR5SmsQHk7eeUd/rnODetf0bW5YJ8PfFbnC6cmM013nR9ujEccUuCl9M
 hHTp+690Doab00PtWtsjmZv5d+eT1bktY/R2PuQhyQM2CKWh1u4FeNTd1lWE551D
 b1wSvhAGMYVEsQv8+HICDrIQ8loGfH2gpBILERLM2yJlhN1IPU3RmNSAcQpZSaql
 veJYVmHdpMACCLp0Dd3hwWKDYvcQ2lCqKk+Cpd0vLpvZ8J5OjCLC+a2dh0PRIYuf
 pwFCvbWz6dn27/9eXEKbyT2JIeBIl4qwrFjfiRKlNX0c4HGKXaE2gJrY7bxnDxe1
 BatAbEFZ+rxHyPmycaj3JdyOxafmw94XzbT8q2g7tmUCj+pvAI+Pbv6PlwN6W2r7
 aGBZzgd8Y9pT6ZbCB0e413d/t5ulxwkt6vVz9Jze4gfcUrWcqHaqt7AadMl7obUx
 AYPLAVGeHybdKlLvqv42IF2QM8ZhizM0+EnxkjfWLrsa7WbstWX5KLPpm3K80dM7
 98p1ToNQDFcNU8WBZw8AkBpFz4j32RVOkvzWFWbhCo+T3is4BmP16uEEjH90aCCY
 skQKMrq8J1ox33gz5gT7
 =Pkuy
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linux

Pull nfsd fixes from Bruce Fields:
 "Two nfsd fixes, one for an RDMA crash, one for a pnfs/block protocol
  bug"

* tag 'nfsd-4.3-2' of git://linux-nfs.org/~bfields/linux:
  svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE
  nfsd/blocklayout: accept any minlength
2015-10-13 11:31:03 -07:00
Jaegeuk Kim
84e4214f08 f2fs: relocate the tracepoint for background_gc
Once f2fs_gc is done, wait_ms is changed once more.
So, its tracepoint would be located after it.

Reported-by: He YunLei <heyunlei@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-13 10:02:01 -07:00
Chao Yu
08b39fbd59 f2fs crypto: fix racing of accessing encrypted page among
different competitors

Since we use different page cache (normally inode's page cache for R/W
and meta inode's page cache for GC) to cache the same physical block
which is belong to an encrypted inode. Writeback of these two page
cache should be exclusive, but now we didn't handle writeback state
well, so there may be potential racing problem:

a)
kworker:				f2fs_gc:
 - f2fs_write_data_pages
  - f2fs_write_data_page
   - do_write_data_page
    - write_data_page
     - f2fs_submit_page_mbio
(page#1 in inode's page cache was queued
in f2fs bio cache, and be ready to write
to new blkaddr)
					 - gc_data_segment
					  - move_encrypted_block
					   - pagecache_get_page
					(page#2 in meta inode's page cache
					was cached with the invalid datas
					of physical block located in new
					blkaddr)
					   - f2fs_submit_page_mbio
					(page#1 was submitted, later, page#2
					with invalid data will be submitted)

b)
f2fs_gc:
 - gc_data_segment
  - move_encrypted_block
   - f2fs_submit_page_mbio
(page#1 in meta inode's page cache was
queued in f2fs bio cache, and be ready
to write to new blkaddr)
					user thread:
					 - f2fs_write_begin
					  - f2fs_submit_page_bio
					(we submit the request to block layer
					to update page#2 in inode's page cache
					with physical block located in new
					blkaddr, so here we may read gabbage
					data from new blkaddr since GC hasn't
					writebacked the page#1 yet)

This patch fixes above potential racing problem for encrypted inode.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-13 09:52:34 -07:00
Chao Yu
ea1a29a0bd f2fs: export ra_nid_pages to sysfs
After finishing building free nid cache, we will try to readahead
asynchronously 4 more pages for the next reloading, the count of
readahead nid pages is fixed.

In some case, like SMR drive, read less sectors with fixed count
each time we trigger RA may be low efficient, since we will face
high seeking overhead, so we'd better let user to configure this
parameter from sysfs in specific workload.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12 14:03:43 -07:00
Chao Yu
2db2388fcf f2fs: readahead for free nids building
When there is no free nid in nid cache, all new node allocaters stop their
job to wait for reloading of free nids, however reloading is synchronous as
we will read 4 NAT pages for building nid cache, it cause the long latency.

This patch tries to readahead more NAT pages with READA request flag after
reloading of free nids. It helps to improve performance when users allocate
node id intensively.

Env: Sandisk 32G sd card
time for i in `seq 1 60000`; { echo -n > /mnt/f2fs/$i; echo XXXXXX > /mnt/f2fs/$i;}

Before:
real    0m2.814s
user    0m1.220s
sys     0m1.536s

After:
real    0m2.711s
user    0m1.136s
sys     0m1.568s

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12 14:03:22 -07:00
Chao Yu
26879fb101 f2fs: support lower priority asynchronous readahead in ra_meta_pages
Now, we use ra_meta_pages to reads continuous physical blocks as much as
possible to improve performance of following reads. However, ra_meta_pages
uses a synchronous readahead approach by submitting bio with READ, as READ
is with high priority, it can not be used in the case of preloading blocks,
and it's not sure when these RAed pages will be used.

This patch supports asynchronous readahead in ra_meta_pages by tagging bio
with READA flag in order to allow preloading.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12 14:03:15 -07:00
Chao Yu
2b947003fa f2fs: don't tag REQ_META for temporary non-meta pages
In recovery or checkpoint flow, we grab pages temperarily in meta inode's
mapping for caching temperary data, actually, datas in these pages were
not meta data of f2fs, but still we tag them with REQ_META flag. However,
lower device like eMMC may do some optimization for data of such type.
So in order to avoid wrong optimization, we'd better remove such flag
for temperary non-meta pages.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12 14:01:46 -07:00
Chao Yu
b8c2940048 f2fs: add a tracepoint for f2fs_read_data_pages
This patch adds a tracepoint for f2fs_read_data_pages to trace when pages
are readahead by VFS.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12 14:00:34 -07:00
Jaegeuk Kim
a56c7c6fb3 f2fs: set GFP_NOFS for grab_cache_page
For normal inodes, their pages are allocated with __GFP_FS, which can cause
filesystem calls when reclaiming memory.
This can incur a dead lock condition accordingly.

So, this patch addresses this problem by introducing
f2fs_grab_cache_page(.., bool for_write), which calls
grab_cache_page_write_begin() with AOP_FLAG_NOFS.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12 13:38:03 -07:00
Jaegeuk Kim
6e2c64ad7c f2fs: fix SSA updates resulting in corruption
The f2fs_collapse_range and f2fs_insert_range changes the block addresses
directly. But that can cause uncovered SSA updates.
In that case, we need to give up to change the block addresses and do buffered
writes to keep filesystem consistency.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12 13:38:02 -07:00
Jaegeuk Kim
a125702326 Revert "f2fs: do not skip dentry block writes"
The periodic checkpoint can resolve the previous issue.
So, now we can use this again to improve the reported performance regression:

https://lkml.org/lkml/2015/10/8/20

This reverts commit 15bec0ff5a9ba6d203178fa8772259df6207942a.
2015-10-12 13:38:02 -07:00
Jaegeuk Kim
c912a8298c f2fs: add F2FS_GOING_DOWN_METAFLUSH to test power-failure
This patch introduces F2FS_GOING_DOWN_METAFLUSH which flushes meta pages like
SSA blocks and then blocks all the writes.
This can be used by power-failure tests.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-12 13:37:54 -07:00
Tejun Heo
b817525a4a writeback: bdi_writeback iteration must not skip dying ones
bdi_for_each_wb() is used in several places to wake up or issue
writeback work items to all wb's (bdi_writeback's) on a given bdi.
The iteration is performed by walking bdi->cgwb_tree; however, the
tree only indexes wb's which are currently active.

For example, when a memcg gets associated with a different blkcg, the
old wb is removed from the tree so that the new one can be indexed.
The old wb starts dying from then on but will linger till all its
inodes are drained.  As these dying wb's may still host dirty inodes,
writeback operations which affect all wb's must include them.
bdi_for_each_wb() skipping dying wb's led to sync(2) missing and
failing to sync the inodes belonging to those wb's.

This patch adds a RCU protected @bdi->wb_list which lists all wb's
beloinging to that bdi.  wb's are added on creation and removed on
release rather than on the start of destruction.  bdi_for_each_wb()
usages are replaced with list_for_each[_continue]_rcu() iterations
over @bdi->wb_list and bdi_for_each_wb() and its helpers are removed.

v2: Updated as per Jan.  last_wb ref leak in bdi_split_work_to_wbs()
    fixed and unnecessary list head severing in cgwb_bdi_destroy()
    removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Artem Bityutskiy <dedekind1@gmail.com>
Fixes: ebe41ab0c7 ("writeback: implement bdi_for_each_wb()")
Link: http://lkml.kernel.org/g/1443012552.19983.209.camel@gmail.com
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-12 10:31:12 -06:00
Tejun Heo
6fdf860f15 writeback: fix bdi_writeback iteration in wakeup_dirtytime_writeback()
wakeup_dirtytime_writeback() walks and wakes up all wb's of all bdi's;
unfortunately, it was always waking up bdi->wb instead of the wb being
walked.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 001fe6f617 ("writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's")
Reviewed-by: Jan Kara <jack@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-10-12 10:31:11 -06:00
Konstantin Khlebnikov
5ffdbe8bf1 ovl: free lower_mnt array in ovl_put_super
This fixes memory leak after umount.

Kmemleak report:

unreferenced object 0xffff8800ba791010 (size 8):
  comm "mount", pid 2394, jiffies 4294996294 (age 53.920s)
  hex dump (first 8 bytes):
    20 1c 13 02 00 88 ff ff                           .......
  backtrace:
    [<ffffffff811f8cd4>] create_object+0x124/0x2c0
    [<ffffffff817a059b>] kmemleak_alloc+0x7b/0xc0
    [<ffffffff811dffe6>] __kmalloc+0x106/0x340
    [<ffffffffa0152bfc>] ovl_fill_super+0x55c/0x9b0 [overlay]
    [<ffffffff81200ac4>] mount_nodev+0x54/0xa0
    [<ffffffffa0152118>] ovl_mount+0x18/0x20 [overlay]
    [<ffffffff81201ab3>] mount_fs+0x43/0x170
    [<ffffffff81220d34>] vfs_kern_mount+0x74/0x170
    [<ffffffff812233ad>] do_mount+0x22d/0xdf0
    [<ffffffff812242cb>] SyS_mount+0x7b/0xc0
    [<ffffffff817b6bee>] entry_SYSCALL_64_fastpath+0x12/0x76
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: dd662667e6 ("ovl: add mutli-layer infrastructure")
Cc: <stable@vger.kernel.org> # v4.0+
2015-10-12 17:11:44 +02:00
Konstantin Khlebnikov
0f95502ad8 ovl: free stack of paths in ovl_fill_super
This fixes small memory leak after mount.

Kmemleak report:

unreferenced object 0xffff88003683fe00 (size 16):
  comm "mount", pid 2029, jiffies 4294909563 (age 33.380s)
  hex dump (first 16 bytes):
    20 27 1f bb 00 88 ff ff 40 4b 0f 36 02 88 ff ff   '......@K.6....
  backtrace:
    [<ffffffff811f8cd4>] create_object+0x124/0x2c0
    [<ffffffff817a059b>] kmemleak_alloc+0x7b/0xc0
    [<ffffffff811dffe6>] __kmalloc+0x106/0x340
    [<ffffffffa01b7a29>] ovl_fill_super+0x389/0x9a0 [overlay]
    [<ffffffff81200ac4>] mount_nodev+0x54/0xa0
    [<ffffffffa01b7118>] ovl_mount+0x18/0x20 [overlay]
    [<ffffffff81201ab3>] mount_fs+0x43/0x170
    [<ffffffff81220d34>] vfs_kern_mount+0x74/0x170
    [<ffffffff812233ad>] do_mount+0x22d/0xdf0
    [<ffffffff812242cb>] SyS_mount+0x7b/0xc0
    [<ffffffff817b6bee>] entry_SYSCALL_64_fastpath+0x12/0x76
    [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: a78d9f0d5d ("ovl: support multiple lower layers")
Cc: <stable@vger.kernel.org> # v4.0+
2015-10-12 17:11:43 +02:00
Miklos Szeredi
1c8a47df36 ovl: fix open in stacked overlay
If two overlayfs filesystems are stacked on top of each other, then we need
recursion in ovl_d_select_inode().

I guess d_backing_inode() is supposed to do that.  But currently it doesn't
and that functionality is open coded in vfs_open().  This is now copied
into ovl_d_select_inode() to fix this regression.

Reported-by: Alban Crequy <alban.crequy@gmail.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Fixes: 4bacc9c923 ("overlayfs: Make f_path always point to the overlay...")
Cc: David Howells <dhowells@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
2015-10-12 15:56:20 +02:00
David Howells
ab79efab0a ovl: fix dentry reference leak
In ovl_copy_up_locked(), newdentry is leaked if the function exits through
out_cleanup as this just to out after calling ovl_cleanup() - which doesn't
actually release the ref on newdentry.

The out_cleanup segment should instead exit through out2 as certainly
newdentry leaks - and possibly upper does also, though this isn't caught
given the catch of newdentry.

Without this fix, something like the following is seen:

	BUG: Dentry ffff880023e9eb20{i=f861,n=#ffff880023e82d90} still in use (1) [unmount of tmpfs tmpfs]
	BUG: Dentry ffff880023ece640{i=0,n=bigfile}  still in use (1) [unmount of tmpfs tmpfs]

when unmounting the upper layer after an error occurred in copyup.

An error can be induced by creating a big file in a lower layer with
something like:

	dd if=/dev/zero of=/lower/a/bigfile bs=65536 count=1 seek=$((0xf000))

to create a large file (4.1G).  Overlay an upper layer that is too small
(on tmpfs might do) and then induce a copy up by opening it writably.

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org> # v3.18+
2015-10-12 15:56:20 +02:00
David Howells
0480334fa6 ovl: use O_LARGEFILE in ovl_copy_up()
Open the lower file with O_LARGEFILE in ovl_copy_up().

Pass O_LARGEFILE unconditionally in ovl_copy_up_data() as it's purely for
catching 32-bit userspace dealing with a file large enough that it'll be
mishandled if the application isn't aware that there might be an integer
overflow.  Inside the kernel, there shouldn't be any problems.

Reported-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org> # v3.18+
2015-10-12 15:56:20 +02:00
Trond Myklebust
daf3761c9f namei: results of d_is_negative() should be checked after dentry revalidation
Leandro Awa writes:
 "After switching to version 4.1.6, our parallelized and distributed
  workflows now fail consistently with errors of the form:

  T34: ./regex.c:39:22: error: config.h: No such file or directory

  From our 'git bisect' testing, the following commit appears to be the
  possible cause of the behavior we've been seeing: commit 766c4cbfacd8"

Al Viro says:
 "What happens is that 766c4cbfac got the things subtly wrong.

  We used to treat d_is_negative() after lookup_fast() as "fall with
  ENOENT".  That was wrong - checking ->d_flags outside of ->d_seq
  protection is unreliable and failing with hard error on what should've
  fallen back to non-RCU pathname resolution is a bug.

  Unfortunately, we'd pulled the test too far up and ran afoul of
  another kind of staleness.  The dentry might have been absolutely
  stable from the RCU point of view (and we might be on UP, etc), but
  stale from the remote fs point of view.  If ->d_revalidate() returns
  "it's actually stale", dentry gets thrown away and the original code
  wouldn't even have looked at its ->d_flags.

  What we need is to check ->d_flags where 766c4cbfac does (prior to
  ->d_seq validation) but only use the result in cases where we do not
  discard this dentry outright"

Reported-by: Leandro Awa <lawa@nvidia.com>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911
Fixes: 766c4cbfac ("namei: d_is_negative() should be checked...")
Tested-by: Leandro Awa <lawa@nvidia.com>
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-10 10:17:27 -07:00
Linus Torvalds
175d58cfed Merge branch 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "These are small and assorted.  Neil's is the oldest, I dropped the
  ball thinking he was going to send it in"

* 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: support NFSv2 export
  Btrfs: open_ctree: Fix possible memory leak
  Btrfs: fix deadlock when finalizing block group creation
  Btrfs: update fix for read corruption of compressed and shared extents
  Btrfs: send, fix corner case for reference overwrite detection
2015-10-09 16:39:35 -07:00
Jaegeuk Kim
6066d8cdb6 f2fs: merge meta writes as many possible
This patch tries to merge IOs as many as possible when background flusher
conducts flushing the dirty meta pages.

[Before]

...
2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 124320, size = 4096
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 124560, size = 32768
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 95720, size = 987136
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123928, size = 4096
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123944, size = 8192
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123968, size = 45056
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 124064, size = 4096
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 97648, size = 1007616
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123776, size = 8192
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123800, size = 32768
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 124624, size = 4096
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 99616, size = 921600
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123608, size = 4096
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123624, size = 77824
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123792, size = 4096
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 123864, size = 32768
...

[After]

...
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 92168, size = 892928
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 93912, size = 753664
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 95384, size = 716800
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 96784, size = 712704
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 104160, size = 364544
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 104872, size = 356352
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 105568, size = 278528
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 106112, size = 319488
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 106736, size = 258048
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 107240, size = 270336
f2fs_submit_write_bio: dev = (8,18), WRITE_SYNC(MP), META, sector = 107768, size = 180224
...

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:57 -07:00
Jaegeuk Kim
60b99b486b f2fs: introduce a periodic checkpoint flow
This patch introduces a periodic checkpoint feature.
Note that, this is not enforcing to conduct checkpoints very strictly in terms
of trigger timing, instead just hope to help user experiences.
The default value is 60 seconds.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:57 -07:00
Jaegeuk Kim
5c26743474 f2fs: add a tracepoint for background gc
This patch introduces a tracepoint to monitor background gc behaviors.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:57 -07:00
Jaegeuk Kim
6aefd93b01 f2fs: introduce background_gc=sync mount option
This patch introduce background_gc=sync enabling synchronous cleaning in
background.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:57 -07:00
Chao Yu
456b88e4d1 f2fs: introduce a new ioctl F2FS_IOC_WRITE_CHECKPOINT
This patch introduce a new ioctl for those users who want to trigger
checkpoint from userspace through ioctl.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:56 -07:00
Chao Yu
d530d4d8e2 f2fs: support synchronous gc in ioctl
This patch drops in batches gc triggered through ioctl, since user
can easily control the gc by designing the loop around the ->ioctl.

We support synchronous gc by forcing using FG_GC in f2fs_gc, so with
it, user can make sure that in this round all blocks gced were
persistent in the device until ioctl returned.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:56 -07:00
Chao Yu
3342bb303b f2fs: skip searching dirty map if dirty segment is not exist
When searching victim during gc, if there are no dirty segments in
filesystem, we will still take the time to search the whole dirty segment
map, it's not needed, it's better to skip in this condition.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:56 -07:00
Chao Yu
a43f7ec327 f2fs: fix to avoid redundant searching in dirty map during gc
When doing gc, we search a victim in dirty map, starting from position of
last victim, we will reset the current searching position until we touch
the end of dirty map, and then search the whole diryt map. So sometimes we
will search the range [victim, last] twice, it's redundant, this patch
avoids this issue.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:55 -07:00
Chao Yu
5b7ee37414 f2fs: use atomic64_t for extent cache hit stat
Our hit stat of extent cache will increase all the time until remount,
and we use atomic_t type for the stat variable, so it may easily incur
overflow when we query extent cache frequently in a long time running
fs.

So to avoid that, this patch uses atomic64_t for hit stat variables.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:55 -07:00
Jaegeuk Kim
39307a8e24 f2fs: use vmalloc to handle -ENOMEM error
This patch introduces f2fs_kvmalloc to avoid -ENOMEM during mount.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:55 -07:00
Jaegeuk Kim
ab126cfc30 f2fs: should get a victim from retrials
If we do not call get_victim first, we cannot get a new victim for retrial
path.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:55 -07:00
Chao Yu
45fe8492cc f2fs: fix to correct freed section number during gc
This patch fixes to maintain the right section count freed in garbage
collecting when triggering a foreground gc.

Besides, when a foreground gc is running on current selected section, once
we fail to gc one segment, it's better to abandon gcing the left segments
in current section, because anyway we will select next victim for
foreground gc, so gc on the left segments in previous section will become
overhead and also cause the long latency for caller.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:54 -07:00
Chao Yu
345a6b2ee2 f2fs: fix to update {m,c}time correctly when truncating larger
This patch fixes to update ctime and atime correctly when truncating
larger in ->setattr.

The bug is reported by xfstest generic/313 as below:

generic/313 2s ... - output mismatch (see ./results/generic/313.out.bad)
    --- tests/generic/313.out   2015-08-04 15:28:53.430798882 +0800
    +++ results/generic/313.out.bad   2015-09-28 17:04:27.294278016 +0800
    @@ -1,2 +1,4 @@
     QA output created by 313
     Silence is golden
    +ctime not updated after truncate up
    +mtime not updated after truncate up
    ...
    (Run 'diff -u tests/generic/313.out tests/generic/313.out.bad'  to see the entire diff)
Ran: generic/313
Failures: generic/313
Failed 1 of 1 tests

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:54 -07:00
Jaegeuk Kim
90b803e6fb f2fs: do not skip dentry block writes
Previously, we skip dentry block writes when wbc is SYNC_NONE with no memory
pressure and the number of dirty pages is pretty small.

But, we didn't skip for normal data writes, which gives us not much big impact
on overall performance.
Moreover, by skipping some data writes, kworker falls into infinite loop to try
to write blocks, when many dir inodes have only one dentry block.

So, this patch removes skipping data writes.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:54 -07:00
Chao Yu
7223554133 f2fs: remove unneeded f2fs_{,un}lock_op in do_recover_data()
Protecting recovery flow by using cp_rwsem is not needed, since we have
prevent triggering any checkpoint by locking cp_mutex previously.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:54 -07:00
Chao Yu
1d7e10d58a f2fs: fix incorrect bimodal calculation
In update_sit_info, we use div_u64 to handle 'u64 divide u64' case, but
div_u64 can only handle 32-bits divisor, so our divisor with u64 type
passed to div_u64 will overflow, result in the wrong calculation when
show debug info of f2fs as below:

BDF: 464, avg. vblocks: 23509
(BDF should never exceed 100)

So change to use div64_u64 to handle this case correctly.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:53 -07:00
Chao Yu
4abd3f5ac4 f2fs: introduce __try_update_largest_extent
This patch adds a new helper __try_update_largest_extent for cleanup.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:53 -07:00
Nicholas Krause
545fe4210d f2fs: fix error handling for calls to various functions in the function recover_inline_data
This fixes error handling for calls to various functions in the
function  recover_inline_data to check if these particular functions
either return a error code or the boolean value false to signal their
caller they have failed internally and if this arises return false
to signal failure immediately to the caller of recover_inline_data
as we cannot continue after failures to calling either the function
truncate_inline_inode or truncate_blocks.

Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:53 -07:00
Chao Yu
9cd81ce3c2 f2fs: disallow switch extent_cache option dynamically
Swith extent_cache option dynamically when remount may casue consistency
issue between extent cache and dnode page. Fix in this patch to avoid
that condition.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:53 -07:00
Chao Yu
46c9e1413f f2fs: use correct flag in f2fs_map_blocks()
We introduce F2FS_GET_BLOCK_READ in commit e2b4e2bc88 ("f2fs: fix
incorrect mapping for bmap"), but forget to use this flag in the right
place, fix it.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:52 -07:00
Chao Yu
f9811703fe f2fs: fix to handle io error in ->direct_IO
Here is a oops reported as following message when testing generic/019 of
xfstest:

 ------------[ cut here ]------------
 kernel BUG at /home/yuchao/git/f2fs-dev/segment.c:882!
 invalid opcode: 0000 [#1] SMP
 Modules linked in: zram lz4_compress lz4_decompress f2fs(O) ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4
nf_def
 CPU: 2 PID: 25441 Comm: fio Tainted: G           O    4.3.0-rc1+ #6
 Hardware name: Hewlett-Packard HP Z220 CMT Workstation/1790, BIOS K51 v01.61 05/16/2013
 task: ffff8803f4e85580 ti: ffff8803fd61c000 task.ti: ffff8803fd61c000
 RIP: 0010:[<ffffffffa0784981>]  [<ffffffffa0784981>] new_curseg+0x321/0x330 [f2fs]
 RSP: 0018:ffff8803fd61f918  EFLAGS: 00010246
 RAX: 00000000000007ed RBX: 0000000000000224 RCX: 000000000000001f
 RDX: 0000000000000800 RSI: ffffffffffffffff RDI: ffff8803f56f4300
 RBP: ffff8803fd61f978 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000024 R11: ffff8800d23bbd78 R12: ffff8800d0ef0000
 R13: 0000000000000224 R14: 0000000000000000 R15: 0000000000000001
 FS:  00007f827ff85700(0000) GS:ffff88041ea80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffffffff600000 CR3: 00000003fef17000 CR4: 00000000001406e0
 Stack:
  000007ea00000002 0000000100000001 ffff8803f6456248 000007ed0000002b
  0000000000000224 ffff880404d1aa20 ffff8803fd61f9c8 ffff8800d0ef0000
  ffff8803f6456248 0000000000000001 00000000ffffffff ffffffffa078f358
 Call Trace:
  [<ffffffffa0785b87>] allocate_segment_by_default+0x1a7/0x1f0 [f2fs]
  [<ffffffffa078322c>] allocate_data_block+0x17c/0x360 [f2fs]
  [<ffffffffa0779521>] __allocate_data_block+0x131/0x1d0 [f2fs]
  [<ffffffffa077a995>] f2fs_direct_IO+0x4b5/0x580 [f2fs]
  [<ffffffff811510ae>] generic_file_direct_write+0xae/0x160
  [<ffffffff811518f5>] __generic_file_write_iter+0xd5/0x1f0
  [<ffffffff81151e07>] generic_file_write_iter+0xf7/0x200
  [<ffffffff81319e38>] ? apparmor_file_permission+0x18/0x20
  [<ffffffffa0768480>] ? f2fs_fallocate+0x1190/0x1190 [f2fs]
  [<ffffffffa07684c6>] f2fs_file_write_iter+0x46/0x90 [f2fs]
  [<ffffffff8120b4fe>] aio_run_iocb+0x1ee/0x290
  [<ffffffff81700f7e>] ? mutex_lock+0x1e/0x50
  [<ffffffff8120a1d7>] ? aio_read_events+0x207/0x2b0
  [<ffffffff8120b913>] do_io_submit+0x373/0x630
  [<ffffffff8120a4f6>] ? SyS_io_getevents+0x56/0xb0
  [<ffffffff8120bbe0>] SyS_io_submit+0x10/0x20
  [<ffffffff81703857>] entry_SYSCALL_64_fastpath+0x12/0x6a
 Code: 45 c8 48 8b 78 10 e8 9f 23 bf e0 41 8b 8c 24 cc 03 00 00 89 c7 31 d2 89 c6 89 d8 29 df f7 f1 29 d1 39 cf 0f 83 be fd ff ff eb
 RIP  [<ffffffffa0784981>] new_curseg+0x321/0x330 [f2fs]
  RSP <ffff8803fd61f918>
 ---[ end trace 2e577d7f711ddb86 ]---

The reason is that: in the test of generic/019, we will trigger a manmade
IO error in block layer through debugfs, after that, prefree segment will
no longer be freed, because we always skip doing gc or checkpoint when
there occurs an IO error.

Meanwhile fio with aio engine generated a large number of direct IOs,
which continue allocating spaces in free segment until we run out of them,
eventually, results in panic in new_curseg as no more free segment was
found.

So, this patch changes to return EIO in direct_IO for this condition.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:52 -07:00
Chao Yu
ea58711e88 f2fs: do in batches truncation in truncate_hole
truncate_data_blocks_range can do in batches truncation which makes all
changes in dnode page content, dnode page status, extent cache, block
count updating together.

But previously, truncate_hole() always truncates one block in dnode page
at a time by invoking truncate_data_blocks_range(,1), which make thing
slow.

This patch changes truncate_hole() to do in batches truncation for all
target blocks in one direct node inside truncate_data_blocks_range, which
can make our punch hole operation in ->fallocate more efficent.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:52 -07:00
Fan Li
4d1fa815f2 f2fs: optimize code of f2fs_update_extent_tree_range
Fix 2 potential problems:
1. when largest extent needs to be invalidated, it will be reset in
   __drop_largest_extent, which makes __is_extent_same after always
   return false, and largest extent unchanged. Now we update it properly.

2. when extent is split and the latter part remains in tree, next_en
   should be the latter part instead of next extent of original extent.
   It will cause merge failure if there is in-place update, although
   there is not, I think this fix will still makes codes less ambiguous.

This patch also simplifies codes of invalidating extents, and optimizes the
procedues that split extent into two.
There are a few modifications after last patch:
1. prev_en now is updated properly.
2. more codes and branches are simplified.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:52 -07:00
Fan Li
41a099de3a f2fs: drop largest extent by range
now we update extent by range, fofs may not be on the largest
extent if the new extent overlaps with it. so add a new function
to drop largest extent properly.

Signed-off-by: Fan li <fanofcode.li@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:51 -07:00
Jaegeuk Kim
a7230d16d5 f2fs: check end_io for metapages before making next checkpoint blocks
This patch avoids to produce new checkpoint blocks before the previous meta
pages were written completely.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:51 -07:00
Jaegeuk Kim
569cf1876a f2fs crypto: allocate buffer for decrypting filename
We got dentry pages from high_mem, and its address space directly goes into the
decryption path via f2fs_fname_disk_to_usr.
But, sg_init_one assumes the address is not from high_mem, so we can get this
panic since it doesn't call kmap_high but kunmap_high is triggered at the end.

kernel BUG at ../../../../../../kernel/mm/highmem.c:290!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
...
 (kunmap_high+0xb0/0xb8) from [<c0114534>] (__kunmap_atomic+0xa0/0xa4)
 (__kunmap_atomic+0xa0/0xa4) from [<c035f028>] (blkcipher_walk_done+0x128/0x1ec)
 (blkcipher_walk_done+0x128/0x1ec) from [<c0366c24>] (crypto_cbc_decrypt+0xc0/0x170)
 (crypto_cbc_decrypt+0xc0/0x170) from [<c0367148>] (crypto_cts_decrypt+0xc0/0x114)
 (crypto_cts_decrypt+0xc0/0x114) from [<c035ea98>] (async_decrypt+0x40/0x48)
 (async_decrypt+0x40/0x48) from [<c032ca34>] (f2fs_fname_disk_to_usr+0x124/0x304)
 (f2fs_fname_disk_to_usr+0x124/0x304) from [<c03056fc>] (f2fs_fill_dentries+0xac/0x188)
 (f2fs_fill_dentries+0xac/0x188) from [<c03059c8>] (f2fs_readdir+0x1f0/0x300)
 (f2fs_readdir+0x1f0/0x300) from [<c0218054>] (vfs_readdir+0x90/0xb4)
 (vfs_readdir+0x90/0xb4) from [<c0218418>] (SyS_getdents64+0x64/0xcc)
 (SyS_getdents64+0x64/0xcc) from [<c0105ba0>] (ret_fast_syscall+0x0/0x30)

Cc: <stable@vger.kernel.org>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:51 -07:00
Chao Yu
973163fc0c f2fs: reorganize f2fs_map_blocks
In this patch, we try to reorganize f2fs_map_blocks to make block mapping
flow more clear by using following structure:

/* check status of mapping */

if (unmapped) {
	/* blkaddr == NULL_ADDR || blkaddr == NEW_ADDR */

	if (create) {
		/* write path, handle dio write case here */
		alloc_and_map;
	} else {
		/*
		 * handle read cases from all call paths:
		 *     1. generic read;
		 *     2. dio read;
		 *     3. fiemap;
		 *     4. bmap
		 */
	}
}

/* map buffer_header */

Besides, this patch handles the missing case correctly for dio write:
When we fail in __allocate_data_blocks, then in f2fs_map_blocks, we will
not allocate blocks correctly for preallocated blocks, but returning with
an unmapped buffer head, which will result in failure of dio write.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:51 -07:00
Jaegeuk Kim
514053e454 f2fs: declare f2fs_update_extent_tree_range as static
This function should be static.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:50 -07:00
Chao Yu
9edcdabf36 f2fs: fix overflow of size calculation
We have potential overflow issue when calculating size of object, when
we left shift index with PAGE_CACHE_SHIFT bits, if type of index has only
32-bits space in 32-bit architecture, left shifting will incur overflow,
i.e:

pgoff_t index =  0xFFFFFFFF;
loff_t size = index << PAGE_CACHE_SHIFT;
size: 0xFFFFF000

So we should cast index with 64-bits type to avoid this issue.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:50 -07:00
Chao Yu
100136acfb f2fs: fix incorrect searching position when shrinking extent cache
When shrinking extent cache, we have two steps in the flow:
1) shrink objects which are unreferenced by inodes;
2) shrink objects from LRU list of extent cache.

In step 1, if we haven't shrunk enough number of objects, we will try
step 2, but before that we didn't update the searching position which
may point to last inode index in global extent tree, result in failing
to shrink objects by traversing the all inodes' extent tree.

In this patch, we reset searching position to beginning of global extent
tree for fixing.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:50 -07:00
Chao Yu
c998012b0b f2fs: verify file type early in f2fs_fallocate
This patch changes to verify file type early in f2fs_fallocate for
cleanup, meanwhile this also fixes to add missing verification for
expand_inode_data.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:50 -07:00
Jaegeuk Kim
c5cd29d21c f2fs: no need to lock for update_inode_page all the time
As comment says, we don't need to call f2fs_lock_op in write_inode to prevent
from producing dirty node pages all the time.
That happens only when there is not enough free sections and we can avoid that
by calling balance_fs in prior to that.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:50 -07:00
Jaegeuk Kim
25b93346a6 f2fs: cover number of dirty node pages under node_write lock
This number is referenced by checkpoint under node_write lock.

Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:49 -07:00
Nicholas Krause
538e17e7e9 f2fs: fix incorrect return statement in the function f2fs_ioc_release_volatile_write
This fixes the incorrect return statement at the end of the function
f2fs_ioc_release_volatile_write's body for returning zero as this is
incorrect due to the function call before this return statement to
the function punch_hole being able to fail and we should return this
function's return fail directly in order to signal to callers of the
function f2fs_ioc_release_volatile if a failure arises with this call
to punch_hole fails.

Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Reviewed-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:49 -07:00
Chao Yu
744288c721 f2fs: trace in batches extent info update
Rename trace_f2fs_update_extent_tree to trace_f2fs_update_extent_tree_range,
then expand and enable it to trace in batches extent info updates.

Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2015-10-09 16:20:49 -07:00
Christoph Hellwig
8c3ad9cb73 nfsd/blocklayout: accept any minlength
Recent Linux clients have started to send GETLAYOUT requests with
minlength less than blocksize.

Servers aren't really allowed to impose this kind of restriction on
layouts; see RFC 5661 section 18.43.3 for details.

This has been observed to cause indefinite hangs on fsx runs on some
clients.

Cc: stable@vger.kernel.org
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2015-10-09 16:11:40 -04:00
Jens Axboe
fd48ca3849 Linux 4.3-rc4
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJWEUxnAAoJEHm+PkMAQRiGYCYH/3gtGkFdvSLi+E1PfI8Qk3ZA
 XuYA4Mj09JBVSmaICeueMTDVrdiq0OE0zPib26GWlF/za13kNU8KgMR3+6XCuLSX
 DiCmh6mwDItoNoSIIUERLqrFHABXz8rZ3gb3uu2+kNN74Cl0piNm1YpFclEEWjMr
 9Wk5fkq+ontnDVUQOvWUxPiUXOJTvdLXBWTRDw1yTdE3RMNwRI2d/hme6Hq++WYV
 tRalZZKQaoB33js9WRVAoLVunvtna+i+/y7VGLj8QyS0+d6ec81Hey2r1/fR/oG4
 bs4ul6vtqeb3IR/PjUqxF59pSrCLEO+qrp9KrTlJNYgr1m1QyjRxWUdy/XhyaWo=
 =gIhN
 -----END PGP SIGNATURE-----

Merge tag 'v4.3-rc4' into for-4.4/core

Linux 4.3-rc4

Pulling in v4.3-rc4 to avoid conflicts with NVMe fixes that have gone
in since for-4.4/core was based.
2015-10-09 10:08:39 -06:00
Linus Torvalds
a0eeb8dd34 NFS client bugfixes for Linux 4.3
Highlights include:
 
 Bugfixes:
 - Fix a use-after-free bug in the RPC/RDMA client
 - Fix a write performance regression
 - Fix up page writeback accounting
 - Don't try to reclaim unused state owners
 - Fix a NFSv4 nograce recovery hang
 - reset states to use open_stateid when returning delegation voluntarily
 - Fix a tracepoint NULL-pointer dereference
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWFIgBAAoJEGcL54qWCgDy3qwQAJrMvwiO0shZe9+PsUZcDIhw
 1CnDmWYafJmpNGK+YEZatI+tdR9pwSYXdfiCGj/Ijfvl1PXUgyVAmNARAB9oFUza
 DVvZjqJ6aiFzeawGC8f2IfwY4XcAy4+BIZOiwp2JafepRnoSgZl24olKbO4cQ7UD
 i5IaDrYYvAxefsUoRogEF19H1y8zC1yUA2aDKrriV6A9rEZSbaZLRfS8BHppXBjY
 w0OP74neD4rnn/rL0YDEdsjiI17W7QwoMk05yzOJH3wQt/Y4Ll/lwLO4y3URpIGF
 wzHzMIeggGPPEM9e1JixPc3Y9F9kCHW8YjGJ3xxY2C6q8vt7dzpaVhh10AxycZtZ
 gcbepjMhoL7gJqu5DQ/0S86Sb5jNaL0KlUDsEnqtOfe3/UiyTJ/f57TMfdscm+wI
 pdyFFtxUHcFueO1a2XuEOuSIUFzFuwIQ2aiHlbu90ev04dd7dqzU0PffhRlzu3tJ
 8+ZHQMbSmotUmhxlpI+VA4rG0JUsaLY09chH5r0NvsXm0LR+z3vX7Q6oONN7IBDv
 5hULj4ecB69smBv+FjQyVUAu0LiahINAGu0p0wEjTdBwFMic5qpVVfhTs8qrkGRZ
 M8RYrANtVhY17fJf5WF7Wyt58icAWRKDHslGdzUav+2VFBfNK1ZeG+QhYYqDNF5k
 SkJsG4iCIN9JazwqfqJI
 =aoNS
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Bugfixes:
   - Fix a use-after-free bug in the RPC/RDMA client
   - Fix a write performance regression
   - Fix up page writeback accounting
   - Don't try to reclaim unused state owners
   - Fix a NFSv4 nograce recovery hang
   - reset states to use open_stateid when returning delegation
     voluntarily
   - Fix a tracepoint NULL-pointer dereference"

* tag 'nfs-for-4.3-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Fix a tracepoint NULL-pointer dereference
  nfs4: reset states to use open_stateid when returning delegation voluntarily
  NFSv4: Fix a nograce recovery hang
  NFSv4.1: nfs4_opendata_check_deleg needs to handle NFS4_OPEN_CLAIM_DELEG_CUR_FH
  NFSv4: Don't try to reclaim unused state owners
  NFS: Fix a write performance regression
  NFS: Fix up page writeback accounting
  xprtrdma: disconnect and flush cqs before freeing buffers
2015-10-07 08:54:22 +01:00
Anna Schumaker
39d0d3bdf7 NFS: Fix a tracepoint NULL-pointer dereference
Running xfstest generic/013 with the tracepoint nfs:nfs4_open_file
enabled produces a NULL-pointer dereference when calculating fileid and
filehandle of the opened file.  Fix this by checking if state is NULL
before trying to use the inode pointer.

Reported-by: Olga Kornievskaia <aglo@umich.edu>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2015-10-06 18:56:25 -04:00