Commit Graph

56429 Commits

Author SHA1 Message Date
Qu Wenruo
5f527822be btrfs: qgroup: Use generation-aware subtree swap to mark dirty extents
Before this patch, with quota enabled during balance, we need to mark
the whole subtree dirty for quota.

E.g.
OO = Old tree blocks (from file tree)
NN = New tree blocks (from reloc tree)

        File tree (src)		          Reloc tree (dst)
            OO (a)                              NN (a)
           /  \                                /  \
     (b) OO    OO (c)                    (b) NN    NN (c)
        /  \  /  \                          /  \  /  \
       OO  OO OO OO (d)                    OO  OO OO NN (d)

For old balance + quota case, quota will mark the whole src and dst tree
dirty, including all the 3 old tree blocks in reloc tree.

It's doable for small file tree or new tree blocks are all located at
lower level.

But for large file tree or new tree blocks are all located at higher
level, this will lead to mark the whole tree dirty, and be unbelievably
slow.

This patch will change how we handle such balance with quota enabled
case.

Now we will search from (b) and (c) for any new tree blocks whose
generation is equal to @last_snapshot, and only mark them dirty.

In above case, we only need to trace tree blocks NN(b), NN(c) and NN(d).
(NN(a) will be traced when COW happens for nodeptr modification).  And
also for tree blocks OO(b), OO(c), OO(d). (OO(a) will be traced when COW
happens for nodeptr modification.)

For above case, we could skip 3 tree blocks, but for larger tree, we can
skip tons of unmodified tree blocks, and hugely speed up balance.

This patch will introduce a new function,
btrfs_qgroup_trace_subtree_swap(), which will do the following main
work:

1) Read out real root eb
   And setup basic dst_path for later calls
2) Call qgroup_trace_new_subtree_blocks()
   To trace all new tree blocks in reloc tree and their counter
   parts in the file tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:36 +02:00
Qu Wenruo
ea49f3e73c btrfs: qgroup: Introduce function to find all new tree blocks of reloc tree
Introduce new function, qgroup_trace_new_subtree_blocks(), to iterate
all new tree blocks in a reloc tree.
So that qgroup could skip unrelated tree blocks during balance, which
should hugely speedup balance speed when quota is enabled.

The function qgroup_trace_new_subtree_blocks() itself only cares about
new tree blocks in reloc tree.

All its main works are:

1) Read out tree blocks according to parent pointers

2) Do recursive depth-first search
   Will call the same function on all its children tree blocks, with
   search level set to current level -1.
   And will also skip all children whose generation is smaller than
   @last_snapshot.

3) Call qgroup_trace_extent_swap() to trace tree blocks

So although we have parameter list related to source file tree, it's not
used at all, but only passed to qgroup_trace_extent_swap().
Thus despite the tree read code, the core should be pretty short and all
about recursive depth-first search.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:36 +02:00
Qu Wenruo
25982561db btrfs: qgroup: Introduce function to trace two swaped extents
Introduce a new function, qgroup_trace_extent_swap(), which will be used
later for balance qgroup speedup.

The basis idea of balance is swapping tree blocks between reloc tree and
the real file tree.

The swap will happen in highest tree block, but there may be a lot of
tree blocks involved.

For example:
 OO = Old tree blocks
 NN = New tree blocks allocated during balance

          File tree (257)                  Reloc tree for 257
L2              OO                                NN
              /    \                            /    \
L1          OO      OO (a)                    OO      NN (a)
           / \     / \                       / \     / \
L0       OO   OO OO   OO                   OO   OO NN   NN
                 (b)  (c)                          (b)  (c)

When calling qgroup_trace_extent_swap(), we will pass:
@src_eb = OO(a)
@dst_path = [ nodes[1] = NN(a), nodes[0] = NN(c) ]
@dst_level = 0
@root_level = 1

In that case, qgroup_trace_extent_swap() will search from OO(a) to
reach OO(c), then mark both OO(c) and NN(c) as qgroup dirty.

The main work of qgroup_trace_extent_swap() can be split into 3 parts:

1) Tree search from @src_eb
   It should acts as a simplified btrfs_search_slot().
   The key for search can be extracted from @dst_path->nodes[dst_level]
   (first key).

2) Mark the final tree blocks in @src_path and @dst_path qgroup dirty
   NOTE: In above case, OO(a) and NN(a) won't be marked qgroup dirty.
   They should be marked during preivous (@dst_level = 1) iteration.

3) Mark file extents in leaves dirty
   We don't have good way to pick out new file extents only.
   So we still follow the old method by scanning all file extents in
   the leave.

This function can free us from keeping two pathes, thus later we only need
to care about how to iterate all new tree blocks in reloc tree.

Signed-off-by: Qu Wenruo <wqu@suse.com>
[ copy changelog to function comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:36 +02:00
Qu Wenruo
c337e7b02f btrfs: qgroup: Introduce trace event to analyse the number of dirty extents accounted
Number of qgroup dirty extents is directly linked to the performance
overhead, so add a new trace event, trace_qgroup_num_dirty_extents(), to
record how many dirty extents is processed in
btrfs_qgroup_account_extents().

This will be pretty handy to analyze later balance performance
improvement.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:36 +02:00
Qu Wenruo
fa6ac71524 btrfs: relocation: Add basic extent backref related comments for build_backref_tree
fs/btrfs/relocation.c:build_backref_tree() is some code from 2009 era,
although it works pretty fine, it's not that easy to understand.
Especially combined with the complex btrfs backref format.

This patch adds some basic comment for the backref build part of the
code, making it less hard to read, at least for backref searching part.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
Omar Sandoval
4779cc0424 Btrfs: get rid of btrfs_symlink_aops
The only aops we define for symlinks are identical to the aops for
regular files. This has been the case since symlink support was added in
commit 2b8d99a723 ("Btrfs: symlinks and hard links"). As far as I can
tell, there wasn't a good reason to have separate aops then, and there
isn't now, so let's just do what most other filesystems do and reuse the
same structure.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
Chris Mason
7703bdd8d2 Btrfs: don't clean dirty pages during buffered writes
During buffered writes, we follow this basic series of steps:

again:
	lock all the pages
	wait for writeback on all the pages
	Take the extent range lock
	wait for ordered extents on the whole range
	clean all the pages

	if (copy_from_user_in_atomic() hits a fault) {
		drop our locks
		goto again;
	}

	dirty all the pages
	release all the locks

The extra waiting, cleaning and locking are there to make sure we don't
modify pages in flight to the drive, after they've been crc'd.

If some of the pages in the range were already dirty when the write
began, and we need to goto again, we create a window where a dirty page
has been cleaned and unlocked.  It may be reclaimed before we're able to
lock it again, which means we'll read the old contents off the drive and
lose any modifications that had been pending writeback.

We don't actually need to clean the pages.  All of the other locking in
place makes sure we don't start IO on the pages, so we can just leave
them dirty for the duration of the write.

Fixes: 73d59314e6 (the original btrfs merge)
CC: stable@vger.kernel.org # v4.4+
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
David Sterba
818255feec btrfs: use common helper instead of open coding a bit test
The helper does the same math and we take care about the special case
when flags is 0 too.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
Nikolay Borisov
0110a4c434 btrfs: refactor __btrfs_run_delayed_refs loop
Refactor the delayed refs loop by using the newly introduced
btrfs_run_delayed_refs_for_head function. This greatly simplifies
__btrfs_run_delayed_refs and makes it more obvious what is happening.

We now have 1 loop which iterates the existing delayed_heads and then
each selected ref head is processed by the new helper. All existing
semantics of the code are preserved so no functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
Nikolay Borisov
e726138676 btrfs: Factor out loop processing all refs of a head
This patch introduces a new helper encompassing the implicit inner loop
in __btrfs_run_delayed_refs which processes all the refs for a given
head. The code is mostly copy/paste, the only difference is that if we
detect a newer reference then -EAGAIN is returned so that callers can
react correctly.

Also, at the end of the loop the head is relocked and
btrfs_merge_delayed_refs is run again to retain the pre-refactoring
semantics.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:35 +02:00
Nikolay Borisov
b1cdbcb53a btrfs: Factor out ref head locking code in __btrfs_run_delayed_refs
This is in preparation to refactor the giant loop in
__btrfs_run_delayed_refs. As a first step define a new function
which implements acquiring a reference to a btrfs_delayed_refs_head and
use it. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:34 +02:00
David Sterba
b2fa11547b btrfs: tests: polish ifdefs around testing helper
Avoid the inline ifdefs and use two sections for self-tests enabled and
disabled.

Though there could be no ifdef and unconditional test_bit of
BTRFS_FS_STATE_DUMMY_FS_INFO, the static inline can help to optimize out
any code that would depend on conditions using btrfs_is_testing.

As this is only for the testing code, drop unlikely().

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:34 +02:00
David Sterba
a654666a34 btrfs: tests: group declarations of self-test helpers
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:34 +02:00
David Sterba
57ec5fb478 btrfs: tests: move testing members of struct btrfs_root to the end
The data used only for tests are better placed at the end of the
structure so that they don't change the structure layout. All new
members of btrfs_root should be placed before.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:34 +02:00
David Sterba
9c36396c2a btrfs: tests: add separate stub for find_lock_delalloc_range
The helper find_lock_delalloc_range is now conditionally built static,
dpending on whether the self-tests are enabled or not. There's a macro
that is supposed to hide the export, used only once. To discourage
further use, drop it an add a public wrapper for the helper needed by
tests.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:34 +02:00
Liu Bo
ecf160b424 Btrfs: preftree: use rb_first_cached
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

While resolving indirect refs and missing refs, it always looks for the
first rb entry in a while loop, it's helpful to use rb_first_cached
instead.

For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Liu Bo
07e1ce096d Btrfs: extent_map: use rb_first_cached
rb_first_cached() trades an extra pointer "leftmost" for doing the
same job as rb_first() but in O(1).

As evict_inode_truncate_pages() removes all extent mapping by always
looking for the first rb entry, it's helpful to use rb_first_cached
instead.

For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Liu Bo
03a1d4c891 Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root
rb_first_cached() trades an extra pointer "leftmost" for doing the same job as
rb_first() but in O(1).

Functions manipulating delayed_item need to get the first entry, this converts
it to use rb_first_cached().

For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Liu Bo
e3d0396563 Btrfs: delayed-refs: use rb_first_cached for ref_tree
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

Functions manipulating href->ref_tree need to get the first entry, this
converts href->ref_tree to use rb_first_cached().

For more details about the optimization see patch "Btrfs: delayed-refs:
use rb_first_cached for href_root".

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Liu Bo
5c9d028b3b Btrfs: delayed-refs: use rb_first_cached for href_root
rb_first_cached() trades an extra pointer "leftmost" for doing the same
job as rb_first() but in O(1).

Functions manipulating href_root need to get the first entry, this
converts href_root to use rb_first_cached().

This patch is first in the sequenct of similar updates to other rbtrees
and this is analysis of the expected behaviour and improvements.

There's a common pattern:

while (node = rb_first) {
        entry = rb_entry(node)
        next = rb_next(node)
        rb_erase(node)
        cleanup(entry)
}

rb_first needs to traverse the tree up to logN depth, rb_erase can
completely reshuffle the tree. With the caching we'll skip the traversal
in rb_first.  That's a cached memory access vs looped pointer
dereference trade-off that IMHO has a clear winner.

Measurements show there's not much difference in a sample tree with
10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of
caching and pointer chasing are unpredictable though.

Further optimzations can be done to avoid the expensive rb_erase step.
In some cases it's ok to process the nodes in any order, so the tree can
be traversed in post-order, not rebalancing the children nodes and just
calling free. Care must be taken regarding the next node.

Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog from mail discussions ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Josef Bacik
3aa7c7a31c btrfs: wait on caching when putting the bg cache
While testing my backport I noticed there was a panic if I ran
generic/416 generic/417 generic/418 all in a row.  This just happened to
uncover a race where we had outstanding IO after we destroy all of our
workqueues, and then we'd go to queue the endio work on those free'd
workqueues.

This is because we aren't waiting for the caching threads to be done
before freeing everything up, so to fix this make sure we wait on any
outstanding caching that's being done before we free up the block group,
so we're sure to be done with all IO by the time we get to
btrfs_stop_all_workers().  This fixes the panic I was seeing
consistently in testing.

------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:6112!
SMP PTI
Modules linked in:
CPU: 1 PID: 27165 Comm: kworker/u4:7 Not tainted 4.16.0-02155-g3553e54a578d-dirty #875
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Workqueue: btrfs-cache btrfs_cache_helper
RIP: 0010:btrfs_map_bio+0x346/0x370
RSP: 0000:ffffc900061e79d0 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff880071542e00 RCX: 0000000000533000
RDX: ffff88006bb74380 RSI: 0000000000000008 RDI: ffff880078160000
RBP: 0000000000000001 R08: ffff8800781cd200 R09: 0000000000503000
R10: ffff88006cd21200 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8800781cd200 R15: ffff880071542e00
FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000817ffc4 CR3: 0000000078314000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 btree_submit_bio_hook+0x8a/0xd0
 submit_one_bio+0x5d/0x80
 read_extent_buffer_pages+0x18a/0x320
 btree_read_extent_buffer_pages+0xbc/0x200
 ? alloc_extent_buffer+0x359/0x3e0
 read_tree_block+0x3d/0x60
 read_block_for_search.isra.30+0x1a5/0x360
 btrfs_search_slot+0x41b/0xa10
 btrfs_next_old_leaf+0x212/0x470
 caching_thread+0x323/0x490
 normal_work_helper+0xc5/0x310
 process_one_work+0x141/0x340
 worker_thread+0x44/0x3c0
 kthread+0xf8/0x130
 ? process_one_work+0x340/0x340
 ? kthread_bind+0x10/0x10
 ret_from_fork+0x35/0x40
RIP: btrfs_map_bio+0x346/0x370 RSP: ffffc900061e79d0
---[ end trace 827eb13e50846033 ]---
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception

CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:33 +02:00
Jeff Mahoney
fee7acc361 btrfs: keep trim from interfering with transaction commits
Commit 499f377f49 (btrfs: iterate over unused chunk space in FITRIM)
fixed free space trimming, but introduced latency when it was running.
This is due to it pinning the transaction using both a incremented
refcount and holding the commit root sem for the duration of a single
trim operation.

This was to ensure safety but it's unnecessary.  We already hold the the
chunk mutex so we know that the chunk we're using can't be allocated
while we're trimming it.

In order to check against chunks allocated already in this transaction,
we need to check the pending chunks list.  To to that safely without
joining the transaction (or attaching than then having to commit it) we
need to ensure that the dev root's commit root doesn't change underneath
us and the pending chunk lists stays around until we're done with it.

We can ensure the former by holding the commit root sem and the latter
by pinning the transaction.  We do this now, but the critical section
covers the trim operation itself and we don't need to do that.

This patch moves the pinning and unpinning logic into helpers and unpins
the transaction after performing the search and check for pending
chunks.

Limiting the critical section of the transaction pinning improves the
latency substantially on slower storage (e.g. image files over NFS).

Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Jeff Mahoney
0be88e367f btrfs: don't attempt to trim devices that don't support it
We check whether any device the file system is using supports discard in
the ioctl call, but then we attempt to trim free extents on every device
regardless of whether discard is supported.  Due to the way we mask off
EOPNOTSUPP, we can end up issuing the trim operations on each free range
on devices that don't support it, just wasting time.

Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Jeff Mahoney
d4e329de5e btrfs: iterate all devices during trim, instead of fs_devices::alloc_list
btrfs_trim_fs iterates over the fs_devices->alloc_list while holding the
device_list_mutex.  The problem is that ->alloc_list is protected by the
chunk mutex.  We don't want to hold the chunk mutex over the trim of the
entire file system.  Fortunately, the ->dev_list list is protected by
the dev_list mutex and while it will give us all devices, including
read-only devices, we already just skip the read-only devices.  Then we
can continue to take and release the chunk mutex while scanning each
device.

Fixes: 499f377f49 ("btrfs: iterate over unused chunk space in FITRIM")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Qu Wenruo
6ba9fc8e62 btrfs: Ensure btrfs_trim_fs can trim the whole filesystem
[BUG]
fstrim on some btrfs only trims the unallocated space, not trimming any
space in existing block groups.

[CAUSE]
Before fstrim_range passed to btrfs_trim_fs(), it gets truncated to
range [0, super->total_bytes).  So later btrfs_trim_fs() will only be
able to trim block groups in range [0, super->total_bytes).

While for btrfs, any bytenr aligned to sectorsize is valid, since btrfs
uses its logical address space, there is nothing limiting the location
where we put block groups.

For filesystem with frequent balance, it's quite easy to relocate all
block groups and bytenr of block groups will start beyond
super->total_bytes.

In that case, btrfs will not trim existing block groups.

[FIX]
Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
can get the unmodified range, which is normally set to [0, U64_MAX].

Reported-by: Chris Murphy <lists@colorremedies.com>
Fixes: f4c697e640 ("btrfs: return EINVAL if start > total_bytes in fitrim ioctl")
CC: <stable@vger.kernel.org> # v4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Qu Wenruo
93bba24d4b btrfs: Enhance btrfs_trim_fs function to handle error better
Function btrfs_trim_fs() doesn't handle errors in a consistent way. If
error happens when trimming existing block groups, it will skip the
remaining blocks and continue to trim unallocated space for each device.

The return value will only reflect the final error from device trimming.

This patch will fix such behavior by:

1) Recording the last error from block group or device trimming
   The return value will also reflect the last error during trimming.
   Make developer more aware of the problem.

2) Continuing trimming if possible
   If we failed to trim one block group or device, we could still try
   the next block group or device.

3) Report number of failures during block group and device trimming
   It would be less noisy, but still gives user a brief summary of
   what's going wrong.

Such behavior can avoid confusion for cases like failure to trim the
first block group and then only unallocated space is trimmed.

Reported-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add bg_ret and dev_ret to the messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:32 +02:00
Jeff Mahoney
5c06147128 btrfs: fix error handling in btrfs_dev_replace_start
When we fail to start a transaction in btrfs_dev_replace_start, we leave
dev_replace->replace_start set to STARTED but clear ->srcdev and
->tgtdev.  Later, that can result in an Oops in
btrfs_dev_replace_progress when having state set to STARTED or SUSPENDED
implies that ->srcdev is valid.

Also fix error handling when the state is already STARTED or SUSPENDED
while starting.  That, too, will clear ->srcdev and ->tgtdev even though
it doesn't own them.  This should be an impossible case to hit since we
should be protected by the BTRFS_FS_EXCL_OP bit being set.  Let's add an
ASSERT there while we're at it.

Fixes: e93c89c1aa (Btrfs: add new sources for device replace code)
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
zhong jiang
c1766dd782 btrfs: change remove_extent_mapping to return void
remove_extent_mapping uses the variable "ret" for return value, but it
is not modified after initialzation. Further, I find that any of the
callers do not handle the return value and the callees are only simple
functions so the return values does not need to be passed.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
Nikolay Borisov
315bed43fe btrfs: handle error of get_old_root
In btrfs_search_old_slot get_old_root is always used with the assumption
it cannot fail. However, this is not true in rare circumstance it can
fail and return null. This will lead to null point dereference when the
header is read. Fix this by checking the return value and properly
handling NULL by setting ret to -EIO and returning gracefully.

Coverity-id: 1087503
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
Nikolay Borisov
28bee48982 btrfs: Remove logically dead code from btrfs_orphan_cleanup
In btrfs_orphan_cleanup the final 'if (ret) goto out' cannot ever be
executed. This is due to the last assignment to 'ret' depending on the
return value of btrfs_iget. If an error other than -ENOENT is returned
then the loop is prematurely terminated by 'goto out'.  On the other
hand, if the error value is ENOENT then a subsequent if branch is
executed that always re-assigns 'ret' and in case it's an error just
terminates the loop. No functional changes.

Coverity-id: 1437392
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
Liu Bo
4183c52ce8 Btrfs: remove wait_ordered_range in btrfs_evict_inode
When we delete an inode,

btrfs_evict_inode() {
    truncate_inode_pages_final()
        truncate_inode_pages_range()
            lock_page()
            truncate_cleanup_page()
                 btrfs_invalidatepage()
                      wait_on_page_writeback
                           btrfs_lookup_ordered_range()
                 cancel_dirty_page()
           unlock_page()
     ...
     btrfs_wait_ordered_range()
     ...

As VFS has called ->invalidatepage() to get all ordered extents done (if
there are any) and truncated all page cache pages (no dirty pages to
writeback after this step), wait_ordered_range() is just a noop.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
Liu Bo
abb57ef3ff Btrfs: skip set_page_dirty if eb pages are already dirty
As long as @eb is marked with EXTENT_BUFFER_DIRTY, all of its pages
are dirty, so no need to set pages dirty again.

Ftrace showed that the loop took 10us on my dev box, so removing this
can save us at least 10us if eb is already dirty and otherwise avoid a
potentially expensive calls to set_page_dirty.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:31 +02:00
Liu Bo
51995c399b Btrfs: assert page dirty bit on extent buffer pages
Just in case that someone breaks the rule that pages are dirty as long
as eb is dirty. The next patch will dirty the pages conditionally.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:30 +02:00
Liu Bo
98e6b1eb40 Btrfs: remove unnecessary level check in balance_level
In the callchain:

btrfs_search_slot()
   if (level != 0)
      setup_nodes_for_search()
          balance_level()

It is just impossible to have level=0 in balance_level, we can drop the
check.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:30 +02:00
Liu Bo
3cf5068f3d Btrfs: unify error handling of btrfs_lookup_dir_item
Unify the error handling of directory item lookups using IS_ERR_OR_NULL.
No functional changes.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:30 +02:00
Liu Bo
3b2fd80160 Btrfs: use args in the correct order for kcalloc in btrfsic_read_block
kcalloc is defined as:

  kcalloc(size_t n, size_t size, gfp_t flags)

Although this won't cause problems in practice, btrfsic_read_block()
uses kcalloc with n and size in the opposite order.

Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:30 +02:00
Nikolay Borisov
a27a94c2b0 btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly
Instead of returning an error value and using one of the parameters for
returning the actual object we are interested in just refactor the
function to directly return btrfs_device *. Also bubble up the error
handling for the special BTRFS_ERROR_DEV_MISSING_NOT_FOUND value into
btrfs_rm_device. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:30 +02:00
Nikolay Borisov
6c05040702 btrfs: Make btrfs_find_device_missing_or_by_path return directly a device
This function returns a numeric error value and additionally the
device found in one of its input parameters. Simplify this by making
the function directly return a pointer to btrfs_device. Additionally
adjust the caller to handle the case when we want to remove the
'missing' device and ENOENT is returned to return the expected
positive error value, parsed by progs. Finally, unexport the function
since it's not called outside of volume.c. No functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Nikolay Borisov
b444ad46b2 btrfs: Make btrfs_find_device_by_path return struct btrfs_device
Currently this function returns an error code as well as uses one of
its arguments as a return value for struct btrfs_device. Change the
function so that it returns btrfs_device directly and use the usual
"encode error in pointer" mechanics if something goes wrong. No
functional changes.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Jeff Mahoney
374b0e2d6b btrfs: fix error handling in free_log_tree
When we hit an I/O error in free_log_tree->walk_log_tree during file system
shutdown we can crash due to there not being a valid transaction handle.

Use btrfs_handle_fs_error when there's no transaction handle to use.

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000060
  IP: free_log_tree+0xd2/0x140 [btrfs]
  PGD 0 P4D 0
  Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
  Modules linked in: <modules>
  CPU: 2 PID: 23544 Comm: umount Tainted: G        W        4.12.14-kvmsmall #9 SLE15 (unreleased)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
  task: ffff96bfd3478880 task.stack: ffffa7cf40d78000
  RIP: 0010:free_log_tree+0xd2/0x140 [btrfs]
  RSP: 0018:ffffa7cf40d7bd10 EFLAGS: 00010282
  RAX: 00000000fffffffb RBX: 00000000fffffffb RCX: 0000000000000002
  RDX: 0000000000000000 RSI: ffff96c02f07d4c8 RDI: 0000000000000282
  RBP: ffff96c013cf1000 R08: ffff96c02f07d4c8 R09: ffff96c02f07d4d0
  R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
  R13: ffff96c005e800c0 R14: ffffa7cf40d7bdb8 R15: 0000000000000000
  FS:  00007f17856bcfc0(0000) GS:ffff96c03f600000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000060 CR3: 0000000045ed6002 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   ? wait_for_writer+0xb0/0xb0 [btrfs]
   btrfs_free_log+0x17/0x30 [btrfs]
   btrfs_drop_and_free_fs_root+0x9a/0xe0 [btrfs]
   btrfs_free_fs_roots+0xc0/0x130 [btrfs]
   ? wait_for_completion+0xf2/0x100
   close_ctree+0xea/0x2e0 [btrfs]
   ? kthread_stop+0x161/0x260
   generic_shutdown_super+0x6c/0x120
   kill_anon_super+0xe/0x20
   btrfs_kill_super+0x13/0x100 [btrfs]
   deactivate_locked_super+0x3f/0x70
   cleanup_mnt+0x3b/0x70
   task_work_run+0x78/0x90
   exit_to_usermode_loop+0x77/0xa6
   do_syscall_64+0x1c5/0x1e0
   entry_SYSCALL_64_after_hwframe+0x42/0xb7
  RIP: 0033:0x7f1784f90827
  RSP: 002b:00007ffdeeb03118 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
  RAX: 0000000000000000 RBX: 0000556a60c62970 RCX: 00007f1784f90827
  RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556a60c62b50
  RBP: 0000000000000000 R08: 0000000000000005 R09: 00000000ffffffff
  R10: 0000556a60c63900 R11: 0000000000000246 R12: 0000556a60c62b50
  R13: 00007f17854a81c4 R14: 0000000000000000 R15: 0000000000000000
  RIP: free_log_tree+0xd2/0x140 [btrfs] RSP: ffffa7cf40d7bd10
  CR2: 0000000000000060

Fixes: 681ae50917 ("Btrfs: cleanup reserved space when freeing tree log on error")
CC: <stable@vger.kernel.org> # v3.13
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Misono Tomohiro
380fd06640 btrfs: remove redundant variable from btrfs_cross_ref_exist
Since commit d7df2c796d ("Btrfs attach delayed ref updates to delayed
ref heads"), check_delayed_ref() won't return -ENOENT.

In btrfs_cross_ref_exist(), two variables 'ret' and 'ret2' are
originally used to handle -ENOENT error case.  Since the code is not
needed anymore, let's just remove 'ret2'.

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Liu Bo
e49aabd973 Btrfs: set leave_spinning in btrfs_get_extent
Unless it's going to read inline extents from btree leaf to page,
btrfs_get_extent won't sleep during the period of holding path lock.

This sets leave_spinning at first and sets path to blocking mode right
before reading inline extent if that's the case.  The benefit is that a
path in spinning mode typically has lower impact (faster) on waiters
rather than that in the blocking mode.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Liu Bo
de2c6615dc Btrfs: fix alignment in declaration and prototype of btrfs_get_extent
This fixes btrfs_get_extent to be consistent with our existing
declaration style.

Note: For the record, indentation styles that are accepted are both,
aligning under the opening ( and tab or double tab indentation on the
next line.  Preferrably not spliting the type or long expressions in the
argument lists.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:29 +02:00
Colin Ian King
29c5e5d496 btrfs: remove unused pointer 'tree' in btrfs_submit_compressed_read
Pointer 'tree' is being assigned but is never used hence it is redundant
and can be removed. This is a leftover from cleanup patch
00032d38ea ("btrfs: drop extent_io_ops::merge_bio_hook
callback").

Cleans up clang warning:

  warning: variable 'tree' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Colin Ian King
d005dbeca0 btrfs: remove unused pointer inode in relink_file_extents
Pointer inode is being assigned but is never used hence it is redundant
and can be removed. It's been unused since the introduction in
38c227d87c ("Btrfs: snapshot-aware defrag").

Cleans up clang warning:

  variable ‘inode’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Su Yue
28c4a3e21a btrfs: defrag: use btrfs_mod_outstanding_extents in cluster_pages_for_defrag
Since commit 8b62f87bad ("Btrfs: rework outstanding_extents"),
manual operations of outstanding_extent in btrfs_inode are replaced by
btrfs_mod_outstanding_extents().
The one in cluster_pages_for_defrag seems to be lost, so replace it
of btrfs_mod_outstanding_extents().

Fixes: 8b62f87bad ("Btrfs: rework outstanding_extents")
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Liu Bo
6aadd9eb74 Btrfs: remove confusing tracepoint in btrfs_add_reserved_bytes
Here we're not releasing any space, but transferring bytes from
->bytes_may_use to ->bytes_reserved. The last change to the code in
commit 18513091af ("btrfs: update btrfs_space_info's
bytes_may_use timely") removed a conditional tracepoint and the logic
changed too but the tracepiont remained.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Liu Bo
c64142807f btrfs: free path at an earlier point in btrfs_get_extent
trace_btrfs_get_extent() has nothing to do with path, place
btrfs_free_path ahead so that we can unlock path on error.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Liu Bo
9688e9a99e Btrfs: use next_state in find_first_extent_bit
As next_state() is already defined to get the next state, use it in
find_first_extent_bit. No functional changes.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:28 +02:00
Qu Wenruo
b72c3aba09 btrfs: locking: Add extra check in btrfs_init_new_buffer() to avoid deadlock
[BUG]
For certain crafted image, whose csum root leaf has missing backref, if
we try to trigger write with data csum, it could cause deadlock with the
following kernel WARN_ON():

  WARNING: CPU: 1 PID: 41 at fs/btrfs/locking.c:230 btrfs_tree_lock+0x3e2/0x400
  CPU: 1 PID: 41 Comm: kworker/u4:1 Not tainted 4.18.0-rc1+ #8
  Workqueue: btrfs-endio-write btrfs_endio_write_helper
  RIP: 0010:btrfs_tree_lock+0x3e2/0x400
  Call Trace:
   btrfs_alloc_tree_block+0x39f/0x770
   __btrfs_cow_block+0x285/0x9e0
   btrfs_cow_block+0x191/0x2e0
   btrfs_search_slot+0x492/0x1160
   btrfs_lookup_csum+0xec/0x280
   btrfs_csum_file_blocks+0x2be/0xa60
   add_pending_csums+0xaf/0xf0
   btrfs_finish_ordered_io+0x74b/0xc90
   finish_ordered_fn+0x15/0x20
   normal_work_helper+0xf6/0x500
   btrfs_endio_write_helper+0x12/0x20
   process_one_work+0x302/0x770
   worker_thread+0x81/0x6d0
   kthread+0x180/0x1d0
   ret_from_fork+0x35/0x40

[CAUSE]
That crafted image has missing backref for csum tree root leaf.  And
when we try to allocate new tree block, since there is no
EXTENT/METADATA_ITEM for csum tree root, btrfs consider it's free slot
and use it.

The extent tree of the image looks like:

  Normal image                      |       This fuzzed image
  ----------------------------------+--------------------------------
  BG 29360128                       | BG 29360128
   One empty slot                   |  One empty slot
  29364224: backref to UUID tree    | 29364224: backref to UUID tree
   Two empty slots                  |  Two empty slots
  29376512: backref to CSUM tree    |  One empty slot (bad type) <<<
  29380608: backref to D_RELOC tree | 29380608: backref to D_RELOC tree
  ...                               | ...

Since bytenr 29376512 has no METADATA/EXTENT_ITEM, when btrfs try to
alloc tree block, it's an valid slot for btrfs.

And for finish_ordered_write, when we need to insert csum, we try to CoW
csum tree root.

By accident, empty slots at bytenr BG_OFFSET, BG_OFFSET + 8K,
BG_OFFSET + 12K is already used by tree block COW for other trees, the
next empty slot is BG_OFFSET + 16K, which should be the backref for CSUM
tree.

But due to the bad type, btrfs can recognize it and still consider it as
an empty slot, and will try to use it for csum tree CoW.

Then in the following call trace, we will try to lock the new tree
block, which turns out to be the old csum tree root which is already
locked:

btrfs_search_slot() called on csum tree root, which is at 29376512
|- btrfs_cow_block()
   |- btrfs_set_lock_block()
   |  |- Now locks tree block 29376512 (old csum tree root)
   |- __btrfs_cow_block()
      |- btrfs_alloc_tree_block()
         |- btrfs_reserve_extent()
            | Now it returns tree block 29376512, which extent tree
            | shows its empty slot, but it's already hold by csum tree
            |- btrfs_init_new_buffer()
               |- btrfs_tree_lock()
                  | Triggers WARN_ON(eb->lock_owner == current->pid)
                  |- wait_event()
                     Wait lock owner to release the lock, but it's
                     locked by ourself, so it will deadlock

[FIX]
This patch will do the lock_owner and current->pid check at
btrfs_init_new_buffer().
So above deadlock can be avoided.

Since such problem can only happen in crafted image, we will still
trigger kernel warning for later aborted transaction, but with a little
more meaningful warning message.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200405
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
Qu Wenruo
65c6e82bec btrfs: Handle owner mismatch gracefully when walking up tree
[BUG]
When mounting certain crafted image, btrfs will trigger kernel BUG_ON()
when trying to recover balance:

  kernel BUG at fs/btrfs/extent-tree.c:8956!
  invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 1 PID: 662 Comm: mount Not tainted 4.18.0-rc1-custom+ #10
  RIP: 0010:walk_up_proc+0x336/0x480 [btrfs]
  RSP: 0018:ffffb53540c9b890 EFLAGS: 00010202
  Call Trace:
   walk_up_tree+0x172/0x1f0 [btrfs]
   btrfs_drop_snapshot+0x3a4/0x830 [btrfs]
   merge_reloc_roots+0xe1/0x1d0 [btrfs]
   btrfs_recover_relocation+0x3ea/0x420 [btrfs]
   open_ctree+0x1af3/0x1dd0 [btrfs]
   btrfs_mount_root+0x66b/0x740 [btrfs]
   mount_fs+0x3b/0x16a
   vfs_kern_mount.part.9+0x54/0x140
   btrfs_mount+0x16d/0x890 [btrfs]
   mount_fs+0x3b/0x16a
   vfs_kern_mount.part.9+0x54/0x140
   do_mount+0x1fd/0xda0
   ksys_mount+0xba/0xd0
   __x64_sys_mount+0x21/0x30
   do_syscall_64+0x60/0x210
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

[CAUSE]
Extent tree corruption.  In this particular case, reloc tree root's
owner is DATA_RELOC_TREE (should be TREE_RELOC), thus its backref is
corrupted and we failed the owner check in walk_up_tree().

[FIX]
It's pretty hard to take care of every extent tree corruption, but at
least we can remove such BUG_ON() and exit more gracefully.

And since in this particular image, DATA_RELOC_TREE and TREE_RELOC share
the same root (which is obviously invalid), we needs to make
__del_reloc_root() more robust to detect such invalid sharing to avoid
possible NULL dereference as root->node can be NULL in this case.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=200411
Reported-by: Xu Wen <wen.xu@gatech.edu>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
zhong jiang
45128b08f7 btrfs: change btrfs_pin_log_trans to return void
btrfs_pin_log_trans defines the variable "ret" for return value, but it
is not modified after initialization. Further, I find that none of the
callers do handles the return value, so it is safe to drop the unneeded
"ret" and make it return void.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
zhong jiang
556f3ca88e btrfs: change btrfs_free_reserved_bytes to return void
btrfs_free_reserved_bytes uses the variable "ret" for return value,
but it is not modified after initialzation. Further, I find that any of
the callers do not handle the return value, so it is safe to drop the
unneeded "ret" and return void. There are no callees that would need the
function to handle or pass the value either.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
Liu Bo
bee6ec822a Btrfs: remove always true if branch in btrfs_get_extent
@path is always NULL when it comes to the if branch.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:27 +02:00
Qu Wenruo
9c7b0c2e8d btrfs: qgroup: Dirty all qgroups before rescan
[BUG]
In the following case, rescan won't zero out the number of qgroup 1/0:

  $ mkfs.btrfs -fq $DEV
  $ mount $DEV /mnt

  $ btrfs quota enable /mnt
  $ btrfs qgroup create 1/0 /mnt
  $ btrfs sub create /mnt/sub
  $ btrfs qgroup assign 0/257 1/0 /mnt

  $ dd if=/dev/urandom of=/mnt/sub/file bs=1k count=1000
  $ btrfs sub snap /mnt/sub /mnt/snap
  $ btrfs quota rescan -w /mnt
  $ btrfs qgroup show -pcre /mnt
  qgroupid         rfer         excl     max_rfer     max_excl parent  child
  --------         ----         ----     --------     -------- ------  -----
  0/5          16.00KiB     16.00KiB         none         none ---     ---
  0/257      1016.00KiB     16.00KiB         none         none 1/0     ---
  0/258      1016.00KiB     16.00KiB         none         none ---     ---
  1/0        1016.00KiB     16.00KiB         none         none ---     0/257

So far so good, but:

  $ btrfs qgroup remove 0/257 1/0 /mnt
  WARNING: quotas may be inconsistent, rescan needed
  $ btrfs quota rescan -w /mnt
  $ btrfs qgroup show -pcre  /mnt
  qgoupid         rfer         excl     max_rfer     max_excl parent  child
  --------         ----         ----     --------     -------- ------  -----
  0/5          16.00KiB     16.00KiB         none         none ---     ---
  0/257      1016.00KiB     16.00KiB         none         none ---     ---
  0/258      1016.00KiB     16.00KiB         none         none ---     ---
  1/0        1016.00KiB     16.00KiB         none         none ---     ---
	     ^^^^^^^^^^     ^^^^^^^^ not cleared

[CAUSE]
Before rescan we call qgroup_rescan_zero_tracking() to zero out all
qgroups' accounting numbers.

However we don't mark all qgroups dirty, but rely on rescan to do so.

If we have any high level qgroup without children, it won't be marked
dirty during rescan, since we cannot reach that qgroup.

This will cause QGROUP_INFO items of childless qgroups never get updated
in the quota tree, thus their numbers will stay the same in "btrfs
qgroup show" output.

[FIX]
Just mark all qgroups dirty in qgroup_rescan_zero_tracking(), so even if
we have childless qgroups, their QGROUP_INFO items will still get
updated during rescan.

Reported-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Tested-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:26 +02:00
Omar Sandoval
3293428096 Btrfs: clean up scrub is_dev_replace parameter
struct scrub_ctx has an ->is_dev_replace member, so there's no point in
passing around is_dev_replace where sctx is available.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:26 +02:00
Anand Jain
1da739678e btrfs: add helper to obtain number of devices with ongoing dev-replace
When the replace is running the fs_devices::num_devices also includes
the replaced device, however in some operations like device delete and
balance it needs the actual num_devices without the repalced devices.
The function btrfs_num_devices() just provides that.

And here is a scenario how balance and repalce items could co-exist:

Consider balance is started and paused, now start the replace followed
by a unmount or system power-cycle. During following mount, the
open_ctree() first restarts the balance so it must check for the device
replace otherwise our num_devices calculation will be wrong.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:26 +02:00
Anand Jain
16220c467a btrfs: add assertions where number of devices could go below 0
In preparation to add helper function to deduce the num_devices with
replace running, use assert instead of BUG_ON or WARN_ON. The number of
devices would not normally drop to 0 due to other checks so the assert
is sufficient.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, adjust the assert condition ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:26 +02:00
zhong jiang
f8b00e0f06 btrfs: remove unneeded NULL checks before kfree
Kfree has taken the NULL pointer into account. So remove the check
before kfree.

The issue is detected with the help of Coccinelle.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:26 +02:00
Liu Bo
4b6f8e9695 Btrfs: do not unnecessarily pass write_lock_level when processing leaf
As we're going to return right after the call, it's not necessary to get
update the new write_lock_level from unlock_up.

Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:26 +02:00
Misono Tomohiro
4fd786e6c3 btrfs: Remove 'objectid' member from struct btrfs_root
There are two members in struct btrfs_root which indicate root's
objectid: objectid and root_key.objectid.

They are both set to the same value in __setup_root():

  static void __setup_root(struct btrfs_root *root,
                           struct btrfs_fs_info *fs_info,
                           u64 objectid)
  {
    ...
    root->objectid = objectid;
    ...
    root->root_key.objectid = objecitd;
    ...
  }

and not changed to other value after initialization.

grep in btrfs directory shows both are used in many places:
  $ grep -rI "root->root_key.objectid" | wc -l
  133
  $ grep -rI "root->objectid" | wc -l
  55
 (4.17, inc. some noise)

It is confusing to have two similar variable names and it seems
that there is no rule about which should be used in a certain case.

Since ->root_key itself is needed for tree reloc tree, let's remove
'objecitd' member and unify code to use ->root_key.objectid in all places.

Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Lu Fengqi
5a2cb25ab9 btrfs: remove a useless return statement in btrfs_block_rsv_add
Since ret must be 0 here, don't have to return.  No functional change
and code readability is not hurt.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Lu Fengqi
684572df94 btrfs: Remove root parameter from btrfs_insert_dir_item
All callers pass the root tree of dir, we can push that down to the
function itself.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Lu Fengqi
3a58417486 btrfs: switch update_size to bool in btrfs_block_rsv_migrate and btrfs_rsv_add_bytes
Using true and false here is closer to the expected semantic than using
0 and 1.  No functional change.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Lu Fengqi
a7176f74fa btrfs: simplify the send_in_progress check in btrfs_delete_subvolume
Only when send_in_progress, we have to do something different such as
btrfs_warn() and return -EPERM. Therefore, we could check
send_in_progress first and process error handling, after the
root_item_lock has been got.

Just for better readability. No functional change.

Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15 17:23:25 +02:00
Dan Schatzberg
5571f1e654 fuse: enable caching of symlinks
FUSE file reads are cached in the page cache, but symlink reads are
not. This patch enables FUSE READLINK operations to be cached which
can improve performance of some FUSE workloads.

In particular, I'm working on a FUSE filesystem for access to source
code and discovered that about a 10% improvement to build times is
achieved with this patch (there are a lot of symlinks in the source
tree).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-15 15:43:07 +02:00
Miklos Szeredi
9a2eb24d1a fuse: only invalidate atime in direct read
After sending a synchronous READ request from __fuse_direct_read() we only
need to invalidate atime; none of the other attributes should be changed by
a read().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-15 15:43:06 +02:00
Miklos Szeredi
802dc0497b fuse: don't need GETATTR after every READ
If 'auto_inval_data' mode is active, then fuse_file_read_iter() will call
fuse_update_attributes(), which will check the attribute validity and send
a GETATTR request if some of the attributes are no longer valid.  The page
cache is then invalidated if the size or mtime have changed.

Then, if a READ request was sent and reply received (which is the case if
the data wasn't cached yet, or if the file is opened for O_DIRECT), the
atime attribute is invalidated.

This will result in the next read() also triggering a GETATTR, ...

This can be fixed by only sending GETATTR if the mode or size are invalid,
we don't need to do a refresh if only atime is invalid.

More generally, none of the callers of fuse_update_attributes() need an
up-to-date atime value, so for now just remove STATX_ATIME from the request
mask when attributes are updated for internal use.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-15 15:43:06 +02:00
Miklos Szeredi
2f1e81965f fuse: allow fine grained attr cache invaldation
This patch adds the infrastructure for more fine grained attribute
invalidation.  Currently only 'atime' is invalidated separately.

The use of this infrastructure is extended to the statx(2) interface, which
for now means that if only 'atime' is invalid and STATX_ATIME is not
specified in the mask argument, then no GETATTR request will be generated.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-15 15:43:06 +02:00
David Howells
f0a7d1883d afs: Fix clearance of reply
The recent patch to fix the afs_server struct leak didn't actually fix the
bug, but rather fixed some of the symptoms.  The problem is that an
asynchronous call that holds a resource pointed to by call->reply[0] will
find the pointer cleared in the call destructor, thereby preventing the
resource from being cleaned up.

In the case of the server record leak, the afs_fs_get_capabilities()
function in devel code sets up a call with reply[0] pointing at the server
record that should be altered when the result is obtained, but this was
being cleared before the destructor was called, so the put in the
destructor does nothing and the record is leaked.

Commit f014ffb025 removed the additional ref obtained by
afs_install_server(), but the removal of this ref is actually used by the
garbage collector to mark a server record as being defunct after the record
has expired through lack of use.

The offending clearance of call->reply[0] upon completion in
afs_process_async_call() has been there from the origin of the code, but
none of the asynchronous calls actually use that pointer currently, so it
should be safe to remove (note that synchronous calls don't involve this
function).

Fix this by the following means:

 (1) Revert commit f014ffb025.

 (2) Remove the clearance of reply[0] from afs_process_async_call().

Without this, afs_manage_servers() will suffer an assertion failure if it
sees a server record that didn't get used because the usage count is not 1.

Fixes: f014ffb025 ("afs: Fix afs_server struct leak")
Fixes: 08e0e7c82e ("[AF_RXRPC]: Make the in-kernel AFS filesystem use AF_RXRPC.")
Signed-off-by: David Howells <dhowells@redhat.com>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-15 15:31:47 +02:00
Greg Kroah-Hartman
3a27203102 libnvdimm/dax 4.19-rc8
* Fix a livelock in dax_layout_busy_page() present since v4.18. The
   lockup triggers when truncating an actively mapped huge page out of a
   mapping pinned for direct-I/O.
 
 * Fix mprotect() clobbers of _PAGE_DEVMAP. Broken since v4.5 mprotect()
   clears this flag that is needed to communicate the liveness of device
   pages to the get_user_pages() path.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbwiZhAAoJEB7SkWpmfYgCYFoQAL8ED6c1bfGUPRsWSrTRChU0
 ungVZ/Vf1+2ERd3ivUXPQzahNtqH5EWvEVp0aboVpyJUoVllrztInVS2hxaGJE+e
 w7WnzaXh36MY0kvLpK+Ny1Cxk7qg2rXnmzOAPRVdSUoSvh0TXOn5HFX1i/OdI7WK
 wgJwXraCoyKP9aTItw7oHQy9S36bi1RJVUakOAoEpEx4Vn+fwFxLNIt34G5CRJ+k
 iflicM7CPngxlFzwfoiX9v3DhV7toexk1A4LAzzwypG0Aiqd5tW2FG1lwLMPncNk
 8FezBm9VjkMwzv6hj7nD9UfU2lbh3GqqGDW0cPX1DPSgDxr/4pOLtKcbYWHh6yas
 NtCXk37q90ey3GtD2wYBRkBNly6UWvHJ0d3srtO6ZSl1VN6JQu8rhVhQ6KnON24B
 NcWlEVf2brqf0uaW4byCVbdVfIDp96/qgEvCo1pq3olXwCdDyOBJjYxaBcnu5JDV
 YsItMCJ49AxS/qoCt3vam7vC5TGhfYHL5xJPaF06cdjYvgfqOIV3VQT1ujBx4cvh
 MBFRBKDc6oDiJFgkrdYqHwJfn5fCQVS180Oy5S0AFGsVAzsJalKBZBLx2f2RQn8c
 r+kczvvPjpczEeEqzaqsxTgjowo/75Q8PRXc2PbwQzNxfkHuKf+xxQpnUg0mN6Hf
 w8zPSaCcCs2Wo21Kd/ua
 =VXnU
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-fixes-4.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Dan writes:
  "libnvdimm/dax 4.19-rc8

   * Fix a livelock in dax_layout_busy_page() present since v4.18. The
     lockup triggers when truncating an actively mapped huge page out of
     a mapping pinned for direct-I/O.

   * Fix mprotect() clobbers of _PAGE_DEVMAP. Broken since v4.5
     mprotect() clears this flag that is needed to communicate the
     liveness of device pages to the get_user_pages() path."

* tag 'libnvdimm-fixes-4.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  mm: Preserve _PAGE_DEVMAP across mprotect() calls
  filesystem-dax: Fix dax_layout_busy_page() livelock
2018-10-14 08:34:31 +02:00
Richard Weinberger
f8ccb14fd6 ubifs: Fix WARN_ON logic in exit path
ubifs_assert() is not WARN_ON(), so we have to invert
the checks.
Randy faced this warning with UBIFS being a module, since
most users use UBIFS as builtin because UBIFS is the rootfs
nobody noticed so far. :-(
Including me.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: 54169ddd38 ("ubifs: Turn two ubifs_assert() into a WARN_ON()")
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-13 11:05:02 +02:00
Greg Kroah-Hartman
79fc170b1f Merge branch 'akpm'
Fixes from Andrew:

* akpm:
  fs/fat/fatent.c: add cond_resched() to fat_count_free_clusters()
  mm/thp: fix call to mmu_notifier in set_pmd_migration_entry() v2
  mm/mmap.c: don't clobber partially overlapping VMA with MAP_FIXED_NOREPLACE
  ocfs2: fix a GCC warning
2018-10-13 09:31:19 +02:00
Khazhismel Kumykov
ac081c3be3 fs/fat/fatent.c: add cond_resched() to fat_count_free_clusters()
On non-preempt kernels this loop can take a long time (more than 50 ticks)
processing through entries.

Link: http://lkml.kernel.org/r/20181010172623.57033-1-khazhy@google.com
Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-13 09:31:03 +02:00
zhong jiang
1cff514a51 ocfs2: fix a GCC warning
Fix the following compile warning:

fs/ocfs2/dlmglue.c:99:30: warning: ‘lockdep_keys’ defined but not used [-Wunused-variable]
 static struct lock_class_key lockdep_keys[OCFS2_NUM_LOCK_TYPES];

Link: http://lkml.kernel.org/r/1536938148-32110-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-13 09:31:02 +02:00
Greg Kroah-Hartman
ed66c252d9 gfs2 4.19 fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbwLzVAAoJENW/n+sDE2U6ZQUP/3mcvCqDjU+YBC/RCuA0Hvxd
 4rtzRqY5jTt4HZCTJ9Qj8aARZ4vN/LjKQuQU8IYhzJQcXzeyDi1V2zH4xGmEDomZ
 o9LqcVUUJO1bS/aykLrbnSEEfjU+zeA3z60Jbj3OHkpL6H9u6q+AwqCZnsIcKBnI
 Z8lgUda6oC3L60xzxpdr6BoYLIMqucuJ0YMyVo6YsKT1PEoP30EQhAoqnnKVfxx7
 +l66OUK4Qw3z45auen9MIOXjn+y7BC0RjFANetsBlAS29PE5X9rXWQ5xnv+9uYCh
 DDXPiMjdkLm2Xv8d5U80QhVuO8vxkgn3lnLvE99cmMmQrU/6pS91q+azgc/98RDN
 ZYCReIihyPbug3+uRziYcBHnI+4c8LrM/Za0TsHDG6gA6ddiiNHvdnJE9Nbk85Le
 dS+jQwA5eiYtGc6styoyk52v6huGTI3oPbffyYTTbOLnT6bSaEv+k0zWWCgTqMv3
 sb0imETfsdd1O/7MxR2kAVtQPdSzhD5GMIbkmlRb8nwP+QXb+FBVtQFPsZaPZklR
 qF//eFXGcoWfM1XerOYgl6VVolvntSc+ivycSEsDGHNSU5zUy2JYNTxmkypBB6pb
 wslVyIvjW2bcnUhjGr9PDSg+yQjaAk+5EyiJmxtJaqCiSfjgTpI4YwNrkfQo5csI
 H/Sq1WeTLjo2KeRb7LpU
 =gGo4
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-4.19.fixes3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Andreas writes:
  "gfs2 4.19 fixes

   Fix iomap buffered write support for journaled files"

* tag 'gfs2-4.19.fixes3' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix iomap buffered write support for journaled files (2)
2018-10-13 09:06:27 +02:00
Al Viro
82a6857bf9 compat_ioctl - kill keyboard ioctl handling
all of those are provided only by vt and s390 tty3270; both have
proper ->compat_ioctl()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-13 00:50:51 -04:00
Al Viro
969ec01e99 gigaset: add ->compat_ioctl()
... and get rid of COMPAT_IOCTL() for its private ioctls

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-13 00:50:50 -04:00
Al Viro
6bbf265892 kill the rest of tty COMPAT_IOCTL() entries
TIOCLINUX is handled by ->compat_ioctl() in the only place that has
native ->ioctl() recognizing it, TIOC{START,STOP} are simply useless
these days - unrecognized compat ioctl won't spew into syslog
anymore.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-13 00:50:46 -04:00
Al Viro
7765435030 take compat TIOC[SG]SERIAL treatment into tty_compat_ioctl()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-13 00:50:44 -04:00
David S. Miller
d864991b22 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts were easy to resolve using immediate context mostly,
except the cls_u32.c one where I simply too the entire HEAD
chunk.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-12 21:38:46 -07:00
Al Viro
3df629d873 gfs2_meta: ->mount() can get NULL dev_name
get in sync with mount_bdev() handling of the same

Reported-by: syzbot+c54f8e94e6bba03b04e9@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-13 00:19:13 -04:00
Al Viro
995f608e7a ntfs: don't open-code ERR_CAST
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-12 22:46:50 -04:00
David Howells
f014ffb025 afs: Fix afs_server struct leak
Fix a leak of afs_server structs.  The routine that installs them in the
various lookup lists and trees gets a ref on leaving the function, whether
it added the server or a server already exists.  It shouldn't increment
the refcount if it added the server.

The effect of this that "rmmod kafs" will hang waiting for the leaked
server to become unused.

Fixes: d2ddc776a4 ("afs: Overhaul volume and server record caching and fileserver rotation")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-12 17:36:40 +02:00
Andreas Gruenbacher
fee5150c48 gfs2: Fix iomap buffered write support for journaled files (2)
It turns out that the fix in commit 6636c3cc56 is bad; the assertion
that the iomap code no longer creates buffer heads is incorrect for
filesystems that set the IOMAP_F_BUFFER_HEAD flag.

Instead, what's happening is that gfs2_iomap_begin_write treats all
files that have the jdata flag set as journaled files, which is
incorrect as long as those files are inline ("stuffed").  We're handling
stuffed files directly via the page cache, which is why we ended up with
pages without buffer heads in gfs2_page_add_databufs.

Fix this by handling stuffed journaled files correctly in
gfs2_iomap_begin_write.

This reverts commit 6636c3cc5690c11631e6366cf9a28fb99c8b25bb.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-10-12 17:14:42 +02:00
Theodore Ts'o
33458eaba4 ext4: fix use-after-free race in ext4_remount()'s error path
It's possible for ext4_show_quota_options() to try reading
s_qf_names[i] while it is being modified by ext4_remount() --- most
notably, in ext4_remount's error path when the original values of the
quota file name gets restored.

Reported-by: syzbot+a2872d6feea6918008a9@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 3.2+
2018-10-12 09:28:09 -04:00
Andreas Gruenbacher
0ddeded4ae gfs2: Pass resource group to rgblk_free
Function rgblk_free can only deal with one resource group at a time, so
pass that resource group is as a parameter.  Several of the callers
already have the resource group at hand, so we only need additional
lookup code in a few places.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:33:07 -05:00
Bob Peterson
c3abc29e54 gfs2: Remove unnecessary gfs2_rlist_alloc parameter
The state parameter of gfs2_rlist_alloc is set to LM_ST_EXCLUSIVE in all
calls, so remove it and hardcode that state in gfs2_rlist_alloc instead.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:32:28 -05:00
Andreas Gruenbacher
ec23df2b0c gfs2: Fix marking bitmaps non-full
Reservations in gfs can span multiple gfs2_bitmaps (but they won't span
multiple resource groups).  When removing a reservation, we want to
clear the GBF_FULL flags of all involved gfs2_bitmaps, not just that of
the first bitmap.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:31:55 -05:00
Andreas Gruenbacher
243fea4df9 gfs2: Fix some minor typos
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:31:21 -05:00
Andreas Gruenbacher
281b4952d1 gfs2: Rename bitmap.bi_{len => bytes}
This field indicates the size of the bitmap in bytes, similar to how the
bi_blocks field indicates the size of the bitmap in blocks.

In count_unlinked, replace an instance of bi_bytes * GFS2_NBBY by
bi_blocks.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:30:43 -05:00
Andreas Gruenbacher
ad89945818 gfs2: Remove unused RGRP_RSRV_MINBYTES definition
This definition is only used to define RGRP_RSRV_MINBLKS, with no
benefit over defining RGRP_RSRV_MINBLKS directly.

In addition, instead of forcing RGRP_RSRV_MINBLKS to be of type u32,
cast it to that type where that type is required.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:29:59 -05:00
Andreas Gruenbacher
21f09c4395 gfs2: Move rs_{sizehint, rgd_gh} fields into the inode
Move the rs_sizehint and rs_rgd_gh fields from struct gfs2_blkreserv
into the inode: they are more closely related to the inode than to a
particular reservation.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:29:14 -05:00
Andreas Gruenbacher
3548fce164 gfs2: Clean up out-of-bounds check in gfs2_rbm_from_block
We already have a function that checks if a block is within a resource
group, so use that in gfs2_rbm_from_block as well.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:28:39 -05:00
Andreas Gruenbacher
f654683dae gfs2: Always check the result of gfs2_rbm_from_block
When gfs2_rbm_from_block fails, the rbm it returns is undefined, so we
always want to make sure gfs2_rbm_from_block has succeeded before
looking at the rbm.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Reviewed-by: Steven Whitehouse <swhiteho@redhat.com>
2018-10-12 07:18:25 -05:00
David Howells
6b3944e42e afs: Fix cell proc list
Access to the list of cells by /proc/net/afs/cells has a couple of
problems:

 (1) It should be checking against SEQ_START_TOKEN for the keying the
     header line.

 (2) It's only holding the RCU read lock, so it can't just walk over the
     list without following the proper RCU methods.

Fix these by using an hlist instead of an ordinary list and using the
appropriate accessor functions to follow it with RCU.

Since the code that adds a cell to the list must also necessarily change,
sort the list on insertion whilst we're at it.

Fixes: 989782dcdc ("afs: Overhaul cell database management")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-12 13:18:57 +02:00
Greg Kroah-Hartman
4718dcad7d xfs: fixes for 4.19-rc7
Update for 4.19-rc7 to fix numerous file clone and deduplication issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbvo1jAAoJEK3oKUf0dfodurwP/3OGvJo11XjE5Wuh5tbcSOUm
 duJdJvN+b0gx8Lb1IL2lHaBEt/Pw3PKi+KPx6d+pd47aflCYNMiU7I1kzejxvq9I
 TBVdhxffAmNBU02VU/qmhucMnKfpVr/UX19f0sHZcfXbZkpIuHrSpI25MNSYqbEs
 bRoMDkgbuu377hCJu6tnBAUm38z4iZdfJaGPVmJmuVMn8JJE8KA3Une/daxg0/HY
 zbCMWP3SGJwqx7SyyeurQSkRS/8GG3LgbnYOz0FlaHfBd4JVJscHOZU08nrYmMAN
 9SOowvahaPkDcyO8c6gRkyE6NIbOsb79718g+HTm4eqzyChnh+jnFTzitcTMuK2c
 vjblXZSDKnN/P4UaEEzIAnQ1Eew4WhLrKuBr+3fCr2bKTGfj0iiX36CjEgk1k+Df
 t1DV/Hj6me6crZpWO7GQZVLnDsiJBzaOsIgpYnB3vcJyof/cgm+5bfGFedDZFdpV
 XTh0oBN8B7oQztd4MvcfQOGXSktrMtq+6RomO8kv77mVtHFnjnroKHq/wIft2nyI
 rs7u7DFmCLyB9Lbm8aoeMPo4+0zxTp9385HSJ+XznHcjGXxkxIIGhhdU3omKxANa
 j3UQB1+VRC4lHSluUXOW7vH0j1tX3gaS3z/yJ+gfcAdizUmdp0h0rMx8c9VR8SEJ
 xv/fpbrcKHyBwuNvKtBm
 =yj1P
 -----END PGP SIGNATURE-----

Merge tag 'xfs-fixes-for-4.19-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Dave writes:
  "xfs: fixes for 4.19-rc7

   Update for 4.19-rc7 to fix numerous file clone and deduplication issues."

* tag 'xfs-fixes-for-4.19-rc7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix data corruption w/ unaligned reflink ranges
  xfs: fix data corruption w/ unaligned dedupe ranges
  xfs: update ctime and remove suid before cloning files
  xfs: zero posteof blocks when cloning above eof
  xfs: refactor clonerange preparation into a separate helper
2018-10-11 07:17:42 +02:00
Al Viro
e884bce1d9 ext4: don't open-code ERR_CAST
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-10 16:41:40 -04:00
Al Viro
3837d208d8 simplify btrfs_lookup()
d_splice_alias() is fine with ERR_PTR(-E...) for inode

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-10 16:38:27 -04:00
Eric Biggers
691115c351 vfs: require i_size <= SIZE_MAX in kernel_read_file()
On 32-bit systems, the buffer allocated by kernel_read_file() is too
small if the file size is > SIZE_MAX, due to truncation to size_t.

Fortunately, since the 'count' argument to kernel_read() is also
truncated to size_t, only the allocated space is filled; then, -EIO is
returned since 'pos != i_size' after the read loop.

But this is not obvious and seems incidental.  We should be more
explicit about this case.  So, fail early if i_size > SIZE_MAX.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
2018-10-10 12:56:14 -04:00
Colin Ian King
2978d87347 orangefs: rate limit the client not running info message
Currently accessing various /sys/fs/orangefs files will spam the
kernel log with the following info message when the client is not
running:

[  491.489284] sysfs_service_op_show: Client not running :-5:

Rate limit this info message to make it less spammy.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-10-10 09:54:29 -04:00
Chengguang Xu
052d12766b orangefs: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated
as ACL_NOT_CACHED in vfs function inode_init_always() and later
will be updated by calling xxx_init_acl() in specific filesystems.
Howerver, when default_acl and acl are NULL then they keep the value
of ACL_NOT_CACHED, this patch tries to cache NULL for acl/default_acl
in this case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2018-10-10 09:54:25 -04:00
Al Viro
74dd7c97ea ecryptfs_rename(): verify that lower dentries are still OK after lock_rename()
We get lower layer dentries, find their parents, do lock_rename() and
proceed to vfs_rename().  However, we do not check that dentries still
have the same parents and are not unlinked.  Need to check that...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-09 23:33:17 -04:00
Andreas Gruenbacher
dc480feb45 gfs2: Fix iomap buffered write support for journaled files
Commit 64bc06bb32 broke buffered writes to journaled files (chattr
+j): we'll try to journal the buffer heads of the page being written to
in gfs2_iomap_journaled_page_done.  However, the iomap code no longer
creates buffer heads, so we'll BUG() in gfs2_page_add_databufs.  Fix
that by creating buffer heads ourself when needed.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2018-10-09 18:20:13 +02:00
Steve Whitehouse
6ddc5c3ddf gfs2: getlabel support
Add support for the GETFSLABEL ioctl in gfs2.
I tested this patch and it works as expected.

Signed-off-by: Steve Whitehouse <swhiteho@redhat.com>
Tested-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-09 07:06:16 -05:00
Tim Smith
1eb8d73879 GFS2: Flush the GFS2 delete workqueue before stopping the kernel threads
Flushing the workqueue can cause operations to happen which might
call gfs2_log_reserve(), or get stuck waiting for locks taken by such
operations.  gfs2_log_reserve() can io_schedule(). If this happens, it
will never wake because the only thing which can wake it is gfs2_logd()
which was already stopped.

This causes umount of a gfs2 filesystem to wedge permanently if, for
example, the umount immediately follows a large delete operation.

When this occured, the following stack trace was obtained from the
umount command

[<ffffffff81087968>] flush_workqueue+0x1c8/0x520
[<ffffffffa0666e29>] gfs2_make_fs_ro+0x69/0x160 [gfs2]
[<ffffffffa0667279>] gfs2_put_super+0xa9/0x1c0 [gfs2]
[<ffffffff811b7edf>] generic_shutdown_super+0x6f/0x100
[<ffffffff811b7ff7>] kill_block_super+0x27/0x70
[<ffffffffa0656a71>] gfs2_kill_sb+0x71/0x80 [gfs2]
[<ffffffff811b792b>] deactivate_locked_super+0x3b/0x70
[<ffffffff811b79b9>] deactivate_super+0x59/0x60
[<ffffffff811d2998>] cleanup_mnt+0x58/0x80
[<ffffffff811d2a12>] __cleanup_mnt+0x12/0x20
[<ffffffff8108c87d>] task_work_run+0x7d/0xa0
[<ffffffff8106d7d9>] exit_to_usermode_loop+0x73/0x98
[<ffffffff81003961>] syscall_return_slowpath+0x41/0x50
[<ffffffff815a594c>] int_ret_from_sys_call+0x25/0x8f
[<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Tim Smith <tim.smith@citrix.com>
Signed-off-by: Mark Syms <mark.syms@citrix.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-09 07:05:36 -05:00
Borislav Petkov
cf089611f4 proc/vmcore: Fix i386 build error of missing copy_oldmem_page_encrypted()
Lianbo reported a build error with a particular 32-bit config, see Link
below for details.

Provide a weak copy_oldmem_page_encrypted() function which architectures
can override, in the same manner other functionality in that file is
supplied.

Reported-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
CC: x86@kernel.org
Link: http://lkml.kernel.org/r/710b9d95-2f70-eadf-c4a1-c3dc80ee4ebb@redhat.com
2018-10-09 11:57:28 +02:00
Dan Williams
d7782145e1 filesystem-dax: Fix dax_layout_busy_page() livelock
In the presence of multi-order entries the typical
pagevec_lookup_entries() pattern may loop forever:

	while (index < end && pagevec_lookup_entries(&pvec, mapping, index,
				min(end - index, (pgoff_t)PAGEVEC_SIZE),
				indices)) {
		...
		for (i = 0; i < pagevec_count(&pvec); i++) {
			index = indices[i];
			...
		}
		index++; /* BUG */
	}

The loop updates 'index' for each index found and then increments to the
next possible page to continue the lookup. However, if the last entry in
the pagevec is multi-order then the next possible page index is more
than 1 page away. Fix this locally for the filesystem-dax case by
checking for dax-multi-order entries. Going forward new users of
multi-order entries need to be similarly careful, or we need a generic
way to report the page increment in the radix iterator.

Fixes: 5fac7408d8 ("mm, fs, dax: handle layout changes to pinned dax...")
Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <zwisler@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-10-08 11:38:44 -07:00
Andrew Price
4c62bd9cea gfs2: Don't leave s_fs_info pointing to freed memory in init_sbd
When alloc_percpu() fails, sdp gets freed but sb->s_fs_info still points
to the same address. Move the assignment after that error check so that
s_fs_info can only point to a valid sdp or NULL, which is checked for
later in the error path, in gfs2_kill_super().

Reported-by: syzbot+dcb8b3587445007f5808@syzkaller.appspotmail.com
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-08 07:52:43 -05:00
Amir Goldstein
d0a6a87e40 fanotify: support reporting thread id instead of process id
In order to identify which thread triggered the event in a
multi-threaded program, add the FAN_REPORT_TID flag in fanotify_init to
opt-in for reporting the event creator's thread id information.

Signed-off-by: nixiaoming <nixiaoming@huawei.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-08 13:48:45 +02:00
Chengguang Xu
6fd941784b ext4: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated as
ACL_NOT_CACHED in vfs function inode_init_always() and later will be
updated by calling xxx_init_acl() in specific filesystems.  However,
when default_acl and acl are NULL then they keep the value of
ACL_NOT_CACHED.  This patch changes the code to cache NULL for acl /
default_acl in this case to save unnecessary ACL lookup attempt.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2018-10-06 22:40:34 -04:00
David S. Miller
72438f8cef Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-10-06 14:43:42 -07:00
Lianbo Jiang
992b649a3f kdump, proc/vmcore: Enable kdumping encrypted memory with SME enabled
In the kdump kernel, the memory of the first kernel needs to be dumped
into the vmcore file.

If SME is enabled in the first kernel, the old memory has to be remapped
with the memory encryption mask in order to access it properly.

Split copy_oldmem_page() functionality to handle encrypted memory
properly.

 [ bp: Heavily massage everything. ]

Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: kexec@lists.infradead.org
Cc: tglx@linutronix.de
Cc: mingo@redhat.com
Cc: hpa@zytor.com
Cc: akpm@linux-foundation.org
Cc: dan.j.williams@intel.com
Cc: bhelgaas@google.com
Cc: baiyaowei@cmss.chinamobile.com
Cc: tiwai@suse.de
Cc: brijesh.singh@amd.com
Cc: dyoung@redhat.com
Cc: bhe@redhat.com
Cc: jroedel@suse.de
Link: https://lkml.kernel.org/r/be7b47f9-6be6-e0d1-2c2a-9125bc74b818@redhat.com
2018-10-06 12:09:26 +02:00
Dave Chinner
b39989009b xfs: fix data corruption w/ unaligned reflink ranges
When reflinking sub-file ranges, a data corruption can occur when
the source file range includes a partial EOF block. This shares the
unknown data beyond EOF into the second file at a position inside
EOF, exposing stale data in the second file.

XFS only supports whole block sharing, but we still need to
support whole file reflink correctly.  Hence if the reflink
request includes the last block of the souce file, only proceed with
the reflink operation if it lands at or past the destination file's
current EOF. If it lands within the destination file EOF, reject the
entire request with -EINVAL and make the caller go the hard way.

This avoids the data corruption vector, but also avoids disruption
of returning EINVAL to userspace for the common case of whole file
cloning.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-06 11:44:39 +10:00
Dave Chinner
dceeb47b0e xfs: fix data corruption w/ unaligned dedupe ranges
A deduplication data corruption is Exposed by fstests generic/505 on
XFS. It is caused by extending the block match range to include the
partial EOF block, but then allowing unknown data beyond EOF to be
considered a "match" to data in the destination file because the
comparison is only made to the end of the source file. This corrupts
the destination file when the source extent is shared with it.

XFS only supports whole block dedupe, but we still need to appear to
support whole file dedupe correctly.  Hence if the dedupe request
includes the last block of the souce file, don't include it in the
actual XFS dedupe operation. If the rest of the range dedupes
successfully, then report the partial last block as deduped, too, so
that userspace sees it as a successful dedupe rather than return
EINVAL because we can't dedupe unaligned blocks.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-06 11:44:19 +10:00
Ashish Samant
cbe355f57c ocfs2: fix locking for res->tracking and dlm->tracking_list
In dlm_init_lockres() we access and modify res->tracking and
dlm->tracking_list without holding dlm->track_lock.  This can cause list
corruptions and can end up in kernel panic.

Fix this by locking res->tracking and dlm->tracking_list with
dlm->track_lock instead of dlm->spinlock.

Link: http://lkml.kernel.org/r/1529951192-4686-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Changwei Ge <ge.changwei@h3c.com>
Acked-by: Joseph Qi <jiangqi903@gmail.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Jann Horn
f8a00cef17 proc: restrict kernel stack dumps to root
Currently, you can use /proc/self/task/*/stack to cause a stack walk on
a task you control while it is running on another CPU.  That means that
the stack can change under the stack walker.  The stack walker does
have guards against going completely off the rails and into random
kernel memory, but it can interpret random data from your kernel stack
as instruction pointers and stack pointers.  This can cause exposure of
kernel stack contents to userspace.

Restrict the ability to inspect kernel stacks of arbitrary tasks to root
in order to prevent a local attacker from exploiting racy stack unwinding
to leak kernel task stack contents.  See the added comment for a longer
rationale.

There don't seem to be any users of this userspace API that can't
gracefully bail out if reading from the file fails.  Therefore, I believe
that this change is unlikely to break things.  In the case that this patch
does end up needing a revert, the next-best solution might be to fake a
single-entry stack based on wchan.

Link: http://lkml.kernel.org/r/20180927153316.200286-1-jannh@google.com
Fixes: 2ec220e27f ("proc: add /proc/*/stack")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:05 -07:00
Larry Chen
69eb7765b9 ocfs2: fix crash in ocfs2_duplicate_clusters_by_page()
ocfs2_duplicate_clusters_by_page() may crash if one of the extent's pages
is dirty.  When a page has not been written back, it is still in dirty
state.  If ocfs2_duplicate_clusters_by_page() is called against the dirty
page, the crash happens.

To fix this bug, we can just unlock the page and wait until the page until
its not dirty.

The following is the backtrace:

kernel BUG at /root/code/ocfs2/refcounttree.c:2961!
[exception RIP: ocfs2_duplicate_clusters_by_page+822]
__ocfs2_move_extent+0x80/0x450 [ocfs2]
? __ocfs2_claim_clusters+0x130/0x250 [ocfs2]
ocfs2_defrag_extent+0x5b8/0x5e0 [ocfs2]
__ocfs2_move_extents_range+0x2a4/0x470 [ocfs2]
ocfs2_move_extents+0x180/0x3b0 [ocfs2]
? ocfs2_wait_for_recovery+0x13/0x70 [ocfs2]
ocfs2_ioctl_move_extents+0x133/0x2d0 [ocfs2]
ocfs2_ioctl+0x253/0x640 [ocfs2]
do_vfs_ioctl+0x90/0x5f0
SyS_ioctl+0x74/0x80
do_syscall_64+0x74/0x140
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Once we find the page is dirty, we do not wait until it's clean, rather we
use write_one_page() to write it back

Link: http://lkml.kernel.org/r/20180829074740.9438-1-lchen@suse.com
[lchen@suse.com: update comments]
  Link: http://lkml.kernel.org/r/20180830075041.14879-1-lchen@suse.com
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Larry Chen <lchen@suse.com>
Acked-by: Changwei Ge <ge.changwei@h3c.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-10-05 16:32:04 -07:00
Jan Kara
ccd3c4373e jbd2: fix use after free in jbd2_log_do_checkpoint()
The code cleaning transaction's lists of checkpoint buffers has a bug
where it increases bh refcount only after releasing
journal->j_list_lock. Thus the following race is possible:

CPU0					CPU1
jbd2_log_do_checkpoint()
					jbd2_journal_try_to_free_buffers()
					  __journal_try_to_free_buffer(bh)
  ...
  while (transaction->t_checkpoint_io_list)
  ...
    if (buffer_locked(bh)) {

<-- IO completes now, buffer gets unlocked -->

      spin_unlock(&journal->j_list_lock);
					    spin_lock(&journal->j_list_lock);
					    __jbd2_journal_remove_checkpoint(jh);
					    spin_unlock(&journal->j_list_lock);
					  try_to_free_buffers(page);
      get_bh(bh) <-- accesses freed bh

Fix the problem by grabbing bh reference before unlocking
journal->j_list_lock.

Fixes: dc6e8d669c ("jbd2: don't call get_bh() before calling __jbd2_journal_remove_checkpoint()")
Fixes: be1158cc61 ("jbd2: fold __process_buffer() into jbd2_log_do_checkpoint()")
Reported-by: syzbot+7f4a27091759e2fe7453@syzkaller.appspotmail.com
CC: stable@vger.kernel.org
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-05 18:44:40 -04:00
Al Viro
b07581d2d5 cachefiles: fix the race between cachefiles_bury_object() and rmdir(2)
the victim might've been rmdir'ed just before the lock_rename();
unlike the normal callers, we do not look the source up after the
parents are locked - we know it beforehand and just recheck that it's
still the child of what used to be its parent.  Unfortunately,
the check is too weak - we don't spot a dead directory since its
->d_parent is unchanged, dentry is positive, etc.  So we sail all
the way to ->rename(), with hosting filesystems _not_ expecting
to be asked renaming an rmdir'ed subdirectory.

The fix is easy, fortunately - the lock on parent is sufficient for
making IS_DEADDIR() on child safe.

Cc: stable@vger.kernel.org
Fixes: 9ae326a690 (CacheFiles: A cache that backs onto a mounted filesystem)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-05 14:37:28 -04:00
Bob Peterson
e54c78a27f gfs2: Use fs_* functions instead of pr_* function where we can
Before this patch, various errors and messages were reported using
the pr_* functions: pr_err, pr_warn, pr_info, etc., but that does
not tell you which gfs2 mount had the problem, which is often vital
to debugging. This patch changes the calls from pr_* to fs_* in
most of the messages so that the file system id is printed along
with the message.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-05 11:16:54 -05:00
Bob Peterson
b524abcc01 gfs2: slow the deluge of io error messages
When an io error is hit, it calls gfs2_io_error_bh_i for every
journal buffer it can't write. Since we changed gfs2_io_error_bh_i
recently to withdraw later in the cycle, it sends a flood of
errors to the console. This patch checks for the file system already
being withdrawn, and if so, doesn't send more messages. It doesn't
stop the flood of messages, but it slows it down and keeps it more
reasonable.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-10-05 10:51:11 -05:00
Greg Kroah-Hartman
087f759a41 four small SMB3 fixes: one for stable, the others to address a more recent regression
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlu2jLMACgkQiiy9cAdy
 T1GLMQwAiUaUTtR0MN2ws+IHNnwFX5YnWFa1Cr/6Tgz9WBo5TrrvQcBGhTIUQok2
 7w2+I0rXPO7P7Tq2odWn66BT1/l0o2U10aB4OHYlS7zqfCAp6GuwKGtATivRqjsg
 ENn2mmBRuTmz+XtGmlVklmQpHjn4+MTR2GSLA57oo1fqX5YBeSxcq2UEciCGttRQ
 nAvn923NrPSbGRvL/eDDWGXas8gMATXnrNlN8st2AsxnoFC6CrzUuRM0s/IFgp6r
 Orh0jIlovlT2Q6LdND7n7xXnEVoyUTETG89IjRVwcBJ9LtXrJXpuWo7/sOr11cMy
 sGeUkILNlmkIdLtIlUaZcGaclq/yfeKdMxziBT8tWP6wXIH4LrGCzvZ9jDtrzygU
 EWL2nLsqMJNh0+wsYSZM/m9VFC1ReeB1AgIC2EHu/xzqWoTeL40i2OhKi/732FEX
 H+FlTZUoorhFoI96AMJMrqXjXyqZUZSQ1b5iJ7/jq888Vdey+m7HiMUtxv0+5yzC
 +lQksh8X
 =4PHD
 -----END PGP SIGNATURE-----

Merge tag '4.19-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Steve writes:
  "SMB3 fixes

   four small SMB3 fixes: one for stable, the others to address a more
   recent regression"

* tag '4.19-rc6-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  smb3: fix lease break problem introduced by compounding
  cifs: only wake the thread for the very last PDU in a compound
  cifs: add a warning if we try to to dequeue a deleted mid
  smb2: fix missing files in root share directory listing
2018-10-05 08:27:47 -07:00
Olga Kornievskaia
44f411c353 NFSv4.x: fix lock recovery during delegation recall
Running "./nfstest_delegation --runtest recall26" uncovers that
client doesn't recover the lock when we have an appending open,
where the initial open got a write delegation.

Instead of checking for the passed in open context against
the file lock's open context. Check that the state is the same.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-10-05 09:34:36 -04:00
Darrick J. Wong
7debbf015f xfs: update ctime and remove suid before cloning files
Before cloning into a file, update the ctime and remove sensitive
attributes like suid, just like we'd do for a regular file write.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-05 19:05:41 +10:00
Darrick J. Wong
410fdc72b0 xfs: zero posteof blocks when cloning above eof
When we're reflinking between two files and the destination file range
is well beyond the destination file's EOF marker, zero any posteof
speculative preallocations in the destination file so that we don't
expose stale disk contents.  The previous strategy of trying to clear
the preallocations does not work if the destination file has the
PREALLOC flag set.

Uncovered by shared/010.

Reported-by: Zorro Lang <zlang@redhat.com>
Bugzilla-id: https://bugzilla.kernel.org/show_bug.cgi?id=201259
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-05 19:04:27 +10:00
Darrick J. Wong
0d41e1d28c xfs: refactor clonerange preparation into a separate helper
Refactor all the reflink preparation steps into a separate helper
that we'll use to land all the upcoming fixes for insufficient input
checks.

This rework also moves the invalidation of the destination range to
the prep function so that it is done before the range is remapped.
This ensures that nobody can access the data in range being remapped
until the remap is complete.

[dgc: fix xfs_reflink_remap_prep() return value and caller check to
handle vfs_clone_file_prep_inodes() returning 0 to mean "nothing to
do". ]

[dgc: make sure length changed by vfs_clone_file_prep_inodes() gets
propagated back to XFS code that does the remapping. ]

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-05 19:04:22 +10:00
Greg Kroah-Hartman
010bd965f9 overlayfs fixes for 4.19-rc7
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW7YOCQAKCRDh3BK/laaZ
 PAmGAQC1/3+mkfHXqyDx+zv4f7A348ZULW26EWFWwMwlEljWowD+PlRFBgUu5v5c
 194mCApZU7hl6o0u07aijjz6m81e6QE=
 =dM89
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Miklos writes:
  "overlayfs fixes for 4.19-rc7

   This update fixes a couple of regressions in the stacked file update
   added in this cycle, as well as some older bugs uncovered by
   syzkaller.

   There's also one trivial naming change that touches other parts of
   the fs subsystem."

* tag 'ovl-fixes-4.19-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix format of setxattr debug
  ovl: fix access beyond unterminated strings
  ovl: make symbol 'ovl_aops' static
  vfs: swap names of {do,vfs}_clone_file_range()
  ovl: fix freeze protection bypass in ovl_clone_file_range()
  ovl: fix freeze protection bypass in ovl_write_iter()
  ovl: fix memory leak on unlink of indexed file
2018-10-04 13:24:38 -07:00
Greg Kroah-Hartman
1b0350c355 XFS fixes for 4.19-rc6
Accumlated regression and bug fixes for 4.19-rc6, including:
 
 o make iomap correctly mark dirty pages for sub-page block sizes
 o fix regression in handling extent-to-btree format conversion errors
 o fix torn log wrap detection for new logs
 o various corrupt inode detection fixes
 o various delalloc state fixes
 o cleanup all the missed transaction cancel cases missed from changes merged
   in 4.19-rc1
 o fix lockdep false positive on transaction allocation
 o fix locking and reference counting on buffer log items
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbtVyHAAoJEK3oKUf0dfodc8YP/1hT+TdXZDBVPo9kXQwmrnra
 8P1J8IEuGj851PwYQobUhifdDh4eQ+PpcYEIfqTSGhtjTg7cqyQOvTo4uUKHLn50
 8mFXQb6JrAFYnjGjorOCde+lpoQrtj3oKASscs7RaL/W1RNffY74UGdiE0yOJnJs
 9IC7TpJT4ZzAgYBgC61whfYWmszLmRuKlVZWgXrM8hkMqliHdfrXeZk7MXTeiflz
 Q5+9eVnCCgTtC0TWgVAkFMvlZs7UtNMWIIn6zt+HQLB7Vms6q4ArpIT+fF45njwG
 94tyTcFT0AcIJ8OIi5i85fOXcDGjTFFf554Yf80PlWGlk3SbXPHFQao/ItDA4S1Z
 eCl2eurUOXlNXKuyzWoV5KjOBiiBicbOwl6eX0K996cGehhEdFOvBihvAGjJy5rm
 c4plTTy6bhYKZWr3JLuaPDlNWDMr/P/aUkiq9tFXyaMmVKNeyxd6UKzIokl5uUfi
 Ycik4Ik8zRgpFEUx5Clvb2W5qH9pqpR0WcFhrcj0mDnJa10TpBWesA0g2F+hM0Jc
 0mSuGxfQeJk7tuVcvsBpPNzgVEPKabvnYbOlGK7HvCSOPRkg0Fhmj+iJK721ccVl
 Ltiz0Y485eAGLKn9kOWKr47A5GAAFpuK29OrD2a8XizHPZ3Sr2CG5X1jI0gtMb8n
 BiBAxO6dojRQATDuTI6w
 =5ylA
 -----END PGP SIGNATURE-----

Merge tag 'xfs-fixes-for-4.19-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Dave writes:
  "XFS fixes for 4.19-rc6

   Accumlated regression and bug fixes for 4.19-rc6, including:

   o make iomap correctly mark dirty pages for sub-page block sizes
   o fix regression in handling extent-to-btree format conversion errors
   o fix torn log wrap detection for new logs
   o various corrupt inode detection fixes
   o various delalloc state fixes
   o cleanup all the missed transaction cancel cases missed from changes merged
     in 4.19-rc1
   o fix lockdep false positive on transaction allocation
   o fix locking and reference counting on buffer log items"

* tag 'xfs-fixes-for-4.19-rc6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix error handling in xfs_bmap_extents_to_btree
  iomap: set page dirty after partial delalloc on mkwrite
  xfs: remove invalid log recovery first/last cycle check
  xfs: validate inode di_forkoff
  xfs: skip delalloc COW blocks in xfs_reflink_end_cow
  xfs: don't treat unknown di_flags2 as corruption in scrub
  xfs: remove duplicated include from alloc.c
  xfs: don't bring in extents in xfs_bmap_punch_delalloc_range
  xfs: fix transaction leak in xfs_reflink_allocate_cow()
  xfs: avoid lockdep false positives in xfs_trans_alloc
  xfs: refactor xfs_buf_log_item reference count handling
  xfs: clean up xfs_trans_brelse()
  xfs: don't unlock invalidated buf on aborted tx commit
  xfs: remove last of unnecessary xfs_defer_cancel() callers
  xfs: don't crash the vfs on a garbage inline symlink
2018-10-04 09:17:38 -07:00
Miklos Szeredi
1a8f8d2a44 ovl: fix format of setxattr debug
Format has a typo: it was meant to be "%.*s", not "%*s".  But at some point
callers grew nonprintable values as well, so use "%*pE" instead with a
maximized length.

Reported-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 3a1e819b4e ("ovl: store file handle of lower inode on copy up")
Cc: <stable@vger.kernel.org> # v4.12
2018-10-04 14:49:10 +02:00
Amir Goldstein
601350ff58 ovl: fix access beyond unterminated strings
KASAN detected slab-out-of-bounds access in printk from overlayfs,
because string format used %*s instead of %.*s.

> BUG: KASAN: slab-out-of-bounds in string+0x298/0x2d0 lib/vsprintf.c:604
> Read of size 1 at addr ffff8801c36c66ba by task syz-executor2/27811
>
> CPU: 0 PID: 27811 Comm: syz-executor2 Not tainted 4.19.0-rc5+ #36
...
>  printk+0xa7/0xcf kernel/printk/printk.c:1996
>  ovl_lookup_index.cold.15+0xe8/0x1f8 fs/overlayfs/namei.c:689

Reported-by: syzbot+376cea2b0ef340db3dd4@syzkaller.appspotmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 359f392ca5 ("ovl: lookup index entry for copy up origin")
Cc: <stable@vger.kernel.org> # v4.13
2018-10-04 14:49:10 +02:00
Amir Goldstein
bdd5a46fe3 fanotify: add BUILD_BUG_ON() to count the bits of fanotify constants
Also define the FANOTIFY_EVENT_FLAGS consisting of the extra flags
FAN_ONDIR and FAN_ON_CHILD.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:30:03 +02:00
Amir Goldstein
a39f7ec417 fsnotify: convert runtime BUG_ON() to BUILD_BUG_ON()
The BUG_ON() statements to verify number of bits in ALL_FSNOTIFY_BITS
and ALL_INOTIFY_BITS are converted to build time check of the constant.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:28:50 +02:00
Amir Goldstein
23c9deeb32 fanotify: deprecate uapi FAN_ALL_* constants
We do not want to add new bits to the FAN_ALL_* uapi constants
because they have been exposed to userspace.  If there are programs
out there using these constants, those programs could break if
re-compiled with modified FAN_ALL_* constants and run on an old kernel.

We deprecate the uapi constants FAN_ALL_* and define new FANOTIFY_*
constants for internal use to replace them. New feature bits will be
added only to the new constants.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:28:38 +02:00
Amir Goldstein
a72fd224e3 fanotify: simplify handling of FAN_ONDIR
fanotify mark add/remove code jumps through hoops to avoid setting the
FS_ISDIR in the commulative object mask.

That was just papering over a bug in fsnotify() handling of the FS_ISDIR
extra flag. This bug is now fixed, so all the hoops can be removed along
with the unneeded internal flag FAN_MARK_ONDIR.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:28:29 +02:00
Amir Goldstein
007d1e8395 fsnotify: generalize handling of extra event flags
FS_EVENT_ON_CHILD gets a special treatment in fsnotify() because it is
not a flag specifying an event type, but rather an extra flags that may
be reported along with another event and control the handling of the
event by the backend.

FS_ISDIR is also an "extra flag" and not an "event type" and therefore
desrves the same treatment. With inotify/dnotify backends it was never
possible to set FS_ISDIR in mark masks, so it did not matter.
With fanotify backend, mark adding code jumps through hoops to avoid
setting the FS_ISDIR in the commulative object mask.

Separate the constant ALL_FSNOTIFY_EVENTS to ALL_FSNOTIFY_FLAGS and
ALL_FSNOTIFY_EVENTS, so the latter can be used to test for specific
event types.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-10-04 13:28:02 +02:00
David Howells
46894a1359 rxrpc: Use IPv4 addresses throught the IPv6
AF_RXRPC opens an IPv6 socket through which to send and receive network
packets, both IPv6 and IPv4.  It currently turns AF_INET addresses into
AF_INET-as-AF_INET6 addresses based on an assumption that this was
necessary; on further inspection of the code, however, it turns out that
the IPv6 code just farms packets aimed at AF_INET addresses out to the IPv4
code.

Fix AF_RXRPC to use AF_INET addresses directly when given them.

Fixes: 7b674e390e ("rxrpc: Fix IPv6 support")
Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-04 09:32:28 +01:00
David Howells
66be646bd9 afs: Sort address lists so that they are in logical ascending order
Sort address lists so that they are in logical ascending order rather than
being partially in ascending order of the BE representations of those
values.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-04 09:32:28 +01:00
David Howells
4c19bbdc7f afs: Always build address lists using the helper functions
Make the address list string parser use the helper functions for adding
addresses to an address list so that they end up appropriately sorted.
This will better handles overruns and make them easier to compare.

It also reduces the number of places that addresses are handled, making it
easier to fix the handling.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-04 09:32:27 +01:00
David Howells
68eb64c3d2 afs: Do better max capacity handling on address lists
Note the maximum allocated capacity in an afs_addr_list struct and discard
addresses that would exceed it in afs_merge_fs_addr{4,6}().

Also, since the current maximum capacity is less than 255, reduce the
relevant members to bytes.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-04 09:32:27 +01:00
Ingo Molnar
c0554d2d3d Merge branch 'linus' into x86/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-10-04 08:23:03 +02:00
Wang Shilong
182a79e0c1 ext4: propagate error from dquot_initialize() in EXT4_IOC_FSSETXATTR
We return most failure of dquota_initialize() except
inode evict, this could make a bit sense, for example
we allow file removal even quota files are broken?

But it dosen't make sense to allow setting project
if quota files etc are broken.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-10-03 12:19:21 -04:00
Eric W. Biederman
ae7795bc61 signal: Distinguish between kernel_siginfo and siginfo
Linus recently observed that if we did not worry about the padding
member in struct siginfo it is only about 48 bytes, and 48 bytes is
much nicer than 128 bytes for allocating on the stack and copying
around in the kernel.

The obvious thing of only adding the padding when userspace is
including siginfo.h won't work as there are sigframe definitions in
the kernel that embed struct siginfo.

So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
traditional name for the userspace definition.  While the version that
is used internally to the kernel and ultimately will not be padded to
128 bytes is called kernel_siginfo.

The definition of struct kernel_siginfo I have put in include/signal_types.h

A set of buildtime checks has been added to verify the two structures have
the same field offsets.

To make it easy to verify the change kernel_siginfo retains the same
size as siginfo.  The reduction in size comes in a following change.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-10-03 16:47:43 +02:00
Wang Shilong
dc7ac6c4ca ext4: fix setattr project check in fssetxattr ioctl
Currently, project quota could be changed by fssetxattr
ioctl, and existed permission check inode_owner_or_capable()
is obviously not enough, just think that common users could
change project id of file, that could make users to
break project quota easily.

This patch try to follow same regular of xfs project
quota:

"Project Quota ID state is only allowed to change from
within the init namespace. Enforce that restriction only
if we are trying to change the quota ID state.
Everything else is allowed in user namespaces."

Besides that, check and set project id'state should
be an atomic operation, protect whole operation with
inode lock, ext4_ioctl_setproject() is only used for
ioctl EXT4_IOC_FSSETXATTR, we have held mnt_want_write_file()
before ext4_ioctl_setflags(), and ext4_ioctl_setproject()
is called after ext4_ioctl_setflags(), we could share
codes, so remove it inside ext4_ioctl_setproject().

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
2018-10-03 10:33:32 -04:00
Greg Kroah-Hartman
73dec82d8d Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Al writes:
  "xattrs regression fix from Andreas; sat in -next for quite a while."

* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  sysfs: Do not return POSIX ACL xattrs via listxattr
2018-10-03 04:21:23 -07:00
Souptick Joarder
401b25aa1a ext4: convert fault handler to use vm_fault_t type
Return type of ext4_page_mkwrite and ext4_filemap_fault are
changed to use vm_fault_t type.

With this patch all the callers of block_page_mkwrite_return()
are changed to handle vm_fault_t. So converting the return type
of block_page_mkwrite_return() to vm_fault_t.

Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Matthew Wilcox <willy@infradead.org>
2018-10-02 22:20:50 -04:00
Lukas Czerner
625ef8a3ac ext4: initialize retries variable in ext4_da_write_inline_data_begin()
Variable retries is not initialized in ext4_da_write_inline_data_begin()
which can lead to nondeterministic number of retries in case we hit
ENOSPC. Initialize retries to zero as we do everywhere else.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Fixes: bc0ca9df3b ("ext4: retry allocation when inline->extent conversion failed")
Cc: stable@kernel.org
2018-10-02 21:18:45 -04:00
Jaegeuk Kim
fb7d70db30 f2fs: clear PageError on the read path
When running fault injection test, I hit somewhat wrong behavior in f2fs_gc ->
gc_data_segment():

0. fault injection generated some PageError'ed pages

1. gc_data_segment
 -> f2fs_get_read_data_page(REQ_RAHEAD)

2. move_data_page
 -> f2fs_get_lock_data_page()
  -> f2f_get_read_data_page()
   -> f2fs_submit_page_read()
    -> submit_bio(READ)
  -> return EIO due to PageError
  -> fail to move data

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-10-02 17:33:10 -07:00
Steve French
7af929d6d0 smb3: fix lease break problem introduced by compounding
Fixes problem (discovered by Aurelien) introduced by recent commit:
commit b24df3e30c
("cifs: update receive_encrypted_standard to handle compounded responses")

which broke the ability to respond to some lease breaks
(lease breaks being ignored is a problem since can block
server response for duration of the lease break timeout).

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-10-02 18:54:09 -05:00
Ronnie Sahlberg
4e34feb5e9 cifs: only wake the thread for the very last PDU in a compound
For compounded PDUs we whould only wake the waiting thread for the
very last PDU of the compound.
We do this so that we are guaranteed that the demultiplex_thread will
not process or access any of those MIDs any more once the send/recv
thread starts processing.

Else there is a race where at the end of the send/recv processing we
will try to delete all the mids of the compound. If the multiplex
thread still has other mids to process at this point for this compound
this can lead to an oops.

Needed to fix recent commit:
commit 730928c8f4
("cifs: update smb2_queryfs() to use compounding")

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-10-02 18:53:57 -05:00
Ronnie Sahlberg
ddf83afb9f cifs: add a warning if we try to to dequeue a deleted mid
cifs_delete_mid() is called once we are finished handling a mid and we
expect no more work done on this mid.

Needed to fix recent commit:
commit 730928c8f4
("cifs: update smb2_queryfs() to use compounding")

Add a warning if someone tries to dequeue a mid that has already been
flagged to be deleted.
Also change list_del() to list_del_init() so that if we have similar bugs
resurface in the future we will not oops.

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-10-02 18:12:31 -05:00
Aurelien Aptel
0595751f26 smb2: fix missing files in root share directory listing
When mounting a Windows share that is the root of a drive (eg. C$)
the server does not return . and .. directory entries. This results in
the smb2 code path erroneously skipping the 2 first entries.

Pseudo-code of the readdir() code path:

cifs_readdir(struct file, struct dir_context)
    initiate_cifs_search            <-- if no reponse cached yet
        server->ops->query_dir_first

    dir_emit_dots
        dir_emit                    <-- adds "." and ".." if we're at pos=0

    find_cifs_entry
        initiate_cifs_search        <-- if pos < start of current response
                                         (restart search)
        server->ops->query_dir_next <-- if pos > end of current response
                                         (fetch next search res)

    for(...)                        <-- loops over cur response entries
                                          starting at pos
        cifs_filldir                <-- skip . and .., emit entry
            cifs_fill_dirent
            dir_emit
	pos++

A) dir_emit_dots() always adds . & ..
   and sets the current dir pos to 2 (0 and 1 are done).

Therefore we always want the index_to_find to be 2 regardless of if
the response has . and ..

B) smb1 code initializes index_of_last_entry with a +2 offset

  in cifssmb.c CIFSFindFirst():
		psrch_inf->index_of_last_entry = 2 /* skip . and .. */ +
			psrch_inf->entries_in_buffer;

Later in find_cifs_entry() we want to find the next dir entry at pos=2
as a result of (A)

	first_entry_in_buffer = cfile->srch_inf.index_of_last_entry -
					cfile->srch_inf.entries_in_buffer;

This var is the dir pos that the first entry in the buffer will
have therefore it must be 2 in the first call.

If we don't offset index_of_last_entry by 2 (like in (B)),
first_entry_in_buffer=0 but we were instructed to get pos=2 so this
code in find_cifs_entry() skips the 2 first which is ok for non-root
shares, as it skips . and .. from the response but is not ok for root
shares where the 2 first are actual files

		pos_in_buf = index_to_find - first_entry_in_buffer;
                // pos_in_buf=2
		// we skip 2 first response entries :(
		for (i = 0; (i < (pos_in_buf)) && (cur_ent != NULL); i++) {
			/* go entry by entry figuring out which is first */
			cur_ent = nxt_dir_entry(cur_ent, end_of_smb,
						cfile->srch_inf.info_level);
		}

C) cifs_filldir() skips . and .. so we can safely ignore them for now.

Sample program:

int main(int argc, char **argv)
{
	const char *path = argc >= 2 ? argv[1] : ".";
	DIR *dh;
	struct dirent *de;

	printf("listing path <%s>\n", path);
	dh = opendir(path);
	if (!dh) {
		printf("opendir error %d\n", errno);
		return 1;
	}

	while (1) {
		de = readdir(dh);
		if (!de) {
			if (errno) {
				printf("readdir error %d\n", errno);
				return 1;
			}
			printf("end of listing\n");
			break;
		}
		printf("off=%lu <%s>\n", de->d_off, de->d_name);
	}

	return 0;
}

Before the fix with SMB1 on root shares:

<.>            off=1
<..>           off=2
<$Recycle.Bin> off=3
<bootmgr>      off=4

and on non-root shares:

<.>    off=1
<..>   off=4  <-- after adding .., the offsets jumps to +2 because
<2536> off=5       we skipped . and .. from response buffer (C)
<411>  off=6       but still incremented pos
<file> off=7
<fsx>  off=8

Therefore the fix for smb2 is to mimic smb1 behaviour and offset the
index_of_last_entry by 2.

Test results comparing smb1 and smb2 before/after the fix on root
share, non-root shares and on large directories (ie. multi-response
dir listing):

PRE FIX
=======
pre-1-root VS pre-2-root:
        ERR pre-2-root is missing [bootmgr, $Recycle.Bin]
pre-1-nonroot VS pre-2-nonroot:
        OK~ same files, same order, different offsets
pre-1-nonroot-large VS pre-2-nonroot-large:
        OK~ same files, same order, different offsets

POST FIX
========
post-1-root VS post-2-root:
        OK same files, same order, same offsets
post-1-nonroot VS post-2-nonroot:
        OK same files, same order, same offsets
post-1-nonroot-large VS post-2-nonroot-large:
        OK same files, same order, same offsets

REGRESSION?
===========
pre-1-root VS post-1-root:
        OK same files, same order, same offsets
pre-1-nonroot VS post-1-nonroot:
        OK same files, same order, same offsets

BugLink: https://bugzilla.samba.org/show_bug.cgi?id=13107
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Paulo Alcantara <palcantara@suse.deR>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2018-10-02 18:06:21 -05:00
Theodore Ts'o
18aded1749 ext4: fix EXT4_IOC_SWAP_BOOT
The code EXT4_IOC_SWAP_BOOT ioctl hasn't been updated in a while, and
it's a bit broken with respect to more modern ext4 kernels, especially
metadata checksums.

Other problems fixed with this commit:

* Don't allow installing a DAX, swap file, or an encrypted file as a
  boot loader.

* Respect the immutable and append-only flags.

* Wait until any DIO operations are finished *before* calling
  truncate_inode_pages().

* Don't swap inode->i_flags, since these flags have nothing to do with
  the inode blocks --- and it will give the IMA/audit code heartburn
  when the inode is evicted.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Reported-by: syzbot+e81ccd4744c6c4f71354@syzkaller.appspotmail.com
2018-10-02 18:21:19 -04:00
Gabriel Krisman Bertazi
799578ab16 ext4: fix build error when DX_DEBUG is defined
Enabling DX_DEBUG triggers the build error below.  info is an attribute
of  the dxroot structure.

linux/fs/ext4/namei.c:2264:12: error: ‘info’
undeclared (first use in this function); did you mean ‘insl’?
	   	  info->indirect_levels));

Fixes: e08ac99fa2 ("ext4: add largedir feature")
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Lukas Czerner <lczerner@redhat.com>
2018-10-02 12:43:51 -04:00
Theodore Ts'o
f18b2b83a7 ext4: fix argument checking in EXT4_IOC_MOVE_EXT
If the starting block number of either the source or destination file
exceeds the EOF, EXT4_IOC_MOVE_EXT should return EINVAL.

Also fixed the helper function mext_check_coverage() so that if the
logical block is beyond EOF, make it return immediately, instead of
looping until the block number wraps all the away around.  This takes
long enough that if there are multiple threads trying to do pound on
an the same inode doing non-sensical things, it can end up triggering
the kernel's soft lockup detector.

Reported-by: syzbot+c61979f6f2cba5cb3c06@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
2018-10-02 01:34:44 -04:00
Greg Kroah-Hartman
ef0f2584c2 Fixes for v4.19-rc7
- Fix failure-path memory leak in ramoops_init (nixiaoming)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAluxB3cWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJidvEACZlqemGsSVQ4elXWCW9EPqyVSn
 lbPg5ONurG/51J6313Ankgn7PrOI7WRd3iAXUWYByoc0DXn/jgz0i/B1+MtSnH4g
 fMeZ3DMnwmuC4h8/50/xmxbjKj2+vW7tX+978wWbYnvYNC+UXf8CN4J4MwBFmNR3
 DGH+oVCx2MITbIYQ3u5FIUgRJl0sD15GtxHg/l5Ff78dtUJVvlnD6ZAa9/rDBCPu
 DD0IqjJ5BqTmGw98L2tG0I2SDSrC8TGAYdQlZK/k7vHUqCWP6QspCpQQy3x6bh+W
 QRL6gCNEPZIGB+uAVbueCo80zRrx1NltbkbO4n9zn9ItYxpvOXJwGT4peYYXwabO
 nn+cWwRAJGITp9BFKIk/V8p05rpRhw8oeOBHIzwylb3C6bqK3ijnkk9mNSmUnLPG
 FtzRs/cYEMHi7n7+aygm1lNHn98PAWGLmomXLyxIUtsSH1jvFWG+jxLiRih/Oah5
 qjSGw2r681vLKsj2tQV4hpFvWaV83t2cO1QOldWSSkVHKJO9+pglEUGz9gfCvBRz
 bCvA0aPNGmMGTO0faQMB/TG2pMtOK94UU6kBIALxlEeHrtdpoOaQN6g++R7sY3/y
 SWo8BeGlNMSAtkWtd6Y7xxIF5N4PDDc6iMyjHGM/KCptANVu03BWlfOn32tel9jR
 17wiJi6SjqQjHVp7jw==
 =bzfz
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.19-rc7' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Kees writes:
  "Pstore fixes for v4.19-rc7

   - Fix failure-path memory leak in ramoops_init (nixiaoming)"

* tag 'pstore-v4.19-rc7' of https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore/ram: Fix failure-path memory leak in ramoops_init
2018-10-01 17:22:36 -07:00
Eric Whitney
f456767d33 ext4: fix reserved cluster accounting at page invalidation time
Add new code to count canceled pending cluster reservations on bigalloc
file systems and to reduce the cluster reservation count on all file
systems using delayed allocation.  This replaces old code in
ext4_da_page_release_reservations that was incorrect.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:33:24 -04:00
Eric Whitney
9fe671496b ext4: adjust reserved cluster count when removing extents
Modify ext4_ext_remove_space() and the code it calls to correct the
reserved cluster count for pending reservations (delayed allocated
clusters shared with allocated blocks) when a block range is removed
from the extent tree.  Pending reservations may be found for the clusters
at the ends of written or unwritten extents when a block range is removed.
If a physical cluster at the end of an extent is freed, it's necessary
to increment the reserved cluster count to maintain correct accounting
if the corresponding logical cluster is shared with at least one
delayed and unwritten extent as found in the extents status tree.

Add a new function, ext4_rereserve_cluster(), to reapply a reservation
on a delayed allocated cluster sharing blocks with a freed allocated
cluster.  To avoid ENOSPC on reservation, a flag is applied to
ext4_free_blocks() to briefly defer updating the freeclusters counter
when an allocated cluster is freed.  This prevents another thread
from allocating the freed block before the reservation can be reapplied.

Redefine the partial cluster object as a struct to carry more state
information and to clarify the code using it.

Adjust the conditional code structure in ext4_ext_remove_space to
reduce the indentation level in the main body of the code to improve
readability.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:25:08 -04:00
Eric Whitney
b6bf9171ef ext4: reduce reserved cluster count by number of allocated clusters
Ext4 does not always reduce the reserved cluster count by the number
of clusters allocated when mapping a delayed extent.  It sometimes
adds back one or more clusters after allocation if delalloc blocks
adjacent to the range allocated by ext4_ext_map_blocks() share the
clusters newly allocated for that range.  However, this overcounts
the number of clusters needed to satisfy future mapping requests
(holding one or more reservations for clusters that have already been
allocated) and premature ENOSPC and quota failures, etc., result.

Ext4 also does not reduce the reserved cluster count when allocating
clusters for non-delayed allocated writes that have previously been
reserved for delayed writes.  This also results in overcounts.

To make it possible to handle reserved cluster accounting for
fallocated regions in the same manner as used for other non-delayed
writes, do the reserved cluster accounting for them at the time of
allocation.  In the current code, this is only done later when a
delayed extent sharing the fallocated region is finally mapped.

Address comment correcting handling of unsigned long long constant
from Jan Kara's review of RFC version of this patch.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:24:08 -04:00
Eric Whitney
0b02f4c0d6 ext4: fix reserved cluster accounting at delayed write time
The code in ext4_da_map_blocks sometimes reserves space for more
delayed allocated clusters than it should, resulting in premature
ENOSPC, exceeded quota, and inaccurate free space reporting.

Fix this by checking for written and unwritten blocks shared in the
same cluster with the newly delayed allocated block.  A cluster
reservation should not be made for a cluster for which physical space
has already been allocated.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:19:37 -04:00
Eric Whitney
1dc0aa46e7 ext4: add new pending reservation mechanism
Add new pending reservation mechanism to help manage reserved cluster
accounting.  Its primary function is to avoid the need to read extents
from the disk when invalidating pages as a result of a truncate, punch
hole, or collapse range operation.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:17:41 -04:00
Eric Whitney
ad431025ae ext4: generalize extents status tree search functions
Ext4 contains a few functions that are used to search for delayed
extents or blocks in the extents status tree.  Rather than duplicate
code to add new functions to search for extents with different status
values, such as written or a combination of delayed and unwritten,
generalize the existing code to search for caller-specified extents
status values.  Also, move this code into extents_status.c where it
is better associated with the data structures it operates upon, and
where it can be more readily used to implement new extents status tree
functions that might want a broader scope for i_es_lock.

Three missing static specifiers in RFC version of patch reported and
fixed by Fengguang Wu <fengguang.wu@intel.com>.

Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2018-10-01 14:10:39 -04:00
Jens Axboe
c0aac682fa This is the 4.19-rc6 release
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAluw4MIACgkQONu9yGCS
 aT7+8xAAiYnc4khUsxeInm3z44WPfRX1+UF51frTNSY5C8Nn5nvRSnTUNLuKkkrz
 8RbwCL6UYyJxF9I/oZdHPsPOD4IxXkQY55tBjz7ZbSBIFEwYM6RJMm8mAGlXY7wq
 VyWA5MhlpGHM9DjrguB4DMRipnrSc06CVAnC+ZyKLjzblzU1Wdf2dYu+AW9pUVXP
 j4r74lFED5djPY1xfqfzEwmYRCeEGYGx7zMqT3GrrF5uFPqj1H6O5klEsAhIZvdl
 IWnJTU2coC8R/Sd17g4lHWPIeQNnMUGIUbu+PhIrZ/lDwFxlocg4BvarPXEdzgYi
 gdZzKBfovpEsSu5RCQsKWG4IGQxY7I1p70IOP9eqEFHZy77qT1YcHVAWrK1Y/bJd
 UA08gUOSzRnhKkNR3+PsaMflUOl9WkpyHECZu394cyRGMutSS50aWkavJPJ/o1Qi
 D/oGqZLLcKFyuNcchG+Met1TzY3LvYEDgSburqwqeUZWtAsGs8kmiiq7qvmXx4zV
 IcgM8ERqJ8mbfhfsXQU7hwydIrPJ3JdIq19RnM5ajbv2Q4C/qJCyAKkQoacrlKR4
 aiow/qvyNrP80rpXfPJB8/8PiWeDtAnnGhM+xySZNlw3t8GR6NYpUkIzf5TdkSb3
 C8KuKg6FY9QAS62fv+5KK3LB/wbQanxaPNruQFGe5K1iDQ5Fvzw=
 =dMl4
 -----END PGP SIGNATURE-----

Merge tag 'v4.19-rc6' into for-4.20/block

Merge -rc6 in, for two reasons:

1) Resolve a trivial conflict in the blk-mq-tag.c documentation
2) A few important regression fixes went into upstream directly, so
   they aren't in the 4.20 branch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

* tag 'v4.19-rc6': (780 commits)
  Linux 4.19-rc6
  MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
  cpufreq: qcom-kryo: Fix section annotations
  perf/core: Add sanity check to deal with pinned event failure
  xen/blkfront: correct purging of persistent grants
  Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
  selftests/powerpc: Fix Makefiles for headers_install change
  blk-mq: I/O and timer unplugs are inverted in blktrace
  dax: Fix deadlock in dax_lock_mapping_entry()
  x86/boot: Fix kexec booting failure in the SEV bit detection code
  bcache: add separate workqueue for journal_write to avoid deadlock
  drm/amd/display: Fix Edid emulation for linux
  drm/amd/display: Fix Vega10 lightup on S3 resume
  drm/amdgpu: Fix vce work queue was not cancelled when suspend
  Revert "drm/panel: Add device_link from panel device to DRM device"
  xen/blkfront: When purging persistent grants, keep them in the buffer
  clocksource/drivers/timer-atmel-pit: Properly handle error cases
  block: fix deadline elevator drain for zoned block devices
  ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
  drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-01 08:58:57 -06:00
Miklos Szeredi
e52a825048 fuse: realloc page array
Writeback caching currently allocates requests with the maximum number of
possible pages, while the actual number of pages per request depends on a
couple of factors that cannot be determined when the request is allocated
(whether page is already under writeback, whether page is contiguous with
previous pages already added to a request).

This patch allows such requests to start with no page allocation (all pages
inline) and grow the page array on demand.

If the max_pages tunable remains the default value, then this will mean
just one allocation that is the same size as before.  If the tunable is
larger, then this adds at most 3 additional memory allocations (which is
generously compensated by the improved performance from the larger
request).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:06 +02:00
Constantine Shulyupin
5da784cce4 fuse: add max_pages to init_out
Replace FUSE_MAX_PAGES_PER_REQ with the configurable parameter max_pages to
improve performance.

Old RFC with detailed description of the problem and many fixes by Mitsuo
Hayasaka (mitsuo.hayasaka.hu@hitachi.com):
 - https://lkml.org/lkml/2012/7/5/136

We've encountered performance degradation and fixed it on a big and complex
virtual environment.

Environment to reproduce degradation and improvement:

1. Add lag to user mode FUSE
Add nanosleep(&(struct timespec){ 0, 1000 }, NULL); to xmp_write_buf in
passthrough_fh.c

2. patch UM fuse with configurable max_pages parameter. The patch will be
provided latter.

3. run test script and perform test on tmpfs
fuse_test()
{

       cd /tmp
       mkdir -p fusemnt
       passthrough_fh -o max_pages=$1 /tmp/fusemnt
       grep fuse /proc/self/mounts
       dd conv=fdatasync oflag=dsync if=/dev/zero of=fusemnt/tmp/tmp \
		count=1K bs=1M 2>&1 | grep -v records
       rm fusemnt/tmp/tmp
       killall passthrough_fh
}

Test results:

passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
	rw,nosuid,nodev,relatime,user_id=0,group_id=0 0 0
1073741824 bytes (1.1 GB) copied, 1.73867 s, 618 MB/s

passthrough_fh /tmp/fusemnt fuse.passthrough_fh \
	rw,nosuid,nodev,relatime,user_id=0,group_id=0,max_pages=256 0 0
1073741824 bytes (1.1 GB) copied, 1.15643 s, 928 MB/s

Obviously with bigger lag the difference between 'before' and 'after'
will be more significant.

Mitsuo Hayasaka, in 2012 (https://lkml.org/lkml/2012/7/5/136),
observed improvement from 400-550 to 520-740.

Signed-off-by: Constantine Shulyupin <const@MakeLinux.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:06 +02:00
Miklos Szeredi
8a7aa286ab fuse: allocate page array more efficiently
When allocating page array for a request the array for the page pointers
and the array for page descriptors are allocated by two separate kmalloc()
calls.  Merge these into one allocation.

Also instead of initializing the request and the page arrays with memset(),
use the zeroing allocation variants.

Reserved requests never carry pages (page array size is zero). Make that
explicit by initializing the page array pointers to NULL and make sure the
assumption remains true by adding a WARN_ON().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:05 +02:00
Miklos Szeredi
ab2257e994 fuse: reduce size of struct fuse_inode
Do this by grouping fields used for cached writes and putting them into a
union with fileds used for cached readdir (with obviously no overlap, since
we don't have hybrid objects).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:05 +02:00
Miklos Szeredi
261aaba72f fuse: use iversion for readdir cache verification
Use the internal iversion counter to make sure modifications of the
directory through this filesystem are not missed by the mtime check (due to
mtime granularity).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:05 +02:00
Miklos Szeredi
7118883b44 fuse: use mtime for readdir cache verification
Store the modification time of the directory in the cache, obtained before
starting to fill the cache.

When reading the cache, verify that the directory hasn't changed, by
checking if current modification time is the same as the one stored in the
cache.

This only needs to be done when the current file position is at the
beginning of the directory, as mandated by POSIX.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:04 +02:00
Miklos Szeredi
3494927e09 fuse: add readdir cache version
Allow the cache to be invalidated when page(s) have gone missing.  In this
case increment the version of the cache and reset to an empty state.

Add a version number to the directory stream in struct fuse_file as well,
indicating the version of the cache it's supposed to be reading.  If the
cache version doesn't match the stream's version, then reset the stream to
the beginning of the cache.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:04 +02:00
Miklos Szeredi
5d7bc7e868 fuse: allow using readdir cache
The cache is only used if it's completed, not while it's still being
filled; this constraint could be lifted later, if it turns out to be
useful.

Introduce state in struct fuse_file that indicates the position within the
cache.  After a seek, reset the position to the beginning of the cache and
search the cache for the current position.  If the current position is not
found in the cache, then fall back to uncached readdir.

It can also happen that page(s) disappear from the cache, in which case we
must also fall back to uncached readdir.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:04 +02:00
Miklos Szeredi
69e3455115 fuse: allow caching readdir
This patch just adds the cache filling functions, which are invoked if
FOPEN_CACHE_DIR flag is set in the OPENDIR reply.

Cache reading and cache invalidation are added by subsequent patches.

The directory cache uses the page cache.  Directory entries are packed into
a page in the same format as in the READDIR reply.  A page only contains
whole entries, the space at the end of the page is cleared.  The page is
locked while being modified.

Multiple parallel readdirs on the same directory can fill the cache; the
only constraint is that continuity must be maintained (d_off of last entry
points to position of current entry).

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-10-01 10:07:04 +02:00
Chao Yu
f847c699cf f2fs: allow out-place-update for direct IO in LFS mode
Normally, DIO uses in-pllace-update, but in LFS mode, f2fs doesn't
allow triggering any in-place-update writes, so we fallback direct
write to buffered write, result in bad performance of large size
write.

This patch adds to support triggering out-place-update for direct IO
to enhance its performance.

Note that it needs to exclude direct read IO during direct write,
since new data writing to new block address will no be valid until
write finished.

storage: zram

time xfs_io -f -d /mnt/f2fs/file -c "pwrite 0 1073741824" -c "fsync"

Before:
real	0m13.061s
user	0m0.327s
sys	0m12.486s

After:
real	0m6.448s
user	0m0.228s
sys	0m6.212s

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:42:50 -07:00
Chao Yu
39a8695824 f2fs: refactor ->page_mkwrite() flow
Thread A				Thread B
- f2fs_vm_page_mkwrite
					- f2fs_setattr
					 - down_write(i_mmap_sem)
					 - truncate_setsize
					 - f2fs_truncate
					 - up_write(i_mmap_sem)
 - f2fs_reserve_block
 reserve NEW_ADDR
 - skip dirty page due to truncation

1. we don't need to rserve new block address for a truncated page.
2. dn.data_blkaddr is used out of node page lock coverage.

Refactor ->page_mkwrite() flow to fix above issues:
- use __do_map_lock() to avoid racing checkpoint()
- lock data page in prior to dnode page
- cover f2fs_reserve_block with i_mmap_sem lock
- wait page writeback before zeroing page

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:41:22 -07:00
Chao Yu
bab475c541 Revert: "f2fs: check last page index in cached bio to decide submission"
There is one case that we can leave bio in f2fs, result in hanging
page writeback waiter.

Thread A				Thread B
- f2fs_write_cache_pages
 - f2fs_submit_page_write
 page #0 cached in bio #0 of cold log
 - f2fs_submit_page_write
 page #1 cached in bio #1 of warm log
					- f2fs_write_cache_pages
					 - f2fs_submit_page_write
					 bio is full, submit bio #1 contain page #1
 - f2fs_submit_merged_write_cond(, page #1)
 fail to submit bio #0 due to page #1 is not in any cached bios.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:39:54 -07:00
Junling Zheng
d440c52d31 f2fs: support superblock checksum
Now we support crc32 checksum for superblock.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Junling Zheng <zhengjunling@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:34:18 -07:00
Chao Yu
274bd9ba39 f2fs: add to account skip count of background GC
This patch adds to account skip count of background GC, and show stat
info via 'status' debugfs entry.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:33:34 -07:00
Chao Yu
b63e7be590 f2fs: add to account meta IO
This patch supports to account meta IO, it enables to show write IO
from f2fs more comprehensively via 'status' debugfs entry.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 18:33:21 -07:00
Jaegeuk Kim
095680f24f f2fs: keep lazytime on remount
This patch fixes losing lazytime when remounting f2fs.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-30 16:50:46 -07:00
Dave Chinner
e55ec4ddbe xfs: fix error handling in xfs_bmap_extents_to_btree
Commit 01239d77b9 ("xfs: fix a null pointer dereference in
xfs_bmap_extents_to_btree") attempted to fix a null pointer
dreference when a fuzzing corruption of some kind was found.
This fix was flawed, resulting in assert failures like:

XFS: Assertion failed: ifp->if_broot == NULL, file: fs/xfs/libxfs/xfs_bmap.c, line: 715
.....
Call Trace:
  xfs_bmap_extents_to_btree+0x6b9/0x7b0
  __xfs_bunmapi+0xae7/0xf00
  ? xfs_log_reserve+0x1c8/0x290
  xfs_reflink_remap_extent+0x20b/0x620
  xfs_reflink_remap_blocks+0x7e/0x290
  xfs_reflink_remap_range+0x311/0x530
  vfs_dedupe_file_range_one+0xd7/0xe0
  vfs_dedupe_file_range+0x15b/0x1a0
  do_vfs_ioctl+0x267/0x6c0

The problem is that the error handling code now asserts that the
inode fork is not in btree format before the error handling code
undoes the modifications that put the fork back in extent format.
Fix this by moving the assert back to after the xfs_iroot_realloc()
call that returns the fork to extent format, and clean up the jump
labels to be meaningful.

Also, returning ENOSPC when xfs_btree_get_bufl() fails to
instantiate the buffer that was allocated (the actual fix in the
commit mentioned above) is incorrect. This is a fatal error - only
an invalid block address or a filesystem shutdown can result in
failing to get a buffer here.

Hence change this to EFSCORRUPTED so that the higher layer knows
this was a corruption related failure and should not treat it as an
ENOSPC error.  This should result in a shutdown (via cancelling a
dirty transaction) which is necessary as we do not attempt to clean
up the (invalid) block that we have already allocated.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-10-01 08:11:07 +10:00
Trond Myklebust
c7944ebb9c NFSv4: Fix lookup revalidate of regular files
If we're revalidating an existing dentry in order to open a file, we need
to ensure that we check the directory has not changed before we optimise
away the lookup.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:18 -04:00
Trond Myklebust
5ceb9d7fda NFS: Refactor nfs_lookup_revalidate()
Refactor the code in nfs_lookup_revalidate() as a stepping stone towards
optimising and fixing nfs4_lookup_revalidate().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:18 -04:00
Trond Myklebust
be189f7e7f NFS: Fix dentry revalidation on NFSv4 lookup
We need to ensure that inode and dentry revalidation occurs correctly
on reopen of a file that is already open. Currently, we can end up
not revalidating either in the case of NFSv4.0, due to the 'cached open'
path.
Let's fix that by ensuring that we only do cached open for the special
cases of open recovery and delegation return.

Reported-by: Stan Hu <stanhu@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:18 -04:00
Trond Myklebust
1c6c4b740d NFS: Remove private spinlock in struct nfs_pgio_header
Now that each struct nfs_pgio_header corresponds to one RPC call, we
only have one writer to the struct nfs_pgio_header.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
8d8928d879 NFSv3: Improve NFSv3 performance when server returns no post-op attributes
When the server fails to return post-op attributes, the client's
attempt to place read data directly in the page cache fails, and
so we have to do an extra copy in order to realign the data with
page borders.
This patch attempts to detect servers that don't return post-op
attributes on read (e.g. for pNFS) and adjusts the placement
calculation accordingly.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Anna Schumaker
80f4236886 NFSv4: Split out NFS v4.2 copy completion functions
The convention in the rest of the code is to have a separate function
for anything that might be ifdef-ed out.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Anna Schumaker
000d3f9566 NFS: Reduce indentation of nfs4_recovery_handle_error()
This is to match kernel coding style for switch statements.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Anna Schumaker
35a61606a6 NFS: Reduce indentation of the switch statement in nfs4_reclaim_open_state()
Most places in the kernel tend to line up cases with the switch to
reduce indentation, so move this over to match that style.
Additionally, I handle the (status >= 0) case in the switch so that we
only "goto restart" from a single place after error handling.

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Anna Schumaker
cb7a8384dc NFS: Split out the body of nfs4_reclaim_open_state()
Moving all of this into a new function removes the need for cramped
indentation, making the code overall easier to look at.   I also take
this chance to switch copy recovery over to using
nfs4_stateid_match_other()

Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Tigran Mkrtchyan
10ec57e4c5 nfs4: flex_file: ignore synthetic uid/gid for tightly coupled DSes
for tightly coupled DSes client must ignore provided synthetic uid and
gid as stated in draft-ietf-nfsv4-flex-files-19#section-5.1.

Signed-off-by: Tigran Mkrtchyan <tigran.mkrtchyan@desy.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
943cff67b8 NFSv4.1: Fix the r/wsize checking
The intention of nfs4_session_set_rwsize() was to cap the r/wsize to the
buffer sizes negotiated by the CREATE_SESSION. The initial code had a
bug whereby we would not check the values negotiated by nfs_probe_fsinfo()
(the assumption being that CREATE_SESSION will always negotiate buffer values
that are sane w.r.t. the server's preferred r/wsizes) but would only check
values set by the user in the 'mount' command.

The code was changed in 4.11 to _always_ set the r/wsize, meaning that we
now never use the server preferred r/wsizes. This is the regression that
this patch fixes.
Also rename the function to nfs4_session_limit_rwsize() in order to avoid
future confusion.

Fixes: 033853325f (NFSv4.1 respect server's max size in CREATE_SESSION")
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
ace9fad43a NFSv4: Convert struct nfs4_state to use refcount_t
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
9ae075fdd1 NFSv4: Convert open state lookup to use RCU
Further reduce contention on the inode->i_lock.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
0de43976fb NFS: Convert lookups of the open context to RCU
Reduce contention on the inode->i_lock by ensuring that we use RCU
when looking up the NFS open context.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
6ba0c4e5bb NFS: Simplify internal check for whether file is open for write
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:17 -04:00
Trond Myklebust
1db97eaa0b NFS: Convert lookups of the lock context to RCU
Speed up lookups of an existing lock context by avoiding the inode->i_lock,
and using RCU instead.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
28ced9a84c pNFS: Don't allocate more pages than we need to fit a layoutget response
For the 'files' and 'flexfiles' layout types, we do not expect the reply
to be any larger than 4k. The block and scsi layout types are a little more
greedy, so we keep allocating the maximum response size for now.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
a2791d3a2c pNFS: Don't zero out the array in nfs4_alloc_pages()
We don't need a zeroed out array, since it is immediately being filled.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Trond Myklebust
431f6eb357 SUNRPC: Add a label for RPC calls that require allocation on receive
If the RPC call relies on the receive call allocating pages as buffers,
then let's label it so that we
a) Don't leak memory by allocating pages for requests that do not expect
   this behaviour
b) Can optimise for the common case where calls do not require allocation.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2018-09-30 15:35:16 -04:00
Miguel Ojeda
f0604f6303 Compiler Attributes: ext4: remove local __nonstring definition
Commit 072ebb3bff ("ext4: add nonstring annotations to ext4.h")
introduced a local definition of __nonstring to suppress some false
positives in gcc 8's -Wstringop-truncation.

Since now we support __nonstring for everyone, remove it.

Tested-by: Sedat Dilek <sedat.dilek@gmail.com> # on top of v4.19-rc5, clang 7
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
2018-09-30 20:14:04 +02:00
Kees Cook
bac6f6cda2 pstore/ram: Fix failure-path memory leak in ramoops_init
As reported by nixiaoming, with some minor clarifications:

1) memory leak in ramoops_register_dummy():
   dummy_data = kzalloc(sizeof(*dummy_data), GFP_KERNEL);
   but no kfree() if platform_device_register_data() fails.

2) memory leak in ramoops_init():
   Missing platform_device_unregister(dummy) and kfree(dummy_data)
   if platform_driver_register(&ramoops_driver) fails.

I've clarified the purpose of ramoops_register_dummy(), and added a
common cleanup routine for all three failure paths to call.

Reported-by: nixiaoming <nixiaoming@huawei.com>
Cc: stable@vger.kernel.org
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Colin Cross <ccross@android.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-30 10:15:41 -07:00
Greg Kroah-Hartman
9ba6873e16 filesystem-dax for 4.19-rc6
Fix a deadlock in the new for 4.19 dax_lock_mapping_entry() routine.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbsDBrAAoJEB7SkWpmfYgCi54QAIz8yFZI5+5+amG/L/F9mGe4
 sagcSPsk67EzzTDhnKASTlRmpm0+LWzQckY7o/fDRoM0VQVKjXVDke4VTDnHFg7W
 JfZMN24dg6Pcbq3CxuSXMOiWd8vXSnLL2Myin+fQ/kY1rxnIz2ZYNWxCQsLvdPiC
 VKJAbpYlcG41HZPPnRkMaRBxf2INUraSgyHFoehbgvlwLD7YUOzPh9strauutK5M
 xljv2d/yjfaW4U6DhQhUSo+sDYRLGDkbqQw6ZoVqbODA0IXdY6ytiCujLLD9xODg
 lDKF68jCX/+lFIURm8BRpX9iqHvfILC5el61a4bTxjJ6XUf+Ok5vgkeZFDfQKziC
 rLqm09NTQ5Xu0MJ8Ql+5cqAFqBMA7Uy1zF6l8DnGFCtMV/S0H/TgdXWLzHjRXQvE
 18ekLqTcRk5UmPXJYJ829ln0TKTd3zyuVgwuLuGAeO97m431y3K2Q74ncPahgE9+
 W0nduPFTmMikohcKah2P3mQWGtUAYWodQsEs+Y9gJPyoDic+fmjo+mI0xg7CeFL4
 kpfug45i8hdbnlHrHOJ6bz7fRq7CvaaRaI3gOvFfuN2TJVY8Qfs/8JD4HN8F7u+r
 zDPVnvkutaYV1uOOBU4nDzPJ+naVGlpOj1/tsMU4ikj3LbfkfW+gxsr6XGZYPU81
 qYjEfXm60ritFoAA5dVV
 =3l8a
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-fixes2-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Dan writes:
  "filesystem-dax for 4.19-rc6

   Fix a deadlock in the new for 4.19 dax_lock_mapping_entry() routine."

* tag 'libnvdimm-fixes2-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: Fix deadlock in dax_lock_mapping_entry()
2018-09-30 06:19:38 -07:00
Matthew Wilcox
3159f943aa xarray: Replace exceptional entries
Introduce xarray value entries and tagged pointers to replace radix
tree exceptional entries.  This is a slight change in encoding to allow
the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
value entry).  It is also a change in emphasis; exceptional entries are
intimidating and different.  As the comment explains, you can choose
to store values or pointers in the xarray and they are both first-class
citizens.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Josef Bacik <jbacik@fb.com>
2018-09-29 22:47:49 -04:00
Matthew Wilcox
3d0186bb06 Update email address
Redirect some older email addresses that are in the git logs.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-09-29 22:47:48 -04:00
Brian Foster
561295a325 iomap: set page dirty after partial delalloc on mkwrite
The iomap page fault mechanism currently dirties the associated page
after the full block range of the page has been allocated. This
leaves the page susceptible to delayed allocations without ever
being set dirty on sub-page block sized filesystems.

For example, consider a page fault on a page with one preexisting
real (non-delalloc) block allocated in the middle of the page. The
first iomap_apply() iteration performs delayed allocation on the
range up to the preexisting block, the next iteration finds the
preexisting block, and the last iteration attempts to perform
delayed allocation on the range after the prexisting block to the
end of the page. If the first allocation succeeds and the final
allocation fails with -ENOSPC, iomap_apply() returns the error and
iomap_page_mkwrite() fails to dirty the page having already
performed partial delayed allocation. This eventually results in the
page being invalidated without ever converting the delayed
allocation to real blocks.

This problem is reliably reproduced by generic/083 on XFS on ppc64
systems (64k page size, 4k block size). It results in leaked
delalloc blocks on inode reclaim, which triggers an assert failure
in xfs_fs_destroy_inode() and filesystem accounting inconsistency.

Move the set_page_dirty() call from iomap_page_mkwrite() to the
actor callback, similar to how the buffer head implementation works.
The actor callback is called iff ->iomap_begin() returns success, so
ensures the page is dirtied as soon as possible after an allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:51:01 +10:00
Brian Foster
ec2ed0b5e9 xfs: remove invalid log recovery first/last cycle check
One of the first steps of log recovery is to check for the special
case of a zeroed log. If the first cycle in the log is zero or the
tail portion of the log is zeroed, the head is set to the first
instance of cycle 0. xlog_find_zeroed() includes a sanity check that
enforces that the first cycle in the log must be 1 if the last cycle
is 0. While this is true in most cases, the check is not totally
valid because it doesn't consider the case where the filesystem
crashed after a partial/out of order log buffer completion that
wraps around the end of the physical log.

For example, consider a filesystem that has completed most of the
first cycle of the log, reaches the end of the physical log and
splits the next single log buffer write into two in order to wrap
around the end of the log. If these I/Os are reordered, the second
(wrapped) I/O completes and the first happens to fail, the log is
left in a state where the last cycle of the log is 0 and the first
cycle is 2. This causes the xlog_find_zeroed() sanity check to fail
and prevents the filesystem from mounting. This situation has been
reproduced on particular systems via repeated runs of generic/475.

This is an expected state that log recovery already knows how to
deal with, however. Since the log is still partially zeroed, the
head is detected correctly and points to a valid tail. The
subsequent stale block detection clears blocks beyond the head up to
the tail (within a maximum range), with the express purpose of
clearing such out of order writes. As expected, this removes the out
of order cycle 2 blocks at the physical start of the log.

In other words, the only thing that prevents a clean mount and
recovery of the filesystem in this scenario is the specific (last ==
0 && first != 1) sanity check in xlog_find_zeroed(). Since the log
head/tail are now independently validated via cycle, log record and
CRC checks, this highly specific first cycle check is of dubious
value. Remove it and rely on the higher level validation to
determine whether log content is sane and recoverable.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:50:41 +10:00
Eric Sandeen
339e1a3fcd xfs: validate inode di_forkoff
Verify the inode di_forkoff, lifted from xfs_repair's
process_check_inode_forkoff().

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:50:13 +10:00
Christoph Hellwig
f5f3f959b7 xfs: skip delalloc COW blocks in xfs_reflink_end_cow
The iomap direct I/O code issues a single ->end_io call for the whole
I/O request, and if some of the extents cowered needed a COW operation
it will call xfs_reflink_end_cow over the whole range.

When we do AIO writes we drop the iolock after doing the initial setup,
but before the I/O completion.  Between dropping the lock and completing
the I/O we can have a racing buffered write create new delalloc COW fork
extents in the region covered by the outstanding direct I/O write, and
thus see delalloc COW fork extents in xfs_reflink_end_cow.  As
concurrent writes are fundamentally racy and no guarantees are given we
can simply skip those.

This can be easily reproduced with xfstests generic/208 in always_cow
mode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:49:58 +10:00
Eric Sandeen
f369a13cea xfs: don't treat unknown di_flags2 as corruption in scrub
xchk_inode_flags2() currently treats any di_flags2 values that the
running kernel doesn't recognize as corruption, and calls
xchk_ino_set_corrupt() if they are set.  However, it's entirely possible
that these flags were set in some newer kernel and are quite valid,
but ignored in this kernel.

(Validators don't care one bit about unknown di_flags2.)

Call xchk_ino_set_warning instead, because this may or may not actually
indicate a problem.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:49:00 +10:00
YueHaibing
2863c2ebc4 xfs: remove duplicated include from alloc.c
Remove duplicated include xfs_alloc.h

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:48:21 +10:00
Christoph Hellwig
0065b54119 xfs: don't bring in extents in xfs_bmap_punch_delalloc_range
This function is only used to punch out delayed allocations on I/O
failure, which means we need to have read the extents earlier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:47:46 +10:00
Dave Chinner
df30707791 xfs: fix transaction leak in xfs_reflink_allocate_cow()
When xfs_reflink_allocate_cow() allocates a transaction, it drops
the ILOCK to perform the operation. This Introduces a race condition
where another thread modifying the file can perform the COW
allocation operation underneath us. This result in the retry loop
finding an allocated block and jumping straight to the conversion
code. It does not, however, cancel the transaction it holds and so
this gets leaked. This results in a lockdep warning:

================================================
WARNING: lock held when returning to user space!
4.18.5 #1 Not tainted
------------------------------------------------
worker/6123 is leaving the kernel with locks still held!
1 lock held by worker/6123:
 #0: 000000009eab4f1b (sb_internal#2){.+.+}, at: xfs_trans_alloc+0x17c/0x220

And eventually the filesystem deadlocks because it runs out of log
space that is reserved by the leaked transaction and never gets
released.

The logic flow in xfs_reflink_allocate_cow() is a convoluted mess of
gotos - it's no surprise that it has bug where the flow through
several goto jumps then fails to clean up context from a non-obvious
logic path. CLean up the logic flow and make sure every path does
the right thing.

Reported-by: Alexander Y. Fomichev <git.user@gmail.com>
Tested-by: Alexander Y. Fomichev <git.user@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=200981
Signed-off-by: Dave Chinner <dchinner@redhat.com>
[hch: slight refactor]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:47:15 +10:00
Dave Chinner
8683edb775 xfs: avoid lockdep false positives in xfs_trans_alloc
We've had a few reports of lockdep tripping over memory reclaim
context vs filesystem freeze "deadlocks". They all have looked
to be false positives on analysis, but it seems that they are
being tripped because we take freeze references before we run
a GFP_KERNEL allocation for the struct xfs_trans.

We can avoid this false positive vector just by re-ordering the
operations in xfs_trans_alloc(). That is. we need allocate the
structure before we take the freeze reference and enter the GFP_NOFS
allocation context that follows the xfs_trans around. This prevents
lockdep from seeing the GFP_KERNEL allocation inside the transaction
context, and that prevents it from triggering the freeze level vs
alloc context vs reclaim warnings.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-09-29 13:46:21 +10:00
Brian Foster
95808459b1 xfs: refactor xfs_buf_log_item reference count handling
The xfs_buf_log_item structure has a reference counter with slightly
tricky semantics. In the common case, a buffer is logged and
committed in a transaction, committed to the on-disk log (added to
the AIL) and then finally written back and removed from the AIL. The
bli refcount covers two potentially overlapping timeframes:

 1. the bli is held in an active transaction
 2. the bli is pinned by the log

The caveat to this approach is that the reference counter does not
purely dictate the lifetime of the bli. IOW, when a dirty buffer is
physically logged and unpinned, the bli refcount may go to zero as
the log item is inserted into the AIL. Only once the buffer is
written back can the bli finally be freed.

The above semantics means that it is not enough for the various
refcount decrementing contexts to release the bli on decrement to
zero. xfs_trans_brelse(), transaction commit (->iop_unlock()) and
unpin (->iop_unpin()) must all drop the associated reference and
make additional checks to determine if the current context is
responsible for freeing the item.

For example, if a transaction holds but does not dirty a particular
bli, the commit may drop the refcount to zero. If the bli itself is
clean, it is also not AIL resident and must be freed at this time.
The same is true for xfs_trans_brelse(). If the transaction dirties
a bli and then aborts or an unpin results in an abort due to a log
I/O error, the last reference count holder is expected to explicitly
remove the item from the AIL and release it (since an abort means
filesystem shutdown and metadata writeback will never occur).

This leads to fairly complex checks being replicated in a few
different places. Since ->iop_unlock() and xfs_trans_brelse() are
nearly identical, refactor the logic into a common helper that
implements and documents the semantics in one place. This patch does
not change behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:45:26 +10:00
Brian Foster
23420d05e6 xfs: clean up xfs_trans_brelse()
xfs_trans_brelse() is a bit of a historical mess, similar to
xfs_buf_item_unlock(). It is unnecessarily verbose, has snippets of
commented out code, inconsistency with regard to stale items, etc.

Clean up xfs_trans_brelse() to use similar logic and flow as
xfs_buf_item_unlock() with regard to bli reference count handling.
This patch makes no functional changes, but facilitates further
refactoring of the common bli reference count handling code.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:45:02 +10:00
Brian Foster
d9183105ca xfs: don't unlock invalidated buf on aborted tx commit
xfstests generic/388,475 occasionally reproduce assertion failures
in xfs_buf_item_unpin() when the final bli reference is dropped on
an invalidated buffer and the buffer is not locked as it is expected
to be. Invalidated buffers should remain locked on transaction
commit until the final unpin, at which point the buffer is removed
from the AIL and the bli is freed since stale buffers are not
written back.

The assert failures are associated with filesystem shutdown,
typically due to log I/O errors injected by the test. The
problematic situation can occur if the shutdown happens to cause a
race between an active transaction that has invalidated a particular
buffer and an I/O error on a log buffer that contains the bli
associated with the same (now stale) buffer.

Both transaction and log contexts acquire a bli reference. If the
transaction has already invalidated the buffer by the time the I/O
error occurs and ends up aborting due to shutdown, the transaction
and log hold the last two references to a stale bli. If the
transaction cancel occurs first, it treats the buffer as non-stale
due to the aborted state: the bli reference is dropped and the
buffer is released/unlocked. The log buffer I/O error handling
eventually calls into xfs_buf_item_unpin(), drops the final
reference to the bli and treats it as stale. The buffer wasn't left
locked by xfs_buf_item_unlock(), however, so the assert fails and
the buffer is double unlocked. The latter problem is mitigated by
the fact that the fs is shutdown and no further damage is possible.

->iop_unlock() of an invalidated buffer should behave consistently
with respect to the bli refcount, regardless of aborted state. If
the refcount remains elevated on commit, we know the bli is awaiting
an unpin (since it can't be in another transaction) and will be
handled appropriately on log buffer completion. If the final bli
reference of an invalidated buffer is dropped in ->iop_unlock(), we
can assume the transaction has aborted because invalidation implies
a dirty transaction. In the non-abort case, the log would have
acquired a bli reference in ->iop_pin() and prevented bli release at
->iop_unlock() time. In the abort case the item must be freed and
buffer unlocked because it wasn't pinned by the log.

Rework xfs_buf_item_unlock() to simplify the currently circuitous
and duplicate logic and leave invalidated buffers locked based on
bli refcount, regardless of aborted state. This ensures that a
pinned, stale buffer is always found locked when eventually
unpinned.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:44:40 +10:00
Brian Foster
d5a2e2893d xfs: remove last of unnecessary xfs_defer_cancel() callers
Now that deferred operations are completely managed via
transactions, it's no longer necessary to cancel the dfops in error
paths that already cancel the associated transaction. There are a
few such calls lingering throughout the codebase.

Remove all remaining unnecessary calls to xfs_defer_cancel(). This
leaves xfs_defer_cancel() calls in two places. The first is the call
in the transaction cancel path itself, which facilitates this patch.
The second is made via the xfs_defer_finish() error path to provide
consistent error semantics with transaction commit. For example,
xfs_trans_commit() expects an xfs_defer_finish() failure to clean up
the dfops structure before it returns.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:41:58 +10:00
Darrick J. Wong
ae29478766 xfs: don't crash the vfs on a garbage inline symlink
The VFS routine that calls ->get_link blindly copies whatever's returned
into the user's buffer.  If we return a NULL pointer, the vfs will
crash on the null pointer.  Therefore, return -EFSCORRUPTED instead of
blowing up the kernel.

[dgc: clean up with hch's suggestions]

Reported-by: wen.xu@gatech.edu
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2018-09-29 13:40:40 +10:00
Jaegeuk Kim
89d13c3850 f2fs: fix missing up_read
This patch fixes missing up_read call.

Fixes: c9b60788fc ("f2fs: fix to do sanity check with block address in main area")
Cc: <stable@vger.kernel.org> # 4.19+
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-28 19:11:38 -07:00
Jaegeuk Kim
61f7725aa1 f2fs: return correct errno in f2fs_gc
This fixes overriding error number in f2fs_gc.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-28 10:40:46 -07:00
Jaegeuk Kim
edc55aaf0d f2fs: avoid f2fs_bug_on if f2fs_get_meta_page_nofail got EIO
This patch avoids BUG_ON when f2fs_get_meta_page_nofail got EIO during
xfstests/generic/475.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-28 10:39:58 -07:00
Miklos Szeredi
18172b10b6 fuse: extract fuse_emit() helper
Prepare for cache filling by introducing a helper for emitting a single
directory entry.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:23 +02:00
Miklos Szeredi
d123d8e183 fuse: split out readdir.c
Directory reading code is about to grow larger, so split it out from dir.c
into a new source file.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:23 +02:00
Kirill Tkhai
be2ff42c5d fuse: Use hash table to link processing request
We noticed the performance bottleneck in FUSE running our Virtuozzo storage
over rdma. On some types of workload we observe 20% of times spent in
request_find() in profiler.  This function is iterating over long requests
list, and it scales bad.

The patch introduces hash table to reduce the number of iterations, we do
in this function. Hash generating algorithm is taken from hash_add()
function, while 256 lines table is used to store pending requests.  This
fixes problem and improves the performance.

Reported-by: Alexey Kuznetsov <kuznet@virtuozzo.com>
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:23 +02:00
Kirill Tkhai
3a5358d1a1 fuse: kill req->intr_unique
This field is not needed after the previous patch, since we can easily
convert request ID to interrupt request ID and vice versa.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:23 +02:00
Kirill Tkhai
c59fd85e4f fuse: change interrupt requests allocation algorithm
Using of two unconnected IDs req->in.h.unique and req->intr_unique does not
allow to link requests to a hash table. We need can't use none of them as a
key to calculate hash.

This patch changes the algorithm of allocation of IDs for a request. Plain
requests obtain even ID, while interrupt requests are encoded in the low
bit. So, in next patches we will be able to use the rest of ID bits to
calculate hash, and the hash will be the same for plain and interrupt
requests.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:23 +02:00
Kirill Tkhai
63825b4e1d fuse: do not take fc->lock in fuse_request_send_background()
Currently, we take fc->lock there only to check for fc->connected.
But this flag is changed only on connection abort, which is very
rare operation.

So allow checking fc->connected under just fc->bg_lock and use this lock
(as well as fc->lock) when resetting fc->connected.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:23 +02:00
Kirill Tkhai
ae2dffa394 fuse: introduce fc->bg_lock
To reduce contention of fc->lock, this patch introduces bg_lock for
protection of fields related to background queue. These are:
max_background, congestion_threshold, num_background, active_background,
bg_queue and blocked.

This allows next patch to make async reads not requiring fc->lock, so async
reads and writes will have better performance executed in parallel.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:22 +02:00
Kirill Tkhai
2b30a53314 fuse: add locking to max_background and congestion_threshold changes
Functions sequences like request_end()->flush_bg_queue() require that
max_background and congestion_threshold are constant during their
execution. Otherwise, checks like

	if (fc->num_background == fc->max_background)

made in different time may behave not like expected.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:22 +02:00
Kirill Tkhai
2a23f2b8ad fuse: use READ_ONCE on congestion_threshold and max_background
Since they are of unsigned int type, it's allowed to read them
unlocked during reporting to userspace. Let's underline this fact
with READ_ONCE() macroses.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:22 +02:00
Kirill Tkhai
e287179afe fuse: use list_first_entry() in flush_bg_queue()
This cleanup patch makes the function to use the primitive
instead of direct dereferencing.

Also, move fiq dereferencing out of cycle, since it's
always constant.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:22 +02:00
Niels de Vos
88bc7d5097 fuse: add support for copy_file_range()
There are several FUSE filesystems that can implement server-side copy
or other efficient copy/duplication/clone methods. The copy_file_range()
syscall is the standard interface that users have access to while not
depending on external libraries that bypass FUSE.

Signed-off-by: Niels de Vos <ndevos@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-28 16:43:22 +02:00
Miklos Szeredi
908a572b80 fuse: fix blocked_waitq wakeup
Using waitqueue_active() is racy.  Make sure we issue a wake_up()
unconditionally after storing into fc->blocked.  After that it's okay to
optimize with waitqueue_active() since the first wake up provides the
necessary barrier for all waiters, not the just the woken one.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 3c18ef8117 ("fuse: optimize wake_up")
Cc: <stable@vger.kernel.org> # v3.10
2018-09-28 16:43:22 +02:00
Miklos Szeredi
4c316f2f3f fuse: set FR_SENT while locked
Otherwise fuse_dev_do_write() could come in and finish off the request, and
the set_bit(FR_SENT, ...) could trigger the WARN_ON(test_bit(FR_SENT, ...))
in request_end().

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reported-by: syzbot+ef054c4d3f64cd7f7cec@syzkaller.appspotmai
Fixes: 46c34a348b ("fuse: no fc->lock for pqueue parts")
Cc: <stable@vger.kernel.org> # v4.2
2018-09-28 16:43:22 +02:00
Kirill Tkhai
d2d2d4fb1f fuse: Fix use-after-free in fuse_dev_do_write()
After we found req in request_find() and released the lock,
everything may happen with the req in parallel:

cpu0                              cpu1
fuse_dev_do_write()               fuse_dev_do_write()
  req = request_find(fpq, ...)    ...
  spin_unlock(&fpq->lock)         ...
  ...                             req = request_find(fpq, oh.unique)
  ...                             spin_unlock(&fpq->lock)
  queue_interrupt(&fc->iq, req);   ...
  ...                              ...
  ...                              ...
  request_end(fc, req);
    fuse_put_request(fc, req);
  ...                              queue_interrupt(&fc->iq, req);


Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 46c34a348b ("fuse: no fc->lock for pqueue parts")
Cc: <stable@vger.kernel.org> # v4.2
2018-09-28 16:43:21 +02:00
Kirill Tkhai
bc78abbd55 fuse: Fix use-after-free in fuse_dev_do_read()
We may pick freed req in this way:

[cpu0]                                  [cpu1]
fuse_dev_do_read()                      fuse_dev_do_write()
   list_move_tail(&req->list, ...);     ...
   spin_unlock(&fpq->lock);             ...
   ...                                  request_end(fc, req);
   ...                                    fuse_put_request(fc, req);
   if (test_bit(FR_INTERRUPTED, ...))
         queue_interrupt(fiq, req);

Fix that by keeping req alive until we finish all manipulations.

Reported-by: syzbot+4e975615ca01f2277bdd@syzkaller.appspotmail.com
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: 46c34a348b ("fuse: no fc->lock for pqueue parts")
Cc: <stable@vger.kernel.org> # v4.2
2018-09-28 16:43:21 +02:00
Greg Kroah-Hartman
c127e59bee \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAlusxsAACgkQnJ2qBz9k
 QNnfhAgAlvqmbnw9LFG+p0DbByFshuf7r/ygruOGKgWZ3/z2ES6Zg4eUpvgQGK04
 3+c2TWgnAgXRPOLwIQkxndXTCzR8hxueIfpnVTzNildQ2R3PNpVM8Q+DPmVrgfYo
 OOFB/bXig1EKtLmlXdNSHOSlqRc7G+xmG1zx04to1ecYjr8ISp9BvnedfNe4iNkD
 PHoaZsb1Lic8/rLpabXFrYe9wI4udesE3H4uQWckfAW6wL4w7+x9FFEFgoSHmkSL
 AxCDGZnvZAMUqECyYQ806UrihXmXxFbGVKY9LUWOzcn9TyvpiyLLlCsGn9Dpc3vA
 67o2CVIDR+PyqapHY3HAF+jJuuV15Q==
 =vI4P
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Jan writes:
  "an ext2 patch fixing fsync(2) for DAX mounts."

* tag 'for_v4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext2, dax: set ext2_dax_aops for dax files
2018-09-27 21:16:24 +02:00
Jan Kara
f52afc93cd dax: Fix deadlock in dax_lock_mapping_entry()
When dax_lock_mapping_entry() has to sleep to obtain entry lock, it will
fail to unlock mapping->i_pages spinlock and thus immediately deadlock
against itself when retrying to grab the entry lock again. Fix the
problem by unlocking mapping->i_pages before retrying.

Fixes: c2a7d2a115 ("filesystem-dax: Introduce dax_lock_mapping_entry()")
Reported-by: Barret Rhoden <brho@google.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-09-27 10:56:15 -07:00
Amir Goldstein
96a71f21ef fanotify: store fanotify_init() flags in group's fanotify_data
This averts the need to re-generate flags in fanotify_show_fdinfo()
and sets the scene for addition of more upcoming flags without growing
new members to the fanotify_data struct.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-27 15:29:00 +02:00
Chao Yu
4a1728cad6 f2fs: mark inode dirty explicitly in recover_inode()
Mark inode dirty explicitly in the end of recover_inode() to make sure
that all recoverable fields can be persisted later.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Chao Yu
5cd1f387a1 f2fs: fix to recover inode's crtime during POR
Testcase to reproduce this bug:
1. mkfs.f2fs -O extra_attr -O inode_crtime /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. xfs_io -f /mnt/f2fs/file -c "fsync"
5. godown /mnt/f2fs
6. umount /mnt/f2fs
7. mount -t f2fs /dev/sdd /mnt/f2fs
8. xfs_io -f /mnt/f2fs/file -c "statx -r"

stat.btime.tv_sec = 0
stat.btime.tv_nsec = 0

This patch fixes to recover inode creation time fields during
mount.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Chao Yu
7de36cf3e4 f2fs: fix to recover inode's i_gc_failures during POR
inode.i_gc_failures is used to indicate that skip count of migrating
on blocks of inode, we should guarantee it can be recovered in sudden
power-off case.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Chao Yu
19c73a691c f2fs: fix to recover inode's i_flags during POR
Testcase to reproduce this bug:
1. mkfs.f2fs /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr +A /mnt/f2fs/file
6. xfs_io -f /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. lsattr /mnt/f2fs/file

-----------------N- /mnt/f2fs/file

But actually, we expect the corrct result is:

-------A---------N- /mnt/f2fs/file

The reason is we didn't recover inode.i_flags field during mount,
fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Chao Yu
f4474aa6e5 f2fs: fix to recover inode's project id during POR
Testcase to reproduce this bug:
1. mkfs.f2fs -O extra_attr -O project_quota /dev/sdd
2. mount -t f2fs /dev/sdd /mnt/f2fs
3. touch /mnt/f2fs/file
4. sync
5. chattr -p 1 /mnt/f2fs/file
6. xfs_io -f /mnt/f2fs/file -c "fsync"
7. godown /mnt/f2fs
8. umount /mnt/f2fs
9. mount -t f2fs /dev/sdd /mnt/f2fs
10. lsattr -p /mnt/f2fs/file

    0 -----------------N- /mnt/f2fs/file

But actually, we expect the correct result is:

    1 -----------------N- /mnt/f2fs/file

The reason is we didn't recover inode.i_projid field during mount,
fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:33 -07:00
Jaegeuk Kim
0a4daae5ff f2fs: update i_size after DIO completion
This is related to
ee70daaba8 ("xfs: update i_size after unwritten conversion in dio completion")

If we update i_size during dio_write, dio_read can read out stale data, which
breaks xfstests/465.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:45:34 -07:00
Jaegeuk Kim
d83d0f5ba8 f2fs: report ENOENT correctly in f2fs_rename
This fixes wrong error report in f2fs_rename.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-26 12:38:11 -07:00
Chengguang Xu
c6b1867b1d f2fs: fix remount problem of option io_bits
Currently we show mount option "io_bits=%u" as "io_size=%uKB",
it will cause option parsing problem(unrecognized mount option)
in remount.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-25 18:44:56 -07:00
YueHaibing
7d20b6a272 nfsd: remove set but not used variable 'dirp'
Fixes gcc '-Wunused-but-set-variable' warning:

fs/nfsd/vfs.c: In function 'nfsd_create':
fs/nfsd/vfs.c:1279:16: warning:
 variable 'dirp' set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-09-25 20:35:13 -04:00
Olga Kornievskaia
e0639dc580 NFSD introduce async copy feature
Upon receiving a request for async copy, create a new kthread.  If we
get asynchronous request, make sure to copy the needed arguments/state
from the stack before starting the copy. Then start the thread and reply
back to the client indicating copy is asynchronous.

nfsd_copy_file_range() will copy in a loop over the total number of
bytes is needed to copy. In case a failure happens in the middle, we
ignore the error and return how much we copied so far. Once done
creating a workitem for the callback workqueue and send CB_OFFLOAD with
the results.

The lifetime of the copy stateid is bound to the vfs copy. This way we
don't need to keep the nfsd_net structure for the callback.  We could
keep it around longer so that an OFFLOAD_STATUS that came late would
still get results, but clients should be able to deal without that.

We handle OFFLOAD_CANCEL by sending a signal to the copy thread and
calling kthread_stop.

A client should cancel any ongoing copies before calling DESTROY_CLIENT;
if not, we return a CLIENT_BUSY error.

If the client is destroyed for some other reason (lease expiration, or
server shutdown), we must clean up any ongoing copies ourselves.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
[colin.king@canonical.com: fix leak in error case]
[bfields@fieldses.org: remove signalling, merge patches]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-09-25 20:34:54 -04:00
Olga Kornievskaia
885e2bf3ea NFSD OFFLOAD_CANCEL xdr
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-09-25 20:34:54 -04:00
Olga Kornievskaia
6308bc98e8 NFSD OFFLOAD_STATUS xdr
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-09-25 20:34:54 -04:00
Olga Kornievskaia
9eb190fca8 NFSD CB_OFFLOAD xdr
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-09-25 20:34:54 -04:00
Greg Kroah-Hartman
a38523185b libnvdimm/dax for 4.19-rc6
* (2) fixes for the dax error handling updates that were merged for
 v4.19-rc1. My mails to Al have been bouncing recently, so I do not have
 his ack but the uaccess change is of the trivial / obviously correct
 variety. The address_space_operations fixes a regression.
 
 * A filesystem-dax fix to correct the zero page lookup to be compatible
  with non-x86 (mips and s390) architectures.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbqoecAAoJEB7SkWpmfYgCaPYP/1Pf2ADt0pOskSk0ixM06UI9
 1lR2g2/ICMc505+oB+wUv9TkZOh9jcIS9o8GfLhgNvP7AU4woRvudeyr4NUc0mdw
 rtHRA6TIimbXa3+O2qMg4gqUjXRxj6urQp5oeQi8mQ/vefZv1aisEw/Klae8klVC
 HGoMFii19WGXXyPM2vNUb2L+JGZt1p/nl/Z8ydPavn1XkIIGb7c+MivDiaemjjgT
 487TmFULgLVhTCtQXlhkH7UCcCQ3+l3yxKaS1/g2hFpWE4LncBIvq8XBPwf5RQSL
 H/5rH/sd30XR2L0NMERxr0ENvCJf2iIf4bIqckODN4L9ojRE8zmBZMsSeRKmHufm
 3ZfTBLHjPUwwKWy7PlKSsFk2J8KjErsqRlfZQSMSJShpEgL1jCjYtuTEtupaegbU
 v8TwzsNDgJ1B6JuKJ7lh7hOF7vUQ/i65xG8SFACvNoiih8RGSW3llra442k2hmFu
 IEMXa9S4tvqHfXUb0J/6hLLi+xoV+KsYPWRiCuovy7t6EfAWUnNuGCldjfsQtZZv
 npHS7F7lkWlSCneDbE4cMdkkwjBKjAw0sjIWDrPVVCoITVe1j9+bEwE9reX1VOS+
 L+PB/WgcVH72MeiQnmPPTcUyEmNgCbku7NEhwrMHJngxuo9HmXE+BN8jdFDYZFfL
 uWDV25XOxviOC9xBosw4
 =mAaH
 -----END PGP SIGNATURE-----

erge tag 'libnvdimm-fixes-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Dan writes:
  "libnvdimm/dax for 4.19-rc6

  * (2) fixes for the dax error handling updates that were merged for
  v4.19-rc1. My mails to Al have been bouncing recently, so I do not have
  his ack but the uaccess change is of the trivial / obviously correct
  variety. The address_space_operations fixes a regression.

  * A filesystem-dax fix to correct the zero page lookup to be compatible
   with non-x86 (mips and s390) architectures."

* tag 'libnvdimm-fixes-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  device-dax: Add missing address_space_operations
  uaccess: Fix is_source param for check_copy_size() in copy_to_iter_mcsafe()
  filesystem-dax: Fix use of zero page
2018-09-25 21:37:41 +02:00
Wei Yongjun
69383c5913 ovl: make symbol 'ovl_aops' static
Fixes the following sparse warning:

fs/overlayfs/inode.c:507:39: warning:
 symbol 'ovl_aops' was not declared. Should it be static?

Fixes: 5b910bd615 ("ovl: fix GPF in swapfile_activate of file from overlayfs over xfs")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-25 20:41:23 +02:00
Chengguang Xu
2aad26fa0a ext2: remove redundant building macro check
If macro CONFIG_QUOTA is not enabled then mount option flag
of usrquota/grpquota will not be set, so we can remove some
building macro check safely in ext2_shwo_options().
Additionally, I think it's better to define EXT2_MOUNT_DAX
regardless macro CONFIG_FS_DAX is enabled just like acl/xattr.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-24 21:34:15 +02:00
Amir Goldstein
a725356b66 vfs: swap names of {do,vfs}_clone_file_range()
Commit 031a072a0b ("vfs: call vfs_clone_file_range() under freeze
protection") created a wrapper do_clone_file_range() around
vfs_clone_file_range() moving the freeze protection to former, so
overlayfs could call the latter.

The more common vfs practice is to call do_xxx helpers from vfs_xxx
helpers, where freeze protecction is taken in the vfs_xxx helper, so
this anomality could be a source of confusion.

It seems that commit 8ede205541 ("ovl: add reflink/copyfile/dedup
support") may have fallen a victim to this confusion -
ovl_clone_file_range() calls the vfs_clone_file_range() helper in the
hope of getting freeze protection on upper fs, but in fact results in
overlayfs allowing to bypass upper fs freeze protection.

Swap the names of the two helpers to conform to common vfs practice
and call the correct helpers from overlayfs and nfsd.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-24 10:54:01 +02:00
Amir Goldstein
d9d150ae50 ovl: fix freeze protection bypass in ovl_clone_file_range()
Tested by doing clone on overlayfs while upper xfs+reflink is frozen:

  xfs_io -f /ovl/y
                             fsfreeze -f /xfs
  xfs_io> reflink /ovl/x

Before the fix xfs_io enters xfs_reflink_remap_range() and blocks
in xfs_trans_alloc(). After the fix, xfs_io blocks outside xfs code
in ovl_clone_file_range().

Fixes: 8ede205541 ("ovl: add reflink/copyfile/dedup support")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-24 10:54:01 +02:00
Amir Goldstein
898cc19d8a ovl: fix freeze protection bypass in ovl_write_iter()
Tested by re-writing to an open overlayfs file while upper ext4 is frozen:

  xfs_io -f /ovl/x
  xfs_io> pwrite 0 4096
                             fsfreeze -f /ext4
  xfs_io> pwrite 0 4096

  WARNING: CPU: 0 PID: 1492 at fs/ext4/ext4_jbd2.c:53 \
           ext4_journal_check_start+0x48/0x82

After the fix, the second write blocks in ovl_write_iter() and avoids
hitting WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE) in
ext4_journal_check_start().

Fixes: 2a92e07edc ("ovl: add ovl_write_iter()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-24 10:54:01 +02:00
Amir Goldstein
63e1325280 ovl: fix memory leak on unlink of indexed file
The memory leak was detected by kmemleak when running xfstests
overlay/051,053

Fixes: caf70cb2ba ("ovl: cleanup orphan index entries")
Cc: <stable@vger.kernel.org> # v4.13
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-24 10:54:01 +02:00
Dennis Zhou (Facebook)
bdc2491708 blkcg: associate writeback bios with a blkg
One of the goals of this series is to remove a separate reference to
the css of the bio. This can and should be accessed via bio_blkcg. In
this patch, the wbc_init_bio call is changed such that it must be called
after a queue has been associated with the bio.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-21 20:29:11 -06:00
Greg Kroah-Hartman
0eba8697bc This pull request contains fixes for UBIFS:
- A wrong UBIFS assertion in mount code
 - Fix for a NULL pointer deref in mount code
 - Revert of a bad fix for xattrs
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlukuf8WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wRnyD/40jc4PA1pP3dJcXqdqwSeN+vJZ
 YWu307rE+giWWg6vRKa8zu3z614579UC7EE/I9miSrPNPYCi9VTLNzJGNMvn8aSf
 8seMrbhW0kz29T4YZ5fESr6x3hMsSdmIWZkPobruul8oBGmNcUm3Ag3MBM4XbQTm
 5AcWUdXVQUXPROK/Xm04hNa5pJ6wkSu5CxA+mf9YM2ZYYPPdCblxxq3NsDX99vDw
 gZup7JndSzpk+IpZMQIEKUP83vh5YgCU06YPjXi5YIfXl+sQnpH0fyrWm3+1C0Yz
 4T80rd3fm+7+obXdAI29mhcoZ09fGayozfqjU/fV3edW43+pUvRWced13vuLqz9Q
 JannCFeN+v/nB6OPlj4JGkgNunZ5M3r6ZZIQBbY9RJuZgG0/FMbpYb62RY1z/jg9
 YA7e3J/R02i7tHnPbcIz6ngKE0c42VKxTdATaybXVunx5ZPip7txj9OeJbfIYUrr
 CbLMdiqIJUAAheHMUrTiF3jDNwEiMhBZA4dAGDvLmpLuyfMlN8Lwje9BdRd5VNKp
 zwEGUQkDWLxniFYLtmbceFSa4IQT6z70XJn1pqdDvgjxfp8trEqhtp9GMxA2u6TA
 adcug5gUzRWMQ2QuyoKIv6tkjmnII5N2KzhZH9hwBReRJUlKY9WQ9jowcGrPXhbw
 F2prIwBxp8drneOeKg==
 =CGLq
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.19-rc4' of git://git.infradead.org/linux-ubifs

Richard writes:
  "This pull request contains fixes for UBIFS:
   - A wrong UBIFS assertion in mount code
   - Fix for a NULL pointer deref in mount code
   - Revert of a bad fix for xattrs"

* tag 'upstream-4.19-rc4' of git://git.infradead.org/linux-ubifs:
  Revert "ubifs: xattr: Don't operate on deleted inodes"
  ubifs: drop false positive assertion
  ubifs: Check for name being NULL while mounting
2018-09-21 15:29:44 +02:00
Chao Yu
dc4cd1257c f2fs: fix to recover inode's uid/gid during POR
Step to reproduce this bug:
1. logon as root
2. mount -t f2fs /dev/sdd /mnt;
3. touch /mnt/file;
4. chown system /mnt/file; chgrp system /mnt/file;
5. xfs_io -f /mnt/file -c "fsync";
6. godown /mnt;
7. umount /mnt;
8. mount -t f2fs /dev/sdd /mnt;

After step 8) we will expect file's uid/gid are all system, but during
recovery, these two fields were not been recovered, fix it.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 14:50:25 -07:00
Jaegeuk Kim
f84262b086 f2fs: avoid infinite loop in f2fs_alloc_nid
If we have an error in f2fs_build_free_nids, we're able to fall into a loop
to find free nids.

Suggested-by: Chao Yu <chao@kernel.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-20 14:49:19 -07:00
Junxiao Bi
234b69e3e0 ocfs2: fix ocfs2 read block panic
While reading block, it is possible that io error return due to underlying
storage issue, in this case, BH_NeedsValidate was left in the buffer head.
Then when reading the very block next time, if it was already linked into
journal, that will trigger the following panic.

[203748.702517] kernel BUG at fs/ocfs2/buffer_head_io.c:342!
[203748.702533] invalid opcode: 0000 [#1] SMP
[203748.702561] Modules linked in: ocfs2 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sunrpc dm_switch dm_queue_length dm_multipath bonding be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i iw_cxgb4 cxgb4 cxgb3i libcxgbi iw_cxgb3 cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf iTCO_wdt iTCO_vendor_support dcdbas ipmi_ssif i2c_core ipmi_si ipmi_msghandler acpi_pad pcspkr sb_edac edac_core lpc_ich mfd_core shpchp sg tg3 ptp pps_core ext4 jbd2 mbcache2 sr_mod cdrom sd_mod ahci libahci megaraid_sas wmi dm_mirror dm_region_hash dm_log dm_mod
[203748.703024] CPU: 7 PID: 38369 Comm: touch Not tainted 4.1.12-124.18.6.el6uek.x86_64 #2
[203748.703045] Hardware name: Dell Inc. PowerEdge R620/0PXXHP, BIOS 2.5.2 01/28/2015
[203748.703067] task: ffff880768139c00 ti: ffff88006ff48000 task.ti: ffff88006ff48000
[203748.703088] RIP: 0010:[<ffffffffa05e9f09>]  [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
[203748.703130] RSP: 0018:ffff88006ff4b818  EFLAGS: 00010206
[203748.703389] RAX: 0000000008620029 RBX: ffff88006ff4b910 RCX: 0000000000000000
[203748.703885] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00000000023079fe
[203748.704382] RBP: ffff88006ff4b8d8 R08: 0000000000000000 R09: ffff8807578c25b0
[203748.704877] R10: 000000000f637376 R11: 000000003030322e R12: 0000000000000000
[203748.705373] R13: ffff88006ff4b910 R14: ffff880732fe38f0 R15: 0000000000000000
[203748.705871] FS:  00007f401992c700(0000) GS:ffff880bfebc0000(0000) knlGS:0000000000000000
[203748.706370] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[203748.706627] CR2: 00007f4019252440 CR3: 00000000a621e000 CR4: 0000000000060670
[203748.707124] Stack:
[203748.707371]  ffff88006ff4b828 ffffffffa0609f52 ffff88006ff4b838 0000000000000001
[203748.707885]  0000000000000000 0000000000000000 ffff880bf67c3800 ffffffffa05eca00
[203748.708399]  00000000023079ff ffffffff81c58b80 0000000000000000 0000000000000000
[203748.708915] Call Trace:
[203748.709175]  [<ffffffffa0609f52>] ? ocfs2_inode_cache_io_unlock+0x12/0x20 [ocfs2]
[203748.709680]  [<ffffffffa05eca00>] ? ocfs2_empty_dir_filldir+0x80/0x80 [ocfs2]
[203748.710185]  [<ffffffffa05ec0cb>] ocfs2_read_dir_block_direct+0x3b/0x200 [ocfs2]
[203748.710691]  [<ffffffffa05f0fbf>] ocfs2_prepare_dx_dir_for_insert.isra.57+0x19f/0xf60 [ocfs2]
[203748.711204]  [<ffffffffa065660f>] ? ocfs2_metadata_cache_io_unlock+0x1f/0x30 [ocfs2]
[203748.711716]  [<ffffffffa05f4f3a>] ocfs2_prepare_dir_for_insert+0x13a/0x890 [ocfs2]
[203748.712227]  [<ffffffffa05f442e>] ? ocfs2_check_dir_for_entry+0x8e/0x140 [ocfs2]
[203748.712737]  [<ffffffffa061b2f2>] ocfs2_mknod+0x4b2/0x1370 [ocfs2]
[203748.713003]  [<ffffffffa061c385>] ocfs2_create+0x65/0x170 [ocfs2]
[203748.713263]  [<ffffffff8121714b>] vfs_create+0xdb/0x150
[203748.713518]  [<ffffffff8121b225>] do_last+0x815/0x1210
[203748.713772]  [<ffffffff812192e9>] ? path_init+0xb9/0x450
[203748.714123]  [<ffffffff8121bca0>] path_openat+0x80/0x600
[203748.714378]  [<ffffffff811bcd45>] ? handle_pte_fault+0xd15/0x1620
[203748.714634]  [<ffffffff8121d7ba>] do_filp_open+0x3a/0xb0
[203748.714888]  [<ffffffff8122a767>] ? __alloc_fd+0xa7/0x130
[203748.715143]  [<ffffffff81209ffc>] do_sys_open+0x12c/0x220
[203748.715403]  [<ffffffff81026ddb>] ? syscall_trace_enter_phase1+0x11b/0x180
[203748.715668]  [<ffffffff816f0c9f>] ? system_call_after_swapgs+0xe9/0x190
[203748.715928]  [<ffffffff8120a10e>] SyS_open+0x1e/0x20
[203748.716184]  [<ffffffff816f0d5e>] system_call_fastpath+0x18/0xd7
[203748.716440] Code: 00 00 48 8b 7b 08 48 83 c3 10 45 89 f8 44 89 e1 44 89 f2 4c 89 ee e8 07 06 11 e1 48 8b 03 48 85 c0 75 df 8b 5d c8 e9 4d fa ff ff <0f> 0b 48 8b 7d a0 e8 dc c6 06 00 48 b8 00 00 00 00 00 00 00 10
[203748.717505] RIP  [<ffffffffa05e9f09>] ocfs2_read_blocks+0x669/0x7f0 [ocfs2]
[203748.717775]  RSP <ffff88006ff4b818>

Joesph ever reported a similar panic.
Link: https://oss.oracle.com/pipermail/ocfs2-devel/2013-May/008931.html

Link: http://lkml.kernel.org/r/20180912063207.29484-1-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Changwei Ge <ge.changwei@h3c.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-20 22:01:12 +02:00
Dominique Martinet
a1b3d2f217 fs/proc/kcore.c: fix invalid memory access in multi-page read optimization
The 'm' kcore_list item could point to kclist_head, and it is incorrect to
look at m->addr / m->size in this case.

There is no choice but to run through the list of entries for every
address if we did not find any entry in the previous iteration

Reset 'm' to NULL in that case at Omar Sandoval's suggestion.

[akpm@linux-foundation.org: add comment]
Link: http://lkml.kernel.org/r/1536100702-28706-1-git-send-email-asmadeus@codewreck.org
Fixes: bf991c2231 ("proc/kcore: optimize multiple page reads")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Omar Sandoval <osandov@osandov.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-20 22:01:11 +02:00
Richard Weinberger
f061c1cc40 Revert "ubifs: xattr: Don't operate on deleted inodes"
This reverts commit 11a6fc3dc7.
UBIFS wants to assert that xattr operations are only issued on files
with positive link count. The said patch made this operations return
-ENOENT for unlinked files such that the asserts will no longer trigger.
This was wrong since xattr operations are perfectly fine on unlinked
files.
Instead the assertions need to be fixed/removed.

Cc: <stable@vger.kernel.org>
Fixes: 11a6fc3dc7 ("ubifs: xattr: Don't operate on deleted inodes")
Reported-by: Koen Vandeputte <koen.vandeputte@ncentric.com>
Tested-by: Joel Stanley <joel@jms.id.au>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-09-20 21:37:41 +02:00
Sascha Hauer
d3bdc016c5 ubifs: drop false positive assertion
The following sequence triggers

	ubifs_assert(c, c->lst.taken_empty_lebs > 0);

at the end of ubifs_remount_fs():

mount -t ubifs /dev/ubi0_0 /mnt
echo 1 > /sys/kernel/debug/ubifs/ubi0_0/ro_error
umount /mnt
mount -t ubifs -o ro /dev/ubix_y /mnt
mount -o remount,ro /mnt

The resulting

UBIFS assert failed in ubifs_remount_fs at 1878 (pid 161)

is a false positive. In the case above c->lst.taken_empty_lebs has
never been changed from its initial zero value. This will only happen
when the deferred recovery is done.

Fix this by doing the assertion only when recovery has been done
already.

Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-09-20 21:37:07 +02:00
Richard Weinberger
37f31b6ca4 ubifs: Check for name being NULL while mounting
The requested device name can be NULL or an empty string.
Check for that and refuse to continue. UBIFS has to do this manually
since we cannot use mount_bdev(), which checks for this condition.

Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Reported-by: syzbot+38bd0f7865e5c6379280@syzkaller.appspotmail.com
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-09-20 21:37:07 +02:00
Chao Yu
1390643d1d jfs: remove redundant dquot_initialize() in jfs_evict_inode()
We don't need to call dquot_initialize() twice in jfs_evict_inode(),
remove one of them for cleanup.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2018-09-20 09:28:49 -05:00
Sahitya Tummala
a7d10cf3e4 f2fs: add new idle interval timing for discard and gc paths
This helps to control the frequency of submission of discard and
GC requests independently, based on the need.

Suggested-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-19 15:55:14 -07:00
Toshi Kani
9e796c9db9 ext2, dax: set ext2_dax_aops for dax files
Sync syscall to DAX file needs to flush processor cache, but it
currently does not flush to existing DAX files.  This is because
'ext2_da_aops' is set to address_space_operations of existing DAX
files, instead of 'ext2_dax_aops', since S_DAX flag is set after
ext2_set_aops() in the open path.

Similar to ext4, change ext2_iget() to initialize i_flags before
ext2_set_aops().

Fixes: fb094c9074 ("ext2, dax: introduce ext2_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-19 15:03:04 +02:00
Andreas Gruenbacher
ffc4c92227 sysfs: Do not return POSIX ACL xattrs via listxattr
Commit 786534b92f introduced a regression that caused listxattr to
return the POSIX ACL attribute names even though sysfs doesn't support
POSIX ACLs.  This happens because simple_xattr_list checks for NULL
i_acl / i_default_acl, but inode_init_always initializes those fields
to ACL_NOT_CACHED ((void *)-1).  For example:
    $ getfattr -m- -d /sys
    /sys: system.posix_acl_access: Operation not supported
    /sys: system.posix_acl_default: Operation not supported
Fix this in simple_xattr_list by checking if the filesystem supports POSIX ACLs.

Fixes: 786534b92f ("tmpfs: listxattr should include POSIX ACL xattrs")
Reported-by:  Marc Aurèle La France <tsi@tuyoix.net>
Tested-by: Marc Aurèle La France <tsi@tuyoix.net>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: stable@vger.kernel.org # v4.5+
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-18 07:30:48 -04:00
Greg Kroah-Hartman
ad3273d5f1 Various ext4 bug fixes; primarily making ext4 more robust against
maliciously crafted file systems, and some DAX fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlufGncACgkQ8vlZVpUN
 gaPwuQf9FKp9yRvjBkjtnH3+s4Ps8do9r067+90y1k2DJMxKoaBUhGSW2MJJ04j+
 5F6Ndp/TZHw+LfPnzsqlrAAoP3CG5+kacfJ7xeVKR0umvACm6rLMsCUct7/rFoSl
 PgzCALFIJvQ9+9shuO9qrgmjJrfrlTVUgR9Mu3WUNEvMFbMjk3FMI8gi5kjjWemE
 G9TDYH2lMH2sL0cWF51I2gOyNXOXrihxe+vP7j6i/rUkV+YLpKZhE1ss3Sfn6pR2
 p/KjnXdupLJpgYLJne9kMrq2r8xYmDfA0S+Dec7nkox5FUOFUHssl3+q8C7cDwO9
 zl6VyVFwybjFRJ/Y59wox6eqVPlIWw==
 =1P1w
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Ted writes:
	Various ext4 bug fixes; primarily making ext4 more robust against
	maliciously crafted file systems, and some DAX fixes.

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4, dax: set ext4_dax_aops for dax files
  ext4, dax: add ext4_bmap to ext4_dax_aops
  ext4: don't mark mmp buffer head dirty
  ext4: show test_dummy_encryption mount option in /proc/mounts
  ext4: close race between direct IO and ext4_break_layouts()
  ext4: fix online resizing for bigalloc file systems with a 1k block size
  ext4: fix online resize's handling of a too-small final block group
  ext4: recalucate superblock checksum after updating free blocks/inodes
  ext4: avoid arithemetic overflow that can trigger a BUG
  ext4: avoid divide by zero fault when deleting corrupted inline directories
  ext4: check to make sure the rename(2)'s destination is not freed
  ext4: add nonstring annotations to ext4.h
2018-09-17 09:13:47 +02:00
Bernd Edlinger
a75e78f21f kernfs: Fix range checks in kernfs_get_target_path
The terminating NUL byte is only there because the buffer is
allocated with kzalloc(PAGE_SIZE, GFP_KERNEL), but since the
range-check is off-by-one, and PAGE_SIZE==PATH_MAX, the
returned string may not be zero-terminated if it is exactly
PATH_MAX characters long.  Furthermore also the initial loop
may theoretically exceed PATH_MAX and cause a fault.

Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-16 22:38:57 +02:00
Toshi Kani
cce6c9f7e6 ext4, dax: set ext4_dax_aops for dax files
Sync syscall to DAX file needs to flush processor cache, but it
currently does not flush to existing DAX files.  This is because
'ext4_da_aops' is set to address_space_operations of existing DAX
files, instead of 'ext4_dax_aops', since S_DAX flag is set after
ext4_set_aops() in the open path.

  New file
  --------
  lookup_open
    ext4_create
      __ext4_new_inode
        ext4_set_inode_flags   // Set S_DAX flag
      ext4_set_aops            // Set aops to ext4_dax_aops

  Existing file
  -------------
  lookup_open
    ext4_lookup
      ext4_iget
        ext4_set_aops          // Set aops to ext4_da_aops
        ext4_set_inode_flags   // Set S_DAX flag

Change ext4_iget() to initialize i_flags before ext4_set_aops().

Fixes: 5f0663bb4a ("ext4, dax: introduce ext4_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2018-09-15 21:37:59 -04:00
Toshi Kani
94dbb63117 ext4, dax: add ext4_bmap to ext4_dax_aops
Ext4 mount path calls .bmap to the journal inode. This currently
works for the DAX mount case because ext4_iget() always set
'ext4_da_aops' to any regular files.

In preparation to fix ext4_iget() to set 'ext4_dax_aops' for ext4
DAX files, add ext4_bmap() to 'ext4_dax_aops', since bmap works for
DAX inodes.

Fixes: 5f0663bb4a ("ext4, dax: introduce ext4_dax_aops")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Suggested-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
2018-09-15 21:23:41 -04:00
Li Dongyang
fe18d64989 ext4: don't mark mmp buffer head dirty
Marking mmp bh dirty before writing it will make writeback
pick up mmp block later and submit a write, we don't want the
duplicate write as kmmpd thread should have full control of
reading and writing the mmp block.
Another reason is we will also have random I/O error on
the writeback request when blk integrity is enabled, because
kmmpd could modify the content of the mmp block(e.g. setting
new seq and time) while the mmp block is under I/O requested
by writeback.

Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@vger.kernel.org
2018-09-15 17:11:25 -04:00
Eric Biggers
338affb548 ext4: show test_dummy_encryption mount option in /proc/mounts
When in effect, add "test_dummy_encryption" to _ext4_show_options() so
that it is shown in /proc/mounts and other relevant procfs files.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-15 14:28:26 -04:00
Linus Torvalds
3a5af36b6d fixes for four CIFS/SMB3 potential pointer overflow issues, one minor build fix, and a build warning cleanup
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAlucKDwACgkQiiy9cAdy
 T1EhEgwAqgVXTujce2UVPtaFY/MaGmaIwAimh+aYOCAADxLYJHkjtRzHd5PQgf+L
 n55R2hFcMeelWxOMEb/aRmxIKLk8fmJYVWClM2+S7Z79M3GHexDbMS8+oDZnzCTB
 EknvaTbi+vOt4HGABkbJ/jiQCgonmeobon30gLWaYa3XGeYc7ZV5gR+EXL9xSdvh
 +I+x3rSDpm8fQ5njkB7RKgfB+ha4NQqZ6dXlYQzcb0vMO3/lhQ56Ypgn95Jlu5UW
 pcLxUFE1do+JeGvIU1it2SCRJ5499g180Rxucl7X1xFBQ44Qss9QOeWkFTZ8784V
 PIYVZMTUqO0Km4H22qXD8lIY5GjDuAXLYM3AddFkJpbKaw6g++ZsUXSfoA5zRIDn
 10edaPK/hDIQeFaV+ySTN5g/Qh3YFnmY4kDL3t3CRZDe4+DTW/+cmrF3sGkhZDQt
 +nDo0JxJsjNnJW6loB5Lb76lygvsng01owSsYSAChjhYwBvCDgp9/D85pDQ/oYkl
 HKD9tiF3
 =YNSx
 -----END PGP SIGNATURE-----

Merge tag '4.19-rc3-smb3-cifs' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Fixes for four CIFS/SMB3 potential pointer overflow issues, one minor
  build fix, and a build warning cleanup"

* tag '4.19-rc3-smb3-cifs' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: read overflow in is_valid_oplock_break()
  cifs: integer overflow in in SMB2_ioctl()
  CIFS: fix wrapping bugs in num_entries()
  cifs: prevent integer overflow in nxt_dir_entry()
  fs/cifs: require sha512
  fs/cifs: suppress a string overflow warning
2018-09-14 19:33:42 -10:00
Linus Torvalds
589109df31 NFS client bugfixes for Linux 4.19
Stable bugfixes:
 - v4.17+: Fix a tracepoint Oops in initiate_file_draining()
 - v4.17+: Fix a tracepoint Oops in initiate_file_draining()
 - v4.11+: Fix an infinite loop on I/O
 
 Other fixes:
 - Return errors if a waiting layoutget is killed
 - Don't open code clearing of delegation state
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlucGhMACgkQ18tUv7Cl
 QOsMCA/9ET0bbzus25DWcnbT10bpEdtW4p6dR/2ztGqUGe0FVyCyVT70jBKFbnAL
 cO1pqElKLj7TMPhTsxK63dDxXGELXCqDtmsHELaD8jf0h6270KAPariJBQ5+N0ud
 5U5CXswW/zbQekE9GMTDQtAAGBzfht33PavFt2+5oVYTAQ5K6Pwvq2qMoifQxMlk
 wjtVjypz0QjBy5bHBO6XGxX58JIc23EwA0/KDS4cU3vkwDXmEZcVYIUdqJF4gtCz
 JdJdnT4b9ebtbdHENx8rkot3L1VSx6JfW9pPMvxLjxn8IG1rj7zQXtc7kpnoF8RY
 WVGWuf4rn7Zo7YXf11SXNebMYgrljx5/0KcmUtSgSmCCqVUmY1e/a69e0fhkKfxn
 /W/+fYIdC1wG0JXtrCN8eJbGropYj3B8Ln5TBV4LN91hFMI8Lx/4D1lKLoK7RNLJ
 3Q6VmKIhKVMHgK6eAivWyN7X/WzIvAj37a2ix2xSjENUIP+l3ePAekc4f5XGMe8O
 wx6wxgvVSomVrsM565XGcjw+LXUyzlXowS6JhR+Zn6fYmbDkPk0ZMC5HBFidakkB
 YxO7aWjBvyYinOrBMWODeWatt4q50bI6YtQpxxWyDIdRXWbvQDjBILveTh6/sRGQ
 KbA3r6XC/DSMri4Mo/r+92fbBEYV2otnkkYraz04VgBVYYgiJts=
 =C83+
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "These are a handful of fixes for problems that Trond found. Patch #1
  and #3 have the same name, a second issue was found after applying the
  first patch.

  Stable bugfixes:
   - v4.17+: Fix tracepoint Oops in initiate_file_draining()
   - v4.11+: Fix an infinite loop on I/O

  Other fixes:
   - Return errors if a waiting layoutget is killed
   - Don't open code clearing of delegation state"

* tag 'nfs-for-4.19-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Don't open code clearing of delegation state
  NFSv4.1 fix infinite loop on I/O.
  NFSv4: Fix a tracepoint Oops in initiate_file_draining()
  pNFS: Ensure we return the error if someone kills a waiting layoutget
  NFSv4: Fix a tracepoint Oops in initiate_file_draining()
2018-09-14 19:25:28 -10:00
Trond Myklebust
9f0c5124f4 NFS: Don't open code clearing of delegation state
Add a helper for the case when the nfs4 open state has been set to use
a delegation stateid, and we want to revert to using the open stateid.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:24:27 -04:00
Trond Myklebust
994b15b983 NFSv4.1 fix infinite loop on I/O.
The previous fix broke recovery of delegated stateids because it assumes
that if we did not mark the delegation as suspect, then the delegation has
effectively been revoked, and so it removes that delegation irrespectively
of whether or not it is valid and still in use. While this is "mostly
harmless" for ordinary I/O, we've seen pNFS fail with LAYOUTGET spinning
in an infinite loop while complaining that we're using an invalid stateid
(in this case the all-zero stateid).

What we rather want to do here is ensure that the delegation is always
correctly marked as needing testing when that is the case. So we want
to close the loophole offered by nfs4_schedule_stateid_recovery(),
which marks the state as needing to be reclaimed, but not the
delegation that may be backing it.

Fixes: 0e3d3e5df0 ("NFSv4.1 fix infinite loop on IO BAD_STATEID error")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.11+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:24:11 -04:00
Trond Myklebust
2edaead69e NFSv4: Fix a tracepoint Oops in initiate_file_draining()
Now that the value of 'ino' can be NULL or an ERR_PTR(), we need to
change the test in the tracepoint.

Fixes: ce5624f7e6 ("NFSv4: Return NFS4ERR_DELAY when a layout fails...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:24:08 -04:00
Trond Myklebust
d03360aaf5 pNFS: Ensure we return the error if someone kills a waiting layoutget
If someone interrupts a wait on one or more outstanding layoutgets in
pnfs_update_layout() then return the ERESTARTSYS/EINTR error.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:24:08 -04:00
Trond Myklebust
2a534a7473 NFSv4: Fix a tracepoint Oops in initiate_file_draining()
Now that the value of 'ino' can be NULL or an ERR_PTR(), we need to
change the test in the tracepoint.

Fixes: ce5624f7e6 ("NFSv4: Return NFS4ERR_DELAY when a layout fails...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.17+
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-09-14 16:23:16 -04:00
Al Viro
e21120383f move compat handling of tty ioctls to tty_compat_ioctl()
ioctls that are
	* callable only via tty_ioctl()
	* not driver-specific
	* not demand data structure conversions
	* either always need passing arg as is or always demand compat_ptr()
get intercepted in tty_compat_ioctl() from the very beginning and
redirecter to tty_ioctl().  As the result, their entries in fs/compat_ioctl.c
(some of those had been missing, BTW) got removed, as well as
n_tty_compat_ioctl_helper() (now it's never called with any cmd it would accept).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-14 11:12:17 -04:00
Linus Torvalds
48751b562b overlayfs fixes for 4.19-rc4
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW5qpOgAKCRDh3BK/laaZ
 PDCQAQCIKLg0aLeWOkfUO76mBjlp5srKgJfrqpFoyuozD6l2fQEAl/W2x9NOduV+
 PK4sCYMT8SpI0hMrbv9P4zZ683kmaA8=
 =RnZU
 -----END PGP SIGNATURE-----

Merge tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs fixes from Miklos Szeredi:
 "This fixes a regression in the recent file stacking update, reported
  and fixed by Amir Goldstein. The fix is fairly trivial, but involves
  adding a fadvise() f_op and the associated churn in the vfs. As
  discussed on -fsdevel, there are other possible uses for this method,
  than allowing proper stacking for overlays.

  And there's one other fix for a syzkaller detected oops"

* tag 'ovl-fixes-4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
  ovl: fix oopses in ovl_fill_super() failure paths
  ovl: add ovl_fadvise()
  vfs: implement readahead(2) using POSIX_FADV_WILLNEED
  vfs: add the fadvise() file operation
  Documentation/filesystems: update documentation of file_operations
  ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
  ovl: respect FIEMAP_FLAG_SYNC flag
2018-09-13 19:21:40 -10:00
Linus Torvalds
145ea6f10d pstore fixes for v4.19-rc4:
- Handle page-vs-byte offset handling between iomap and vmap (Bin Yang)
 -----BEGIN PGP SIGNATURE-----
 Comment: Kees Cook <kees@outflux.net>
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAluakCIWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJnq1D/9a9PFwxBflN3Wi9G2SMJuCF8y1
 vb+auPyMtk2gpnDnE/GbumwLfB/ZW/1JSBNncCmSnFOjeK3M0m4e0gz0XU2AQaT6
 BTNFZlVe649a6zct59nEQRILG1aPP5UdKv4eE2mJ+XOFEi3DSTMgtsmFhftrngGZ
 Gp8XYaNgf6JIL25fUtfgZd5YiwbR+ZdLZdcd+qlIliJF80rup/QftdgRsPJk0kUS
 b8yE6sWNnP1NDzlUWcAWXy231lewbbUsdYAQ11RLThl8Glogz3hlxLKIbCuKFymd
 IGwSyShnZFczO6qN6Rcvx5o8PIgbTH+1ntQc3s3PNuXOpOmOk7F2YspkAEqLY3//
 4Gy+jphSmg12JAzPy7B1m0UT8nCE0S9qfi7tnQ0Dti0rvu+vS1dyqFlqbTgN9Hb3
 Wfi8u/rNvUT5XqB0afq5v7U6On7mjHfW6PwcwqZJZtEblERqv+DOgRUPjWkack6E
 x64IA0UzlEX2AlWHiBxGcwd55kIvk+xI94VVy2K8QcbWlGeM5Ij7WDMABTkmJn5L
 3I8EwoIpg9qtoabKOywOQuj6m9Dovvo0Dv8juUtHARBUJZC0fEpakTdPZEQoiSBs
 7Vox1UTzY+QqJch1ayEvjmqJNkPsAor7zqfVX4Ub85tuECAEs+MtQDbZCZUTXR7I
 WN5N1QBDpuwC8tupCg==
 =8krP
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore fix from Kees Cook:
 "This fixes a 6 year old pstore bug that everyone just got lucky in
  avoiding, likely due only using page-aligned persistent ram regions:

   - Handle page-vs-byte offset handling between iomap and vmap (Bin Yang)"

* tag 'pstore-v4.19-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore: Fix incorrect persistent ram buffer mapping
2018-09-13 18:52:41 -10:00
Bin Yang
831b624df1 pstore: Fix incorrect persistent ram buffer mapping
persistent_ram_vmap() returns the page start vaddr.
persistent_ram_iomap() supports non-page-aligned mapping.

persistent_ram_buffer_map() always adds offset-in-page to the vaddr
returned from these two functions, which causes incorrect mapping of
non-page-aligned persistent ram buffer.

By default ftrace_size is 4096 and max_ftrace_cnt is nr_cpu_ids. Without
this patch, the zone_sz in ramoops_init_przs() is 4096/nr_cpu_ids which
might not be page aligned. If the offset-in-page > 2048, the vaddr will be
in next page. If the next page is not mapped, it will cause kernel panic:

[    0.074231] BUG: unable to handle kernel paging request at ffffa19e0081b000
...
[    0.075000] RIP: 0010:persistent_ram_new+0x1f8/0x39f
...
[    0.075000] Call Trace:
[    0.075000]  ramoops_init_przs.part.10.constprop.15+0x105/0x260
[    0.075000]  ramoops_probe+0x232/0x3a0
[    0.075000]  platform_drv_probe+0x3e/0xa0
[    0.075000]  driver_probe_device+0x2cd/0x400
[    0.075000]  __driver_attach+0xe4/0x110
[    0.075000]  ? driver_probe_device+0x400/0x400
[    0.075000]  bus_for_each_dev+0x70/0xa0
[    0.075000]  driver_attach+0x1e/0x20
[    0.075000]  bus_add_driver+0x159/0x230
[    0.075000]  ? do_early_param+0x95/0x95
[    0.075000]  driver_register+0x70/0xc0
[    0.075000]  ? init_pstore_fs+0x4d/0x4d
[    0.075000]  __platform_driver_register+0x36/0x40
[    0.075000]  ramoops_init+0x12f/0x131
[    0.075000]  do_one_initcall+0x4d/0x12c
[    0.075000]  ? do_early_param+0x95/0x95
[    0.075000]  kernel_init_freeable+0x19b/0x222
[    0.075000]  ? rest_init+0xbb/0xbb
[    0.075000]  kernel_init+0xe/0xfc
[    0.075000]  ret_from_fork+0x3a/0x50

Signed-off-by: Bin Yang <bin.yang@intel.com>
[kees: add comments describing the mapping differences, updated commit log]
Fixes: 24c3d2f342 ("staging: android: persistent_ram: Make it possible to use memory outside of bootmem")
Cc: stable@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-13 09:14:57 -07:00
Dan Carpenter
097f5863b1 cifs: read overflow in is_valid_oplock_break()
We need to verify that the "data_offset" is within bounds.

Reported-by: Dr Silvio Cesare of InfoSect <silvio.cesare@gmail.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2018-09-12 17:13:34 -05:00
Chao Yu
6f5c2ed0a2 f2fs: split IO error injection according to RW
This patch adds to support injecting error for write IO, this can simulate
IO error like fail_make_request or dm_flakey does.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-12 13:07:32 -07:00
Chao Yu
7c1a000d46 f2fs: add SPDX license identifiers
Remove the verbose license text from f2fs files and replace them with
SPDX tags.  This does not change the license of any of the code.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-12 13:07:10 -07:00
Chengguang Xu
4cb037ec3f f2fs: surround fault_injection related option parsing using CONFIG_F2FS_FAULT_INJECTION
It's a little bit strange when fault_injection related
options fail with -EINVAL which were already disabled
from config, so surround all fault_injection related option
parsing code using CONFIG_F2FS_FAULT_INJECTION. Meanwhile,
slightly change warning message to keep consistency with
option POSIX_ACL and FS_XATTR.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-12 13:05:33 -07:00
Arnd Bergmann
04b72322e8 media: dvb: move compat handlers into drivers
The VIDEO_STILLPICTURE is only implemented by one driver, while
VIDEO_GET_EVENT has two users in tree. In both cases, it is fairly
easy to handle the compat ioctls in the native handler rather
than relying on translation in fs/compat_ioctls.

In effect, this means that now the drivers implement both structure
layouts in both native and compat mode, but I don't see anything
wrong with that.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Arnd Bergmann
8a24280b11 media: dvb: move most compat_ioctl handling into drivers
Most DVB audio and video ioctl commands are completely compatible,
and are implemented by just two drivers: ttpci and ivtv. In both
cases, we can use the same ioctl handler for both native and
compat ioctl handling, and remove the entries from the global
lookup table.

In case of ttpci, this directly hooks into the file_operations
structure, and for ivtv, we have to set the compat_ioctl32
method in v4l2_file_operations. For all I can tell, setting it
to video_ioctl2 will still do the right thing for all commands.

Note that for the VIDEO_STILLPICTURE and VIDEO_GET_EVENT commands,
a translation handler in fs/compat_ioctl.c is still used. This
works because the command numbers are different on 32-bit
systems.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Arnd Bergmann
e6c8320648 media: cec: move compat_ioctl handling to cec-api.c
All the CEC ioctls are compatible, and they are only implemented
in one driver, so we can simply let this driver handle them
natively.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Hans Verkuil <hans.verkuil@cisco.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Arnd Bergmann
b5d3206112 media: dvb: dmxdev: move compat_ioctl handling to dmxdev.c
All dmx ioctls are compatible, and they are only implemented
in one file, so we can replace the list of commands in
fs/compat_ioctl.c with a single line in dmxdev.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Arnd Bergmann
1ccbeeb888 media: dvb: fix compat ioctl translation
The VIDEO_GET_EVENT and VIDEO_STILLPICTURE was added back in 2005 but
it never worked because the command number is wrong.

Using the right command number means we have a better chance of them
actually doing the right thing, though clearly nobody has ever tried
it successfully.

I noticed these while auditing the remaining users of compat_time_t
for y2038 bugs. This one is fine in that regard, it just never did
anything.

Fixes: 6e87abd0b8 ("[DVB]: Add compat ioctl handling.")

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
2018-09-12 11:00:51 -04:00
Dan Carpenter
2d204ee9d6 cifs: integer overflow in in SMB2_ioctl()
The "le32_to_cpu(rsp->OutputOffset) + *plen" addition can overflow and
wrap around to a smaller value which looks like it would lead to an
information leak.

Fixes: 4a72dafa19 ("SMB2 FSCTL and IOCTL worker function")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
2018-09-12 09:32:07 -05:00
Dan Carpenter
56446f218a CIFS: fix wrapping bugs in num_entries()
The problem is that "entryptr + next_offset" and "entryptr + len + size"
can wrap.  I ended up changing the type of "entryptr" because it makes
the math easier when we don't have to do so much casting.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2018-09-12 09:32:07 -05:00
Dan Carpenter
8ad8aa3535 cifs: prevent integer overflow in nxt_dir_entry()
The "old_entry + le32_to_cpu(pDirInfo->NextEntryOffset)" can wrap
around so I have added a check for integer overflow.

Reported-by: Dr Silvio Cesare of InfoSect <silvio.cesare@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
2018-09-12 09:27:57 -05:00
Matthew Wilcox
b90ca5cc32 filesystem-dax: Fix use of zero page
Use my_zero_pfn instead of ZERO_PAGE(), and pass the vaddr to it instead
of zero so it works on MIPS and s390 who reference the vaddr to select a
zero page.

Cc: <stable@vger.kernel.org>
Fixes: 91d25ba8a6 ("dax: use common 4k zero page for dax mmap reads")
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2018-09-11 21:27:44 -07:00
Wang Shilong
c8e927579e f2fs: fix setattr project check upon fssetxattr ioctl
Currently, project quota could be changed by fssetxattr
ioctl, and existed permission check inode_owner_or_capable()
is obviously not enough, just think that common users could
change project id of file, that could make users to
break project quota easily.

This patch try to follow same regular of xfs project
quota:

"Project Quota ID state is only allowed to change from
within the init namespace. Enforce that restriction only
if we are trying to change the quota ID state.
Everything else is allowed in user namespaces."

Besides that, check and set project id'state should
be an atomic operation, protect whole operation with
inode lock.

Signed-off-by: Wang Shilong <wshilong@ddn.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Zhikang Zhang
b430f72636 f2fs: avoid sleeping under spin_lock
In the call trace below, we might sleep in function dput().

So in order to avoid sleeping under spin_lock, we remove f2fs_mark_inode_dirty_sync
from __try_update_largest_extent && __drop_largest_extent.

BUG: sleeping function called from invalid context at fs/dcache.c:796
Call trace:
	dump_backtrace+0x0/0x3f4
	show_stack+0x24/0x30
	dump_stack+0xe0/0x138
	___might_sleep+0x2a8/0x2c8
	__might_sleep+0x78/0x10c
	dput+0x7c/0x750
	block_dump___mark_inode_dirty+0x120/0x17c
	__mark_inode_dirty+0x344/0x11f0
	f2fs_mark_inode_dirty_sync+0x40/0x50
	__insert_extent_tree+0x2e0/0x2f4
	f2fs_update_extent_tree_range+0xcf4/0xde8
	f2fs_update_extent_cache+0x114/0x12c
	f2fs_update_data_blkaddr+0x40/0x50
	write_data_page+0x150/0x314
	do_write_data_page+0x648/0x2318
	__write_data_page+0xdb4/0x1640
	f2fs_write_cache_pages+0x768/0xafc
	__f2fs_write_data_pages+0x590/0x1218
	f2fs_write_data_pages+0x64/0x74
	do_writepages+0x74/0xe4
	__writeback_single_inode+0xdc/0x15f0
	writeback_sb_inodes+0x574/0xc98
	__writeback_inodes_wb+0x190/0x204
	wb_writeback+0x730/0xf14
	wb_check_old_data_flush+0x1bc/0x1c8
	wb_workfn+0x554/0xf74
	process_one_work+0x440/0x118c
	worker_thread+0xac/0x974
	kthread+0x1a0/0x1c8
	ret_from_fork+0x10/0x1c

Signed-off-by: Zhikang Zhang <zhangzhikang1@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
e1293bdfa0 f2fs: plug readahead IO in readdir()
Add a plug to merge readahead IO in readdir(), expecting it can
reduce bio count before submitting to block layer.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
042be0f849 f2fs: fix to do sanity check with current segment number
https://bugzilla.kernel.org/show_bug.cgi?id=200219

Reproduction way:
- mount image
- run poc code
- umount image

F2FS-fs (loop1): Bitmap was wrongly set, blk:15364
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/segment.c:2061!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 17686 Comm: umount Tainted: G        W  O      4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: update_sit_entry+0x459/0x4e0 [f2fs]
Code: e8 1c b5 fd ff 0f 0b 0f 0b 8b 45 e4 c7 44 24 08 9c 7a 6c f8 c7 44 24 04 bc 4a 6c f8 89 44 24 0c 8b 06 89 04 24 e8 f7 b4 fd ff <0f> 0b 8b 45 e4 0f b6 d2 89 54 24 10 c7 44 24 08 60 7a 6c f8 c7 44
EAX: 00000032 EBX: 000000f8 ECX: 00000002 EDX: 00000001
ESI: d7177000 EDI: f520fe68 EBP: d6477c6c ESP: d6477c34
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010282
CR0: 80050033 CR2: b7fbe000 CR3: 2a99b3c0 CR4: 000406f0
Call Trace:
 f2fs_allocate_data_block+0x124/0x580 [f2fs]
 do_write_page+0x78/0x150 [f2fs]
 f2fs_do_write_node_page+0x25/0xa0 [f2fs]
 __write_node_page+0x2bf/0x550 [f2fs]
 f2fs_sync_node_pages+0x60e/0x6d0 [f2fs]
 ? sync_inode_metadata+0x2f/0x40
 ? f2fs_write_checkpoint+0x28f/0x7d0 [f2fs]
 ? up_write+0x1e/0x80
 f2fs_write_checkpoint+0x2a9/0x7d0 [f2fs]
 ? mark_held_locks+0x5d/0x80
 ? _raw_spin_unlock_irq+0x27/0x50
 kill_f2fs_super+0x68/0x90 [f2fs]
 deactivate_locked_super+0x3d/0x70
 deactivate_super+0x40/0x60
 cleanup_mnt+0x39/0x70
 __cleanup_mnt+0x10/0x20
 task_work_run+0x81/0xa0
 exit_to_usermode_loop+0x59/0xa7
 do_fast_syscall_32+0x1f5/0x22c
 entry_SYSENTER_32+0x53/0x86
EIP: 0xb7f95c51
Code: c1 1e f7 ff ff 89 e5 8b 55 08 85 d2 8b 81 64 cd ff ff 74 02 89 02 5d c3 8b 0c 24 c3 8b 1c 24 c3 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
EAX: 00000000 EBX: 0871ab90 ECX: bfb2cd00 EDX: 00000000
ESI: 00000000 EDI: 0871ab90 EBP: 0871ab90 ESP: bfb2cd7c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000246
Modules linked in: f2fs(O) crc32_generic bnep rfcomm bluetooth ecdh_generic snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq pcbc joydev aesni_intel snd_seq_device aes_i586 snd_timer crypto_simd snd cryptd soundcore mac_hid serio_raw video i2c_piix4 parport_pc ppdev lp parport hid_generic psmouse usbhid hid e1000 [last unloaded: f2fs]
---[ end trace d423f83982cfcdc5 ]---

The reason is, different log headers using the same segment, once
one log's next block address is used by another log, it will cause
panic as above.

Main area: 24 segs, 24 secs 24 zones
  - COLD  data: 0, 0, 0
  - WARM  data: 1, 1, 1
  - HOT   data: 20, 20, 20
  - Dir   dnode: 22, 22, 22
  - File   dnode: 22, 22, 22
  - Indir nodes: 21, 21, 21

So this patch adds sanity check to detect such condition to avoid
this issue.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
4a70e25544 f2fs: fix memory leak of percpu counter in fill_super()
In fill_super -> init_percpu_info, we should destroy percpu counter
in error path, otherwise memory allcoated for percpu counter will
leak.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Chao Yu
0b2103e886 f2fs: fix memory leak of write_io in fill_super()
It needs to release memory allocated for sbi->write_io in error path,
otherwise, it will cause memory leak.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 17:09:32 -07:00
Al Viro
77021f8bab presence of RS485 ioctls has been unconditional since 2014
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-11 19:46:07 -04:00
Chengguang Xu
313ed62a3d f2fs: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated
as ACL_NOT_CACHED in vfs function inode_init_always() and later
will be updated by calling xxx_init_acl() in specific filesystems.
Howerver, when default_acl and acl are NULL then they keep the value
of ACL_NOT_CACHED, this patch tries to cache NULL for acl/default_acl
in this case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 13:16:03 -07:00
Chao Yu
1378752b99 f2fs: fix to flush all dirty inodes recovered in readonly fs
generic/417 reported as blow:

------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/inode.c:695!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 21697 Comm: umount Tainted: G        W  O      4.18.0-rc2+ #39
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
EIP: f2fs_evict_inode+0x556/0x580 [f2fs]
Call Trace:
 ? _raw_spin_unlock+0x2c/0x50
 evict+0xa8/0x170
 dispose_list+0x34/0x40
 evict_inodes+0x118/0x120
 generic_shutdown_super+0x41/0x100
 ? rcu_read_lock_sched_held+0x97/0xa0
 kill_block_super+0x22/0x50
 kill_f2fs_super+0x6f/0x80 [f2fs]
 deactivate_locked_super+0x3d/0x70
 deactivate_super+0x40/0x60
 cleanup_mnt+0x39/0x70
 __cleanup_mnt+0x10/0x20
 task_work_run+0x81/0xa0
 exit_to_usermode_loop+0x59/0xa7
 do_fast_syscall_32+0x1f5/0x22c
 entry_SYSENTER_32+0x53/0x86
EIP: f2fs_evict_inode+0x556/0x580 [f2fs]

It can simply reproduced with scripts:

Enable quota feature during mkfs.

Testcase1:
1. mkfs.f2fs /dev/zram0
2. mount -t f2fs /dev/zram0 /mnt/f2fs
3. xfs_io -f /mnt/f2fs/file -c "pwrite 0 4k" -c "fsync"
4. godown /mnt/f2fs
5. umount /mnt/f2fs
6. mount -t f2fs -o ro /dev/zram0 /mnt/f2fs
7. umount /mnt/f2fs

Testcase2:
1. mkfs.f2fs /dev/zram0
2. mount -t f2fs /dev/zram0 /mnt/f2fs
3. touch /mnt/f2fs/file
4. create process[pid = x] do:
	a) open /mnt/f2fs/file;
	b) unlink /mnt/f2fs/file
5. godown -f /mnt/f2fs
6. kill process[pid = x]
7. umount /mnt/f2fs
8. mount -t f2fs -o ro /dev/zram0 /mnt/f2fs
9. umount /mnt/f2fs

The reason is: during recovery, i_{c,m}time of inode will be updated, then
the inode can be set dirty w/o being tracked in sbi->inode_list[DIRTY_META]
global list, so later write_checkpoint will not flush such dirty inode into
node page.

Once umount is called, sync_filesystem() in generic_shutdown_super() will
skip syncng dirty inodes due to sb_rdonly check, leaving dirty inodes
there.

To solve this issue, during umount, add remove SB_RDONLY flag in
sb->s_flags, to make sure sync_filesystem() will not be skipped.

Signed-off-by: Chao Yu <yuchao0@huawei.com>

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 13:16:03 -07:00
Yunlei He
cda9cc595f f2fs: report error if quota off error during umount
Now, we depend on fsck to ensure quota file data is ok,
so we scan whole partition if checkpoint without umount
flag. It's same for quota off error case, which may make
quota file data inconsistent.

generic/019 reports below error:

 __quota_error: 1160 callbacks suppressed
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 Quota error (device zram1): write_blk: dquota write failed
 Quota error (device zram1): qtree_write_dquot: Error -28 occurred while creating quota
 VFS: Busy inodes after unmount of zram1. Self-destruct in 5 seconds.  Have a nice day...

If we failed in below path due to fail to write dquot block, we will miss
to release quota inode, fix it.

- f2fs_put_super
 - f2fs_quota_off_umount
  - f2fs_quota_off
   - f2fs_quota_sync   <-- failed
   - dquot_quota_off   <-- missed to call

Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-11 13:16:03 -07:00
Eric W. Biederman
961366a019 signal: Remove the siginfo paramater from kernel_dqueue_signal
None of the callers use the it so remove it.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-09-11 21:19:14 +02:00
Ross Zwisler
b1f382178d ext4: close race between direct IO and ext4_break_layouts()
If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
ext4_break_layouts() => ___wait_var_event(), the waiting function
ext4_wait_dax_page() will never be called.  This means that
ext4_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.

Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-11 13:31:16 -04:00
Chengguang Xu
02645bcdfc jfs: remove quota option from ignore list
We treat quota option as usrquota, so remove quota option from ignore
list.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2018-09-10 15:36:40 -05:00
Al Viro
702ec3072a hidp: fix compat_ioctl
1) no point putting it into fs/compat_ioctl.c when you handle it in
your ->compat_ioctl() anyway.
2) HIDPCONNADD is *not* COMPATIBLE_IOCTL() stuff at all - it does
layout massage (pointer-chasing there)
3) use compat_ptr()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:07 -04:00
Al Viro
89c0c24b4f cmtp: fix compat_ioctl
Use compat_ptr().  And don't mess with fs/compat_ioctl.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:06 -04:00
Al Viro
cc04f6e242 bnep: fix compat_ioctl
use compat_ptr() properly and don't bother with fs/compat_ioctl.c -
it's all handled in ->compat_ioctl() anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:06 -04:00
Al Viro
0976d4e1dc compat_ioctl: trim the pointless includes
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:40:57 -04:00
Miklos Szeredi
8c25741aaa ovl: fix oopses in ovl_fill_super() failure paths
ovl_free_fs() dereferences ofs->workbasedir and ofs->upper_mnt in cases when
those might not have been initialized yet.

Fix the initialization order for these fields.

Reported-by: syzbot+c75f181dc8429d2eb887@syzkaller.appspotmail.com
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc:  <stable@vger.kernel.org> # v4.15
Fixes: 95e6d4177c ("ovl: grab reference to workbasedir early")
Fixes: a9075cdb46 ("ovl: factor out ovl_free_fs() helper")
2018-09-10 12:55:49 +02:00
Stefan Metzmacher
5890184d2b fs/cifs: require sha512
This got lost in commit 0fdfef9aa7,
which removed CONFIG_CIFS_SMB311.

Signed-off-by: Stefan Metzmacher <metze@samba.org>
Fixes: 0fdfef9aa7 ("smb3: simplify code by removing CONFIG_CIFS_SMB311")
CC: Stable <stable@vger.kernel.org>
CC: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-09-09 00:04:27 -05:00
Stephen Rothwell
bcfb84a996 fs/cifs: suppress a string overflow warning
A powerpc build of cifs with gcc v8.2.0 produces this warning:

fs/cifs/cifssmb.c: In function ‘CIFSSMBNegotiate’:
fs/cifs/cifssmb.c:605:3: warning: ‘strncpy’ writing 16 bytes into a region of size 1 overflows the destination [-Wstringop-overflow=]
   strncpy(pSMB->DialectsArray+count, protocols[i].name, 16);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Since we are already doing a strlen() on the source, change the strncpy
to a memcpy().

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-09-09 00:02:39 -05:00
David Howells
ecfe951f0c afs: Fix cell specification to permit an empty address list
Fix the cell specification mechanism to allow cells to be pre-created
without having to specify at least one address (the addresses will be
upcalled for).

This allows the cell information preload service to avoid the need to issue
loads of DNS lookups during boot to get the addresses for each cell (500+
lookups for the 'standard' cell list[*]).  The lookups can be done later as
each cell is accessed through the filesystem.

Also remove the print statement that prints a line every time a new cell is
added.

[*] There are 144 cells in the list.  Each cell is first looked up for an
    SRV record, and if that fails, for an AFSDB record.  These get a list
    of server names, each of which then has to be looked up to get the
    addresses for that server.  E.g.:

	dig srv _afs3-vlserver._udp.grand.central.org

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-07 16:39:44 -07:00
Jaegeuk Kim
5ce805869c f2fs: submit bio after shutdown
Sometimes, some merged IOs could get a chance to be submitted, resulting in
system hang in shutdown test. This issues IOs all the time after shutdown.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-07 14:26:43 -07:00
Linus Torvalds
a12ed06ba2 Two rbd patches to complete support for images within namespaces that
went into -rc1 and a use-after-free fix.
 
 The rbd changes have been sitting in a branch for quite a while but
 couldn't be included into the -rc1 pull request because of a pending
 wire protocol backwards compatibility fixup that only got committed
 early this week.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAluSrJYTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi/N8B/4sZzRCJMCejvU/yRq91NlaPDrxbVHh
 nfICZ/8Fsy/fmvK8NWNyHcCIWx+nWrbCvCJMj0fxWMhk/1t75yC+TdyCJnyuhsQU
 V/CPTs9BTdwrSUiTB83/n/ukGL6mpESk0CQ1er/l1EO6FnNOXvgzHDnCqUQZLdzU
 1aRcx5JQWWo/QlCmzt2KWENhfQRMvLAtf04F5cUuR+JTrMjwWia6MAuRGuOhVQkW
 XIlFNakBKab89Vod1pmA7BrG/+sHXCpVGX6sjAp9vQUWO3WWKBRnNtVwo9dPSHah
 hBR8IzOkihw7HfTlINWVpiR69nTfM80PQHXJkFSp36E6Sfq8EShRpFIZ
 =pga5
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.19-rc3' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "Two rbd patches to complete support for images within namespaces that
  went into -rc1 and a use-after-free fix.

  The rbd changes have been sitting in a branch for quite a while but
  couldn't be included into the -rc1 pull request because of a pending
  wire protocol backwards compatibility fixup that only got committed
  early this week"

* tag 'ceph-for-4.19-rc3' of https://github.com/ceph/ceph-client:
  rbd: support cloning across namespaces
  rbd: factor out get_parent_info()
  ceph: avoid a use-after-free in ceph_destroy_options()
2018-09-07 10:57:59 -07:00
Linus Torvalds
d042a240a8 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAluSNfQACgkQnJ2qBz9k
 QNloSAf/RpsqUnmQvJKK7hQUVNMCQP/Kf3KND5iN5RfMbhU9r7tzERkNvqhdA6QZ
 uoPi8dEecI+ihY5F8ddyw1Chaou4MToWKdNz4ojwJXVrN6bb+pq+xj0hTvT5FjFh
 iM1JXHtSEk6W+CnXPE5CycrZppIHxJfJxeaWg7av5Zyc4nkTesxtG8PycMBxROW8
 detUcJt15VGBswi19udztf7XY/lwDwUQ9LwC0W5B+o8pKIwuN3ENMVVOeAriAyoy
 hXTpPA8twBhM7i8D/1eppDCkYLTr08bquNsDpn8kUEf2RxcxiFJuDLOeXiH3sQRq
 BZmf/QIIRA8R+SPeFiuxY/795FDC6Q==
 =CWu1
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull fsnotify fix from Jan Kara:
 "A small fsnotify fix from Amir"

* tag 'for_v4.19-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  fsnotify: fix ignore mask logic in fsnotify()
2018-09-07 10:54:46 -07:00
Dominique Martinet
b4dc44b3ca 9p locks: fix glock.client_id leak in do_lock
the 9p client code overwrites our glock.client_id pointing to a static
buffer by an allocated string holding the network provided value which
we do not care about; free and reset the value as appropriate.

This is almost identical to the leak in v9fs_file_getlock() fixed by
Al Viro in commit ce85dd58ad ("9p: we are leaking glock.client_id
in v9fs_file_getlock()"), which was returned as an error by a coverity
false positive -- while we are here attempt to make the code slightly
more robust to future change of the net/9p/client code and hopefully
more clear to coverity that there is no problem.

Link: http://lkml.kernel.org/r/1536339057-21974-5-git-send-email-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:52:35 +09:00
Dominique Martinet
e02a53d92e 9p: acl: fix uninitialized iattr access
iattr is passed to v9fs_vfs_setattr_dotl which does send various
values from iattr over the wire, even if it tells the server to
only look at iattr.ia_valid fields this could leak some stack data.

Link: http://lkml.kernel.org/r/1536339057-21974-2-git-send-email-asmadeus@codewreck.org
Addresses-Coverity-ID: 1195601 ("Uninitalized scalar variable")
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:51:50 +09:00
Dinu-Razvan Chis-Serban
5e172f75e5 9p locks: add mount option for lock retry interval
The default P9_LOCK_TIMEOUT can be too long for some users exporting
a local file system to a guest VM (30s), make this configurable at
mount time.

Link: http://lkml.kernel.org/r/1536295827-3181-1-git-send-email-asmadeus@codewreck.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195727
Signed-off-by: Dinu-Razvan Chis-Serban <justcsdr@gmail.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:40:06 +09:00
Gertjan Halkes
2803cf4379 9p: do not trust pdu content for stat item size
v9fs_dir_readdir() could deadloop if a struct was sent with a size set
to -2

Link: http://lkml.kernel.org/r/1536134432-11997-1-git-send-email-asmadeus@codewreck.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=88021
Signed-off-by: Gertjan Halkes <gertjan@google.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:40:06 +09:00
Gustavo A. R. Silva
426d5a0f97 9p: fix spelling mistake in fall-through annotation
Replace "fallthough" with a proper "fall through" annotation.

This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough

Link: http://lkml.kernel.org/r/20180903193806.GA11258@embeddedor.com
Addresses-Coverity-ID: 402012 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:39:47 +09:00
Jan Kara
1abefb0274 udf: Drop pack pragma from udf_sb.h
Drop pack pragma. The header file defines only in-memory structures.

Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-07 10:32:23 +02:00
Jan Kara
694538b5d7 udf: Drop freed bitmap / table support
We don't support Free Space Table and Free Space Bitmap as specified by
UDF standard for writing as we don't support erasing blocks before
overwriting them. Just drop the handling of these structures as
partition descriptor checking code already makes sure such filesystems
can be mounted only read-only.

Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-07 10:32:22 +02:00
Jan Kara
b085fbe2ef udf: Fix crash during mount
Fix a crash during an attempt to mount a filesystem that has both
Unallocated Space Table and Unallocated Space Bitmap. Such filesystem
actually violates the UDF standard so we just have to properly detect
such situation and refuse to mount such filesystem read-write. When we
are at it, verify also other constraints on the allocation information
mandated by the standard.

Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-07 10:32:22 +02:00
Jan Kara
a9ad01bc75 udf: Prevent write-unsupported filesystem to be remounted read-write
There are certain filesystem features which we support for reading but
not for writing. We properly refuse to mount such filesystems read-write
however for some features (such as read-only partitions), we don't check
for these features when remounting the filesystem from read-only to
read-write. Thus such filesystems could be remounted read-write leading
to strange behavior (most likely crashes).

Fix the problem by marking in superblock whether the filesystem has some
features that are supported in read-only mode and check this flag during
remount.

Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-07 10:32:22 +02:00
Linus Torvalds
c6ff25ce35 four small SMB3 fixes, three for stable, and one minor debug clarification
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAluRb84ACgkQiiy9cAdy
 T1GBuAv/ZkXC5cxNpE84dRLii4ey+Y0u5ip3VIyzxwburDDxQh99zWTs8FdyYe1f
 5CYD4PKcNajVceNUg1EdfkNC4ss21I2HcxujquCo8gZJqrt5lHvZXELJ31d8ovXG
 Y2MRl5+KZKLx1sBgzsGJf3aZOneObp7EEAnL/bjeziX7caD6uO/F52MXcMgWrpoM
 krvWSMzS5iY+jRltvPLhTzUmfbaPoS86FRNIHHOiA8AgQLvx3CT4lL+kJOHv1bHX
 haZc9zKy1iUU+yK05vnNLOHVlPeZDa8j/Q8lcEfTrDVa0J/je4DkI9HsB1X54vz0
 65JluAf4G4vaSYU/hnLaWt4PZ7owjTr7fJlu0c8TE6aPqAZEYYoVhjbrdXg7OI4j
 9nUsoEY1hyBH2X5qStvpb9GiZbhR19bhcvnvOAgZ7p4VnyKfexe7o2QHTCPPAxaP
 uYf8OeHjmnYF3MQI0s6rA0TPdZxc77acgrZQWUoOMPN/bwkyP1bSfA1aURJAsx/k
 OodjAog1
 =WRXo
 -----END PGP SIGNATURE-----

Merge tag '4.19-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Four small SMB3 fixes, three for stable, and one minor debug
  clarification"

* tag '4.19-rc2-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: connect to servername instead of IP for IPC$ share
  smb3: check for and properly advertise directory lease support
  smb3: minor debugging clarifications in rfc1001 len processing
  SMB3: Backup intent flag missing for directory opens with backupuid mounts
  fs/cifs: don't translate SFM_SLASH (U+F026) to backslash
2018-09-06 15:39:11 -07:00
Chengguang Xu
e8d4ceeb34 jfs: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated
as ACL_NOT_CACHED in vfs function inode_init_always() and later
will be updated by calling xxx_init_acl() in specific filesystems.
Howerver, when default_acl and acl are NULL then they keep the value
of ACL_NOT_CACHED, this patch tries to cache NULL for acl/default_acl
in this case.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2018-09-06 11:56:26 -05:00
Linus Torvalds
5404525b98 for-4.19-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAluRLa8ACgkQxWXV+ddt
 WDvc+BAAqxTMVngZ60WfktXzsS56OB6fu/R3DORgYcSZ0BCD4zTwoDlCjLhrCK6E
 cmC+BMj+AspDQYiYESwGyFcN10sK0X7w7fa3wypTc4GNWxpkRm0Z6zT/kCvLUhdI
 NlkMqAfsZ9N6iIXcR0qOxI7G55e3mpXPZGdFTk5rmDTv/9TqU0TMp9s8Zw5scn6R
 ctdE+iE0lpRfNjF8ZDH1BtYIV4g2X81sZF/fkGz621HQfMTCjjPHFdlz+jlirBaf
 BrYR4w4zjVuMKd3ZC5FHffVchbkvt29h6fAr4sEpJTwFJwd8pjI7GuPYWDQ918NB
 TGX6EUP6usQqDK2zD405jCS6MbMshJm3uh5kmEpeNgK/tKJTln8Sbef/Xs93yIn2
 +k9BMKOIcUHHBiv6PgCaZomcWCpii2S2u6vncqCnNuI4wK1RN3gHJc5YPhJArlrB
 NUFJiTCQE6LWYOP2Hw+rggcrtBxli0bX7Mqp5FYFVdh5KBvolJE1o3B/JS8qpqRF
 u0dPwbLHtTpTpXM5EfmM8a45S+DxuxTDBh3vdoAOM9LN/ivpeqqnFbHrIGmrTMjo
 pQJ8aTrCwYMEMNu6oCV1cniFrOYRZ439hYjg524MjVXYCRyxhzAdVmVTEBaLjWCW
 9GlGqEC7YZY2wLi5lPEGqxsIaVVELpettJB9KbBKmYB47VFWEf0=
 =fu93
 -----END PGP SIGNATURE-----

Merge tag 'for-4.19-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix for improper fsync after hardlink

 - fix for a corruption during file deduplication

 - use after free fixes

 - RCU warning fix

 - fix for buffered write to nodatacow file

* tag 'for-4.19-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu
  btrfs: use after free in btrfs_quota_enable
  btrfs: btrfs_shrink_device should call commit transaction at the end
  btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata
  Btrfs: fix data corruption when deduplicating between different files
  Btrfs: sync log after logging new name
  Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space
2018-09-06 09:04:45 -07:00
Ilya Dryomov
8aaff15168 ceph: avoid a use-after-free in ceph_destroy_options()
syzbot reported a use-after-free in ceph_destroy_options(), called from
ceph_mount().  The problem was that create_fs_client() consumed the opt
pointer on some errors, but not on all of them.  Make sure it always
consumes both libceph and ceph options.

Reported-by: syzbot+8ab6f1042021b4eed062@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
2018-09-06 16:18:04 +02:00
Jaegeuk Kim
0ded69f632 f2fs: avoid wrong decrypted data from disk
1. Create a file in an encrypted directory
2. Do GC & drop caches
3. Read stale data before its bio for metapage was not issued yet

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:41:07 -07:00
Chao Yu
22d7ea1364 Revert "f2fs: use printk_ratelimited for f2fs_msg"
Don't limit printing log, so that we will not miss any key messages.

This reverts commit a36c106dff.

In addition, we use printk_ratelimited to avoid too many log prints.
- error injection
- discard submission failure

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:41:06 -07:00
Sahitya Tummala
abde73c718 f2fs: fix unnecessary periodic wakeup of discard thread when dev is busy
When dev is busy, discard thread wake up timeout can be aligned with the
exact time that it needs to wait for dev to come out of busy. This helps
to avoid unnecessary periodic wakeups and thus save some power.

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:40:32 -07:00
Chao Yu
7d20c8abb2 f2fs: fix to avoid NULL pointer dereference on se->discard_map
https://bugzilla.kernel.org/show_bug.cgi?id=200951

These is a NULL pointer dereference issue reported in bugzilla:

Hi,
in the setup there is a SATA SSD connected to a SATA-to-USB bridge.

The disc is "Samsung SSD 850 PRO 256G" which supports TRIM.
There are four partitions:
 sda1: FAT  /boot
 sda2: F2FS /
 sda3: F2FS /home
 sda4: F2FS

The bridge is ASMT1153e which uses the "uas" driver.
There is no TRIM pass-through, so, when mounting it reports:
 mounting with "discard" option, but the device does not support discard

The USB host is USB3.0 and UASP capable. It is the one on RK3399.

Given this everything works fine, except there is no TRIM support.

In order to enable TRIM a new UDEV rule is added [1]:
 /etc/udev/rules.d/10-sata-bridge-trim.rules:
 ACTION=="add|change", ATTRS{idVendor}=="174c", ATTRS{idProduct}=="55aa", SUBSYSTEM=="scsi_disk", ATTR{provisioning_mode}="unmap"
After reboot any F2FS write hangs forever and dmesg reports:
 Unable to handle kernel NULL pointer dereference

Also tested on a x86_64 system: works fine even with TRIM enabled.
 same disc
 same bridge
 different usb host controller
 different cpu architecture
 not root filesystem

Regards,
  Vicenç.

[1] Post #5 in https://bbs.archlinux.org/viewtopic.php?id=236280

 Unable to handle kernel NULL pointer dereference at virtual address 000000000000003e
 Mem abort info:
   ESR = 0x96000004
   Exception class = DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000626e3122
 [000000000000003e] pgd=0000000000000000
 Internal error: Oops: 96000004 [#1] SMP
 Modules linked in: overlay snd_soc_hdmi_codec rc_cec dw_hdmi_i2s_audio dw_hdmi_cec snd_soc_simple_card snd_soc_simple_card_utils snd_soc_rockchip_i2s rockchip_rga snd_soc_rockchip_pcm rockchipdrm videobuf2_dma_sg v4l2_mem2mem rtc_rk808 videobuf2_memops analogix_dp videobuf2_v4l2 videobuf2_common dw_hdmi dw_wdt cec rc_core videodev drm_kms_helper media drm rockchip_thermal rockchip_saradc realtek drm_panel_orientation_quirks syscopyarea sysfillrect sysimgblt fb_sys_fops dwmac_rk stmmac_platform stmmac pwm_bl squashfs loop crypto_user gpio_keys hid_kensington
 CPU: 5 PID: 957 Comm: nvim Not tainted 4.19.0-rc1-1-ARCH #1
 Hardware name: Sapphire-RK3399 Board (DT)
 pstate: 00000005 (nzcv daif -PAN -UAO)
 pc : update_sit_entry+0x304/0x4b0
 lr : update_sit_entry+0x108/0x4b0
 sp : ffff00000ca13bd0
 x29: ffff00000ca13bd0 x28: 000000000000003e
 x27: 0000000000000020 x26: 0000000000080000
 x25: 0000000000000048 x24: ffff8000ebb85cf8
 x23: 0000000000000253 x22: 00000000ffffffff
 x21: 00000000000535f2 x20: 00000000ffffffdf
 x19: ffff8000eb9e6800 x18: ffff8000eb9e6be8
 x17: 0000000007ce6926 x16: 000000001c83ffa8
 x15: 0000000000000000 x14: ffff8000f602df90
 x13: 0000000000000006 x12: 0000000000000040
 x11: 0000000000000228 x10: 0000000000000000
 x9 : 0000000000000000 x8 : 0000000000000000
 x7 : 00000000000535f2 x6 : ffff8000ebff3440
 x5 : ffff8000ebff3440 x4 : ffff8000ebe3a6c8
 x3 : 00000000ffffffff x2 : 0000000000000020
 x1 : 0000000000000000 x0 : ffff8000eb9e5800
 Process nvim (pid: 957, stack limit = 0x0000000063a78320)
 Call trace:
  update_sit_entry+0x304/0x4b0
  f2fs_invalidate_blocks+0x98/0x140
  truncate_node+0x90/0x400
  f2fs_remove_inode_page+0xe8/0x340
  f2fs_evict_inode+0x2b0/0x408
  evict+0xe0/0x1e0
  iput+0x160/0x260
  do_unlinkat+0x214/0x298
  __arm64_sys_unlinkat+0x3c/0x68
  el0_svc_handler+0x94/0x118
  el0_svc+0x8/0xc
 Code: f9400800 b9488400 36080140 f9400f01 (387c4820)
 ---[ end trace a0f21a307118c477 ]---

The reason is it is possible to enable discard flag on block queue via
UDEV, but during mount, f2fs will initialize se->discard_map only if
this flag is set, once the flag is set after mount, f2fs may dereference
NULL pointer on se->discard_map.

So this patch does below changes to fix this issue:
- initialize and update se->discard_map all the time.
- don't clear DISCARD option if device has no QUEUE_FLAG_DISCARD flag
during mount.
- don't issue small discard on zoned block device.
- introduce some functions to enhance the readability.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Tested-by: Vicente Bergas <vicencb@gmail.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:40:31 -07:00
Chengguang Xu
1618e6e297 f2fs: add additional sanity check in f2fs_acl_from_disk()
Add additinal sanity check for irregular case(e.g. corruption).
If size of extended attribution is smaller than size of acl header,
then return -EINVAL.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-09-05 13:40:31 -07:00
Ryusuke Konishi
ae98043f5f nilfs2: convert to SPDX license tags
Remove the verbose license text from NILFS2 files and replace them with
SPDX tags.  This does not change the license of any of the code.

Link: http://lkml.kernel.org/r/1535624528-5982-1-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-09-04 16:45:02 -07:00
Alexander Popov
c8d126275a fs/proc: Show STACKLEAK metrics in the /proc file system
Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about
tasks via the /proc file system. In particular, /proc/<pid>/stack_depth
shows the maximum kernel stack consumption for the current and previous
syscalls. Although this information is not precise, it can be useful for
estimating the STACKLEAK performance impact for your workloads.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Alexander Popov <alex.popov@linux.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-09-04 10:35:48 -07:00
Jason A. Donenfeld
578bdaabd0 crypto: speck - remove Speck
These are unused, undesired, and have never actually been used by
anybody. The original authors of this code have changed their mind about
its inclusion. While originally proposed for disk encryption on low-end
devices, the idea was discarded [1] in favor of something else before
that could really get going. Therefore, this patch removes Speck.

[1] https://marc.info/?l=linux-crypto-vger&m=153359499015659

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Eric Biggers <ebiggers@google.com>
Cc: stable@vger.kernel.org
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-09-04 11:35:03 +08:00
Theodore Ts'o
5f8c10936f ext4: fix online resizing for bigalloc file systems with a 1k block size
An online resize of a file system with the bigalloc feature enabled
and a 1k block size would be refused since ext4_resize_begin() did not
understand s_first_data_block is 0 for all bigalloc file systems, even
when the block size is 1k.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-03 22:25:01 -04:00
Theodore Ts'o
f0a459dec5 ext4: fix online resize's handling of a too-small final block group
Avoid growing the file system to an extent so that the last block
group is too small to hold all of the metadata that must be stored in
the block group.

This problem can be triggered with the following reproducer:

umount /mnt
mke2fs -F -m0 -b 4096 -t ext4 -O resize_inode,^has_journal \
	-E resize=1073741824 /tmp/foo.img 128M
mount /tmp/foo.img /mnt
truncate --size 1708M /tmp/foo.img
resize2fs /dev/loop0 295400
umount /mnt
e2fsck -fy /tmp/foo.img

Reported-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-03 22:19:43 -04:00
Amir Goldstein
d54f4fba88 fanotify: add API to attach/detach super block mark
Add another mark type flag FAN_MARK_FILESYSTEM for add/remove/flush
of super block mark type.

A super block watch gets all events on the filesystem, regardless of
the mount from which the mark was added, unless an ignore mask exists
on either the inode or the mount where the event was generated.

Only one of FAN_MARK_MOUNT and FAN_MARK_FILESYSTEM mark type flags
may be provided to fanotify_mark() or no mark type flag for inode mark.

Cc: <linux-api@vger.kernel.org>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:16:11 +02:00
Amir Goldstein
60f7ed8c7c fsnotify: send path type events to group with super block marks
Send events to group if super block mark mask matches the event
and unless the same group has an ignore mask on the vfsmount or
the inode on which the event occurred.

Soon, fanotify backend is going to support super block marks and
fanotify backend only supports path type events.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:14:04 +02:00
Amir Goldstein
1e6cb72399 fsnotify: add super block object type
Add the infrastructure to attach a mark to a super_block struct
and detach all attached marks when super block is destroyed.

This is going to be used by fanotify backend to setup super block
marks.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 15:14:01 +02:00
Jann Horn
9da3f2b740 x86/fault: BUG() when uaccess helpers fault on kernel addresses
There have been multiple kernel vulnerabilities that permitted userspace to
pass completely unchecked pointers through to userspace accessors:

 - the waitid() bug - commit 96ca579a1e ("waitid(): Add missing
   access_ok() checks")
 - the sg/bsg read/write APIs
 - the infiniband read/write APIs

These don't happen all that often, but when they do happen, it is hard to
test for them properly; and it is probably also hard to discover them with
fuzzing. Even when an unmapped kernel address is supplied to such buggy
code, it just returns -EFAULT instead of doing a proper BUG() or at least
WARN().

Try to make such misbehaving code a bit more visible by refusing to do a
fixup in the pagefault handler code when a userspace accessor causes a #PF
on a kernel address and the current context isn't whitelisted.

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: dvyukov@google.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-7-jannh@google.com
2018-09-03 15:12:09 +02:00
Amir Goldstein
9bdda4e9cf fsnotify: fix ignore mask logic in fsnotify()
Commit 92183a4289 ("fsnotify: fix ignore mask logic in
send_to_group()") acknoledges the use case of ignoring an event on
an inode mark, because of an ignore mask on a mount mark of the same
group (i.e. I want to get all events on this file, except for the events
that came from that mount).

This change depends on correctly merging the inode marks and mount marks
group lists, so that the mount mark ignore mask would be tested in
send_to_group(). Alas, the merging of the lists did not take into
account the case where event in question is not in the mask of any of
the mount marks.

To fix this, completely remove the tests for inode and mount event masks
from the lists merging code.

Fixes: 92183a4289 ("fsnotify: fix ignore mask logic in send_to_group")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 14:57:41 +02:00
Chengguang Xu
59fed3bf8a ext2: cache NULL when both default_acl and acl are NULL
default_acl and acl of newly created inode will be initiated as
ACL_NOT_CACHED in vfs function inode_init_always() and later will be
updated by calling xxx_init_acl() in specific filesystems.  However,
when default_acl and acl are NULL, then they keep the value of
ACL_NOT_CACHED. This patch changes the code to cache NULL for acl /
default_acl in this case to save unnecessary ACL lookup in future.

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 11:05:03 +02:00
Colin Ian King
849fe89ce6 udf: remove unused variables group_start and nr_groups
Variables group_start and nr_groups are being assigned but are never used
hence they are redundant and can be removed.

Cleans up clang warning:
variable 'group_start' set but not used [-Wunused-but-set-variable]
variable 'nr_groups' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-09-03 11:04:49 +02:00
Amir Goldstein
b833a36603 ovl: add ovl_fadvise()
Implement stacked fadvise to fix syscalls readahead(2) and fadvise64(2)
on an overlayfs file.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: d1d04ef857 ("ovl: stack file ops")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-09-03 09:43:10 +02:00
Thomas Werschlein
395a2076b4 cifs: connect to servername instead of IP for IPC$ share
This patch is required allows access to a Microsoft fileserver failover
cluster behind a 1:1 NAT firewall.

The change also provides stronger context for authentication and share
connection (see MS-SMB2 3.3.5.7 and MS-SRVS 3.1.6.8) as noted by
Tom Talpey, and addresses comments about the buffer size for the UNC
made by Aurélien Aptel.

Signed-off-by: Thomas Werschlein <thomas.werschlein@geo.uzh.ch>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Tom Talpey <ttalpey@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
2018-09-02 23:21:42 -05:00
Steve French
f801568332 smb3: check for and properly advertise directory lease support
Although servers will typically ignore unsupported features,
we should advertise the support for directory leases (as
Windows e.g. does) in the negotiate protocol capabilities we
pass to the server, and should check for the server capability
(CAP_DIRECTORY_LEASING) before sending a lease request for an
open of a directory.  This will prevent us from accidentally
sending directory leases to SMB2.1 or SMB2 server for example.

Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-09-02 23:21:42 -05:00
Steve French
25f2573512 smb3: minor debugging clarifications in rfc1001 len processing
I ran into some cases where server was returning the wrong length
on frames but I couldn't easily match them to the command in the
network trace (or server logs) since I need the command and/or
multiplex id to find the offending SMB2/SMB3 command.  Add these
two fields to the log message. In the case of padding too much
it may not be a problem in all cases but might have correlated
to a network disconnect case in some problems we have been
looking at. In the case of frame too short is even more important.

Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
2018-09-02 23:21:42 -05:00
Steve French
5e19697b56 SMB3: Backup intent flag missing for directory opens with backupuid mounts
When "backup intent" is requested on the mount (e.g. backupuid or
backupgid mount options), the corresponding flag needs to be set
on opens of directories (and files) but was missing in some
places causing access denied trying to enumerate and backup
servers.

Fixes kernel bugzilla #200953
https://bugzilla.kernel.org/show_bug.cgi?id=200953

Reported-and-tested-by: <whh@rubrik.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
2018-09-02 23:21:42 -05:00
Jon Kuhn
c15e3f19a6 fs/cifs: don't translate SFM_SLASH (U+F026) to backslash
When a Mac client saves an item containing a backslash to a file server
the backslash is represented in the CIFS/SMB protocol as as U+F026.
Before this change, listing a directory containing an item with a
backslash in its name will return that item with the backslash
represented with a true backslash character (U+005C) because
convert_sfm_character mapped U+F026 to U+005C when interpretting the
CIFS/SMB protocol response.  However, attempting to open or stat the
path using a true backslash will result in an error because
convert_to_sfm_char does not map U+005C back to U+F026 causing the
CIFS/SMB request to be made with the backslash represented as U+005C.

This change simply prevents the U+F026 to U+005C conversion from
happenning.  This is analogous to how the code does not do any
translation of UNI_SLASH (U+F000).

Signed-off-by: Jon Kuhn <jkuhn@barracuda.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-09-02 23:21:42 -05:00
Linus Torvalds
501dacbc24 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Thomas Gleixner:
 "A small set of updates for core code:

   - Prevent tracing in functions which are called from trace patching
     via stop_machine() to prevent executing half patched function trace
     entries.

   - Remove old GCC workarounds

   - Remove pointless includes of notifier.h"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Remove workaround for unreachable warnings from old GCC
  notifier: Remove notifier header file wherever not used
  watchdog: Mark watchdog touch functions as notrace
2018-09-02 09:41:45 -07:00
Theodore Ts'o
4274f516d4 ext4: recalucate superblock checksum after updating free blocks/inodes
When mounting the superblock, ext4_fill_super() calculates the free
blocks and free inodes and stores them in the superblock.  It's not
strictly necessary, since we don't use them any more, but it's nice to
keep them roughly aligned to reality.

Since it's not critical for file system correctness, the code doesn't
call ext4_commit_super().  The problem is that it's in
ext4_commit_super() that we recalculate the superblock checksum.  So
if we're not going to call ext4_commit_super(), we need to call
ext4_superblock_csum_set() to make sure the superblock checksum is
consistent.

Most of the time, this doesn't matter, since we end up calling
ext4_commit_super() very soon thereafter, and definitely by the time
the file system is unmounted.  However, it doesn't work in this
sequence:

mke2fs -Fq -t ext4 /dev/vdc 128M
mount /dev/vdc /vdc
cp xfstests/git-versions /vdc
godown /vdc
umount /vdc
mount /dev/vdc
tune2fs -l /dev/vdc

With this commit, the "tune2fs -l" no longer fails.

Reported-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
2018-09-01 14:42:14 -04:00
Theodore Ts'o
bcd8e91f98 ext4: avoid arithemetic overflow that can trigger a BUG
A maliciously crafted file system can cause an overflow when the
results of a 64-bit calculation is stored into a 32-bit length
parameter.

https://bugzilla.kernel.org/show_bug.cgi?id=200623

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
2018-09-01 12:45:04 -04:00
Amir Goldstein
5b910bd615 ovl: fix GPF in swapfile_activate of file from overlayfs over xfs
Since overlayfs implements stacked file operations, the underlying
filesystems are not supposed to be exposed to the overlayfs file,
whose f_inode is an overlayfs inode.

Assigning an overlayfs file to swap_file results in an attempt of xfs
code to dereference an xfs_inode struct from an ovl_inode pointer:

 CPU: 0 PID: 2462 Comm: swapon Not tainted
 4.18.0-xfstests-12721-g33e17876ea4e #3402
 RIP: 0010:xfs_find_bdev_for_inode+0x23/0x2f
 Call Trace:
  xfs_iomap_swapfile_activate+0x1f/0x43
  __se_sys_swapon+0xb1a/0xee9

Fix this by not assigning the real inode mapping to f_mapping, which
will cause swapon() to return an error (-EINVAL). Although it makes
sense not to allow setting swpafile on an overlayfs file, some users
may depend on it, so we may need to fix this up in the future.

Keeping f_mapping pointing to overlay inode mapping will cause O_DIRECT
open to fail. Fix this by installing ovl_aops with noop_direct_IO in
overlay inode mapping.

Keeping f_mapping pointing to overlay inode mapping will cause other
a_ops related operations to fail (e.g. readahead()). Those will be
fixed by follow up patches.

Suggested-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: f7c72396d0 ("ovl: add O_DIRECT support")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-08-30 17:08:35 +02:00
Amir Goldstein
80d3481081 ovl: respect FIEMAP_FLAG_SYNC flag
Stacked overlayfs fiemap operation broke xfstests that test delayed
allocation (with "_test_generic_punch -d"), because ovl_fiemap()
failed to write dirty pages when requested.

Fixes: 9e142c4102 ("ovl: add ovl_fiemap()")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2018-08-30 17:08:35 +02:00
Mukesh Ojha
13ba17bee1 notifier: Remove notifier header file wherever not used
The conversion of the hotplug notifiers to a state machine left the
notifier.h includes around in some places. Remove them.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1535114033-4605-1-git-send-email-mojha@codeaurora.org
2018-08-30 12:56:40 +02:00
Linus Torvalds
f3f106dac0 \n
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAluGfbAACgkQnJ2qBz9k
 QNmkVQgAjv/tCHStwoQ4Hhr6q5wLU9apAW5WC7Or2MyNfqoJsKgyujpikwFMa0xY
 wkVqlITYavvUKl/nRlQ6hpZtAY7hi1Y7GvIYs2ci6QO4YAUuBoH7qPLKqyZYzXx+
 mBaC68885nMMDqIHvsCcLurwTiDer6XQXXPDmpMoO9g4kyVuhm2e0/M7CPaA4Ue9
 WEfBPBROSNdRH7Wtww/MUvtfd+9mezdlpUQZVVO5cdhAVQ8pzW2g2piYtwIpaY7E
 DiJ1zcgaqCnni+b69x4wz8TwB0q8wZJjrPfJgf4X7mIGBZrw80yNRlevfe5SxwdJ
 mkMDvjaD0TEItvpjgTZReXaAjFOWTA==
 =E61M
 -----END PGP SIGNATURE-----

Merge tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs

Pull misc fs fixes from Jan Kara:

 - make UDF to properly mount media created by Win7

 - make isofs to properly refuse devices with large physical block size

 - fix a Spectre gadget in quotactl(2)

 - fix a warning in fsnotify code hit by syzkaller

* tag 'for_v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  udf: Fix mounting of Win7 created UDF filesystems
  udf: Remove dead code from udf_find_fileset()
  fs/quota: Fix spectre gadget in do_quotactl
  fs/quota: Replace XQM_MAXQUOTAS usage with MAXQUOTAS
  isofs: reject hardware sector size > 2048 bytes
  fsnotify: fix false positive warning on inode delete
2018-08-29 14:56:45 -07:00
Arnd Bergmann
4faea239e5 y2038: utimes: Rework #ifdef guards for compat syscalls
After changing over to 64-bit time_t syscalls, many architectures will
want compat_sys_utimensat() but not respective handlers for utime(),
utimes() and futimesat(). This adds a new __ARCH_WANT_SYS_UTIME32 to
complement __ARCH_WANT_SYS_UTIME. For now, all 64-bit architectures that
support CONFIG_COMPAT set it, but future 64-bit architectures will not
(tile would not have needed it either, but got removed).

As older 32-bit architectures get converted to using CONFIG_64BIT_TIME,
they will have to use __ARCH_WANT_SYS_UTIME32 instead of
__ARCH_WANT_SYS_UTIME. Architectures using the generic syscall ABI don't
need either of them as they never had a utime syscall.

Since the compat_utimbuf structure is now required outside of
CONFIG_COMPAT, I'm moving it into compat_time.h.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
changed from last version:
- renamed __ARCH_WANT_COMPAT_SYS_UTIME to __ARCH_WANT_SYS_UTIME32
2018-08-29 15:42:23 +02:00
Arnd Bergmann
185cfaf764 y2038: Compile utimes()/futimesat() conditionally
There are four generations of utimes() syscalls: utime(), utimes(),
futimesat() and utimensat(), each one being a superset of the previous
one. For y2038 support, we have to add another one, which is the same
as the existing utimensat() but always passes 64-bit times_t based
timespec values.

There are currently 10 architectures that only use utimensat(), two
that use utimes(), futimesat() and utimensat() but not utime(), and 11
architectures that have all four, and those define __ARCH_WANT_SYS_UTIME
in order to get a sys_utime implementation. Since all the new
architectures only want utimensat(), moving all the legacy entry points
into a common __ARCH_WANT_SYS_UTIME guard simplifies the logic. Only alpha
and ia64 grow a tiny bit as they now also get an unused sys_utime(),
but it didn't seem worth the extra complexity of adding yet another
ifdef for those.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:23 +02:00
Arnd Bergmann
a4f7a30046 y2038: Change sys_utimensat() to use __kernel_timespec
When 32-bit architectures get changed to support 64-bit time_t,
utimensat() needs to use the new __kernel_timespec structure as its
argument.

The older utime(), utimes() and futimesat() system calls don't need a
corresponding change as they are no longer used on C libraries that have
64-bit time support.

As we do for the other syscalls that have timespec arguments, we reuse
the 'compat' syscall entry points to implement the traditional four
interfaces, and only leave the new utimensat() as a native handler,
so that the same code gets used on both 32-bit and 64-bit kernels
on each syscall.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:22 +02:00
Arnd Bergmann
caf6f9c8a3 asm-generic: Remove unneeded __ARCH_WANT_SYS_LLSEEK macro
The sys_llseek sytem call is needed on all 32-bit architectures and
none of the 64-bit ones, so we can remove the __ARCH_WANT_SYS_LLSEEK guard
and simplify the include/asm-generic/unistd.h header further.

Since 32-bit tasks can run either natively or in compat mode on 64-bit
architectures, we have to check for both !CONFIG_64BIT and CONFIG_COMPAT.

There are a few 64-bit architectures that also reference sys_llseek
in their 64-bit ABI (e.g. sparc), but I verified that those all
select CONFIG_COMPAT, so the #if check is still correct here. It's
a bit odd to include it in the syscall table though, as it's the
same as sys_lseek() on 64-bit, but with strange calling conventions.

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:21 +02:00
Arnd Bergmann
82b355d161 y2038: Remove newstat family from default syscall set
We have four generations of stat() syscalls:
- the oldstat syscalls that are only used on the older architectures
- the newstat family that is used on all 64-bit architectures but
  lacked support for large files on 32-bit architectures.
- the stat64 family that is used mostly on 32-bit architectures to
  replace newstat
- statx() to replace all of the above, adding 64-bit timestamps among
  other things.

We already compile stat64 only on those architectures that need it,
but newstat is always built, including on those that don't reference
it. This adds a new __ARCH_WANT_NEW_STAT symbol along the lines of
__ARCH_WANT_OLD_STAT and __ARCH_WANT_STAT64 to control compilation of
newstat. All architectures that need it use an explict define, the
others now get a little bit smaller, and future architecture (including
64-bit targets) won't ever see it.

Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:20 +02:00
Dominique Martinet
81c99089bc v9fs_dir_readdir: fix double-free on p9stat_read error
p9stat_read will call p9stat_free on error, we should only free the
struct content on success.

There also is no need to "p9stat_init" st as the read function will
zero the whole struct for us anyway, so clean up the code a bit while
we are here.

Link: http://lkml.kernel.org/r/1535410108-20650-1-git-send-email-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Reported-by: syzbot+d4252148d198410b864f@syzkaller.appspotmail.com
2018-08-29 13:39:57 +09:00
Bob Peterson
4f36cb36c9 gfs2: Don't set GFS2_RDF_UPTODATE when the lvb is updated
The GFS2_RDF_UPTODATE flag in the rgrp is used to determine when
a rgrp buffer is valid. It's cleared when the glock is invalidated,
signifying that the buffer data is now invalid. But before this
patch, function update_rgrp_lvb was setting the flag when it
determined it had a valid lvb. But that's an invalid assumption:
just because you have a valid lvb doesn't mean you have valid
buffers. After all, another node may have made the lvb valid,
and this node just fetched it from the glock via dlm.

Consider this scenario:
1. The file system is mounted with RGRPLVB option.
2. In gfs2_inplace_reserve it locks the rgrp glock EX, but thanks
   to GL_SKIP, it skips the gfs2_rgrp_bh_get.
3. Since loops == 0 and the allocation target (ap->target) is
   bigger than the largest known chunk of blocks in the rgrp
   (rs->rs_rbm.rgd->rd_extfail_pt) it skips that rgrp and bypasses
   the call to gfs2_rgrp_bh_get there as well.
4. update_rgrp_lvb sees the lvb MAGIC number is valid, so bypasses
   gfs2_rgrp_bh_get, but it still sets sets GFS2_RDF_UPTODATE due
   to this invalid assumption.
5. The next time update_rgrp_lvb is called, it sees the bit is set
   and just returns 0, assuming both the lvb and rgrp are both
   uptodate. But since this is a smaller allocation, or space has
   been freed by another node, thus adjusting the lvb values,
   it decides to use the rgrp for allocations, with invalid rd_free
   due to the fact it was never updated.

This patch changes update_rgrp_lvb so it doesn't set the UPTODATE
flag anymore. That way, it has no choice but to fetch the latest
values.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
2018-08-28 12:51:08 -05:00
Bob Peterson
72244b6bc7 gfs2: improve debug information when lvb mismatches are found
Before this patch, gfs2_rgrp_bh_get would check for lvb mismatches,
but it wouldn't tell you what was actually wrong. This patch adds
more information to help us debug it. It also makes rgrp consistency
checks dump any bad rgrps, and the rgrp dump code dump any lvbs
as well as the rgrp itself.

Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
2018-08-28 12:51:08 -05:00
Theodore Ts'o
4d982e25d0 ext4: avoid divide by zero fault when deleting corrupted inline directories
A specially crafted file system can trick empty_inline_dir() into
reading past the last valid entry in a inline directory, and then run
into the end of xattr marker. This will trigger a divide by zero
fault.  Fix this by using the size of the inline directory instead of
dir->i_size.

Also clean up error reporting in __ext4_check_dir_entry so that the
message is clearer and more understandable --- and avoids the division
by zero trap if the size passed in is zero.  (I'm not sure why we
coded it that way in the first place; printing offset % size is
actually more confusing and less useful.)

https://bugzilla.kernel.org/show_bug.cgi?id=200933

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
2018-08-27 09:22:45 -04:00
Arnd Bergmann
9afc5eee65 y2038: globally rename compat_time to old_time32
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:

Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.

The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:

old				new
---				---
compat_time_t			old_time32_t
struct compat_timeval		struct old_timeval32
struct compat_timespec		struct old_timespec32
struct compat_itimerspec	struct old_itimerspec32
ns_to_compat_timeval()		ns_to_old_timeval32()
get_compat_itimerspec64()	get_old_itimerspec32()
put_compat_itimerspec64()	put_old_itimerspec32()
compat_get_timespec64()		get_old_timespec32()
compat_put_timespec64()		put_old_timespec32()

As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.

I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.

This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-27 14:48:48 +02:00
Theodore Ts'o
b50282f324 ext4: check to make sure the rename(2)'s destination is not freed
If the destination of the rename(2) system call exists, the inode's
link count (i_nlinks) must be non-zero.  If it is, the inode can end
up on the orphan list prematurely, leading to all sorts of hilarity,
including a use-after-free.

https://bugzilla.kernel.org/show_bug.cgi?id=200931

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: stable@vger.kernel.org
2018-08-27 01:47:09 -04:00
Theodore Ts'o
072ebb3bff ext4: add nonstring annotations to ext4.h
This suppresses some false positives in gcc 8's -Wstringop-truncation

Suggested by Miguel Ojeda (hopefully the __nonstring definition will
eventually get accepted in the compiler-gcc.h header file).

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
2018-08-27 01:15:11 -04:00
Linus Torvalds
aba16dc5cf Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax
Pull IDA updates from Matthew Wilcox:
 "A better IDA API:

      id = ida_alloc(ida, GFP_xxx);
      ida_free(ida, id);

  rather than the cumbersome ida_simple_get(), ida_simple_remove().

  The new IDA API is similar to ida_simple_get() but better named.  The
  internal restructuring of the IDA code removes the bitmap
  preallocation nonsense.

  I hope the net -200 lines of code is convincing"

* 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
  ida: Change ida_get_new_above to return the id
  ida: Remove old API
  test_ida: check_ida_destroy and check_ida_alloc
  test_ida: Convert check_ida_conv to new API
  test_ida: Move ida_check_max
  test_ida: Move ida_check_leaf
  idr-test: Convert ida_check_nomem to new API
  ida: Start new test_ida module
  target/iscsi: Allocate session IDs from an IDA
  iscsi target: fix session creation failure handling
  drm/vmwgfx: Convert to new IDA API
  dmaengine: Convert to new IDA API
  ppc: Convert vas ID allocation to new IDA API
  media: Convert entity ID allocation to new IDA API
  ppc: Convert mmu context allocation to new IDA API
  Convert net_namespace to new IDA API
  cb710: Convert to new IDA API
  rsxx: Convert to new IDA API
  osd: Convert to new IDA API
  sd: Convert to new IDA API
  ...
2018-08-26 11:48:42 -07:00
Linus Torvalds
d207ea8e74 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Thomas Gleixner:
 "Kernel:
   - Improve kallsyms coverage
   - Add x86 entry trampolines to kcore
   - Fix ARM SPE handling
   - Correct PPC event post processing

  Tools:
   - Make the build system more robust
   - Small fixes and enhancements all over the place
   - Update kernel ABI header copies
   - Preparatory work for converting libtraceevnt to a shared library
   - License cleanups"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  tools arch: Update arch/x86/lib/memcpy_64.S copy used in 'perf bench mem memcpy'
  tools arch x86: Update tools's copy of cpufeatures.h
  perf python: Fix pyrf_evlist__read_on_cpu() interface
  perf mmap: Store real cpu number in 'struct perf_mmap'
  perf tools: Remove ext from struct kmod_path
  perf tools: Add gzip_is_compressed function
  perf tools: Add lzma_is_compressed function
  perf tools: Add is_compressed callback to compressions array
  perf tools: Move the temp file processing into decompress_kmodule
  perf tools: Use compression id in decompress_kmodule()
  perf tools: Store compression id into struct dso
  perf tools: Add compression id into 'struct kmod_path'
  perf tools: Make is_supported_compression() static
  perf tools: Make decompress_to_file() function static
  perf tools: Get rid of dso__needs_decompress() call in __open_dso()
  perf tools: Get rid of dso__needs_decompress() call in symbol__disassemble()
  perf tools: Get rid of dso__needs_decompress() call in read_object_code()
  tools lib traceevent: Change to SPDX License format
  perf llvm: Allow passing options to llc in addition to clang
  perf parser: Improve error message for PMU address filters
  ...
2018-08-26 11:25:21 -07:00
Linus Torvalds
2923b27e54 libnvdimm-for-4.19_dax-memory-failure
* memory_failure() gets confused by dev_pagemap backed mappings. The
   recovery code has specific enabling for several possible page states
   that needs new enabling to handle poison in dax mappings. Teach
   memory_failure() about ZONE_DEVICE pages.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
 OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
 75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
 5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
 BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
 3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
 F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
 T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
 MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
 oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
 /kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
 nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
 =Ftop
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm memory-failure update from Dave Jiang:
 "As it stands, memory_failure() gets thoroughly confused by dev_pagemap
  backed mappings. The recovery code has specific enabling for several
  possible page states and needs new enabling to handle poison in dax
  mappings.

  In order to support reliable reverse mapping of user space addresses:

   1/ Add new locking in the memory_failure() rmap path to prevent races
      that would typically be handled by the page lock.

   2/ Since dev_pagemap pages are hidden from the page allocator and the
      "compound page" accounting machinery, add a mechanism to determine
      the size of the mapping that encompasses a given poisoned pfn.

   3/ Given pmem errors can be repaired, change the speculatively
      accessed poison protection, mce_unmap_kpfn(), to be reversible and
      otherwise allow ongoing access from the kernel.

  A side effect of this enabling is that MADV_HWPOISON becomes usable
  for dax mappings, however the primary motivation is to allow the
  system to survive userspace consumption of hardware-poison via dax.
  Specifically the current behavior is:

     mce: Uncorrected hardware memory error in user-access at af34214200
     {1}[Hardware Error]: It has been corrected by h/w and requires no further action
     mce: [Hardware Error]: Machine check events logged
     {1}[Hardware Error]: event severity: corrected
     Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
     [..]
     Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
     mce: Memory error not recovered
     <reboot>

  ...and with these changes:

     Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
     Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
     Memory failure: 0x20cb00: recovery action for dax page: Recovered

  Given all the cross dependencies I propose taking this through
  nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
  folks"

* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm, pmem: Restore page attributes when clearing errors
  x86/memory_failure: Introduce {set, clear}_mce_nospec()
  x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
  mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  filesystem-dax: Introduce dax_lock_mapping_entry()
  mm, memory_failure: Collect mapping size in collect_procs()
  mm, madvise_inject_error: Let memory_failure() optionally take a page reference
  mm, dev_pagemap: Do not clear ->mapping on final put
  mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
  filesystem-dax: Set page->index
  device-dax: Set page->index
  device-dax: Enable page_mapping()
  device-dax: Convert to vmf_insert_mixed and vm_fault_t
2018-08-25 18:43:59 -07:00
Linus Torvalds
828bf6e904 libnvdimm-for-4.19_misc
Collection of misc libnvdimm patches for 4.19 submission
 * Adding support to read locked nvdimm capacity.
 
 * Change test code to make DSM failure code injection an override.
 
 * Add support for calculate maximum contiguous area for namespace.
 
 * Add support for queueing a short ARS when there is on going ARS for
   nvdimm.
 
 * Allow NULL to be passed in to ->direct_access() for kaddr and
   pfn params.
 
 * Improve smart injection support for nvdimm emulation testing.
 
 * Fix test code that supports for emulating controller temperature.
 
 * Fix hang on error before devm_memremap_pages()
 
 * Fix a bug that causes user memory corruption when data returned
   to user for ars_status.
 
 * Maintainer updates for Ross Zwisler emails and adding Jan Kara to fsdax.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9uUIACgkQYGjFFmlT
 OErL+xAAgWHSGs8w98VtYA9kLDeTYEXutq93wJZQoBu/FMAXuuU3hYmQYnOQU87h
 KKYmfDkeusaih1R3IX7mzlegnnzSfQ6MraNSV76M43noJHbRTunknCPZH6ebp4fo
 b/eljvWlZF/idM+7YcsnoFMnHSRj2pjJGXmKQDlKedHD+KMxpmk6zEl2s5Y0zvPU
 4U7UQLtk3D5IIpLNsLEmxge32BfvNf5IzoSO1aZp7Eqk0+U5Tq3Sq/Tjmd+J0RKt
 6WH5yA6NqXQgBh+ayHsYU8YX62RqnbKQZXqVxD35OH64zJEUefnP1fpt9pmaZ9eL
 43BPMkpM09eLAikO2ET3/3c2k6h3h9ttz1sH8t/hiroCtfmxs3XgskY06hxpKjZV
 EbN+BUmut5Mr+zzYitRr3dbK2aHPVU9IbU7jUw/1Tz23rq3kU5iI7SHHv1b/eWup
 1Cr77Z1M6HB8VBhjnJ+R607sbRrnKQUOV7fGzAaIskyUOTWsEvIgTh/6MRiaj9MD
 5HXIgc/0y9E+G93s7MsUWwzpB7J6E7EGoybST2SKPtqwtDMPsBNeWRjyA9quBCoN
 u1s+e+lWHYutqRW0eisDTTlq3nJwPijSx1nnzhJxw9s1EkCXz3f7KRZhyH1C79Co
 7wjiuvKQ79e/HI/oXvGmTnv5lbLEpWYyJ3U3KIFfoUqugeyhr0k=
 =5p2n
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dave Jiang:
 "Collection of misc libnvdimm patches for 4.19 submission:

   - Adding support to read locked nvdimm capacity.

   - Change test code to make DSM failure code injection an override.

   - Add support for calculate maximum contiguous area for namespace.

   - Add support for queueing a short ARS when there is on going ARS for
     nvdimm.

   - Allow NULL to be passed in to ->direct_access() for kaddr and pfn
     params.

   - Improve smart injection support for nvdimm emulation testing.

   - Fix test code that supports for emulating controller temperature.

   - Fix hang on error before devm_memremap_pages()

   - Fix a bug that causes user memory corruption when data returned to
     user for ars_status.

   - Maintainer updates for Ross Zwisler emails and adding Jan Kara to
     fsdax"

* tag 'libnvdimm-for-4.19_misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm: fix ars_status output length calculation
  device-dax: avoid hang on error before devm_memremap_pages()
  tools/testing/nvdimm: improve emulation of smart injection
  filesystem-dax: Do not request kaddr and pfn when not required
  md/dm-writecache: Don't request pointer dummy_addr when not required
  dax/super: Do not request a pointer kaddr when not required
  tools/testing/nvdimm: kaddr and pfn can be NULL to ->direct_access()
  s390, dcssblk: kaddr and pfn can be NULL to ->direct_access()
  libnvdimm, pmem: kaddr and pfn can be NULL to ->direct_access()
  acpi/nfit: queue issuing of ars when an uc error notification comes in
  libnvdimm: Export max available extent
  libnvdimm: Use max contiguous area for namespace size
  MAINTAINERS: Add Jan Kara for filesystem DAX
  MAINTAINERS: update Ross Zwisler's email address
  tools/testing/nvdimm: Fix support for emulating controller temperature
  tools/testing/nvdimm: Make DSM failure code injection an override
  acpi, nfit: Prefer _DSM over _LSR for namespace label reads
  libnvdimm: Introduce locked DIMM capacity support
2018-08-25 18:13:10 -07:00
Linus Torvalds
db84abf5f8 This pull request contains a single fix for UBIFS:
- Remove an empty file from UBIFS source
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAluBE84WHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7wc73EACHsV8cp3Atkg18yZYQq3t8uOZG
 lZApUWTDr2SpWW88/OJ+RAFKGflKQqSiS1y1HYd2s8is4yYMA/XBto5pWvGqg03P
 GsFGzjMqRztZtNwg7n7JUD0Sq9WLoB1BimuMAG1b8vjTH2uuCxJZVyVi7ZF5W5Oc
 065ebVblFhziI7jddRh/Gpwa2dvx3SlKZa0VImiJd6Mw9AnjO4grU+rOofswXTjX
 AKVjeRvNp0DuxJkxaqO9mH7mug87Owi4co4DBLCJu7pPEq2kvFRHQQtn6QKCdtRk
 gyYBD/ZGkhNYz3wPoQxlHM0bGVAToKtjcT1/T61lrlXKjGJu8S8IHC/raEyCB8yb
 Cz3XxMCH7LXyS3eaov4sZAiogENsoP7i20E2AKwlGil65gx4DVR814MTAi5m0AwN
 Kpv7GHRG9oMEar378Rrv8xptRJhj6nwCxAYuBwmPFlK7d95qMQ9hLjsvYp3TLuzn
 7ihUzT8m/GZG1qrlkkk24Bjm834dCVRKPBJJ4beh3YG8/mj/h6OlQGs0NPRIF6O1
 JknbQqXOz6UZ1UsvgtDK7p8GuHztKA4K+3Qv4BWwIRFisX+gTWKB8IveGPpouUaa
 K8NMOkbw8RsatxzM4DNZBPl2YNjvHuuTGudNqMn3pytsi4Y8y5YKdkpFVgZH1V25
 QmQZbhicHSlfwWR0FQ==
 =QoL7
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.19-rc1-fix' of git://git.infradead.org/linux-ubifs

Pull UBIFS fix from Richard Weinberger:
 "Remove an empty file from UBIFS source"

* tag 'upstream-4.19-rc1-fix' of git://git.infradead.org/linux-ubifs:
  ubifs: Remove empty file.h
2018-08-25 13:27:35 -07:00
Linus Torvalds
04faac10fc three small SMB3 fixes, one for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAluAdzkACgkQiiy9cAdy
 T1GjlAv/SOsNm2sj9Bcq/Z/CpPoFRFoJBFsLeReF78QdbF/+eUFuQJvq0aIK8BmV
 0NRmvlQk9oCGQfWN0pLWeRn7a7xUqMQ7HYKSS1fzW1O+kJGlyA1cFjCbiZe4py6m
 AFSlpraPTtL7isX/ZMOyZ1D7YKMj4Fq5wndcHPSnMQXI2GlaUAip5k/zamXUbMmo
 dFaGDkXc67un6Y/04v18LsKJtOHgbVIAES2OgO0sjqiwp0cnGATsZl/OGzvsTo31
 brstBum/0Ig2Mpr+5IXa4QFoP+naNXDyhv+D69huETwsMSImnjGL6L/GAgUUO5/t
 sDN6bpQdM9wqpckNuFcV9hKbBHQ1nZhzr0gUjRnmN8CJFHOgI2Ndo4J4N9slnWo9
 wEXJV7RN5/VQh3Ozb7m+DpAYo4K4r3Oexxa3MG7+IFY8t89O39PBc0dEQ6O8ztaJ
 NCcubQVpSFxC7fwVjY56IHHofyznS7JLKMAcCe3+ssOHwJt0PZr/iqdBNHC9W4ZE
 7fcNZtct
 =jMLB
 -----END PGP SIGNATURE-----

Merge tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Three small SMB3 fixes, one for stable"

* tag '4.19-rc-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update internal module version number for cifs.ko to 2.12
  cifs: check kmalloc before use
  cifs: check if SMB2 PDU size has been padded and suppress the warning
  cifs: create a define for how many iovs we need for an SMB2_open()
2018-08-25 13:17:53 -07:00
Colin Ian King
e0fcfe1f1a hpfs: remove unnecessary checks on the value of r when assigning error code
At the point where r is being checked for different values, r is always
going to be equal to 2 as the previous if statements jump to end or end1
if r is not 2.  Hence the assignment to err can be simplified to just
err an assignment without any checks on the value or r.

Detected by CoverityScan, CID#1226737 ("Logically dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-25 12:42:33 -07:00
Linus Torvalds
4def196360 Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull namespace fixes from Eric Biederman:
 "This is a set of four fairly obvious bug fixes:

   - a switch from d_find_alias to d_find_any_alias because the xattr
     code perversely takes a dentry

   - two mutex vs copy_to_user fixes from Jann Horn

   - a fix to use a sanitized size not the size userspace passed in from
     Christian Brauner"

* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  getxattr: use correct xattr length
  sys: don't hold uts_sem while accessing userspace memory
  userns: move user access out of the mutex
  cap_inode_getsecurity: use d_find_any_alias() instead of d_find_alias()
2018-08-24 09:25:39 -07:00
Misono Tomohiro
b6fdfbff07 btrfs: Fix suspicious RCU usage warning in btrfs_debug_in_rcu
Commit 672d599041 ("btrfs: Use wrapper macro for rcu string to remove
duplicate code") replaces some open coded RCU string handling with macro.

It turns out that btrfs_debug_in_rcu() is used for the first time and
the macro lacks lock/unlock of RCU string for non-debug case (i.e. when
the message is not printed), leading to suspicious RCU usage warning
when CONFIG_PROVE_RCU is on.

Fix this by adding a wrapper to call lock/unlock for the non-debug case
too.

Fixes: 672d599041 ("btrfs: Use wrapper macro for rcu string to remove duplicate code")
Reported-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Misono Tomohiro <misono.tomohiro@jp.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-24 14:09:43 +02:00
Richard Weinberger
6e5461d774 ubifs: Remove empty file.h
This empty file sneaked into the tree by mistake.
Remove it.

Fixes: 6eb61d587f ("ubifs: Pass struct ubifs_info to ubifs_assert()")
Signed-off-by: Richard Weinberger <richard@nod.at>
2018-08-24 13:50:07 +02:00
Jan Kara
ee4af50ca9 udf: Fix mounting of Win7 created UDF filesystems
Win7 is creating UDF filesystems with single partition with number 8192.
Current partition descriptor scanning code does not handle this well as
it incorrectly assumes that partition numbers will form mostly contiguous
space of small numbers. This results in unmountable media due to errors
like:

UDF-fs: error (device dm-1): udf_read_tagged: tag version 0x0000 != 0x0002 || 0x0003, block 0
UDF-fs: warning (device dm-1): udf_fill_super: No fileset found

Fix the problem by handling partition descriptors in a way that sparse
partition numbering does not matter.

Reported-and-tested-by: jean-luc malet <jeanluc.malet@gmail.com>
CC: stable@vger.kernel.org
Fixes: 7b78fd02fb
Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-24 11:13:32 +02:00
Jan Kara
82c82ab658 udf: Remove dead code from udf_find_fileset()
Remove dead code and slightly simplify code in udf_find_fileset().

Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-24 11:13:32 +02:00
Linus Torvalds
33e17876ea Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - the rest of MM

 - various misc fixes and tweaks

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
  mm: Change return type int to vm_fault_t for fault handlers
  lib/fonts: convert comments to utf-8
  s390: ebcdic: convert comments to UTF-8
  treewide: convert ISO_8859-1 text comments to utf-8
  drivers/gpu/drm/gma500/: change return type to vm_fault_t
  docs/core-api: mm-api: add section about GFP flags
  docs/mm: make GFP flags descriptions usable as kernel-doc
  docs/core-api: split memory management API to a separate file
  docs/core-api: move *{str,mem}dup* to "String Manipulation"
  docs/core-api: kill trailing whitespace in kernel-api.rst
  mm/util: add kernel-doc for kvfree
  mm/util: make strndup_user description a kernel-doc comment
  fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
  treewide: correct "differenciate" and "instanciate" typos
  fs/afs: use new return type vm_fault_t
  drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
  mm: soft-offline: close the race against page allocation
  mm: fix race on soft-offlining free huge pages
  namei: allow restricted O_CREAT of FIFOs and regular files
  hfs: prevent crash on exit from failed search
  ...
2018-08-23 19:20:12 -07:00
Souptick Joarder
2b74030354 mm: Change return type int to vm_fault_t for fault handlers
Use new return type vm_fault_t for fault handler.  For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno.  Once all instances are converted, vm_fault_t will become a
distinct type.

Ref-> commit 1c8f422059 ("mm: change return type to vm_fault_t")

The aim is to change the return type of finish_fault() and
handle_mm_fault() to vm_fault_t type.  As part of that clean up return
type of all other recursively called functions have been changed to
vm_fault_t type.

The places from where handle_mm_fault() is getting invoked will be
change to vm_fault_t type but in a separate patch.

vmf_error() is the newly introduce inline function in 4.17-rc6.

[akpm@linux-foundation.org: don't shadow outer local `ret' in __do_huge_pmd_anonymous_page()]
Link: http://lkml.kernel.org/r/20180604171727.GA20279@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:44 -07:00
Arnd Bergmann
a2036a1ef2 fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
Without CONFIG_MMU, we get a build warning:

  fs/proc/vmcore.c:228:12: error: 'vmcoredd_mmap_dumps' defined but not used [-Werror=unused-function]
   static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,

The function is only referenced from an #ifdef'ed caller, so
this uses the same #ifdef around it.

Link: http://lkml.kernel.org/r/20180525213526.2117790-1-arnd@arndb.de
Fixes: 7efe48df8a ("vmcore: append device dumps to vmcore as elf notes")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Souptick Joarder
0722f18620 fs/afs: use new return type vm_fault_t
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct.  For now, this is just documenting that the
function returns a VM_FAULT value rather than an errno.  Once all
instances are converted, vm_fault_t will become a distinct type.

See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.

Link: http://lkml.kernel.org/r/20180702152017.GA3780@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Salvatore Mesoraca
30aba6656f namei: allow restricted O_CREAT of FIFOs and regular files
Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag.  The purpose
is to make data spoofing attacks harder.  This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection.  This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.

This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:

CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489

This list is not meant to be complete.  It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector.  In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.

[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
  Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
  Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Ernesto A. Fernández
dc2572791d hfs: prevent crash on exit from failed search
hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
hfs_brec_insert() may instead set it to an error-valued pointer.  Fix
this to prevent a crash.

Link: http://lkml.kernel.org/r/53d9749a029c41b4016c495fc5838c9dba3afc52.1530294815.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Cc: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:42 -07:00
Ernesto A. Fernandez
aba93a92f4 hfsplus: prevent crash on exit from failed search
hfs_find_exit() expects fd->bnode to be NULL after a search has failed.
hfs_brec_insert() may instead set it to an error-valued pointer.  Fix
this to prevent a crash.

Link: http://lkml.kernel.org/r/803590a35221fbf411b2c141419aea3233a6e990.1530294813.git.ernesto.mnd.fernandez@gmail.com
Signed-off-by: Ernesto A. Fernandez <ernesto.mnd.fernandez@gmail.com>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:42 -07:00
Ernesto A. Fernández
a7ec7a4193 hfsplus: fix NULL dereference in hfsplus_lookup()
An HFS+ filesystem can be mounted read-only without having a metadata
directory, which is needed to support hardlinks.  But if the catalog
data is corrupted, a directory lookup may still find dentries claiming
to be hardlinks.

hfsplus_lookup() does check that ->hidden_dir is not NULL in such a
situation, but mistakenly does so after dereferencing it for the first
time.  Reorder this check to prevent a crash.

This happens when looking up corrupted catalog data (dentry) on a
filesystem with no metadata directory (this could only ever happen on a
read-only mount).  Wen Xu sent the replication steps in detail to the
fsdevel list: https://bugzilla.kernel.org/show_bug.cgi?id=200297

Link: http://lkml.kernel.org/r/20180712215344.q44dyrhymm4ajkao@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Wen Xu <wen.xu@gatech.edu>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:42 -07:00
Linus Torvalds
53a01c9a5f NFS client updates for Linux 4.19
Stable bufixes:
 - v3.17+: Fix an off-by-one in bl_map_stripe()
 - v4.9+: NFSv4 client live hangs after live data migration recovery
 - v4.18+: xprtrdma: Fix disconnect regression
 - v4.14+: Fix locking in pnfs_generic_recover_commit_reqs
 - v4.9+: Fix a sleep in atomic context in nfs4_callback_sequence()
 
 Features:
 - Add support for asynchronous server-side COPY operations
 
 Other bugfixes and cleanups:
 - Optitmizations and fixes involving NFS v4.1 / pNFS layout handling
 - Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
 - Immediately reschedule writeback when the server replies with an error
 - Fix excessive attribute revalidation in nfs_execute_ok()
 - Add error checking to nfs_idmap_prepare_message()
 - Use new vm_fault_t return type
 - Return a delegation when reclaiming one that the server has recalled
 - Referrals should inherit proto setting from parents
 - Make rpc_auth_create_args a const
 - Improvements to rpc_iostats tracking
 - Fix a potential reference leak when there is an error processing a callback
 - Fix rmdir / mkdir / rename nlink accounting
 - Fix updating inode change attribute
 - Fix error handling in nfsn4_sp4_select_mode()
 - Use an appropriate work queue for direct-write completion
 - Don't busy wait if NFSv4 session draining is interrupted
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlt/CYIACgkQ18tUv7Cl
 QOu8gBAA0xQWmgRoG6oIdYUxvgYqhuJmMqC4SU1E6mCJ93xEuUSvEFw51X+84KCt
 r6UPkp/bKiVe3EIinKTplIzuxgggXNG0EQmO46FYNTl7nqpN85ffLsQoWsiD23fp
 j8afqKPFR2zfhHXLKQC7k1oiOpwGqJ+EJWgIW4llE80pSNaErEoEaDqSPds5thMN
 dHEjjLr8ef6cbBux6sSPjwWGNbE82uoSu3MDuV2+e62hpGkgvuEYo1vyE6ujeZW5
 MUsmw+AHZkwro0msTtNBOHcPZAS0q/2UMPzl1tsDeCWNl2mugqZ6szQLSS2AThKq
 Zr6iK9Q5dWjJfrQHcjRMnYJB+SCX1SfPA7ASuU34opwcWPjecbS9Q92BNTByQYwN
 o9ngs2K0mZfqpYESMAmf7Il134cCBrtEp3skGko2KopJcYcE5YUFhdKihi1yQQjU
 UbOOubMpQk8vY9DpDCAwGbICKwUZwGvq27uuUWL20kFVDb1+jvfHwcV4KjRAJo/E
 J9aFtU+qOh4rMPMnYlEVZcAZBGfenlv/DmBl1upRpjzBkteUpUJsAbCmGyAk4616
 3RECasehgsjNCQpFIhv3FpUkWzP5jt0T3gRr1NeY6WKJZwYnHEJr9PtapS+EIsCT
 tB5DvvaJqFtuHFOxzn+KlGaxdSodHF7klOq7NM3AC0cX8AkWqaU=
 =8+9t
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "These patches include adding async support for the v4.2 COPY
  operation. I think Bruce is planning to send the server patches for
  the next release, but I figured we could get the client side out of
  the way now since it's been in my tree for a while. This shouldn't
  cause any problems, since the server will still respond with
  synchronous copies even if the client requests async.

  Features:
   - Add support for asynchronous server-side COPY operations

  Stable bufixes:
   - Fix an off-by-one in bl_map_stripe() (v3.17+)
   - NFSv4 client live hangs after live data migration recovery (v4.9+)
   - xprtrdma: Fix disconnect regression (v4.18+)
   - Fix locking in pnfs_generic_recover_commit_reqs (v4.14+)
   - Fix a sleep in atomic context in nfs4_callback_sequence() (v4.9+)

  Other bugfixes and cleanups:
   - Optimizations and fixes involving NFS v4.1 / pNFS layout handling
   - Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
   - Immediately reschedule writeback when the server replies with an
     error
   - Fix excessive attribute revalidation in nfs_execute_ok()
   - Add error checking to nfs_idmap_prepare_message()
   - Use new vm_fault_t return type
   - Return a delegation when reclaiming one that the server has
     recalled
   - Referrals should inherit proto setting from parents
   - Make rpc_auth_create_args a const
   - Improvements to rpc_iostats tracking
   - Fix a potential reference leak when there is an error processing a
     callback
   - Fix rmdir / mkdir / rename nlink accounting
   - Fix updating inode change attribute
   - Fix error handling in nfsn4_sp4_select_mode()
   - Use an appropriate work queue for direct-write completion
   - Don't busy wait if NFSv4 session draining is interrupted"

* tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (54 commits)
  pNFS: Remove unwanted optimisation of layoutget
  pNFS/flexfiles: ff_layout_pg_init_read should exit on error
  pNFS: Treat RECALLCONFLICT like DELAY...
  pNFS: When updating the stateid in layoutreturn, also update the recall range
  NFSv4: Fix a sleep in atomic context in nfs4_callback_sequence()
  NFSv4: Fix locking in pnfs_generic_recover_commit_reqs
  NFSv4: Fix a typo in nfs4_init_channel_attrs()
  NFSv4: Don't busy wait if NFSv4 session draining is interrupted
  NFS recover from destination server reboot for copies
  NFS add a simple sync nfs4_proc_commit after async COPY
  NFS handle COPY ERR_OFFLOAD_NO_REQS
  NFS send OFFLOAD_CANCEL when COPY killed
  NFS export nfs4_async_handle_error
  NFS handle COPY reply CB_OFFLOAD call race
  NFS add support for asynchronous COPY
  NFS COPY xdr handle async reply
  NFS OFFLOAD_CANCEL xdr
  NFS CB_OFFLOAD xdr
  NFS: Use an appropriate work queue for direct-write completion
  NFSv4: Fix error handling in nfs4_sp4_select_mode()
  ...
2018-08-23 16:03:58 -07:00
Linus Torvalds
9157141c95 A mistake on my part caused me to tag my branch 6 commits too early,
missing Chuck's fixes for the problem with callbacks over GSS from
 multi-homed servers, and a smaller fix from Laura Abbott.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbftA8AAoJECebzXlCjuG+QPMQALieEKkX0YoqRhPz5G+RrWFy
 KgOBFAoiRcjFQD6wMt9FzD6qYEZqSJ+I2b+K5N3BkdyDDQu845iD0wK0zBGhMgLm
 7ith85nphIMbe18+5jPorqAsI9RlfBQjiSGw1MEx5dicLQQzTObHL5q+l5jcWna4
 jWS3yUKv1URpOsR1hIryw74ktSnhuH8n//zmntw8aWrCkq3hnXOZK/agtYxZ7Viv
 V3kiQsiNpL2FPRcHN7ejhLUTnRkkuD2iYKrzP/SpTT/JfdNEUXlMhKkAySogNpus
 nvR9X7hwta8Lgrt7PSB9ibFTXtCupmuICg5mbDWy6nXea2NvpB01QhnTzrlX17Eh
 Yfk/18z95b6Qs1v4m3SI8ESmyc6l5dMZozLudtHzifyCqooWZriEhCR1PlQfQ/FJ
 4cYQ8U/qiMiZIJXL7N2wpSoSaWR5bqU1rXen29Np1WEDkiv4Nf5u2fsCXzv0ZH2C
 ReWpNkbnNxsNiKpp4geBZtlcSEU1pk+1PqE0MagTdBV3iptiUHRSP4jR7qLnc0zT
 J1lCvU7Fodnt9vNSxMpt2Jd6XxQ6xtx7n6aMQAiYFnXDs+hP2hPnJVCScnYW3L6R
 2r1sHRKKeoOzCJ2thw+zu4lOwMm7WPkJPWAYfv90reWkiKoy2vG0S9P7wsNGoJuW
 fuEjB2b9pow1Ffynat6q
 =JnLK
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Chuck Lever fixed a problem with NFSv4.0 callbacks over GSS from
  multi-homed servers.

  The only new feature is a minor bit of protocol (change_attr_type)
  which the client doesn't even use yet.

  Other than that, various bugfixes and cleanup"

* tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux: (27 commits)
  sunrpc: Add comment defining gssd upcall API keywords
  nfsd: Remove callback_cred
  nfsd: Use correct credential for NFSv4.0 callback with GSS
  sunrpc: Extract target name into svc_cred
  sunrpc: Enable the kernel to specify the hostname part of service principals
  sunrpc: Don't use stack buffer with scatterlist
  rpc: remove unneeded variable 'ret' in rdma_listen_handler
  nfsd: use true and false for boolean values
  nfsd: constify write_op[]
  fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id
  NFSD: Handle full-length symlinks
  NFSD: Refactor the generic write vector fill helper
  svcrdma: Clean up Read chunk path
  svcrdma: Avoid releasing a page in svc_xprt_release()
  nfsd: Mark expected switch fall-through
  sunrpc: remove redundant variables 'checksumlen','blocksize' and 'data'
  nfsd: fix leaked file lock with nfs exported overlayfs
  nfsd: don't advertise a SCSI layout for an unsupported request_queue
  nfsd: fix corrupted reply to badly ordered compound
  nfsd: clarify check_op_ordering
  ...
2018-08-23 16:00:10 -07:00
Linus Torvalds
6f7948f566 This pull request contains updates for both UBI and UBIFS:
- Year 2038 preparations
 - New UBI feature to skip CRC checks of static volumes
 - A new Kconfig option to disable xattrs in UBIFS
 - Lots of fixes in UBIFS, found by our new test framework
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCAA0FiEEdgfidid8lnn52cLTZvlZhesYu8EFAlt9zFkWHHJpY2hhcmRA
 c2lnbWEtc3Rhci5hdAAKCRBm+VmF6xi7waiuD/oDYzerOLe0R7n2sRT9zjtY8kCx
 LuizRvDYUlmMynI6EVahfyJy2IixcDmXOklGdxJqUkN5igDC/FORWdQjv2X9y56d
 qZ2dlS8aBvI0ZKBG2ew4VP1H67CXtCw8H9fE32SGotPmxKRUQqt2vKqo+vgQfapH
 eSVPrOaoqoRh+/ieumYXsvFdEUWpa66G3tVMFe4znu+kYRBbGzSszxpuq1ukIls2
 P9wewqbWAZpqn+n9A9+RBIv81g+jH87/acfjK2L7/lT9wsFO7BQGKi373dPbnTa5
 9WsjGEd+Gt0kb4Kjh5QegY97bPqWjmaMj1BLqeQVpSbQqpzkiFMf9GW5+h3XqAfO
 hM1zzgONZMxHdZSKH0bWzIRQbvU6v0d9C4J/elfFuH9ke2XscrxjOtZZQbtbGeYj
 tE7FWoZnB8euXubulGAUBKofzWe+gItBe9+iA29EBETNOemrJyHyKjO0Fe9ze5p2
 bfVFvN62kHz4ZCJoinwO/OpXnCuA91xrVocLOOIreb4dkZ/kqP+YZWFf70FcE1o5
 sPAbAUu+hfb2LbpktEdZHHbhoupfCnJokzfboJMX0NWKRtFXJDONjogJYTFUjrpW
 eXS+55+WFHoLWtx9J2IVmcb3cQrj/W/4J83kSg99cUkVjGpil50zmtzhq9bHzsLc
 wazngueP7kW2l9bSSg==
 =gCyp
 -----END PGP SIGNATURE-----

Merge tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI/UBIFS updates from Richard Weinberger:

 - Year 2038 preparations

 - New UBI feature to skip CRC checks of static volumes

 - A new Kconfig option to disable xattrs in UBIFS

 - Lots of fixes in UBIFS, found by our new test framework

* tag 'upstream-4.19-rc1' of git://git.infradead.org/linux-ubifs: (21 commits)
  ubifs: Set default assert action to read-only
  ubifs: Allow setting assert action as mount parameter
  ubifs: Rework ubifs_assert()
  ubifs: Pass struct ubifs_info to ubifs_assert()
  ubifs: Turn two ubifs_assert() into a WARN_ON()
  ubi: expose the volume CRC check skip flag
  ubi: provide a way to skip CRC checks
  ubifs: Use kmalloc_array()
  ubifs: Check data node size before truncate
  Revert "UBIFS: Fix potential integer overflow in allocation"
  ubifs: Add comment on c->commit_sem
  ubifs: introduce Kconfig symbol for xattr support
  ubifs: use swap macro in swap_dirty_idx
  ubifs: tnc: use monotonic znode timestamp
  ubifs: use timespec64 for inode timestamps
  ubifs: xattr: Don't operate on deleted inodes
  ubifs: gc: Fix typo
  ubifs: Fix memory leak in lprobs self-check
  ubi: Initialize Fastmap checkmapping correctly
  ubifs: Fix synced_i_size calculation for xattr inodes
  ...
2018-08-23 15:58:04 -07:00
Steve French
7753e38286 cifs: update internal module version number for cifs.ko to 2.12
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-23 15:11:10 -05:00
Nicholas Mc Guire
126c97f4d0 cifs: check kmalloc before use
The kmalloc was not being checked - if it fails issue a warning
and return -ENOMEM to the caller.

Signed-off-by: Nicholas Mc Guire <hofrat@osadl.org>
Fixes: b8da344b74 ("cifs: dynamic allocation of ntlmssp blob")
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
cc: Stable <stable@vger.kernel.org>`
2018-08-23 15:10:49 -05:00
Ronnie Sahlberg
e6c47dd0da cifs: check if SMB2 PDU size has been padded and suppress the warning
Some SMB2/3 servers, Win2016 but possibly others too, adds padding
not only between PDUs in a compound but also to the final PDU.
This padding extends the PDU to a multiple of 8 bytes.

Check if the unexpected length looks like this might be the case
and avoid triggering the log messages for :

  "SMB2 server sent bad RFC1001 len %d not %d\n"

Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-23 15:10:46 -05:00
Ronnie Sahlberg
4d8dfafc5c cifs: create a define for how many iovs we need for an SMB2_open()
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2018-08-23 15:10:40 -05:00
Christian Brauner
82c9a927bc getxattr: use correct xattr length
When running in a container with a user namespace, if you call getxattr
with name = "system.posix_acl_access" and size % 8 != 4, then getxattr
silently skips the user namespace fixup that it normally does resulting in
un-fixed-up data being returned.
This is caused by posix_acl_fix_xattr_to_user() being passed the total
buffer size and not the actual size of the xattr as returned by
vfs_getxattr().
This commit passes the actual length of the xattr as returned by
vfs_getxattr() down.

A reproducer for the issue is:

  touch acl_posix

  setfacl -m user:0:rwx acl_posix

and the compile:

  #define _GNU_SOURCE
  #include <errno.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <sys/types.h>
  #include <unistd.h>
  #include <attr/xattr.h>

  /* Run in user namespace with nsuid 0 mapped to uid != 0 on the host. */
  int main(int argc, void **argv)
  {
          ssize_t ret1, ret2;
          char buf1[128], buf2[132];
          int fret = EXIT_SUCCESS;
          char *file;

          if (argc < 2) {
                  fprintf(stderr,
                          "Please specify a file with "
                          "\"system.posix_acl_access\" permissions set\n");
                  _exit(EXIT_FAILURE);
          }
          file = argv[1];

          ret1 = getxattr(file, "system.posix_acl_access",
                          buf1, sizeof(buf1));
          if (ret1 < 0) {
                  fprintf(stderr, "%s - Failed to retrieve "
                                  "\"system.posix_acl_access\" "
                                  "from \"%s\"\n", strerror(errno), file);
                  _exit(EXIT_FAILURE);
          }

          ret2 = getxattr(file, "system.posix_acl_access",
                          buf2, sizeof(buf2));
          if (ret2 < 0) {
                  fprintf(stderr, "%s - Failed to retrieve "
                                  "\"system.posix_acl_access\" "
                                  "from \"%s\"\n", strerror(errno), file);
                  _exit(EXIT_FAILURE);
          }

          if (ret1 != ret2) {
                  fprintf(stderr, "The value of \"system.posix_acl_"
                                  "access\" for file \"%s\" changed "
                                  "between two successive calls\n", file);
                  _exit(EXIT_FAILURE);
          }

          for (ssize_t i = 0; i < ret2; i++) {
                  if (buf1[i] == buf2[i])
                          continue;

                  fprintf(stderr,
                          "Unexpected different in byte %zd: "
                          "%02x != %02x\n", i, buf1[i], buf2[i]);
                  fret = EXIT_FAILURE;
          }

          if (fret == EXIT_SUCCESS)
                  fprintf(stderr, "Test passed\n");
          else
                  fprintf(stderr, "Test failed\n");

          _exit(fret);
  }
and run:

  ./tester acl_posix

On a non-fixed up kernel this should return something like:

  root@c1:/# ./t
  Unexpected different in byte 16: ffffffa0 != 00
  Unexpected different in byte 17: ffffff86 != 00
  Unexpected different in byte 18: 01 != 00

and on a fixed kernel:

  root@c1:~# ./t
  Test passed

Cc: stable@vger.kernel.org
Fixes: 2f6f0654ab ("userns: Convert vfs posix_acl support to use kuids and kgids")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=199945
Reported-by: Colin Watson <cjwatson@ubuntu.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Acked-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2018-08-23 20:42:57 +02:00
Dan Carpenter
b9b8a41ade btrfs: use after free in btrfs_quota_enable
The issue here is that btrfs_commit_transaction() frees "trans" on both
the error and the success path.  So the problem would be if
btrfs_commit_transaction() succeeds, and then qgroup_rescan_init()
fails.  That means that "ret" is non-zero and "trans" is non-NULL and it
leads to a use after free inside the btrfs_end_transaction() macro.

Fixes: 340f1aa27f ("btrfs: qgroups: Move transaction management inside btrfs_quota_enable/disable")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:27 +02:00
Anand Jain
801660b040 btrfs: btrfs_shrink_device should call commit transaction at the end
Test case btrfs/164 reports use-after-free:

[ 6712.084324] general protection fault: 0000 [#1] PREEMPT SMP
..
[ 6712.195423]  btrfs_update_commit_device_size+0x75/0xf0 [btrfs]
[ 6712.201424]  btrfs_commit_transaction+0x57d/0xa90 [btrfs]
[ 6712.206999]  btrfs_rm_device+0x627/0x850 [btrfs]
[ 6712.211800]  btrfs_ioctl+0x2b03/0x3120 [btrfs]

Reason for this is that btrfs_shrink_device adds the resized device to
the fs_devices::resized_devices after it has called the last commit
transaction.

So the list fs_devices::resized_devices is not empty when
btrfs_shrink_device returns.  Now the parent function
btrfs_rm_device calls:

        btrfs_close_bdev(device);
        call_rcu(&device->rcu, free_device_rcu);

and then does the transactio ncommit. It goes through the
fs_devices::resized_devices in btrfs_update_commit_device_size and
leads to use-after-free.

Fix this by making sure btrfs_shrink_device calls the last needed
btrfs_commit_transaction before the return. This is consistent with what
the grow counterpart does and this makes sure the on-disk state is
persistent when the function returns.

Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Tested-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:27 +02:00
Lu Fengqi
a5b7f4295e btrfs: fix qgroup_free wrong num_bytes in btrfs_subvolume_reserve_metadata
After btrfs_qgroup_reserve_meta_prealloc(), num_bytes will be assigned
again by btrfs_calc_trans_metadata_size(). Once block_rsv fails, we
can't properly free the num_bytes of the previous qgroup_reserve. Use a
separate variable to store the num_bytes of the qgroup_reserve.

Delete the comment for the qgroup_reserved that does not exist and add a
comment about use_global_rsv.

Fixes: c4c129db5d ("btrfs: drop unused parameter qgroup_reserved")
CC: stable@vger.kernel.org # 4.18+
Signed-off-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:26 +02:00
Filipe Manana
de02b9f6bb Btrfs: fix data corruption when deduplicating between different files
If we deduplicate extents between two different files we can end up
corrupting data if the source range ends at the size of the source file,
the source file's size is not aligned to the filesystem's block size
and the destination range does not go past the size of the destination
file size.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ xfs_io -f -c "pwrite -S 0x6b 0 2518890" /mnt/foo
  # The first byte with a value of 0xae starts at an offset (2518890)
  # which is not a multiple of the sector size.
  $ xfs_io -c "pwrite -S 0xae 2518890 102398" /mnt/foo

  # Confirm the file content is full of bytes with values 0x6b and 0xae.
  $ od -t x1 /mnt/foo
  0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
  *
  11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
  11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
  *
  11777540 ae ae ae ae ae ae ae ae
  11777550

  # Create a second file with a length not aligned to the sector size,
  # whose bytes all have the value 0x6b, so that its extent(s) can be
  # deduplicated with the first file.
  $ xfs_io -f -c "pwrite -S 0x6b 0 557771" /mnt/bar

  # Now deduplicate the entire second file into a range of the first file
  # that also has all bytes with the value 0x6b. The destination range's
  # end offset must not be aligned to the sector size and must be less
  # then the offset of the first byte with the value 0xae (byte at offset
  # 2518890).
  $ xfs_io -c "dedupe /mnt/bar 0 1957888 557771" /mnt/foo

  # The bytes in the range starting at offset 2515659 (end of the
  # deduplication range) and ending at offset 2519040 (start offset
  # rounded up to the block size) must all have the value 0xae (and not
  # replaced with 0x00 values). In other words, we should have exactly
  # the same data we had before we asked for deduplication.
  $ od -t x1 /mnt/foo
  0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
  *
  11467540 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b ae ae ae ae ae ae
  11467560 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
  *
  11777540 ae ae ae ae ae ae ae ae
  11777550

  # Unmount the filesystem and mount it again. This guarantees any file
  # data in the page cache is dropped.
  $ umount /dev/sdb
  $ mount /dev/sdb /mnt

  $ od -t x1 /mnt/foo
  0000000 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
  *
  11461300 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 00 00 00 00 00
  11461320 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  *
  11470000 ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae ae
  *
  11777540 ae ae ae ae ae ae ae ae
  11777550

  # The bytes in range 2515659 to 2519040 have a value of 0x00 and not a
  # value of 0xae, data corruption happened due to the deduplication
  # operation.

So fix this by rounding down, to the sector size, the length used for the
deduplication when the following conditions are met:

  1) Source file's range ends at its i_size;
  2) Source file's i_size is not aligned to the sector size;
  3) Destination range does not cross the i_size of the destination file.

Fixes: e1d227a42e ("btrfs: Handle unaligned length in extent_same")
CC: stable@vger.kernel.org # 4.2+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:26 +02:00
Filipe Manana
d4682ba03e Btrfs: sync log after logging new name
When we add a new name for an inode which was logged in the current
transaction, we update the inode in the log so that its new name and
ancestors are added to the log. However when we do this we do not persist
the log, so the changes remain in memory only, and as a consequence, any
ancestors that were created in the current transaction are updated such
that future calls to btrfs_inode_in_log() return true. This leads to a
subsequent fsync against such new ancestor directories returning
immediately, without persisting the log, therefore after a power failure
the new ancestor directories do not exist, despite fsync being called
against them explicitly.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/A
  $ mkdir /mnt/B
  $ mkdir /mnt/A/C
  $ touch /mnt/B/foo
  $ xfs_io -c "fsync" /mnt/B/foo
  $ ln /mnt/B/foo /mnt/A/C/foo
  $ xfs_io -c "fsync" /mnt/A
  <power failure>

After the power failure, directory "A" does not exist, despite the explicit
fsync on it.

Instead of fixing this by changing the behaviour of the explicit fsync on
directory "A" to persist the log instead of doing nothing, make the logging
of the new file name (which happens when creating a hard link or renaming)
persist the log. This approach not only is simpler, not requiring addition
of new fields to the inode in memory structure, but also gives us the same
behaviour as ext4, xfs and f2fs (possibly other filesystems too).

A test case for fstests follows soon.

Fixes: 12fcfd22fe ("Btrfs: tree logging unlink/rename fixes")
Reported-by: Vijay Chidambaram <vvijay03@gmail.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2018-08-23 17:37:26 +02:00
Chuck Lever
a26dd64f54 nfsd: Remove callback_cred
Clean up: The global callback_cred is no longer used, so it can be
removed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
Chuck Lever
cb25e7b293 nfsd: Use correct credential for NFSv4.0 callback with GSS
I've had trouble when operating a multi-homed Linux NFS server with
Kerberos using NFSv4.0. Lately, I've seen my clients reporting
this (and then hanging):

May  9 11:43:26 manet kernel: NFS: NFSv4 callback contains invalid cred

The client-side commit f11b2a1cfb ("nfs4: copy acceptor name from
context to nfs_client") appears to be related, but I suspect this
problem has been going on for some time before that.

RFC 7530 Section 3.3.3 says:
> For Kerberos V5, nfs/hostname would be a server principal in the
> Kerberos Key Distribution Center database.  This is the same
> principal the client acquired a GSS-API context for when it issued
> the SETCLIENTID operation ...

In other words, an NFSv4.0 client expects that the server will use
the same GSS principal for callback that the client used to
establish its lease. For example, if the client used the service
principal "nfs@server.domain" to establish its lease, the server
is required to use "nfs@server.domain" when performing NFSv4.0
callback operations.

The Linux NFS server currently does not. It uses a common service
principal for all callback connections. Sometimes this works as
expected, and other times -- for example, when the server is
accessible via multiple hostnames -- it won't work at all.

This patch scrapes the target name from the client credential,
and uses that for the NFSv4.0 callback credential. That should
be correct much more often.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
Chuck Lever
9abdda5dda sunrpc: Extract target name into svc_cred
NFSv4.0 callback needs to know the GSS target name the client used
when it established its lease. That information is available from
the GSS context created by gssproxy. Make it available in each
svc_cred.

Note this will also give us access to the real target service
principal name (which is typically "nfs", but spec does not require
that).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
Linus Torvalds
fe6f0ed0da f2fs-for-4.19-rc1
In this round, we've tuned f2fs to improve general performance by serializing
 block allocation and enhancing discard flows like fstrim which avoids user IO
 contention. And we've added fsync_mode=nobarrier which gives an option to user
 where it skips issuing cache_flush commands to underlying flash storage. And
 there are many bug fixes related to fuzzed images, revoked atomic writes, quota
 ops, and minor direct IO.
 
 Enhancement:
  - add fsync_mode=nobarrier which bypasses cache_flush command
  - enhance the discarding flow which avoids user IOs and issues in LBA order
  - readahead some encrypted blocks during GC
  - enable in-memory inode checksum to verify the blocks if F2FS_CHECK_FS is set
  - enhance nat_bits behavior
  - set -o discard by default
  - set REQ_RAHEAD to bio in ->readpages
 
 Bug fixes:
  - fix a corner case to corrupt atomic_writes revoking flow
  - revisit i_gc_rwsem to fix race conditions
  - fix some dio behaviors captured by xfstests
  - correct handling errors given by quota-related failures
  - add many sanity check flows to avoid fuzz test failures
  - add more error number propagation to their callers
  - fix several corner cases to continue fault injection w/ shutdown loop
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlt82U4ACgkQQBSofoJI
 UNJTLQ/+PhewnNa5tDfUgWdFnUFz3h9/NcC677l0OplOOUNxA8iSa1xamlKf/nf9
 sB5ey0I7oBF8zQGxfndhHQfi6fpfUcNMr14hm+TS/3+d54xLJmiVShD5fjNSV2vB
 Ur0xoozuQDwYF1e3QKdBQjFqaCf78VheTr3aWxyv22/Sg+PYylZJ2K8rHTB7mGPU
 UG0aRnKrP3FPRjL7Q3m0Vm6b6eZ5uNdNrFfjgn/8yuQQ9V197K8vwSbPAsR5/pOh
 miCQXyM708NgEYJRWkWmC/rDSQdU0/h/mGnJWrBrbceW62QefGOgd2jcVfmthHJa
 ZXpj+BEG5bYpCCxGxF6N+u0e28OKonCwO/uvL8YAd5icN7yXtsKzoF1CCuXxOYf1
 9K5SMylCTSyrs/+LV8CJoT2ya8w0l0w+R/txUYn8UT+4AgqU+chS2kJeXqw9tcHB
 WLFs/rnAyofWCI/8frVBmJY+zA1ZZvTqs/lmVYrtJUkiOcMTq34WICBUAEFKV452
 BM5dcu21bSIkapYispEt4Rr7o4P4HHMQ+N1i2yUZMFCz5T0RyzdybeS5THk2yVzd
 L0kxfYU+zHigNX51ez8+Z7DyDLDBp6jkD0e66x73bUK9TGPH+ZbnAL6gwLikmD3M
 +VxYl5nyW/3bxx1HdfK1Xwd4/wYMNBmdtn5NC50oZ+jpB0h8YCE=
 =zgbL
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs updates from Jaegeuk Kim:
 "In this round, we've tuned f2fs to improve general performance by
  serializing block allocation and enhancing discard flows like fstrim
  which avoids user IO contention. And we've added fsync_mode=nobarrier
  which gives an option to user where it skips issuing cache_flush
  commands to underlying flash storage. And there are many bug fixes
  related to fuzzed images, revoked atomic writes, quota ops, and minor
  direct IO.

  Enhancements:
   - add fsync_mode=nobarrier which bypasses cache_flush command
   - enhance the discarding flow which avoids user IOs and issues in
     LBA order
   - readahead some encrypted blocks during GC
   - enable in-memory inode checksum to verify the blocks if
     F2FS_CHECK_FS is set
   - enhance nat_bits behavior
   - set -o discard by default
   - set REQ_RAHEAD to bio in ->readpages

  Bug fixes:
   - fix a corner case to corrupt atomic_writes revoking flow
   - revisit i_gc_rwsem to fix race conditions
   - fix some dio behaviors captured by xfstests
   - correct handling errors given by quota-related failures
   - add many sanity check flows to avoid fuzz test failures
   - add more error number propagation to their callers
   - fix several corner cases to continue fault injection w/ shutdown
     loop"

* tag 'f2fs-for-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (89 commits)
  f2fs: readahead encrypted block during GC
  f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc
  f2fs: fix performance issue observed with multi-thread sequential read
  f2fs: fix to skip verifying block address for non-regular inode
  f2fs: rework fault injection handling to avoid a warning
  f2fs: support fault_type mount option
  f2fs: fix to return success when trimming meta area
  f2fs: fix use-after-free of dicard command entry
  f2fs: support discard submission error injection
  f2fs: split discard command in prior to block layer
  f2fs: wake up gc thread immediately when gc_urgent is set
  f2fs: fix incorrect range->len in f2fs_trim_fs()
  f2fs: refresh recent accessed nat entry in lru list
  f2fs: fix avoid race between truncate and background GC
  f2fs: avoid race between zero_range and background GC
  f2fs: fix to do sanity check with block address in main area v2
  f2fs: fix to do sanity check with inline flags
  f2fs: fix to reset i_gc_failures correctly
  f2fs: fix invalid memory access
  f2fs: fix to avoid broken of dnode block list
  ...
2018-08-22 13:29:39 -07:00
Miklos Szeredi
6faf05c2b2 ovl: set I_CREATING on inode being created
...otherwise there will be list corruption due to inode_sb_list_add() being
called for inode already on the sb list.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Fixes: e950564b97 ("vfs: don't evict uninitialized inode")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 13:15:25 -07:00
Linus Torvalds
cd9b44f907 Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:

 - the rest of MM

 - procfs updates

 - various misc things

 - more y2038 fixes

 - get_maintainer updates

 - lib/ updates

 - checkpatch updates

 - various epoll updates

 - autofs updates

 - hfsplus

 - some reiserfs work

 - fatfs updates

 - signal.c cleanups

 - ipc/ updates

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (166 commits)
  ipc/util.c: update return value of ipc_getref from int to bool
  ipc/util.c: further variable name cleanups
  ipc: simplify ipc initialization
  ipc: get rid of ids->tables_initialized hack
  lib/rhashtable: guarantee initial hashtable allocation
  lib/rhashtable: simplify bucket_table_alloc()
  ipc: drop ipc_lock()
  ipc/util.c: correct comment in ipc_obtain_object_check
  ipc: rename ipcctl_pre_down_nolock()
  ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
  ipc: reorganize initialization of kern_ipc_perm.seq
  ipc: compute kern_ipc_perm.id under the ipc lock
  init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
  fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
  adfs: use timespec64 for time conversion
  kernel/sysctl.c: fix typos in comments
  drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
  fork: don't copy inconsistent signal handler state to child
  signal: make get_signal() return bool
  signal: make sigkill_pending() return bool
  ...
2018-08-22 12:34:08 -07:00
Arnd Bergmann
3e811f053a fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
get_seconds() is deprecated in favor of ktime_get_real_seconds(), which
returns a 64-bit timestamp.

In the SYSV file system, the superblock timestamp is only 32 bits wide,
and it is used to check whether a file system is clean, so the best
solution seems to be to force a wraparound and explicitly convert it to an
unsigned 32-bit value.

This is independent of the inode timestamps that are also 32-bit wide on
disk and that come from current_time().

Link: http://lkml.kernel.org/r/20180713145236.3152513-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:51 -07:00
Arnd Bergmann
d9edcbc42c adfs: use timespec64 for time conversion
We just truncate the seconds to 32-bit in one place now, so this can
trivially be converted over to using timespec64 consistently.

Link: http://lkml.kernel.org/r/20180620100133.4035614-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:51 -07:00
Arnd Bergmann
f423420c23 fat: propagate 64-bit inode timestamps
Now that we pass down 64-bit timestamps from VFS, we just need to convert
that correctly into on-disk timestamps.  To make that work correctly, this
changes the last use of time_to_tm() in the kernel to time64_to_tm(),
which also lets use remove that deprecated interfaces.

Similarly, the time_t use in fat_time_fat2unix() truncates the timestamp
on the way in, which can be avoided by using types that are wide enough to
hold the intermediate values during the conversion.

[hirofumi@mail.parknet.co.jp: remove useless temporary variable, needless long long]
Link: http://lkml.kernel.org/r/20180619153646.3637529-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
OGAWA Hirofumi
0afa962666 fat: validate ->i_start before using
On corrupted FATfs may have invalid ->i_start.  To handle it, this checks
->i_start before using, and return proper error code.

Link: http://lkml.kernel.org/r/87o9f8y1t5.fsf_-_@mail.parknet.co.jp
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Tested-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Wentao Wang
f663b5b38f fat: add FITRIM ioctl for FAT file system
Add FITRIM ioctl for FAT file system

[witallwang@gmail.com: use u64s]
  Link: http://lkml.kernel.org/r/87h8l37hub.fsf@mail.parknet.co.jp
[hirofumi@mail.parknet.co.jp: bug fixes, coding style fixes, add signal check]
Link: http://lkml.kernel.org/r/87fu10anhj.fsf@mail.parknet.co.jp
Signed-off-by: Wentao Wang <witallwang@gmail.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Jann Horn
a13f085d11 reiserfs: fix broken xattr handling (heap corruption, bad retval)
This fixes the following issues:

- When a buffer size is supplied to reiserfs_listxattr() such that each
  individual name fits, but the concatenation of all names doesn't fit,
  reiserfs_listxattr() overflows the supplied buffer.  This leads to a
  kernel heap overflow (verified using KASAN) followed by an out-of-bounds
  usercopy and is therefore a security bug.

- When a buffer size is supplied to reiserfs_listxattr() such that a
  name doesn't fit, -ERANGE should be returned.  But reiserfs instead just
  truncates the list of names; I have verified that if the only xattr on a
  file has a longer name than the supplied buffer length, listxattr()
  incorrectly returns zero.

With my patch applied, -ERANGE is returned in both cases and the memory
corruption doesn't happen anymore.

Credit for making me clean this code up a bit goes to Al Viro, who pointed
out that the ->actor calling convention is suboptimal and should be
changed.

Link: http://lkml.kernel.org/r/20180802151539.5373-1-jannh@google.com
Fixes: 48b32a3553 ("reiserfs: use generic xattr handlers")
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Jeff Mahoney <jeffm@suse.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Arnd Bergmann
8b73ce6a4b reiserfs: change j_timestamp type to time64_t
This uses the deprecated time_t type but is write-only, and could be
removed, but as Jeff explains, having a timestamp can be usefule for
post-mortem analysis in crash dumps.

In order to remove one of the last instances of time_t, this changes the
type to time64_t, same as j_trans_start_time.

Link: http://lkml.kernel.org/r/20180622133315.221210-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Arnd Bergmann
5b1d149c89 reiserfs: remove obsolete print_time function
Before linux-2.4.6, print_time() was used to pretty-print an inode time
when running reiserfs in user space, after that it has become obsolete and
is still a bit incorrect: It behaves differently on 32-bit and 64-bit
machines, and uses a static buffer to hold a string, which could lead to
undefined behavior if we ever called this from multiple places
simultaneously.

Since we always want to treat the timestamps as 'unsigned' anyway, simply
printing them as an integer is both simpler and safer while avoiding the
deprecated time_t type.

Link: http://lkml.kernel.org/r/20180620142522.27639-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Arnd Bergmann
34d082604a reiserfs: use monotonic time for j_trans_start_time
Using CLOCK_REALTIME time_t timestamps breaks on 32-bit systems in 2038,
and gives surprising results with a concurrent settimeofday().

This changes the reiserfs journal timestamps to use ktime_get_seconds()
instead, which makes it use a 64-bit CLOCK_MONOTONIC stamp.

In the procfs output, the monotonic timestamp needs to be converted back
to CLOCK_REALTIME to keep the existing ABI.

Link: http://lkml.kernel.org/r/20180620142522.27639-2-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Ernesto A. Fernández
f168d9fd63 hfsplus: drop ACL support
The HFS+ Access Control Lists have not worked at all for the past five
years, and nobody seems to have noticed.  Besides, POSIX draft ACLs are
not compatible with MacOS.  Drop the feature entirely.

Link: http://lkml.kernel.org/r/20180714190608.wtnmmtjqeyladkut@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Ernesto A. Fernández
afd6c9e1f5 hfsplus: fix decomposition of Hangul characters
Files created under macOS cannot be opened under linux if their names
contain Korean characters, and vice versa.

The Korean alphabet is special because its normalization is done without a
table.  The module deals with it correctly when composing, but forgets
about it for the decomposition.

Fix this using the Hangul decomposition function provided in the Unicode
Standard.  The code fits a bit awkwardly because it requires a buffer,
while all the other normalizations are returned as pointers to the
decomposition table.  This is actually also a bug because reordering may
still be needed, but for now leave it as it is.

The patch will cause trouble for Hangul filenames already created by the
module in the past.  This shouldn't really be concern because its main
purpose was always sharing with macOS.  If a user actually needs to access
such a file the nodecompose mount option should be enough.

Link: http://lkml.kernel.org/r/20180717220951.p6qqrgautc4pxvzu@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Ting-Chang Hou <tchou@synology.com>
Tested-by: Ting-Chang Hou <tchou@synology.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Ernesto A. Fernández
31651c6071 hfsplus: avoid deadlock on file truncation
After an extent is removed from the extent tree, the corresponding bits
are also cleared from the block allocation file.  This is currently done
without releasing the tree lock.

The problem is that the allocation file has extents of its own; if it is
fragmented enough, some of them may be in the extent tree as well, and
hfsplus_get_block() will try to take the lock again.

To avoid deadlock, only hold the extent tree lock during the actual tree
operations.

Link: http://lkml.kernel.org/r/20180709202549.auxwkb6memlegb4a@eaf
Signed-off-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reported-by: Anatoly Trosinenko <anatoly.trosinenko@gmail.com>
Cc: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Tetsuo Handa
7464726cb5 hfsplus: don't return 0 when fill_super() failed
syzbot is reporting NULL pointer dereference at mount_fs() [1].  This is
because hfsplus_fill_super() is by error returning 0 when
hfsplus_fill_super() detected invalid filesystem image, and mount_bdev()
is returning NULL because dget(s->s_root) == NULL if s->s_root == NULL,
and mount_fs() is accessing root->d_sb because IS_ERR(root) == false if
root == NULL.  Fix this by returning -EINVAL when hfsplus_fill_super()
detected invalid filesystem image.

[1] https://syzkaller.appspot.com/bug?id=21acb6850cecbc960c927229e597158cf35f33d0

Link: http://lkml.kernel.org/r/d83ce31a-874c-dd5b-f790-41405983a5be@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-by: syzbot <syzbot+01ffaf5d9568dd1609f7@syzkaller.appspotmail.com>
Reviewed-by: Ernesto A. Fernández <ernesto.mnd.fernandez@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:50 -07:00
Souptick Joarder
c8ed98cd88 fs/nilfs2/file.c: use new return type vm_fault_t
Use new return type vm_fault_t for page_mkwrite handler.

Link: http://lkml.kernel.org/r/1529555928-2411-1-git-send-email-konishi.ryusuke@lab.ntt.co.jp
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Arnd Bergmann
21a1a52dbd nilfs2: use 64-bit superblock timstamps
The mount time field in the superblock uses a 64-bit timestamp, but
calling get_seconds() may truncate the current time to 32 bits.

This changes it to ktime_get_real_seconds() to avoid the potential
overflow.

Link: http://lkml.kernel.org/r/20180620075041.4154396-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: David Howells <dhowells@redhat.com>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
cbf6898fd6 autofs: add AUTOFS_EXP_FORCED flag
The userspace automount(8) daemon is meant to perform a forced expire when
sent a SIGUSR2.

But since the expiration is routed through the kernel and the kernel
doesn't send an expire request if the mount is busy this hasn't worked at
least since autofs version 5.

Add an AUTOFS_EXP_FORCED flag to allow implemention of the feature and
bump the protocol version so user space can check if it's implemented if
needed.

Link: http://lkml.kernel.org/r/152937734715.21213.6594007182776598970.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
e5c85e1fe1 autofs: make expire flags usage consistent with v5 params
Make the usage of the expire flags consistent by naming the expire flags
the same as it is named in the version 5 miscelaneous ioctl parameters and
only check the bit flags when needed.

Link: http://lkml.kernel.org/r/152937734046.21213.9454131988766280028.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
571bc35c42 autofs: make autofs_expire_indirect() static
autofs_expire_indirect() isn't used outside of fs/autofs/expire.c so make
it static.

Link: http://lkml.kernel.org/r/152937733512.21213.10509996499623738446.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
5d30517d67 autofs: make autofs_expire_direct() static
autofs_expire_direct() isn't used outside of fs/autofs/expire.c so make it
static.

Link: http://lkml.kernel.org/r/152937732944.21213.11821977712410930973.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
d1055565bd autofs: fix clearing AUTOFS_EXP_LEAVES in autofs_expire_indirect()
The expire flag AUTOFS_EXP_LEAVES is cleared before the second call to
should_expire() in autofs_expire_indirect() but the parameter passed in
the second call is incorrect.

Fortunately AUTOFS_EXP_LEAVES expire flag has not been used for a long
time but might be needed in the future so fix it rather than remove the
expire leaves functionality.

Link: http://lkml.kernel.org/r/152937732410.21213.7447294898147765076.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
2fd9944f0f autofs: fix inconsistent use of now variable
The global variable "now" in fs/autofs/expire.c is used in an inconsistent
way, sometimes using jiffies directly, and sometimes using the "now"
variable, and setting it isn't done consistently either.

But the autofs dentry info last_used field is only updated during path
walks or during expire so jiffies can be used directly and the global
variable "now" removed.

Link: http://lkml.kernel.org/r/152937731702.21213.7371321165189170865.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Ian Kent
d4d79b8195 autofs: fix directory and symlink access
Depending on how it is configured the autofs user space daemon can leave
in use mounts mounted at exit and re-connect to them at start up.  But for
this to work best the state of the autofs file system needs to be left
intact over the restart.

Also, at system shutdown, mounts in an autofs file system might be
umounted exposing a mount point trigger for which subsequent access can
lead to a hang.  So recent versions of automount(8) now does its best to
set autofs file system mounts catatonic at shutdown.

When autofs file system mounts are catatonic it's currently possible to
create and remove directories and symlinks which can be a problem at
restart, as described above.

So return EACCES in the directory, symlink and unlink methods if the
autofs file system is catatonic.

Link: http://lkml.kernel.org/r/152902119090.4144.9561910674530214291.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Davidlohr Bueso
992991c03c fs/eventpoll.c: simplify ep_is_linked() callers
Instead of having each caller pass the rdllink explicitly, just have
ep_is_linked() pass it while the callers just need the epi pointer.  This
helper is all about the rdllink, and this change, furthermore, improves
the function's self documentation.

Link: http://lkml.kernel.org/r/20180727053432.16679-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Davidlohr Bueso
679abf381a fs/eventpoll.c: loosen irq safety in ep_poll()
Similar to other calls, ep_poll() is not called with interrupts disabled,
and we can therefore avoid the irq save/restore dance and just disable
local irqs.  In fact, the call should never be called in irq context at
all, considering that the only path is

epoll_wait(2) -> do_epoll_wait() -> ep_poll().

When running on a 2 socket 40-core (ht) IvyBridge a common pipe based
epoll_wait(2) microbenchmark, the following performance improvements are
seen:

    # threads       vanilla         dirty
	 1          1805587	    2106412
	 2          1854064	    2090762
	 4          1805484	    2017436
	 8          1751222	    1974475
	 16         1725299	    1962104
	 32         1378463	    1571233
	 64          787368	     900784

Which is a pretty constantly near 15%.

Also add a lockdep check such that we detect any mischief before
deadlocking.

Link: http://lkml.kernel.org/r/20180727053432.16679-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jason Baron <jbaron@akamai.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Davidlohr Bueso
514056d506 fs/eventpoll.c: simply CONFIG_NET_RX_BUSY_POLL ifdefery
... 'tis easier on the eye.

[akpm@linux-foundation.org: use inlines rather than macros]
Link: http://lkml.kernel.org/r/20180725185620.11020-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:49 -07:00
Davidlohr Bueso
92e6417840 s/epoll: robustify irq safety with lockdep_assert_irqs_enabled()
Sprinkle lockdep_assert_irqs_enabled() checks in the functions that do not
save and restore interrupts when dealing with the ep->wq.lock.  These are
ep_scan_ready_list() and those called by epoll_ctl(): ep_insert, ep_modify
and ep_remove.

[akpm@linux-foundation.org: remove too-obvious comments]
Link: http://lkml.kernel.org/r/20180721183127.3busfa335zlcjeox@linux-r8p5
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:47 -07:00
Davidlohr Bueso
304b18b8d6 fs/epoll: loosen irq safety in epoll_insert() and epoll_remove()
Both functions are similar to the context of ep_modify(), called via
epoll_ctl(2).  Just like ep_modify(), saving and restoring interrupts is
an overkill in these calls as it will never be called with irqs disabled.
While ep_remove() can be called directly from EPOLL_CTL_DEL, it can also
be called when releasing the file, but this also complies with the above.

Link: http://lkml.kernel.org/r/20180720172956.2883-3-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:47 -07:00
Davidlohr Bueso
002b343669 fs/epoll: loosen irq safety in ep_scan_ready_list()
Patch series "fs/epoll: loosen irq safety when possible".

Both patches replace saving+restoring interrupts when taking the ep->lock
(now the waitqueue lock), with just disabling local irqs.  This shows
immediate performance benefits in patch 1 for an epoll workload running on
Xen.  The main concern we need to have with this sort of changes in epoll
is the ep_poll_callback() which is passed to the wait queue wakeup and is
done very often under irq context, this patch does not touch this call.

Patches have been tested pretty heavily with the customer workload,
microbenchmarks, ltp testcases and two high level workloads that use epoll
under the hood: nginx and libevent benchmarks.

This patch (of 2):

Saving and restoring interrupts in ep_scan_ready_list() is an
overkill as it is never called with interrupts disabled. Loosen
this to simply disabling local irqs such that archs where managing
irqs is expensive or virtual environments. This patch yields
some throughput improvements on a workload that is epoll intensive
running on a single Xen DomU.

1 Job	 7500	-->    8800 enq/s  (+17%)
2 Jobs	14000   -->   15200 enq/s  (+8%)
3 Jobs	20500	-->   22300 enq/s  (+8%)
4 Jobs	25000   -->   28000 enq/s  (+8-12)%

On bare metal:

For a 2-socket 40-core (ht) IvyBridge on a few workloads, unfortunately I
don't have a xen environment and the results for Xen I do have (which
numbers are in patch 1) I don't have the actual workload, so cannot
compare them directly.

1) Different configurations were used for a epoll_wait (pipes io)
   microbench (http://linux-scalability.org/epoll/epoll-test.c) and shows
   around a 7-10% improvement in overall total number of times the
   epoll_wait() loops when using both regular and nested epolls, so very
   raw numbers, but measurable nonetheless.

# threads	vanilla		dirty
     1		1677717		1805587
     2		1660510		1854064
     4		1610184		1805484
     8		1577696		1751222
     16		1568837		1725299
     32		1291532		1378463
     64		 752584		 787368

   Note that stddev is pretty small.

2) Another pipe test, which shows no real measurable improvement.
   (http://www.xmailserver.org/linux-patches/pipetest.c)

Link: http://lkml.kernel.org/r/20180720172956.2883-2-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:47 -07:00
Matthew Wilcox
c430d1e848 userfaultfd: use fault_wqh lock
The userfaultfd code currently uses the unlocked waitqueue helpers for
managing fault_wqh, but instead of holding the waitqueue lock for this
waitqueue around these calls, it the waitqueue lock of
fault_pending_wq, which is a different waitqueue instance.  Given that
the waitqueue is not exposed to the rest of the kernel this actually
works ok at the moment, but prevents the userfaultfd locking rules from
being enforced using lockdep.

Switch to the internally locked waitqueue helpers instead.  This means
that the lock inside fault_wqh now nests inside the fault_pending_wqh
lock, but that's not a problem since it was entirely unused before.

[hch@lst.de: slight changelog updates]
[rppt@linux.vnet.ibm.com: spotted changelog spellos]
Link: http://lkml.kernel.org/r/20171214152344.6880-3-hch@lst.de
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:47 -07:00
Christoph Hellwig
ee8ef0a4b1 epoll: use the waitqueue lock to protect ep->wq
Patch series "waitqueue lockdep annotation", v3.

This series adds a strategic lockdep_assert_held to __wake_up_common to
ensure callers really do hold the wait_queue_head lock when calling the
unlocked wake_up variants.  It turns out epoll did not do this for a
fairly common path (hit all the time by systemd during bootup), so the
second patch fixed this instance as well.

This patch (of 3):

The epoll code currently uses the unlocked waitqueue helpers for managing
ep->wq, but instead of holding the waitqueue lock around these calls, it
uses its own ep->lock spinlock.  Given that the waitqueue is not exposed
to the rest of the kernel this actually works ok at the moment, but
prevents the epoll locking rules from being enforced using lockdep.
Remove ep->lock and use the waitqueue lock to not only reduce the size of
struct eventpoll but also to make sure we can assert locking invariants in
the waitqueue code.

Link: http://lkml.kernel.org/r/20171214152344.6880-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:47 -07:00
Omar Sandoval
23c85094fe proc/kcore: add vmcoreinfo note to /proc/kcore
The vmcoreinfo information is useful for runtime debugging tools, not just
for crash dumps.  A lot of this information can be determined by other
means, but this is much more convenient, and it only adds a page at most
to the file.

Link: http://lkml.kernel.org/r/fddbcd08eed76344863303878b12de1c1e2a04b6.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
bf991c2231 proc/kcore: optimize multiple page reads
The current code does a full search of the segment list every time for
every page.  This is wasteful, since it's almost certain that the next
page will be in the same segment.  Instead, check if the previous segment
covers the current page before doing the list search.

Link: http://lkml.kernel.org/r/fd346c11090cf93d867e01b8d73a6567c5ac6361.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
37e949bd52 proc/kcore: clean up ELF header generation
Currently, the ELF file header, program headers, and note segment are
allocated all at once, in some icky code dating back to 2.3.  Programs
tend to read the file header, then the program headers, then the note
segment, all separately, so this is a waste of effort.  It's cleaner and
more efficient to handle the three separately.

Link: http://lkml.kernel.org/r/19c92cbad0e11f6103ff3274b2e7a7e51a1eb74b.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
3673fb08db proc/kcore: hold lock during read
Now that we're using an rwsem, we can hold it during the entirety of
read_kcore() and have a common return path.  This is preparation for the
next change.

[akpm@linux-foundation.org: fix locking bug reported by Tetsuo Handa]
Link: http://lkml.kernel.org/r/d7cfbc1e8a76616f3b699eaff9df0a2730380534.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
b66fb005c9 proc/kcore: fix memory hotplug vs multiple opens race
There's a theoretical race condition that will cause /proc/kcore to miss
a memory hotplug event:

CPU0                              CPU1
// hotplug event 1
kcore_need_update = 1

open_kcore()                      open_kcore()
    kcore_update_ram()                kcore_update_ram()
        // Walk RAM                       // Walk RAM
        __kcore_update_ram()              __kcore_update_ram()
            kcore_need_update = 0

// hotplug event 2
kcore_need_update = 1
                                              kcore_need_update = 0

Note that CPU1 set up the RAM kcore entries with the state after hotplug
event 1 but cleared the flag for hotplug event 2.  The RAM entries will
therefore be stale until there is another hotplug event.

This is an extremely unlikely sequence of events, but the fix makes the
synchronization saner, anyways: we serialize the entire update sequence,
which means that whoever clears the flag will always succeed in replacing
the kcore list.

Link: http://lkml.kernel.org/r/6106c509998779730c12400c1b996425df7d7089.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
0b172f845f proc/kcore: replace kclist_lock rwlock with rwsem
Now we only need kclist_lock from user context and at fs init time, and
the following changes need to sleep while holding the kclist_lock.

Link: http://lkml.kernel.org/r/521ba449ebe921d905177410fee9222d07882f0d.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
bf53183164 proc/kcore: don't grab lock for memory hotplug notifier
The memory hotplug notifier kcore_callback() only needs kclist_lock to
prevent races with __kcore_update_ram(), but we can easily eliminate that
race by using an atomic xchg() in __kcore_update_ram().  This is
preparation for converting kclist_lock to an rwsem.

Link: http://lkml.kernel.org/r/0a4bc89f4dbde8b5b2ea309f7b4fb6a85fe29df2.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Omar Sandoval
a8dd9c4df1 proc/kcore: don't grab lock for kclist_add()
Patch series "/proc/kcore improvements", v4.

This series makes a few improvements to /proc/kcore.  It fixes a couple of
small issues in v3 but is otherwise the same.  Patches 1, 2, and 3 are
prep patches.  Patch 4 is a fix/cleanup.  Patch 5 is another prep patch.
Patches 6 and 7 are optimizations to ->read().  Patch 8 makes it possible
to enable CRASH_CORE on any architecture, which is needed for patch 9.
Patch 9 adds vmcoreinfo to /proc/kcore.

This patch (of 9):

kclist_add() is only called at init time, so there's no point in grabbing
any locks.  We're also going to replace the rwlock with a rwsem, which we
don't want to try grabbing during early boot.

While we're here, mark kclist_add() with __init so that we'll get a
warning if it's called from non-init code.

Link: http://lkml.kernel.org/r/98208db1faf167aa8b08eebfa968d95c70527739.1531953780.git.osandov@fb.com
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Bhupesh Sharma <bhsharma@redhat.com>
Tested-by: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bhupesh Sharma <bhsharma@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
James Morse
df865e8337 fs/proc/kcore.c: use __pa_symbol() for KCORE_TEXT list entries
elf_kcore_store_hdr() uses __pa() to find the physical address of
KCORE_RAM or KCORE_TEXT entries exported as program headers.

This trips CONFIG_DEBUG_VIRTUAL's checks, as the KCORE_TEXT entries are
not in the linear map.

Handle these two cases separately, using __pa_symbol() for the KCORE_TEXT
entries.

Link: http://lkml.kernel.org/r/20180711131944.15252-1-james.morse@arm.com
Signed-off-by: James Morse <james.morse@arm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Souptick Joarder
36f062042b fs/proc/vmcore.c: use new typedef vm_fault_t
Use new return type vm_fault_t for fault handler in struct
vm_operations_struct.  For now, this is just documenting that the function
returns a VM_FAULT value rather than an errno.  Once all instances are
converted, vm_fault_t will become a distinct type.

See 1c8f422059 ("mm: change return type to vm_fault_t") for reference.

Link: http://lkml.kernel.org/r/20180702153325.GA3875@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ganesh Goudar <ganeshgr@chelsio.com>
Cc: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Alexey Dobriyan
9a27e97aaa proc: use "unsigned int" in /proc/stat hook
Number of CPUs is never high enough to force 64-bit arithmetic.
Save couple of bytes on x86_64.

Link: http://lkml.kernel.org/r/20180627200710.GC18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Alexey Dobriyan
891ae71dc4 proc: spread "const" a bit
Link: http://lkml.kernel.org/r/20180627200614.GB18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Alexey Dobriyan
f6d2f584d8 proc: use macro in /proc/latency hook
->latency_record is defined as

	struct latency_record[LT_SAVECOUNT];

so use the same macro whie iterating.

Link: http://lkml.kernel.org/r/20180627200534.GA18434@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:46 -07:00
Alexey Dobriyan
41089b6d3e proc: save 2 atomic ops on write to "/proc/*/attr/*"
Code checks if write is done by current to its own attributes.
For that get/put pair is unnecessary as it can be done under RCU.

Note: rcu_read_unlock() can be done even earlier since pointer to a task
is not dereferenced. It depends if /proc code should look scary or not:

	rcu_read_lock();
	task = pid_task(...);
	rcu_read_unlock();
	if (!task)
		return -ESRCH;
	if (task != current)
		return -EACCESS:

P.S.: rename "length" variable.	Code like this

	length = -EINVAL;

should not exist.

Link: http://lkml.kernel.org/r/20180627200218.GF18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Alexey Dobriyan
a44937fe4e proc: put task earlier in /proc/*/fail-nth
Link: http://lkml.kernel.org/r/20180627195427.GE18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Alexey Dobriyan
8d48b2e044 proc: smaller readlock section in readdir("/proc")
Readdir context is thread local, so ->pos is thread local,
move it out of readlock.

Link: http://lkml.kernel.org/r/20180627195339.GD18113@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Arnd Bergmann
bdf228a272 fs/proc/uptime.c: use ktime_get_boottime_ts64
get_monotonic_boottime() is deprecated and uses the old timespec type.
Let's convert /proc/uptime to use ktime_get_boottime_ts64().

Link: http://lkml.kernel.org/r/20180620081746.282742-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Alexey Dobriyan
2d6e4e822a proc: fixup PDE allocation bloat
24074a35c5 ("proc: Make inline name size calculation automatic")
started to put PDE allocations into kmalloc-256 which is unnecessary as
~40 character names are very rare.

Put allocation back into kmalloc-192 cache for 64-bit non-debug builds.

Put BUILD_BUG_ON to know when PDE size has gotten out of control.

[adobriyan@gmail.com: fix BUILD_BUG_ON breakage on powerpc64]
  Link: http://lkml.kernel.org/r/20180703191602.GA25521@avx2
Link: http://lkml.kernel.org/r/20180617215732.GA24688@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Dennis Zhou (Facebook)
7e8a6304d5 /proc/meminfo: add percpu populated pages count
Currently, percpu memory only exposes allocation and utilization
information via debugfs.  This more or less is only really useful for
understanding the fragmentation and allocation information at a per-chunk
level with a few global counters.  This is also gated behind a config.
BPF and cgroup, for example, have seen an increase in use causing
increased use of percpu memory.  Let's make it easier for someone to
identify how much memory is being used.

This patch adds the "Percpu" stat to meminfo to more easily look up how
much percpu memory is in use.  This number includes the cost for all
allocated backing pages and not just insight at the per a unit, per chunk
level.  Metadata is excluded.  I think excluding metadata is fair because
the backing memory scales with the numbere of cpus and can quickly
outweigh the metadata.  It also makes this calculation light.

Link: http://lkml.kernel.org/r/20180807184723.74919-1-dennisszhou@gmail.com
Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:45 -07:00
Andrew Morton
a670468f5e mm: zero out the vma in vma_init()
Rather than in vm_area_alloc().  To ensure that the various oddball
stack-based vmas are in a good state.  Some of the callers were zeroing
them out, others were not.

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Vlastimil Babka
258f669e7e mm: /proc/pid/smaps_rollup: convert to single value seq_file
The /proc/pid/smaps_rollup file is currently implemented via the
m_start/m_next/m_stop seq_file iterators shared with the other maps files,
that iterate over vma's.  However, the rollup file doesn't print anything
for each vma, only accumulate the stats.

There are some issues with the current code as reported in [1] - the
accumulated stats can get skewed if seq_file start()/stop() op is called
multiple times, if show() is called multiple times, and after seeks to
non-zero position.

Patch [1] fixed those within existing design, but I believe it is
fundamentally wrong to expose the vma iterators to the seq_file mechanism
when smaps_rollup shows logically a single set of values for the whole
address space.

This patch thus refactors the code to provide a single "value" at offset
0, with vma iteration to gather the stats done internally.  This fixes the
situations where results are skewed, and simplifies the code, especially
in show_smap(), at the expense of somewhat less code reuse.

[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2

[vbabka@suse.c: use seq_file infrastructure]
  Link: http://lkml.kernel.org/r/bf4525b0-fd5b-4c4c-2cb3-adee3dd95a48@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Daniel Colascione <dancol@google.com>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Vlastimil Babka
f1547959d9 mm: /proc/pid/smaps: factor out common stats printing
To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out from show_smap() printing the parts of output
that are common for both variants, which is the bulk of the gathered
memory stats.

[vbabka@suse.cz: add const, per Alexey]
  Link: http://lkml.kernel.org/r/b45f319f-cd04-337b-37f8-77f99786aa8a@suse.cz
Link: http://lkml.kernel.org/r/20180723111933.15443-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Vlastimil Babka
8e68d689af mm: /proc/pid/smaps: factor out mem stats gathering
To prepare for handling /proc/pid/smaps_rollup differently from
/proc/pid/smaps factor out vma mem stats gathering from show_smap() - it
will be used by both.

Link: http://lkml.kernel.org/r/20180723111933.15443-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Vlastimil Babka
871305bb20 mm: /proc/pid/*maps remove is_pid and related wrappers
Patch series "cleanups and refactor of /proc/pid/smaps*".

The recent regression in /proc/pid/smaps made me look more into the code.
Especially the issues with smaps_rollup reported in [1] as explained in
Patch 4, which fixes them by refactoring the code.  Patches 2 and 3 are
preparations for that.  Patch 1 is me realizing that there's a lot of
boilerplate left from times where we tried (unsuccessfuly) to mark thread
stacks in the output.

Originally I had also plans to rework the translation from
/proc/pid/*maps* file offsets to the internal structures.  Now the offset
means "vma number", which is not really stable (vma's can come and go
between read() calls) and there's an extra caching of last vma's address.
My idea was that offsets would be interpreted directly as addresses, which
would also allow meaningful seeks (see the ugly seek_to_smaps_entry() in
tools/testing/selftests/vm/mlock2.h).  However loff_t is (signed) long
long so that might be insufficient somewhere for the unsigned long
addresses.

So the result is fixed issues with skewed /proc/pid/smaps_rollup results,
simpler smaps code, and a lot of unused code removed.

[1] https://marc.info/?l=linux-mm&m=151927723128134&w=2

This patch (of 4):

Commit b76437579d ("procfs: mark thread stack correctly in
proc/<pid>/maps") introduced differences between /proc/PID/maps and
/proc/PID/task/TID/maps to mark thread stacks properly, and this was
also done for smaps and numa_maps.  However it didn't work properly and
was ultimately removed by commit b18cb64ead ("fs/proc: Stop trying to
report thread stacks").

Now the is_pid parameter for the related show_*() functions is unused
and we can remove it together with wrapper functions and ops structures
that differ for PID and TID cases only in this parameter.

Link: http://lkml.kernel.org/r/20180723111933.15443-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Colascione <dancol@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:44 -07:00
Ian Kent
0633da48f0 autofs: fix autofs_sbi() does not check super block type
autofs_sbi() does not check the superblock magic number to verify it has
been given an autofs super block.

Link: http://lkml.kernel.org/r/153475422934.17131.7563724552005298277.stgit@pluto.themaw.net
Reported-by: <syzbot+87c3c541582e56943277@syzkaller.appspotmail.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-22 10:52:43 -07:00
Jeremy Cline
7b6924d94a fs/quota: Fix spectre gadget in do_quotactl
'type' is user-controlled, so sanitize it after the bounds check to
avoid using it in speculative execution. This covers the following
potential gadgets detected with the help of smatch:

* fs/ext4/super.c:5741 ext4_quota_read() warn: potential spectre issue
  'sb_dqopt(sb)->files' [r]
* fs/ext4/super.c:5778 ext4_quota_write() warn: potential spectre issue
  'sb_dqopt(sb)->files' [r]
* fs/f2fs/super.c:1552 f2fs_quota_read() warn: potential spectre issue
  'sb_dqopt(sb)->files' [r]
* fs/f2fs/super.c:1608 f2fs_quota_write() warn: potential spectre issue
  'sb_dqopt(sb)->files' [r]
* fs/quota/dquot.c:412 mark_info_dirty() warn: potential spectre issue
  'sb_dqopt(sb)->info' [w]
* fs/quota/dquot.c:933 dqinit_needed() warn: potential spectre issue
  'dquots' [r]
* fs/quota/dquot.c:2112 dquot_commit_info() warn: potential spectre
  issue 'dqopt->ops' [r]
* fs/quota/dquot.c:2362 vfs_load_quota_inode() warn: potential spectre
  issue 'dqopt->files' [w] (local cap)
* fs/quota/dquot.c:2369 vfs_load_quota_inode() warn: potential spectre
  issue 'dqopt->ops' [w] (local cap)
* fs/quota/dquot.c:2370 vfs_load_quota_inode() warn: potential spectre
  issue 'dqopt->info' [w] (local cap)
* fs/quota/quota.c:110 quota_getfmt() warn: potential spectre issue
  'sb_dqopt(sb)->info' [r]
* fs/quota/quota_v2.c:84 v2_check_quota_file() warn: potential spectre
  issue 'quota_magics' [w]
* fs/quota/quota_v2.c:85 v2_check_quota_file() warn: potential spectre
  issue 'quota_versions' [w]
* fs/quota/quota_v2.c:96 v2_read_file_info() warn: potential spectre
  issue 'dqopt->info' [r]
* fs/quota/quota_v2.c:172 v2_write_file_info() warn: potential spectre
  issue 'dqopt->info' [r]

Additionally, a quick inspection indicates there are array accesses with
'type' in quota_on() and quota_off() functions which are also addressed
by this.

Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-22 18:17:48 +02:00
Jeremy Cline
64d9d13828 fs/quota: Replace XQM_MAXQUOTAS usage with MAXQUOTAS
XQM_MAXQUOTAS and MAXQUOTAS are, it appears, equivalent. Replace all
usage of XQM_MAXQUOTAS and remove it along with the unused XQM_*QUOTA
definitions.

Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-22 18:17:29 +02:00
Matthew Wilcox
0f0a0e54a2 devpts: Convert to new IDA API
ida_alloc_max() matches what this driver wants to do.  Also removes a
call to ida_pre_get().  We no longer need the protection of the mutex,
so convert pty_count to an atomic_t and remove the mutex entirely.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-08-21 23:54:17 -04:00
Matthew Wilcox
169b480e4c fs: Convert namespace IDAs to new API
We don't need to keep track of the starting value; the IDA is efficient.

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-08-21 23:54:17 -04:00
Matthew Wilcox
5a66847e44 fs: Convert unnamed_dev_ida to new API
The new API is much easier for this user.  Also add kerneldoc for
get_anon_bdev().

Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-08-21 23:54:16 -04:00
Linus Torvalds
ad1d697358 fuse update for 4.19
This contains various bug fixes and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3xvGwAKCRDh3BK/laaZ
 PKECAP9qUpdtQ5RaIL/y9OGZzJLSZbBZuK3LGNY2u2B3EfrSjgEAvhkhXyOQgvVi
 kgYLNszbg/C+w8U4Xc5GWB6cjNm6rwE=
 =GJI7
 -----END PGP SIGNATURE-----

Merge tag 'fuse-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse update from Miklos Szeredi:
 "Various bug fixes and cleanups"

* tag 'fuse-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: reduce allocation size for splice_write
  fuse: use kvmalloc to allocate array of pipe_buffer structs.
  fuse: convert last timespec use to timespec64
  fs: fuse: Adding new return type vm_fault_t
  fuse: simplify fuse_abort_conn()
  fuse: Add missed unlock_page() to fuse_readpages_fill()
  fuse: Don't access pipe->buffers without pipe_lock()
  fuse: fix initial parallel dirops
  fuse: Fix oops at process_init_reply()
  fuse: umount should wait for all requests
  fuse: fix unlocked access to processing queue
  fuse: fix double request_end()
2018-08-21 18:47:36 -07:00
Linus Torvalds
d9a185f8b4 overlayfs update for 4.19
This contains two new features:
 
  1) Stack file operations: this allows removal of several hacks from the
     VFS, proper interaction of read-only open files with copy-up,
     possibility to implement fs modifying ioctls properly, and others.
 
  2) Metadata only copy-up: when file is on lower layer and only metadata is
     modified (except size) then only copy up the metadata and continue to
     use the data from the lower file.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
 PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
 fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
 =QDXH
 -----END PGP SIGNATURE-----

Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs

Pull overlayfs updates from Miklos Szeredi:
 "This contains two new features:

   - Stack file operations: this allows removal of several hacks from
     the VFS, proper interaction of read-only open files with copy-up,
     possibility to implement fs modifying ioctls properly, and others.

   - Metadata only copy-up: when file is on lower layer and only
     metadata is modified (except size) then only copy up the metadata
     and continue to use the data from the lower file"

* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
  ovl: Enable metadata only feature
  ovl: Do not do metacopy only for ioctl modifying file attr
  ovl: Do not do metadata only copy-up for truncate operation
  ovl: add helper to force data copy-up
  ovl: Check redirect on index as well
  ovl: Set redirect on upper inode when it is linked
  ovl: Set redirect on metacopy files upon rename
  ovl: Do not set dentry type ORIGIN for broken hardlinks
  ovl: Add an inode flag OVL_CONST_INO
  ovl: Treat metacopy dentries as type OVL_PATH_MERGE
  ovl: Check redirects for metacopy files
  ovl: Move some dir related ovl_lookup_single() code in else block
  ovl: Do not expose metacopy only dentry from d_real()
  ovl: Open file with data except for the case of fsync
  ovl: Add helper ovl_inode_realdata()
  ovl: Store lower data inode in ovl_inode
  ovl: Fix ovl_getattr() to get number of blocks from lower
  ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
  ovl: Copy up meta inode data from lowest data inode
  ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
  ...
2018-08-21 18:19:09 -07:00
Linus Torvalds
c22fc16d17 Changes since last update:
- Fix an uninitialized variable
 - Don't use obviously garbage AG header counters to calculate
   transaction reservations
 - Trigger icount recalculation on bad icount when monting.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAlty8tsACgkQ+H93GTRK
 tOtoKw/+OeCaY6jZc2JoztBwLSUJsMYQ0R8Wsj5GRb4bVp9b0zes7RJMFU03nCtj
 XuE4Rhdsx+6+QZQKxTq/Z6lrKHEjF0kL1EVGHtL46Inr+Z+Rr4bLBG6NV1o0dg7B
 CR1IqW5vYcZ7Vrk9ko/RXVXtuCIxBS5jSW/S/uFT95Y4lVMAf/2asR/OoYt5ZVE3
 17CUfWRifiSGoBQpjtfZd63F23XlEEusiErC5iS9rUbE2qC9FxP9EuvoUP5M/n01
 nLS34Fjw7X739AiwHbf10fQPOvBr7atTazCXskjy4gbwqIWTmuhbF4ieTU1OfTI8
 ozhvYomBYLiZbsEYBhVCs09VEnIfHmf2HoLh//efGE8VEvoQllxdn/g2PQekoPAn
 M7VnRUXCTvaLI8IE2d3Ed1VWm0OTea09xqEiNpB0XGjegim9pXuf6t/zbe4R0vJy
 YLBgQT8XRPw5ZgCnBbxvZOXXxQtAqKnqZzYSWGxlHJhhduKVeKMqerhP0nn0ui8g
 wAOmOe3XEoyLfSY8WY0ACEEEA00pAwErerwVEFLCpaKTh5GOY4i3OBdqcZOtXacn
 f5oIeG9HZKAXKkOTGwpq1zGHTOYhz4mxAYhodRFiEE8rXHDa9odUWQ/iG0zgZaO6
 19xznXjXkVWVg0QJqQJi6SbEkkrAEFtFRYH+VPTgWM/1tg47a14=
 =+0Eq
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Darrick Wong:

 - Fix an uninitialized variable

 - Don't use obviously garbage AG header counters to calculate
   transaction reservations

 - Trigger icount recalculation on bad icount when mounting

* tag 'xfs-4.19-merge-7' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: fix WARN_ON_ONCE on uninitialized variable
  xfs: sanity check ag header values in xrep_calc_ag_resblks
  xfs: recalculate summary counters at mount time if icount is bad
2018-08-21 18:15:47 -07:00
Linus Torvalds
0214f46b3a Merge branch 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull core signal handling updates from Eric Biederman:
 "It was observed that a periodic timer in combination with a
  sufficiently expensive fork could prevent fork from every completing.
  This contains the changes to remove the need for that restart.

  This set of changes is split into several parts:

   - The first part makes PIDTYPE_TGID a proper pid type instead
     something only for very special cases. The part starts using
     PIDTYPE_TGID enough so that in __send_signal where signals are
     actually delivered we know if the signal is being sent to a a group
     of processes or just a single process.

   - With that prep work out of the way the logic in fork is modified so
     that fork logically makes signals received while it is running
     appear to be received after the fork completes"

* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
  signal: Don't send signals to tasks that don't exist
  signal: Don't restart fork when signals come in.
  fork: Have new threads join on-going signal group stops
  fork: Skip setting TIF_SIGPENDING in ptrace_init_task
  signal: Add calculate_sigpending()
  fork: Unconditionally exit if a fatal signal is pending
  fork: Move and describe why the code examines PIDNS_ADDING
  signal: Push pid type down into complete_signal.
  signal: Push pid type down into __send_signal
  signal: Push pid type down into send_signal
  signal: Pass pid type into do_send_sig_info
  signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
  signal: Pass pid type into group_send_sig_info
  signal: Pass pid and pid type into send_sigqueue
  posix-timers: Noralize good_sigevent
  signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
  pid: Implement PIDTYPE_TGID
  pids: Move the pgrp and session pid pointers from task_struct to signal_struct
  kvm: Don't open code task_pid in kvm_vcpu_ioctl
  pids: Compute task_tgid using signal->leader_pid
  ...
2018-08-21 13:47:29 -07:00
Trond Myklebust
0af4c8be97 pNFS: Remove unwanted optimisation of layoutget
If we knew that the file was empty, we wouldn't be asking for a layout.
Any optimisation here is already done before calling pnfs_update_layout().
As it stands, we sometimes end up doing an unnecessary inband read to
the MDS even when holding a layout.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-21 13:39:08 -04:00
Trond Myklebust
1c1aeaf143 pNFS/flexfiles: ff_layout_pg_init_read should exit on error
If we get an error while retrieving the layout, then we should
report it rather than falling back to I/O through the MDS.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-21 13:39:05 -04:00
Eric Sandeen
09a4e0be58 isofs: reject hardware sector size > 2048 bytes
The largest block size supported by isofs is ISOFS_BLOCK_SIZE (2048), but
isofs_fill_super calls sb_min_blocksize and sets the blocksize to the
device's logical block size if it's larger than what we ended up with after
option parsing.

If for some reason we try to mount a hard 4k device as an isofs filesystem,
we'll set opt.blocksize to 4096, and when we try to read the superblock
we found via:

        block = iso_blknum << (ISOFS_BLOCK_BITS - s->s_blocksize_bits)

with s_blocksize_bits greater than ISOFS_BLOCK_BITS, we'll have a negative
shift and the bread will fail somewhat cryptically:

  isofs_fill_super: bread failed, dev=sda, iso_blknum=17, block=-2147483648

It seems best to just catch and clearly reject mounts of such a device.

Reported-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-21 11:37:41 +02:00
Chao Yu
6aa58d8ad2 f2fs: readahead encrypted block during GC
During GC, for each encrypted block, we will read block synchronously
into meta page, and then submit it into current cold data log area.

So this block read model with 4k granularity can make poor performance,
like migrating non-encrypted block, let's readahead encrypted block
as well to improve migration performance.

To implement this, we choose meta page that its index is old block
address of the encrypted block, and readahead ciphertext into this
page, later, if readaheaded page is still updated, we will load its
data into target meta page, and submit the write IO.

Note that for OPU, truncation, deletion, we need to invalid meta
page after we invalid old block address, to make sure we won't load
invalid data from target meta page during encrypted block migration.

for ((i = 0; i < 1000; i++))
do {
        xfs_io -f /mnt/f2fs/dir/$i -c "pwrite 0 128k" -c "fsync";
} done

for ((i = 0; i < 1000; i+=2))
do {
        rm /mnt/f2fs/dir/$i;
} done

ret = ioctl(fd, F2FS_IOC_GARBAGE_COLLECT, 0);

Before:
              gc-6549  [001] d..1 214682.212797: block_rq_insert: 8,32 RA 32768 () 786400 + 64 [gc]
              gc-6549  [001] d..1 214682.212802: block_unplug: [gc] 1
              gc-6549  [001] .... 214682.213892: block_bio_queue: 8,32 R 67494144 + 8 [gc]
              gc-6549  [001] .... 214682.213899: block_getrq: 8,32 R 67494144 + 8 [gc]
              gc-6549  [001] .... 214682.213902: block_plug: [gc]
              gc-6549  [001] d..1 214682.213905: block_rq_insert: 8,32 R 4096 () 67494144 + 8 [gc]
              gc-6549  [001] d..1 214682.213908: block_unplug: [gc] 1
              gc-6549  [001] .... 214682.226405: block_bio_queue: 8,32 R 67494152 + 8 [gc]
              gc-6549  [001] .... 214682.226412: block_getrq: 8,32 R 67494152 + 8 [gc]
              gc-6549  [001] .... 214682.226414: block_plug: [gc]
              gc-6549  [001] d..1 214682.226417: block_rq_insert: 8,32 R 4096 () 67494152 + 8 [gc]
              gc-6549  [001] d..1 214682.226420: block_unplug: [gc] 1
              gc-6549  [001] .... 214682.226904: block_bio_queue: 8,32 R 67494160 + 8 [gc]
              gc-6549  [001] .... 214682.226910: block_getrq: 8,32 R 67494160 + 8 [gc]
              gc-6549  [001] .... 214682.226911: block_plug: [gc]
              gc-6549  [001] d..1 214682.226914: block_rq_insert: 8,32 R 4096 () 67494160 + 8 [gc]
              gc-6549  [001] d..1 214682.226916: block_unplug: [gc] 1

After:
              gc-5678  [003] .... 214327.025906: block_bio_queue: 8,32 R 67493824 + 8 [gc]
              gc-5678  [003] .... 214327.025908: block_bio_backmerge: 8,32 R 67493824 + 8 [gc]
              gc-5678  [003] .... 214327.025915: block_bio_queue: 8,32 R 67493832 + 8 [gc]
              gc-5678  [003] .... 214327.025917: block_bio_backmerge: 8,32 R 67493832 + 8 [gc]
              gc-5678  [003] .... 214327.025923: block_bio_queue: 8,32 R 67493840 + 8 [gc]
              gc-5678  [003] .... 214327.025925: block_bio_backmerge: 8,32 R 67493840 + 8 [gc]
              gc-5678  [003] .... 214327.025932: block_bio_queue: 8,32 R 67493848 + 8 [gc]
              gc-5678  [003] .... 214327.025934: block_bio_backmerge: 8,32 R 67493848 + 8 [gc]
              gc-5678  [003] .... 214327.025941: block_bio_queue: 8,32 R 67493856 + 8 [gc]
              gc-5678  [003] .... 214327.025943: block_bio_backmerge: 8,32 R 67493856 + 8 [gc]
              gc-5678  [003] .... 214327.025953: block_bio_queue: 8,32 R 67493864 + 8 [gc]
              gc-5678  [003] .... 214327.025955: block_bio_backmerge: 8,32 R 67493864 + 8 [gc]
              gc-5678  [003] .... 214327.025962: block_bio_queue: 8,32 R 67493872 + 8 [gc]
              gc-5678  [003] .... 214327.025964: block_bio_backmerge: 8,32 R 67493872 + 8 [gc]
              gc-5678  [003] .... 214327.025970: block_bio_queue: 8,32 R 67493880 + 8 [gc]
              gc-5678  [003] .... 214327.025972: block_bio_backmerge: 8,32 R 67493880 + 8 [gc]
              gc-5678  [003] .... 214327.026000: block_bio_queue: 8,32 WS 34123776 + 2048 [gc]
              gc-5678  [003] .... 214327.026019: block_getrq: 8,32 WS 34123776 + 2048 [gc]
              gc-5678  [003] d..1 214327.026021: block_rq_insert: 8,32 R 131072 () 67493632 + 256 [gc]
              gc-5678  [003] d..1 214327.026023: block_unplug: [gc] 1
              gc-5678  [003] d..1 214327.026026: block_rq_issue: 8,32 R 131072 () 67493632 + 256 [gc]
              gc-5678  [003] .... 214327.026046: block_plug: [gc]

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-20 23:13:42 -07:00
Jaegeuk Kim
6f8d445506 f2fs: avoid fi->i_gc_rwsem[WRITE] lock in f2fs_gc
The f2fs_gc() called by f2fs_balance_fs() requires to be called outside of
fi->i_gc_rwsem[WRITE], since f2fs_gc() can try to grab it in a loop.

If it hits the miximum retrials in GC, let's give a chance to release
gc_mutex for a short time in order not to go into live lock in the worst
case.

Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-20 23:13:42 -07:00
Jaegeuk Kim
853137cef4 f2fs: fix performance issue observed with multi-thread sequential read
This reverts the commit - "b93f771 - f2fs: remove writepages lock"
to fix the drop in sequential read throughput.

Test: ./tiotest -t 32 -d /data/tio_tmp -f 32 -b 524288 -k 1 -k 3 -L
device: UFS

Before -
read throughput: 185 MB/s
total read requests: 85177 (of these ~80000 are 4KB size requests).
total write requests: 2546 (of these ~2208 requests are written in 512KB).

After -
read throughput: 758 MB/s
total read requests: 2417 (of these ~2042 are 512KB reads).
total write requests: 2701 (of these ~2034 requests are written in 512KB).

Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2018-08-20 23:13:42 -07:00
Linus Torvalds
7140ad3898 Updates for v4.19:
- Restructure of lockdep and latency tracers
 
    This is the biggest change. Joel Fernandes restructured the hooks
    from irqs and preemption disabling and enabling. He got rid of
    a lot of the preprocessor #ifdef mess that they caused.
 
    He turned both lockdep and the latency tracers to use trace events
    inserted in the preempt/irqs disabling paths. But unfortunately,
    these started to cause issues in corner cases. Thus, parts of the
    code was reverted back to where lockde and the latency tracers
    just get called directly (without using the trace events).
    But because the original change cleaned up the code very nicely
    we kept that, as well as the trace events for preempt and irqs
    disabling, but they are limited to not being called in NMIs.
 
  - Have trace events use SRCU for "rcu idle" calls. This was required
    for the preempt/irqs off trace events. But it also had to not
    allow them to be called in NMI context. Waiting till Paul makes
    an NMI safe SRCU API.
 
  - New notrace SRCU API to allow trace events to use SRCU.
 
  - Addition of mcount-nop option support
 
  - SPDX headers replacing GPL templates.
 
  - Various other fixes and clean ups.
 
  - Some fixes are marked for stable, but were not fully tested
    before the merge window opened.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCW3ruhRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qiM7AP47NhYdSnCFCRUJfrt6PovXmQtuCHt3
 c3QMoGGdvzh9YAEAqcSXwh7uLhpHUp1LjMAPkXdZVwNddf4zJQ1zyxQ+EAU=
 =vgEr
 -----END PGP SIGNATURE-----

Merge tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing updates from Steven Rostedt:

 - Restructure of lockdep and latency tracers

   This is the biggest change. Joel Fernandes restructured the hooks
   from irqs and preemption disabling and enabling. He got rid of a lot
   of the preprocessor #ifdef mess that they caused.

   He turned both lockdep and the latency tracers to use trace events
   inserted in the preempt/irqs disabling paths. But unfortunately,
   these started to cause issues in corner cases. Thus, parts of the
   code was reverted back to where lockdep and the latency tracers just
   get called directly (without using the trace events). But because the
   original change cleaned up the code very nicely we kept that, as well
   as the trace events for preempt and irqs disabling, but they are
   limited to not being called in NMIs.

 - Have trace events use SRCU for "rcu idle" calls. This was required
   for the preempt/irqs off trace events. But it also had to not allow
   them to be called in NMI context. Waiting till Paul makes an NMI safe
   SRCU API.

 - New notrace SRCU API to allow trace events to use SRCU.

 - Addition of mcount-nop option support

 - SPDX headers replacing GPL templates.

 - Various other fixes and clean ups.

 - Some fixes are marked for stable, but were not fully tested before
   the merge window opened.

* tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
  tracing: Fix SPDX format headers to use C++ style comments
  tracing: Add SPDX License format tags to tracing files
  tracing: Add SPDX License format to bpf_trace.c
  blktrace: Add SPDX License format header
  s390/ftrace: Add -mfentry and -mnop-mcount support
  tracing: Add -mcount-nop option support
  tracing: Avoid calling cc-option -mrecord-mcount for every Makefile
  tracing: Handle CC_FLAGS_FTRACE more accurately
  Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
  Uprobes: Simplify uprobe_register() body
  tracepoints: Free early tracepoints after RCU is initialized
  uprobes: Use synchronize_rcu() not synchronize_sched()
  tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister()
  ftrace: Remove unused pointer ftrace_swapper_pid
  tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage"
  tracing/irqsoff: Handle preempt_count for different configs
  tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"
  tracing: irqsoff: Account for additional preempt_disable
  trace: Use rcu_dereference_raw for hooks from trace-event subsystem
  tracing/kprobes: Fix within_notrace_func() to check only notrace functions
  ...
2018-08-20 18:32:00 -07:00
Linus Torvalds
0a78ac4b9b The main things are support for cephx v2 authentication protocol and
basic support for rbd images within namespaces (myself).  Also included
 y2038 conversion patches from Arnd, a pile of miscellaneous fixes from
 Chengguang and Zheng's feature bit infrastructure for the filesystem.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlt62CkTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizfhB/0c/rz6frunc6EcZMWuBNzlOIOktJ/m
 MEbPGjCxMAsmidO1rqHHYF4iEN5hr+3AWTbtIL2m6wkqYVdg3FjmNaAYB27AdQMG
 kH9bLfrKIew72/NZqXfm25yjY/86kIt8t91kay4Lchc97tSYhnFSnku7iAX2HTND
 TMhq/1O/GvEyw/RmqnenJEQqFJvKnfgPPQm6W8sM2bH0T5j+EXmDT/Rv+90LogFR
 J4+pZkHqDfvyMb1WJ5MkumohytbRVzRNKcMpOvjquJSqUgtgZa2JdrIsypDqSNKY
 nUT6jGGlxoSbHCqRwDJoFEJOlh5A9RwKqYxNuM2a/vs9u7HpvdCK/Iah
 =AtgY
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The main things are support for cephx v2 authentication protocol and
  basic support for rbd images within namespaces (myself).

  Also included are y2038 conversion patches from Arnd, a pile of
  miscellaneous fixes from Chengguang and Zheng's feature bit
  infrastructure for the filesystem"

* tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client: (40 commits)
  ceph: don't drop message if it contains more data than expected
  ceph: support cephfs' own feature bits
  crush: fix using plain integer as NULL warning
  libceph: remove unnecessary non NULL check for request_key
  ceph: refactor error handling code in ceph_reserve_caps()
  ceph: refactor ceph_unreserve_caps()
  ceph: change to void return type for __do_request()
  ceph: compare fsc->max_file_size and inode->i_size for max file size limit
  ceph: add additional size check in ceph_setattr()
  ceph: add additional offset check in ceph_write_iter()
  ceph: add additional range check in ceph_fallocate()
  ceph: add new field max_file_size in ceph_fs_client
  libceph: weaken sizeof check in ceph_x_verify_authorizer_reply()
  libceph: check authorizer reply/challenge length before reading
  libceph: implement CEPHX_V2 calculation mode
  libceph: add authorizer challenge
  libceph: factor out encrypt_authorizer()
  libceph: factor out __ceph_x_decrypt()
  libceph: factor out __prepare_write_connect()
  libceph: store ceph_auth_handshake pointer in ceph_connection
  ...
2018-08-20 18:26:55 -07:00
Jan Kara
d3bc0fa841 fsnotify: fix false positive warning on inode delete
When inode is getting deleted and someone else holds reference to a mark
attached to the inode, we just detach the connector from the inode. In
that case fsnotify_put_mark() called from fsnotify_destroy_marks() will
decide to recalculate mask for the inode and __fsnotify_recalc_mask()
will WARN about invalid connector type:

WARNING: CPU: 1 PID: 12015 at fs/notify/mark.c:139
__fsnotify_recalc_mask+0x2d7/0x350 fs/notify/mark.c:139

Actually there's no reason to warn about detached connector in
__fsnotify_recalc_mask() so just silently skip updating the mask in such
case.

Reported-by: syzbot+c34692a51b9a6ca93540@syzkaller.appspotmail.com
Fixes: 3ac70bfcde ("fsnotify: add helper to get mask from connector")
Signed-off-by: Jan Kara <jack@suse.cz>
2018-08-20 13:55:45 +02:00
Linus Torvalds
a18d783fed Driver core patches for 4.19-rc1
Here are all of the driver core and related patches for 4.19-rc1.
 
 Nothing huge here, just a number of small cleanups and the ability to
 now stop the deferred probing after init happens.
 
 All of these have been in linux-next for a while with only a merge issue
 reported.  That merge issue is in fs/sysfs/group.c and Stephen has
 posted the diff of what it should be to resolve this.  I'll follow up
 with that diff to this pull request.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCW3g86Q8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynyXQCePaZSW8wft4b7nLN8RdZ98ATBru0Ani10lrJa
 HQeQJRNbWU1AZ0ym7695
 =tOaH
 -----END PGP SIGNATURE-----

Merge tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull driver core updates from Greg KH:
 "Here are all of the driver core and related patches for 4.19-rc1.

  Nothing huge here, just a number of small cleanups and the ability to
  now stop the deferred probing after init happens.

  All of these have been in linux-next for a while with only a merge
  issue reported"

* tag 'driver-core-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (21 commits)
  base: core: Remove WARN_ON from link dependencies check
  drivers/base: stop new probing during shutdown
  drivers: core: Remove glue dirs from sysfs earlier
  driver core: remove unnecessary function extern declare
  sysfs.h: fix non-kernel-doc comment
  PM / Domains: Stop deferring probe at the end of initcall
  iommu: Remove IOMMU_OF_DECLARE
  iommu: Stop deferring probe at end of initcalls
  pinctrl: Support stopping deferred probe after initcalls
  dt-bindings: pinctrl: add a 'pinctrl-use-default' property
  driver core: allow stopping deferred probe after init
  driver core: add a debugfs entry to show deferred devices
  sysfs: Fix internal_create_group() for named group updates
  base: fix order of OF initialization
  linux/device.h: fix kernel-doc notation warning
  Documentation: update firmware loader fallback reference
  kobject: Replace strncpy with memcpy
  drivers: base: cacheinfo: use OF property_read_u32 instead of get_property,read_number
  kernfs: Replace strncpy with memcpy
  device: Add #define dev_fmt similar to #define pr_fmt
  ...
2018-08-18 11:44:53 -07:00
Ingo Molnar
5804b11034 perf/core improvements ad fixes:
kernel:
 
 . kallsyms, x86: Export addresses of PTI entry trampolines (Alexander Shishkin)
 
 . kallsyms: Simplify update_iter_mod() (Adrian Hunter)
 
 . x86: Add entry trampolines to kcore (Adrian Hunter)
 
 Hardware tracing:
 
 . Fix auxtrace queue resize (Adrian Hunter)
 
 Arch specific:
 
 . Fix uninitialized ARM SPE record error variable (Kim Phillips)
 
 . Fix trace event post-processing in powerpc (Sandipan Das)
 
 Build:
 
 . Fix check-headers.sh AND list path of execution (Alexander Kapshuk)
 
 . Remove -mcet and -fcf-protection when building the python binding
   with older clang versions (Arnaldo Carvalho de Melo)
 
 . Make check-headers.sh check based on kernel dir (Jiri Olsa)
 
 . Move syscall_64.tbl check into check-headers.sh (Jiri Olsa)
 
 Infrastructure:
 
 . Check for null when copying nsinfo.  (Benno Evers)
 
 Libraries:
 
 . Rename libtraceevent prefixes, prep work for making it a shared
   library generaly available (Tzvetomir Stoyanov (VMware))
 
 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELb9bqkb7Te0zijNb1lAW81NSqkAFAlt0OP8ACgkQ1lAW81NS
 qkD0lg/8Dcjd6bCgHzrRJYcR4VUNgHq4LnpJEp3VVvtV29AmptVW3hOCF6siuwDI
 H8rtMzE0gflMVHae310qTaUIzo1A6/ugoRwxUKLKU8aWkA1ikl9iJn6uaTttOCIG
 H4a/mExILDicGfxMk6kAPdyDbYr7r+1UoF58asrWVjPQNPxoJSALJCPtMnLK7Cn8
 qMZN68TqIL5zifbRe6UHKCH/SmDuowVmEIz4Nin3QtwKFPH+I02TtSdYkNWTC5WK
 o469/zy9cceA6a8Q+bVEUP0OD1mU8BvRnZogOiZ5SdMiYZlDkFSqG5MzgTtJUXQC
 DxOKkANdWu7zON/KywGDX8kcIBySzd5toTiXLvsHNNhxR8pT3bU57QvrscqLIMX+
 SVbCR03h5EwLeJopvDKZjwtcSwCWp1aCXCrmLP04tcB1zi+mUIRohbzf2vZDPfwo
 IRtoHvPAgV9UAA7M+UAAtNc7G0Gg/K2c6LlfiuxDhgjk9Jqg4NbOz3cn6rvXtGQX
 B4RM9pdhoNi9tqrXNYCnLzinzKtPhWjyEbb0FBlgvXTNQNiDVjhrkt3pC1fH1zK9
 GX1F6L78x24z3bZl5hUyYXmxOkktpAeprjICXAikYvwwige6aNENVLhCI/fVFHcx
 ByxXbFMom/dSIgU0tBfdqktd7ZmSLm94obSuA6BW8/htf2JIztM=
 =W1go
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-for-mingo-4.19-20180815' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent

Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:

kernel:

- kallsyms, x86: Export addresses of PTI entry trampolines (Alexander Shishkin)

- kallsyms: Simplify update_iter_mod() (Adrian Hunter)

- x86: Add entry trampolines to kcore (Adrian Hunter)

Hardware tracing:

- Fix auxtrace queue resize (Adrian Hunter)

Arch specific:

- Fix uninitialized ARM SPE record error variable (Kim Phillips)

- Fix trace event post-processing in powerpc (Sandipan Das)

Build:

- Fix check-headers.sh AND list path of execution (Alexander Kapshuk)

- Remove -mcet and -fcf-protection when building the python binding
  with older clang versions (Arnaldo Carvalho de Melo)

- Make check-headers.sh check based on kernel dir (Jiri Olsa)

- Move syscall_64.tbl check into check-headers.sh (Jiri Olsa)

Infrastructure:

- Check for null when copying nsinfo.  (Benno Evers)

Libraries:

- Rename libtraceevent prefixes, prep work for making it a shared
  library generaly available (Tzvetomir Stoyanov (VMware))

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-18 13:11:51 +02:00
Linus Torvalds
1f7a4c73a7 Pull request for inclusion in 4.19, take two
This tag is the same as 9p-for-4.19 without the two MAINTAINERS patches
 
 Contains mostly fixes (6 to be backported to stable) and a few changes,
 here is the breakdown:
  * Rework how fids are attributed by replacing some custom tracking in a
 list by an idr (f28cdf0430)
  * For packet-based transports (virtio/rdma) validate that the packet
 length matches what the header says (f984579a01)
  * A few race condition fixes found by syzkaller (9f476d7c54,
 430ac66eb4)
  * Missing argument check when NULL device is passed in sys_mount
 (10aa14527f)
  * A few virtio fixes (23cba9cbde, 31934da810, d28c756cae)
  * Some spelling and style fixes
 
 ----------------------------------------------------------------
 Chirantan Ekbote (1):
       9p/net: Fix zero-copy path in the 9p virtio transport
 
 Colin Ian King (1):
       fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
 
 Jean-Philippe Brucker (1):
       net/9p: fix error path of p9_virtio_probe
 
 Matthew Wilcox (4):
       9p: Fix comment on smp_wmb
       9p: Change p9_fid_create calling convention
       9p: Replace the fidlist with an IDR
       9p: Embed wait_queue_head into p9_req_t
 
 Souptick Joarder (1):
       fs/9p/vfs_file.c: use new return type vm_fault_t
 
 Stephen Hemminger (1):
       9p: fix whitespace issues
 
 Tomas Bortoli (5):
       net/9p/client.c: version pointer uninitialized
       net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
       net/9p/trans_fd.c: fix race by holding the lock
       9p: validate PDU length
       9p: fix multiple NULL-pointer-dereferences
 
 jiangyiwen (2):
       net/9p/virtio: Fix hard lockup in req_done
       9p/virtio: fix off-by-one error in sg list bounds check
 
 piaojun (5):
       net/9p/client.c: add missing '\n' at the end of p9_debug()
       9p/net/protocol.c: return -ENOMEM when kmalloc() failed
       net/9p/trans_virtio.c: fix some spell mistakes in comments
       fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
       net/9p/trans_virtio.c: add null terminal for mount tag
 
  fs/9p/v9fs.c            |   2 +-
  fs/9p/vfs_file.c        |   2 +-
  fs/9p/xattr.c           |   6 ++++--
  include/net/9p/client.h |  11 ++++-------
  net/9p/client.c         | 119 +++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------------------------------------
  net/9p/protocol.c       |   2 +-
  net/9p/trans_fd.c       |  22 +++++++++++++++-------
  net/9p/trans_rdma.c     |   4 ++++
  net/9p/trans_virtio.c   |  66 +++++++++++++++++++++++++++++++++++++---------------------------
  net/9p/trans_xen.c      |   3 +++
  net/9p/util.c           |   1 -
  12 files changed, 122 insertions(+), 116 deletions(-)
 -----BEGIN PGP SIGNATURE-----
 
 iF0EABECAB0WIQQ8idm2ZSicIMLgzKqoqIItDqvwPAUCW3ElNwAKCRCoqIItDqvw
 PMzfAKCkCYFyNC89vcpxcCNsK7rFQ1qKlwCgoaBpZDdegOu0jMB7cyKwAWrB0LM=
 =h3T0
 -----END PGP SIGNATURE-----

Merge tag '9p-for-4.19-2' of git://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:
 "This contains mostly fixes (6 to be backported to stable) and a few
  changes, here is the breakdown:

   - rework how fids are attributed by replacing some custom tracking in
     a list by an idr

   - for packet-based transports (virtio/rdma) validate that the packet
     length matches what the header says

   - a few race condition fixes found by syzkaller

   - missing argument check when NULL device is passed in sys_mount

   - a few virtio fixes

   - some spelling and style fixes"

* tag '9p-for-4.19-2' of git://github.com/martinetd/linux: (21 commits)
  net/9p/trans_virtio.c: add null terminal for mount tag
  9p/virtio: fix off-by-one error in sg list bounds check
  9p: fix whitespace issues
  9p: fix multiple NULL-pointer-dereferences
  fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
  9p: validate PDU length
  net/9p/trans_fd.c: fix race by holding the lock
  net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
  net/9p/virtio: Fix hard lockup in req_done
  net/9p/trans_virtio.c: fix some spell mistakes in comments
  9p/net: Fix zero-copy path in the 9p virtio transport
  9p: Embed wait_queue_head into p9_req_t
  9p: Replace the fidlist with an IDR
  9p: Change p9_fid_create calling convention
  9p: Fix comment on smp_wmb
  net/9p/client.c: version pointer uninitialized
  fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
  net/9p: fix error path of p9_virtio_probe
  9p/net/protocol.c: return -ENOMEM when kmalloc() failed
  net/9p/client.c: add missing '\n' at the end of p9_debug()
  ...
2018-08-17 17:27:58 -07:00
Linus Torvalds
6ada4e2826 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc things

 - a few Y2038 fixes

 - ntfs fixes

 - arch/sh tweaks

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
  mm/hmm.c: remove unused variables align_start and align_end
  fs/userfaultfd.c: remove redundant pointer uwq
  mm, vmacache: hash addresses based on pmd
  mm/list_lru: introduce list_lru_shrink_walk_irq()
  mm/list_lru.c: pass struct list_lru_node* as an argument to __list_lru_walk_one()
  mm/list_lru.c: move locking from __list_lru_walk_one() to its caller
  mm/list_lru.c: use list_lru_walk_one() in list_lru_walk_node()
  mm, swap: make CONFIG_THP_SWAP depend on CONFIG_SWAP
  mm/sparse: delete old sparse_init and enable new one
  mm/sparse: add new sparse_init_nid() and sparse_init()
  mm/sparse: move buffer init/fini to the common place
  mm/sparse: use the new sparse buffer functions in non-vmemmap
  mm/sparse: abstract sparse buffer allocations
  mm/hugetlb.c: don't zero 1GiB bootmem pages
  mm, page_alloc: double zone's batchsize
  mm/oom_kill.c: document oom_lock
  mm/hugetlb: remove gigantic page support for HIGHMEM
  mm, oom: remove sleep from under oom_lock
  kernel/dma: remove unsupported gfp_mask parameter from dma_alloc_from_contiguous()
  mm/cma: remove unsupported gfp_mask parameter from cma_alloc()
  ...
2018-08-17 16:49:31 -07:00
Colin Ian King
5241d47274 fs/userfaultfd.c: remove redundant pointer uwq
Pointer uwq is being assigned but is never used hence it is redundant
and can be removed.

Cleans up clang warning:
  warning: variable 'uwq' set but not used [-Wunused-but-set-variable]

Link: http://lkml.kernel.org/r/20180717090802.18357-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:32 -07:00
Kirill Tkhai
9b996468cf mm: add SHRINK_EMPTY shrinker methods return value
We need to distinguish the situations when shrinker has very small
amount of objects (see vfs_pressure_ratio() called from
super_cache_count()), and when it has no objects at all.  Currently, in
the both of these cases, shrinker::count_objects() returns 0.

The patch introduces new SHRINK_EMPTY return value, which will be used
for "no objects at all" case.  It's is a refactoring mostly, as
SHRINK_EMPTY is replaced by 0 by all callers of do_shrink_slab() in this
patch, and all the magic will happen in further.

Link: http://lkml.kernel.org/r/153063069574.1818.11037751256699341813.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00
Kirill Tkhai
c92e8e10ca fs: propagate shrinker::id to list_lru
Add list_lru::shrinker_id field and populate it by registered shrinker
id.

This will be used to set correct bit in memcg shrinkers map by lru code
in next patches, after there appeared the first related to memcg element
in list_lru.

Link: http://lkml.kernel.org/r/153063059758.1818.14866596416857717800.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Tested-by: Shakeel Butt <shakeelb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Li RongQing <lirongqing@baidu.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Philippe Ombredanne <pombredanne@nexb.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Sahitya Tummala <stummala@codeaurora.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-17 16:20:31 -07:00